EXTRACTINGOPINION FROM WEB SITES USING NATURAL LANGUAGE PROCESSING

impulseverseAI and Robotics

Oct 24, 2013 (3 years and 7 months ago)

65 views

JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN

COMPUTER ENGINEERING

ISSN: 0975


6760|
NOV 12 TO OCT 13

|
VOLUME


02, ISSUE


02

Page
209

EXTRACTING

OPINION

FROM

WEB
SITES

USING

NATURAL LANGUAGE PROCESSING


1
FORAM JOSHI



1

Student
,
Department
of

Computer

Science

and Engineering
,
Noble Engineering

College
,
Junagadh
,
Gujarat
.


er.foramjoshi@gmail
.com


ABSTRACT
:
World is full of valuable data. Among those bulk of data to get our important data is not a
n

easy
task. In this paper first I give the basic idea of data extraction process then explain
processing steps of natural
language processing
. Aft
er that I give my proposed st
ructure which extract opi
ni
on from specific web sites and
process on particular review at the end of all natural language processing steps we can identify either
particular review or comment is
bad, good or medium

all is based
on ranking
.
And future work of my proposal,
it
s implementation for various opinion base web sites
.





KEY

WORD
S
:
Natural Language Processing,
Data
Extraction, Opinion


1
.
INTRODUCTION

World is full of valuable data. In all that data to get
our required

data in a formatted way it is not easy
task product listing, Business directories, Inventories
etc there are numbers of data managing is very
tedious.so that there are number of technique
available and based on those technique number of
soft wares are ava
ilable to analyze the data. We are
going to implement such intelligence technique
among them by which we can easily manipulate the
data.[1] Data extraction is nothing but identify
specific pieces
of data

in a unstructured or semi
structured textual documen
t. In this technique what
we exactly do to transform unstructured data or
information in a corpus of document or web pages
into a specific formatted data and after getting such
data it can be handled like handling traditional
database [
2]. Traditional appr
oach for extracting data
from web source is to write specialized programs,
called Wrappers. what wrapper exactly do to identify
data of interest and map them to some suitable format
like relational database or XML.[3]

In other words,
the input to the syste
m is a collection of sites, (e.g.
different domains), while the output is a
representation of the relevant information from the
source sites, according to specific extraction
criteria.
[4] we

can applied such technique for data
extraction purpose to differe
nt types of text like
newspaper articles, web pages, scientific articles,
newsgroup

messages, classified ads, medical notes
etc.[2]



2.


HOW EXTRACTION WORK?




Figure[1]:
Meaning of Data Extraction in pictorial
format
.

First machine find numbers of data
when we want to
some specific type of data for extraction then it filter
another data. Take a look in figure which gives basic
idea. Figure represent that
initially w
e want the data
of type1

then what machine started processing all the
data. And from that
it gives our required data that is
type1 data. This is what we can say wrapping the
data. Now as we discuss that number of techniques
are available for data extraction like natural language
processing, language and grammars, machine
learning, information
r
etrieval, database

and
ontologies are there. [3]

In

those different technique I
a
m going to implement natural language processing
technique. During my work on this topic I found if
we want to reliable data in simple way. Then I will
go for
natural language

processing
technique. Here

one more diagram which give the flow o
f extracting
the data.[Figure 2]



JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN

COMPUTER ENGINEERING

ISSN: 0975


6760|
NOV 12 TO OCT 13

|
VOLUME


02, ISSUE


02

Page
210


Figure [
2]:

F
low of
extracting the

data.

Natural language processing is the automatic ability
to understand text or audio speech data and extract
valuabl
e information from it.[2] the ultimate
objective of natural language processing is to allow
people to communicate with computers in much the
same way they communicate with each other. More
specifically, natural language processing facilitates
access to a d
atabase or a knowledge base, provides a
friendly user interface, facilitates language translation
and conversion and increase user productivity by
supporting English like input.[4] natural language
processing is defined in vast area where it has been
use
d either it would be main field like automatic
summarization,coreference resolution, discourse
analysis, machine translation, morphological
segmentation, named entity recognition, natural
language

generation, natural language understating,
optical characte
r recognition etc or it may be used in
sub fields like information retri
eval, information
extraction,spe
ech processing etc.[5]






3
.
PROCESSING STEPS IN NATURAL
LANGUAGE PROCESSING

Based on above
comparison

we can see in straight
way that NLP based tool
s are only support simple
type of object like text type of document not support
complex type of object like image,speech or any
other for data extraction.what we are going to
develop or modified the tools which not only support
text but it also supports
speech or image i.e complex
type of data.we choose NLP based tools for data
extraction because this is only one technique which
fully support non html resources.Every times its not
necessary that all data should be in html formate.one
more advantage is tha
t it provides semi automation
compare to other tools of diff techniques which is
based on either manually or automatic.so we gone for
NLP based technique.now in NLP we have number
of techniques available in NLP.they are listed belo
w

with brief
description
.
[6]



Table[
2
]
:

Selected Processing Steps in NLP
-
based
Document Processing System


Generally data ext
raction
in number of ways
some of brief
ly explain below:[7]

JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN

COMPUTER ENGINEERING

ISSN: 0975


6760|
NOV 12 TO OCT 13

|
VOLUME


02, ISSUE


02

Page
211

1.

Named Entity Recognition
: Specific
type of information extraction in which
the goal is to extr
act formal names of
particular types of entities such as
people
, places, organizations

etc.

2.

Relation
Extraction
: Once

entities are
recognized, identify specific relations
between entities.

3.

Web Extraction
: Many web pages are
generated automatically from an
underlying database. Therefore, the
HTML structure of pages is fairly
specific and regular i.e semi structured.
However, output

is intended for human
consumption
, not

machine
interpretation. Data

Extraction system
for such generated pages allows the web
si
te to

be viewed as a structured
database. Process of Extracting from
such pages is sometimes referred to as
Screen Scraping.

4.

Regular
Expressions
:

Language for
composing complex patterns from
simpler ones. An individual character is
regex.

a.

Union: If e1 and
e2 are
regexes, that (e1/e2) is a regex
that matches whatever either
e1 or e2 matches.

i.

(u/e)nable(e/ing)
matches

Unable, enabling

b.

Concatenation:

If e1 and e2 are
regexes, then e1e2 is a regex
that matches a string that
consists
of a

substring that
matches
e1 immediately
followed by a substring that
matches e2.


c.


Repetition (
Keene

closure) : If
e1is a regex,then e1* is a regex
that matches a sequence of
zero or more strings that match
e1.

i.

(un/en)*able matche




Able
, enununenable


4.

PROPOSED STRUCTURE FO
R
OPINION EXTRACTION


Figure[3] gives clear idea that how number
of input we take. and based on some pre
decided rules and thesaurus we can
categorized revie
w
.

Actually what I m going to implement is
first any website which is based on taking
opinions or r
eviews
. On that we are
processing on text with help of pre defined
rules that based on which criteria particular
c
omment or opinion is good, mediam

or
any
abuse thing. If there are N rules matching
the same piece of text, we first rank rules
preliminarily
according to their own
extracting accuracy

[9].




Figure [
3
]
:

The process of our extraction
method


5.

EXPERIMENTAL DATA


Depending on the number of extracted
keywords, their corresponding sentences are
selected to generate the extractive summary
which is t
hen post processed to modify it
into a concise abstractive
summary

[
8]

And then we identify the rank of particular
reviews based on
rank figure

shows
possibility of extracting keywords
automatically as well manually.



Figure [4]: Comparison of
the automatic
selection vs. manual selection of keywords


JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN

COMPUTER ENGINEERING

ISSN: 0975


6760|
NOV 12 TO OCT 13

|
VOLUME


02, ISSUE


02

Page
212

6
.
CONCLUSION

AND FUTURE WORK

In this paper, we de
scribe a novel approach
for opinion

extraction using Natural
language processing and identify whether
opinion is good ,bad or if it abuse type then
it will automatically removed. Future work
for this approach is its implementation for
the various web sites which is based on
review type or opinion based.


7
.
REFERENCE
S


1.

Mozenda

Web

Scraper

-

Web

Data

Extraction

http://www.youtube.com/watch?v=gvWGSBRu
Z5E

2.

Natural Language Processing

http://en.wikipedia.org/wiki/Natural_language_proce
ssing

3.

Yuequn Li, Wenji Mao
1
, Daniel Zeng,
Luwen Huangfu
1
and Chunyang Liu
A Brief Survey
of Web Data Extraction Tools

4.

Natural Language Processing 68

www.hit.ac.il/staff/leon
idm/information
-
system/ch
-
68.html

5.

Natural Language Processing
http://www.seogrep.com/natural
-
language
-
processing/

6.


Mary D. Taffet
Application of Natural
Language Processing Techniques to Enhance Web
-
Based Retrieval of Genealogical Data

7.

PARAG M.JOSHI, SAM L
IU.

Web Document Text
and

Images Extraction using
DOM Analysis and Natural Language Processing.
To
be published in the 9
th

ACM Symposium on
Document Engineering
, DocEng’09
, Munich, and
Germany. September

16
-
18
, 2009


8.

Jagadish S KALLIMANI,
Srinivasa

,

Inf
ormation Extraction by an
Abstractive

Text
Summarization for an Indian



Regional
Language


9.

Yuequn Li, Wenji Mao, Daniel Zeng,
Luwen Huangfu
1
and Chunyang Liu,



Extracting Opinion Explanations from Chinese
Online Reviews