Extraction, Representation & Querying of Unstructured, Unreliable Digital Information

costmarysmileInternet και Εφαρμογές Web

7 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

129 εμφανίσεις

Extraction,Representation & Querying of
Unstructured,Unreliable Digital Information
B.Tech.Seminar Report
Submitted in partial fulllment of the requirements
for the degree of
Bachelor of Technology
by
Aaditya Kumar Ramdas
Roll No:05005027
under the guidance of
Prof.Soumen Chakrabarti

Department of Computer Science and Engineering
Indian Institute of Technology,Bombay
Mumbai
Contents
1 Introduction 1
2 Information Extraction & Annotation 2
2.1 DIPRE.........................................2
2.2 Snowball.........................................3
2.3 SemTag-Seeker.....................................3
2.4 TextRunner.......................................4
3 Representation & Query Semantics 6
3.1 Modelling Uncertainty & Imprecision.........................6
3.1.1 ExDB......................................6
3.1.2 Trio.......................................7
3.2 Querying & Query Processing.............................10
3.2.1 Probabilistic Queries in ExDB........................10
3.2.2 TriQL - Trio Query Language.........................10
3.3 Top-k Queries......................................11
3.4 Corroborating Answers.................................12
3.5 SphereSearch......................................13
4 Conclusion 15
4.1 Future Work......................................15
4.2 Acknowledgements...................................16
A Survey of Tools 17
A.1 Lucene..........................................17
A.2 Joomla!.........................................17
A.3 Twine..........................................18
i
Abstract
The Internet,which continues to boom,is increasingly considered to be a huge resource of
terribly organized unreliable digital information.The users'needs and expectations are also
on the rise,and there is an urgent need for quick and ecient access to reliable information.
In this report,we look at issues ranging from the process of information extraction to their
representation and querying,and attempt to understand and critically analyze the present state
of aairs.
Chapter 1
Introduction
The Internet is growing exponentially and it doesn't seem like that is going to halt in the near
future.A large part of this vast,publicly-available,versatile storehouse of information is in a
highly unstructured form,namely text or rather,HTML.While one part of the world is trying to
introduce some structure into the Internet in the form of XML (or recently,OOXML),we must
realize that it will take a very long while (if ever) to change the entire Internet's information
into a semi-structured form (that rstly everyone agrees upon).Furthermore,direct authoring
of structured data is rare and generally not free.
IE Information extraction (IE) distills structured data or knowledge from unstructured text
(like\organically grown text"or text generated via scripts from RDB,XML) by identifying
references to named entities as well as stated relationships between such entities.IE systems
can be used to directly extricate abstract knowledge from a text corpus,or to extract concrete
data from a set of documents which can then be further analyzed with traditional data-mining
techniques to discover more general patterns.
\Pipeline"Attempts are being made to tap the information that is currently available by
annotating it in certain ways (none of which will ever be completely correct,but nevertheless
satisfactory ways exist),and then extract this semi-structured data.Now,this data,which is
often unreliable,has to be stored in a database (which has to be modelled to suit our needs -
there is no global method to do this),and then,nally,to make this eort worthwhile,we must
come up with simple,yet eective ways to query this database and from the end-users'point of
view,search the Internet for the information they're looking for fast and accurately.
Present eorts Dierent ways of eciently modelling the various steps involved (from IE,to
representation,to querying and ranking) is far from a resolved problem.Dierent approaches
for every single step exist,each having its own interesting issues,pros and cons.With plenty
of research going on in these areas,newer thoughts and models arise.Implementation of these
is also something to be looked into,with optimizations and intelligent changes being of huge
importance,since even milliseconds of speed-up is boon.
What's ahead We will explore,in varying depths,the dierent stages of this'web-mining
pipeline'over the next few chapters,and gain an insight into the present software and content
management systems that exist on the Internet,possibly highlighting diculties,drawbacks and
areas of improvement along the way.
1
Chapter 2
Information Extraction &
Annotation
There have been several eorts to annotate the web using dierent models and heuristics on
these models [JB01].The semantic web has been an increasingly popular way of annotation and
there have also been attempts to analyze HTML and convert it to XML for better structure.
Bootstrapping is often used and automated semantic annotation is relatively successful.The
dierent systems may use their own in-house corpus or work as a\carnivore"on another corpus.
Extraction mechanisms can be in the form of sequentially tokenized models (HMMs,CRFs),
controlled domain with substantial training (DIPRE,Snowball,KnowItAll) and open-domain
\self-supervised"systems (TextRunner).In the following section,we shall have a look at some
of the systems of varying complexity (KnowItAll - low,Snowball - medium and CRFs - high)
and the underlying methods that they use.
2.1 DIPRE
DIPRE,Dual Iterative Pattern Expansion,is a concept introduced by Brin [Bri98].DIPRE
works best in the World Wide Web,where a tuple would tend to appear in several documents in
dierent locations and this redundancy can be exploited.DIPRE is initially trained with a small
example set and a regular expression that the entities must match.DIPRE now looks for in-
stances of what it learnt in the documents and examines the text that surrounds the learn tuples.
Generating Patterns
A DIPRE pattern is a 5-tuple (order;urlprefix;left;middle;right) and is generated by group-
ing together occurrences of seed tuples that have equal strings separating the entities (middle)
and then setting the left and right strings to the longest common substrings of the context
on the left and on the right of the entities,respectively.The order re ects the order in which
the entities appear,and urlprefix is set to the longest common substring of the source URLs
where the seed tuples were discovered.After generating a number of patterns from the initial
seed tuples,DIPRE scans the available documents in search of segments of text that match
the patterns.As a result of this process,DIPRE generates new tuples and uses them as the
new seeds.DIPRE starts the process all over again by searching for these new tuples in the
2
documents to identify new promising patterns.
2.2 Snowball
Snowball [AG00] is a scalable system that was built for,yet again one of the main purposes
of information extraction,to convert unstructured text to structured tables to make precise
queries.It introduces novel strategies for generating patterns and extracting tuples from plain-
text documents.At each iteration of the extraction process,Snowball evaluates the quality of
these patterns and tuples without human intervention,and keeps only the most reliable ones
for the next iteration.They also develop a scalable evaluation methodology and present some
metrics.
Generating Tuples
The Snowball systemworks on the basis of DIPRE's pattern generation mechanism.A Snowball
pattern is a 5-tuple (left;tag1;middle;tag2;right) where tag1 and tag2 are named-entity tags
and left,middle and right are vectors associating weights to terms.An example of a Snowball
pattern is

f(the;0:2)g;LOCATION;f(;0:5);(based;0:5)g;ORGANIZATION;fg

.This
pattern will match strings like the Irving-based Exxon Corporation,where the word the (left
context) precedes a location (Irving),which is in turn followed by the strings  and based
(middle context) and an organization.
The degree of match Match(t
P
;t
S
) between two 5-tuples t
P
= (l
P
;t
1
;m
P
;t
2
;r
P
) (with tags
t
1
and t
2
) and t
S
= (l
S
;t
1
;m
S
;t
2
;r
S
) (with tags t
1
and t
2
) is dened as:
l
P
:l
S
+ m
P
:m
S
+ r
P
:r
S
(if tags match)
0 (otherwise)
2.3 SemTag-Seeker
Seeker is a platform for large-scale text analytics,and SemTag is an application written on the
platform to perform automated semantic tagging of large corpora.It demonstrates the viability
of bootstrapping a web scale semantic network [DEG
+
03].
The bulk of websites don't contain annotations and hence application developers cannot rely
on such annotations.On the other side,website creators are unlikely to add annotations in the
absence of applications that use the annotations.A natural approach to break this cycle and
provide an early set of widespread semantic tags is automated generation.SemTag seeks to
provide an automated processes for adding these to the existing HTML corpus on the Web.
SemTag works in 3 phases:
Spotting Pass Documents are retrieved from the Seeker store,tokenized,and then processed
to nd all instances of the approximately 72K labels that appear in the TAP taxonomy,which is
3
the ontological support used by SemTag.Each resulting label is saved with ten words to either
side as a\window"of context around the particular candidate object.
Learning Pass A representative sample of the data is the scanned to determine the corpus-
wide distribution of terms at each internal node of the taxonomy.The Taxonomy Based Dis-
ambiguation algorithm performs disambiguation of references to entities within a large-scale
ontology.
Tagging Pass Finally,the window must be scanned once more to disambiguate each reference.
When a string is nally determined to refer to an actual TAP object,a record is entered into
a database of nal results containing the URL,the reference and any other associated meta-data.
2.4 TextRunner
TextRunner [MBE07] is a fully implemented,highly scalable system which demonstrates a new
type of information extraction,called OIE (Open Information Extraction),in which the system
makes a single,data-driven pass over the entire corpus and extracts a large set of relational
tuples without requiring any human input.
Challenges
Automation of relation extraction Traditional information extraction uses hand labeled
inputs,but TextRunner requires automation in the process of extracting relations.
Corpus Heterogeneity on the Web Tools like parsers and named-entity taggers are deemed
less accurate because the corpus is dierent from the data used to train the tools.
Scalability and eciency Open IE systems are eectively restricted to a single,fast pass
over the data so that they can scale to huge document collections.
Novel Components
Single Pass Extractor The TextRunner extractor makes a single pass over all documents,
tagging sentences with parts-of-speech tags and noun-phrase chunks as it goes.For each pair
of noun phrases that are not too far apart,and subject to several other constraints,it applies
a classier to determine whether or not to extract a relationship.If the classier deems the
relationship trustworthy,a tuple of the form t = (e
i
;r
j
;e
k
) is extracted,where r
j
is the
relation between entities e
i
,e
k
.
Self-supervised Classier While full parsing is too expensive to apply to the Web,Tex-
tRunner uses a parser to generate training examples for extraction.Using several heuristic
constraints,it automatically labels a set of parsed sentences as trustworthy or untrustworthy
extractions (positive and negative examples,respectively).The classier is trained on these ex-
amples,using features such as the part of speech tags on the words in the relation.The classier
is then able to decide whether a sequence of POS-tagged words is a correct extraction with high
accuracy.
4
SynonymResolution Because TextRunner has no pre-dened relations,it may extract many
dierent strings representing the same relation.Also,as with all information extraction systems,
it can extract multiple names for the same object.The Resolver system performs an unsuper-
vised clustering of TextRunner's extractions to create sets of synonymous entities and relations.
Resolver uses a novel,unsupervised probabilistic model to determine the probability that any
pair of strings is co-referential,given the tuples that each string was extracted with.
Query Interface TextRunner builds an inverted index of the extracted tuples,and spreads it
across a cluster of machines.This architecture supports fast,interactive,and powerful relational
queries.Users may enter words in a relation or entity,and TextRunner quickly returns the entire
set of extractions matching the query.For example,a query for\Newton"will return tuples
like (Newton,invented,calculus).Users may opt to query for all tuples matching synonyms of
the keyword input and may also opt to merge all tuples returned by a query into sets of tuples
that are deemed synonymous.
5
Chapter 3
Representation & Query Semantics
In traditional database management systems (DBMSs),every data itemis either in the database
or it isn't,the exact value of every data item is known,and how a data item came into exis-
tence is an auxiliary fact if recorded at all.Many database applications inherently require more
exibility and accountability in their data.A data item may belong in the database with some
amount of condence,or its value may be approximate.Furthermore,how a data item came to
exist - particularly if it was derived using other (possibly inexact) data at some point in time
- can be an important fact,sometimes as important as the data item itself.Hence it is impor-
tant to consider the modelling of how this inexact data should be represented.Also,storage
of extracted information can get complicated if there is varying data collected from dierent
web sources about the same object (like the height of Mt.Everest or the mileage of Hamilton's
McLaren-Mercedes),so it is important to consider how to corroborate these answers by scoring
them (and possibly representing their uncertainty).
3.1 Modelling Uncertainty & Imprecision
The eld of probabilistic databases has been a recent eld of increasing interest.With enough
work already done on normal relational databases,this relatively newer face of data storage and
interpretation has been a cause for a lot of excitement,mainly because it seems to be the most
natural way to model information - especially digital information being extracted via crawling
billions of web pages made at varying points of time,many of thembeing unreliable and old,like
blogs and news reports.Hence,modelling uncertainty and imprecision attracts a large number
of web-miners and an even larger amount of money.
3.1.1 ExDB
ExDB [MJC07],or Extraction Database is a structured web query system,which uses IE sys-
tems to extract data,schema and constraint information from the web,and model them as
probabilistic tuples to account for the inevitably awed extraction.It tries to structure all the
web's unstructured data,store it well and access and query it easily.Textrunner uses OIE (open
information extraction).The system makes a single,data-driven pass over the entire corpus
and extracts a large set of relational tuples without requiring any human input.It is domain
independent and self-supervised (we mean that relations are learnt on the y and modelled prob-
abilistically).ExDB also oers a language for describing probabilistic queries over this extracted
6
probabilistic database.
Base-level concepts Objects are the data values in the system.eg:Bombay,table,pizza,etc.
Predicates are the binary tables populated by pairs of objects.eg:born-in(Aaditya,Chennai),
drinks(mammal,water),etc.Semantic types are unary tables populated by objects.eg:
city(Bombay),pianist(Elton),etc.It should also have predened data-types like integer,date,
etc.
Relationships
 Synonyms denote two equivalent objects,predicates or types.eg:Bombay and Mumbai
refer to the same place and drinks and drank refer to the same predicate.ExDB uses
these to answer queries that don't exactly match the query text.
 Inclusion dependencies describe a subset relationship between predicates.eg:invented
could be a subset of discovered.
 Functional dependencies are useful for queries with negations to explain why a particular
object is not an answer to a query.
Imprecision & Uncertainty
 (Probabilistic Data and Schema) Tuples in the predicate tables and in the semantic type
tables are probabilistic:each tuple in these relations has a value 0  p  1 describing the
probability that the tuple is in or out.
 (Synonyms) Predicate,type,and object synonyms are also probabilistic;there is always a
chance that Einstein and A:Einstein refer to dierent entities.The predicates invented
and created are only partly equivalent.Items in each synonym set are present with some
value 0  p  1.
 (Probabilistic Constraints) Probabilistic inclusion dependencies are similar to probabilistic
predicate synonyms that are applied only\one-way".Probabilistic FDs allow us to make
statements about the data with less than full condence which is important given that the
managed data is not curated;the extraction set may include obscure or confusing cases
that violate a constraint that is generally true.
Points to note It is important to note again the dierence between the database query
paradigm deployed by ExDB and the keyword-in,documents-out paradigm of search engines.
In ExDB the granularity of the data is a concept (a word or short phrase),not a document.
Oftentimes a query's answer is obtained by joining multiple facts that have been extracted from
multiple,unrelated Web pages.Search engines make no attempt to integrate information across
the page boundary.
3.1.2 Trio
Trio [Wid06] is a new database system that manages not only data,but also the accuracy and
lineage of the data.All of these may be queried independently or simultaneously.
7
Features
 Data values may be inexact - approximate/uncertain/incomplete - and expressed as a
range of values or as condence.
 Queries operate over inexact data,return answers that themselves may be inexact.
 Lineage is an integral part of the model - captures updates,program-based derivations,
bulk data loads,import of data from outside sources.
 Accuracy may be queried (as say"within 1%"or"> 98%") together with lineage ("im-
ported on 4=1=08") as a combination also.
 Lineage can be used to enhance data modications;changes in accuracy of data can be
made.
 Trio is NOT a comprehensive temporal DBMS,a DBMS for semi-structured data,the last
word in managing lineage,inexact data or a federated/distributed system.
Data The basic data in TDM follows the standard relational model:A database is a set of
uniquely-named relations.The schema of each relation is a set of typed attributes,uniquely
named within the relation.An instance of a relation is a set of tuples,each containing a value
(or NULL) for each of the attributes.An instance of a database consists of an instance for each
relation in the database.
Accuracy Inaccuracy of the data in a Trio database instance may occur at the attribute level,
tuple level,and/or relation level.
1.Atomic values:An attribute value may be an approximation of some (unknown) exact
value.
(a) An exact value.
(b) A set of possible values,each with an associated probability in the range [0,1] such
that the probabilities sum to less than or equal to 1.
(c) Endpoints for a range of possible values,when values are drawn from an ordered and
possibly continuous domain (e.g.,integer, oat,date).The basic Trio model assumes
a uniform probability distribution across values in a minimum/maximum range.
(d) A Gaussian distribution over a range of possible values (again assuming an ordered
domain),denoted by a mean/standard-deviation pair.
2.Tuples:A tuple may have an associated condence,typically indicating the likelihood the
tuple is actually in the relation.By default each tuple has condence = 1.As a shortcut
we also permit relation-level condence:A relation with condence = c is equivalent to a
relation whose tuples all have condence = c.
3.Missing Data:A relation may have an associated coverage,a value in the range [0;1],
typically indicating how much of the correct relation is likely to actually be present.By
default,all relations have coverage = 1.
8
Lineage In general,the lineage of data describes how the data came into existence and how it
has evolved over time.When updates occur,new data values are inserted in the database while
old values are\expired"but remain accessible in the system.Similarly,deleted data is expired
but not actually expunged.The lineage relation has the following schema:Lineage(tupleID,
derivation-type,time,how-derived,lineage-data).
Advantages
1.Historical lineage:If a data item I is derived from data D at time T,and D is subsequently
updated,we can still obtain I's original lineage data from the expired portion of the
database.
2.Phantom lineage:As part of lineage we may be interested in explaining why certain data
is not in the database.
3.Versioning:The no-overwrite approach enables Trio to support at least some level of
versioning.
When t was derived Tuple t was either derived now because t is the result of a function
dened over other data in the database (e.g.,t is part of a view),or t was derived at a given
time in the past,which we refer to as a snapshot derivation.
How t was derived
 Query-based:t was derived by a TriQL (or SQL) query.
 Program-based:t was derived as the result of running a program,which may have accessed
the database,and whose results were added to the database.
 Update-based:t was derived as the result of modifying a previous database tuple t (by a
query or a program).
 Load-based:t was inserted as part of a bulk data load.
 Import-based:t was added to the database as part of an import process from one or more
outside data sources.
What data was used to derive t
 For program-based derivations,the lineage data may be unknown,it may be a list of
relations accessed by the program (possibly zero-length),it may be the contents of those
relations at the time of derivation,or it may be the entire database at the time of derivation
if which relations were accessed is not known.
 For update-based derivations,the lineage obviously includes the previous value of the
updated tuple.In addition,if the update was precipitated by a TriQL update statement
or a program,the lineage may include that of a query-based or program-based derivation.
 For load-based derivations,the lineage data may be unknown,or it may be the contents
of one or more load les at the time of derivation.
 For import-based derivations,the lineage data may be unknown,or it may be the entire
set of imported data,as in load-based derivations.
9
3.2 Querying & Query Processing
Representation of data is not nearly enough.If the end-user is going to have a good searching
experience,the querying mechanism and processing should be exible and easy to use.De-
pending on which representation you choose,dierent types of issues arise,some of which are
discussed in the following subsections.
3.2.1 Probabilistic Queries in ExDB
ExDB Queries [MJC07] can constrain variables to particular types and can be used with multi-
ple clauses.For example,q(?x,?y,?z):- invented(<scientist>?x,?y),died-in(<year>?x,?z),
(?z < 1900).
Queries that contain no projections are fairly straightforward.An ExDB query involves a
series of joins against tables in the Web Data Model.The probability of a joined tuple is the
product of the local tuples'probabilities.We are usually satised with just the top-k tuples (as
ranked by probability),so we use top-k queries to try to obtain results as quickly as possible
(which we will explore soon).
Projective Queries Consider the query q(?s):- invented(<scientist>?s,?i),which should
rank scientists by the probability the scientist invented something (the actual inventions are irrel-
evant).A scientist,say Tesla,appears in the output of q whenever the tuple invented(Tesla;I
0
)
is in the database.There may be many inventions,I
1
,...,I
m
,such that invented(Tesla;I
i
).
Any of these are sucient to return Tesla as an answer for q.
In the ExDB,these tuples are only present probabilistically.The probability of the query
should be the probability that any tuple invented(Tesla;I
i
) is truly in the database.At the web
scale,one cannot hope to quickly perform such a huge disjunction.Also,many of the purported
inventions might be erroneous and inaccurate (due to reasons like idiosyncratic language).This
is an issue since a disjunction of a large number of low-probability extractions can result in a
signicant probability.
3.2.2 TriQL - Trio Query Language
TriQL [Wid06] extends the semantics of SQL queries so they can be issued against data that
includes accuracy and lineage,and it adds new features to SQL for queries involving accuracy
and lineage explicitly.Regardless of the formal model,one absolute premise of TriQL is closure:
the result of a TriQL query (or sub-query) over a Trio database is itself in TDM,i.e.,TriQL
query results incorporate the same model of accuracy and lineage as Trio stored data.
Issues with Accuracy
 When we join two tuples each with condence < 1,what is the condence of the result
tuple?
 When we combine two relations with coverage < 1,what is the coverage of the result
relation for the dierent combination operators (e.g.,join,union,intersection)?
10
 When we aggregate values in tuples with condence < 1,what is the form of our result?
 How do we aggregate values from multiple approximation categories?(say an attribute
having an approximate value from a tuple of a certain condence in a relation having a
coverage < 1)
Accuracy & Lineage additions to SQL
 It should be able to make queries like if <attribute> is exact,if <time> approximation is
within 10 minutes or has > 98% chance of being in the database.
 Lineage queries like Identify the program and its parameters used to derive the region
attribute in a summary-by-region table or Retrieve all tuples whose derivation includes
data from relation ObsP for a specic participant P.
 Integrating the above 2 aspects should be possible as in Retrieve all summary data whose
derivation query lters out approximate times with a > 10-minute approximation or Re-
trieve only data derived entirely from sightings in observation relations with condence
 98%.
 Update propagation should take place.If we upgrade the condence of an observation
relation ObsP based on new information about participant P,data derived using ObsP
should have its accuracy updated automatically.
3.3 Top-k Queries
The question of top-k queries often arise in data-exploration,decision-making and data-cleaning
scenarios.A traditional top-k query returns the k objects with the maximum scores based on
some scoring function.In an uncertain world however,we wish to dene two types of top-k
queries:
U-topk Let D be an uncertain database with possible worlds space W = W
1
;W
2
;:::;W
n
.
Let T = T
1
;T
2
;:::;T
m
be a set of k-length tuples vectors,where for each T
i
2 T:Tuples of
T
i
are ordered according to scoring function F,and T
i
is the top-k answer for a non-empty
set of possible worlds W(T
i
)  W.A U-Topk query,based of F,returns T 2 T,where
T = argmax
T
i
2T
(
w2W(T
i
)
Pr(w)).
A U-Topk query answer is a tuple vector with the maximum aggregated probability of being
top-k across all possible worlds.This type of query ts in scenarios where we need to restrict
all top-k tuples to belong together to the same world(s).
U-kRanks Let D be an uncertain database with possible worlds space W = W
1
;W
2
;:::;W
n
.
For i = 1:::k,let x
1
i
;:::;x
m
i
be a set of tuples,where each tuple x
j
i
appears at rank i in a non-
empty set of possible worlds W(x
j
i
)  W based on scoring function F.A U-kRanks query,based
on F returns x

i
;i = 1:::k where x

i
= argmax
x
j
i
(
w2W(x
j
i
)
(Pr(w))).
A U-kRanks query answer is a set of tuples that might not form together the most probable
top-k vector.However each tuple is a clear winner at its rank over all worlds.This type of
11
query ts in data-exploration scenarios,where the most probable tuples at top ranks are re-
quired without restricting them to belong to the same world(s).
Points to note Interesting and ecient algorithms have been proposed for these in [MSC07]
and [KYK08].These,however will not be covered here.Work on top-k query processing algo-
rithms has focused on adaptively reducing the amount of processing done by query evaluation
techniques by ignoring data that would not be useful to identify the best answers to a query.
Standard techniques consider each answer individually and hence can easily identify the maxi-
mum possibly score of an answer.However,in the case of corroboration,the score of an answer
can always increase,hence answers cannot be discarded easily.
3.4 Corroborating Answers
Even though corroboration was initially introduced to nd accurate and believable answers to
user queries,some methods can also be used at the IE stage itself - there are various ways we
can assign a score to a particular value/answer and we can relate probabilities to this score as
well.We shall look at the highly simplistic,but practically eective,scoring functions used,as
discussed in [WM07].
Relevance score of a page
Importance We consider that a page highly ranked by the search engine provides better in-
formation than one with a lower rank.The importance of the (i +1)th page is the product of
the importance of the ith page and (1 ),where a represents the speed at which importance
of a web-page drops as we traverse the search engine result list and quanties the decrease in
importance of lower ranked pages.For a page p,r(p) is the rank returned by the search engine,
then the importance of p,
I(p) = (1 )
r(p)1
.
Duplication Since we are looking into dierent web pages and trying to corroborate answers,
duplicated pages which tend to have similar content should not have as much weight in the
corroboration as pages that contain original content.However,they should still be taken into
account as corroborative evidence.A possible solution is to halve the score of a page each time
duplication is detected.Therefore,if for a given page,there exist d
m
pages sharing the same
domain and d
c
pages generated from copy/paste information,then we have:
D(p) = 1=2
d
m
+d
c
Therefore,the relevant score of a page would be:
s(p) = I(p)  D(p) = (1 )
r(p)1
=2
d
m
+d
c
12
Score of answers within a page p
The number of dierent answers The score of each answer is computed as the score of the
page s(p) divided by the number of answers extracted N(p).In other words,as the number of
answers extracted within one page grows,the scores assigned to each answer decreases,assum-
ing that answers within the same page are equally relevant.Therefore the score of each answer
extracted will be as follows:
s(x;p) = s(p)=N(p)
Prominence of answers All answers extracted from the same page are rarely equally helpful
in answering queries.There may be incorrect answers extracted fromthe page that are irrelevant
to the query.We assume that in most cases the relevant answers will be close to the query subject
in the web page.Assuming the nearest answer is at a distance d
min
characters and a general an-
swer x is at a distance d
x
fromthe query subject on the same page,then the score is adjusted as:
s(x;p) = s(p)=N(p)  d
min
=d
x
Therefore,we assign a score for an answer x from a page p as:
s(x;p) = 1=N(p)  d
min
=d
x
 (1 )
r(p)1)
=2
d
m
+d
c
Score of an answer x Finally,we can write the score of an answer x (considering the rst n
pages) is given by
score(x) = 
n
i=0
s(x;p
i
)
Comments The IE techniques used here are very basic and can be improved by using struc-
tural information within the page,as done by systems like Snowball,TextRunner and KnowItAll.
Also,work on top-k query algorithms with cross-document scoring needs to be done to be ap-
plied in these cases.
3.5 SphereSearch
SphereSearch Engine [JGW05] provides unied rank retrieval on heterogeneous XML and web
data.Its search capabilities include vague structure conditions,text content conditions and
relevance ranking based on IR statistics and statistically quantied ontological relationships.
Features
 It is simpler than existing query languages for XML like XPath or XQuery,but it provides
ranked retrieval with support for concept-,context- and abstraction-aware search.
13
 It handles XML and Web data uniformly by automatically converting HTML data into
XML (HTML2XML),with heuristics and the use of linguistic tools for named-entity recog-
nition and to generate semantically meaningful XML tags (GATE - General Architecture
for Text Engineering).
 It extends XML-IR techniques to arbitrary graphs,with XPath style search conditions
across document/page boundaries and a scoring/ranking model that re ects the compact-
ness of a matching subgraph.
What's dierent?Existing retrieval systems consider only the content of a element (docu-
ment) itself to assess its relevance for a query,often using a scoring model that gives high scores
to elements where the keywords in the query appear frequently.In SphereSearch,this type of
score is provided by the node score of a node.For a condition t expressed in terms of K,the
node score ns(n;t) of node n is dened in terms of exp(K) of all terms that are similar to K us-
ing the ontology,and sim(K;x) which is the ontology based similarity of K to another termx as:
max
x2exp(K)
sim(K;x)  ns(n;x)
But,in the presence of linked documents and when content is spread over several documents,
this sort of scoring isn't enough.Instead,we want to promote the scores of elements where the
requested keyword appears frequently in the content of other elements in its neighbourhood.
For such context-aware scoring,the concept of sphere is introduced - nodes at a xed distance
of a center node.The sphere score of a node is aggregated from nodes in spheres around the
node with less weight to nodes in outer (larger) spheres.
We dene S
d
(n) as the set of all nodes whose distance to n is d.The sphere score with
respect to a keyword condition t is dened as:
s
d
(n;t) = 
v2S
d
(n)
ns(v;t)
and the sphere score of n with respect to t is dened in terms of a sphere size limit D and a
fractional damping factor  as:
s(n;t) = 
D
i=1
s
i
(n;t)  
i
Several other ranking methods like EntityRank are now employing cross-document ranking
schemas.However,many loose ends have still been left untouched,and further work can be
done in this area.
14
Chapter 4
Conclusion
The entire process of web mining can be broken down,roughly,into three parts.As we saw in
Chapter 2,information extraction has been an issue since the very beginning - search engines
or other such applications were required to extract digital information from the terribly huge
amounts of unstructured data like text and provide some element of structure.Once that is
done,the issue of eectively representing the extracted information was bound to arise.
Recently,as was covered in depth in the rst part of Chapter 3,when the issue of uncertainty
and unreliable information came about,probabilistic representations started to come up,and
brought with them their set of issues.Obviously,from the end user's point of view,extraction
and representation was never the problem.The latter part of Chapter 3 covered some of the
queries we could make,their semantics and exibility,along with the issue of ranking pages,and
going beyond single-document to cross-document ranking schemes.
Now,we shall look at some of the areas which could still be of genuine interest for further
research.Finally,we will see a survey of some of interesting tools and softwares that have caused
a few waves in the content management and semantic searching user-driven world.
4.1 Future Work
There is work to be done at all stages of the\pipeline".
 The so-called\self-supervised"extractors and annotators actually work on heuristics and
basic training data.The validity of the assignment of probabilities of the training data,
as well as the heuristics used are without any justication.
 In the area of representation,and management of uncertain databases,the issue of joining
relations where attributes,tuples and relations have fractional approximation,condence
and coverage has not been satisfactorily dealt with.Also,aggregations based on these
uncertain elements could be a subject of discussion.
 Changes made in the probabilities of uncertain elements need to be propagated through
the database into those relations which were derived from it - this is not easy!
 The kind of exibility and options available to the user making queries on uncertain data,
and the processing of these queries has been handled dierently in dierent systems,each
15
having their pros and cons.
 Ranking,which has been based on closed-document methods till now,is shifting to a more
reliable and intuitive cross-document querying,with interesting proposals coming up for
each.
 Finally,we would like to put together several systems,each dealing with a dierent com-
ponent of the pipeline,and analyze its performance.
4.2 Acknowledgements
I would like to thank my guide,Prof.Soumen Chakrabarti for the consistent directions he has
fed into my work and for the insightful and entertaining meetings.
16
Appendix A
Survey of Tools
A.1 Lucene
Open-source Java application search engine It aims to fetch billions of pages per month,
maintain an index of those pages,and return high quality (with transparent ranking computa-
tion) results to over 1000 searches per second at minimal cost.An index refers to a sequence of
documents (a.k.a.directory),a document is a sequence of elds,a eld is a named sequence of
terms and a term (which what matters for searching) is a text string.
Its analyzers are fairly interesting - here is a list of analyzers with what they do while
analyzing a sentence like\The quick brown fox jumps over the lazy dog.":
 Whitespace Analyzer - Simplest built-in analyzer (the lazy dog:becomes [the] [lazy] [dog:])
 Simple Analyzer - Lowercases,splits at non-letter boundaries ([dog:] becomes [dog])
 Stop Analyzer - Lowercases and removes stop words (remove [the])
 Snowball Analyzer - Stemming Algorithm ([jumps] becomes [jump])
There are several kinds of queries as well:TermQuery (nd by key),RangeQuery (text,
date or numeric ranges),BooleanQuery (combine queries into expressions),PrexQuery,Wild-
cardQuery,FuzzyQuery,etc.There is a QueryParser which turns readable expressions into query
instances.Lucene hopes to jump to a much larger scale and be an open-source search engine
that can compete with Google.
A.2 Joomla!
Joomla!is a free open source framework and content publishing system designed for quickly
creating highly interactive multi-language Web sites,online communities,media portals,blogs
and e-Commerce applications.
Yet another CMS?With a fully documented library of developer resources,Joomla!allows
the customization of every aspect of a Web site including presentation,layout,administration,
and the rapid integration with third-party applications making it one of the best CMS softwares
available today.
17
Features
 FTP Layer:The FTP Layer allows le operations (such as installing Extensions or updat-
ing the main conguration le) without having to make all the folders and les writable.
 The use of MySQL,PHP,Apache Tomcat server,and other such open source softwares.
 Business or organizational directories,banner advertising systems
 Document management,image and multimedia galleries,dynamic form builders
 E-commerce and shopping cart engines,paid subscription services
 Forums and chat software,shout-box,calendars,polls,login (soon LDAP),e-mail newslet-
ters,etc.
A.3 Twine
Twine is a new service that intelligently helps you share,organize and nd information with
people you trust.In Twine you can safely share information and knowledge,and collaborate
around common interests,activities and goals.It is primarily aimed at a group of people working
as a collaborative team and it combines social network capabilities with Wiki and database
capabilities.
Features
 Twine is smart.It looks at data and autotags it.It then tries to understand it and build
a semantic graph from it.
 There is a privacy model which maintains access information about dierent users and
makes sure you see only what you have access to.
 There are APIs to take the data in and out of Twine freely - either author articles or
include web content,statically or dynamically.
 Twine builds a semantic prole for every user - it builds up a relationship between yourself
and other entities based on the information you or others have shared.
 Every member of twine gets an email address.Twine processes email contents,learns more
about the person who sends them or the topic itself.
 Eventually they plan to suck your RSS feeds,your bookmarks (say from refox),your
Desktop les,etc.into Twine.
 After learning enough about you,and the semantic prole is rich enough,the system can
start making recommendations,at a user or group level.
 (Search) If you search for some topic,Twine will search for it in the semantic subgraph
you have access to.Ranking is done depending on how the content is related to you,to
the people connected to you or to your interests.
18
Bibliography
[AG00] Eugene Agichtein and Luis Gravano.Snowball:Extracting relations fromlarge plain-
text collections.In Proceedings of the Fifth ACMInternational Conference on Digital
Libraries,2000.
[Bri98] Sergey Brin.Extracting patterns and relations from the world wide web.In
WebDB Workshop at 6th International Conference on Extending Database Tech-
nology,EDBT'98,1998.
[DEG
+
03] Stephen Dill,Nadav Eiron,David Gibson,Daniel Gruhl,R.Guha,Anant Jhingran,
Tapas Kanungo,Sridhar Rajagopalan,Andrew Tomkins,John A.Tomlin,and Ja-
son Y.Zien.Semtag and seeker:bootstrapping the semantic web via automated
semantic annotation.ACM Press,2003.
[JB01] Raymond J.Mooney and Razvan Bunescu.Mining knowledge from text using infor-
mation extraction.SIGKDD,2001.
[JGW05] R.Schenkel J.Graupmann and G.Weikum.The spheresearch engine for unied rank
retrieval of heterogenous xml and web documents.VLDB,2005.
[KYK08] D.Srivastava K.Yi,F.Li and G.Kollios.Ecient processing of top-k queries in
uncertain databases.ICDE,2008.
[MBE07] Stephen Soderland Matthew Broadhead Michele Banko,Michael Cafarella and Oren
Etzioni.TextRunner:Open information extraction on the web.In Proceedings
of Human Language Technologies:The Annual Conference of the North American
Chapter of the Association for Computational Linguistics (NAACL-HLT).Associa-
tion for Computational Linguistics,2007.
[MJC07] Dan Suciu Oren Etzioni Michele Banko Michael J.Carafella,Christopher Re.Struc-
tured querying of web text.CIDR,2007.
[MSC07] I.F.Ilyas M.A.Soliman and K.C.C Chang.Top-k query processing in uncertain
databases.ICDE,2007.
[Wid06] J.Widom.Trio:A system for integrated management of data,accuracy and lineage.
VLDB,2006.
[WM07] M.Wu and A.Marian.Corroborating answers from multiple web sources.WebDB,
2007.
19