KnowItNow: Fast, Scalable Information Extraction from the Web

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 9 months ago)

123 views

KnowItNow: Fast, Scalable Information
Extraction from the Web

Michael J. Cafarella, Doug
Downey, Stephen Soderland,
Oren Etzioni

The Problem


Numerous NLP applications rely on search
-
engine queries to:


Extract information from the web.


Compute statistics over the Web corpus.



Search engines are extremely helpful for several
linguistic tasks such as:


Computing usage statistics.


Finding a subset of web documents to analyze in
depth.


Problem With Search Engines


Search engines were not designed as building blocks for NLP
applications. As a result:



An NLP application is forced to issue literally millions of
queries to search engines; increasing processing time and
limiting scalability.



Fetching web documents is also time
-
consuming.



Search engines are limiting the use of programmatic
queries to their engines


Google has placed hard quotas on the number of daily queries a
program can issue.


Other engines force applications to introduce “courtesy waits” between
queries.

Example of the Problem “KnowItAll”


KnowItAll works in a generate
-
and
-
test
architecture extracting Information in 2 stages:


First, it Utilizes a small set of domain independent
extraction patterns to
generate candidate
facts.


Second, it automatically
tests the plausibility of
the
candidate facts it extracts using
pointwise
mutual information
(PMI) statistics computed from
search
-
engine hit counts.

1
st

Stage in KnowItAll


Take the generic pattern “NP1 such as
NPList2”.


This indicates that the head of each simple
noun phrase (NP) in NPList2 is a member of
the class named in NP1.


Take as example the pattern for class City, and
the sentence “We provide tours to cities such as
Paris, London, and Berlin.”


KNOWITALL extracts three candidate cities from
the sentence: Paris, London, Berlin.

2
nd

Stage in KnowItAll


KnowItAll needs to assess the likelihood of
the information it found.


Verify that Paris is actually a city.


It does that by computing the PMI between
Paris and a set of k
discriminator phrases
that tend to have high mutual information
with city names. (Paris is a city)


This requires at least k search
-
engine
queries for every candidate extraction!

The Solution


A novel architecture for Information
Extraction which does not depend on Web
search
-
engine queries; KnowItNow.


Works over 2 stages like KnowItAll:


Uses a specialized search engine called the Binding Engine
(or BE) which efficiently returns bindings in response to
variabilized queries.


Uses URNS, a combinatorial model, which estimates the
probability that each extraction is correct without using any
additional search engine queries

The Binding Engine vs.

The Traditional Engine

The Traditional Engine


Take the search query (“Cities such as
<N
ounPhrase
>”).


Perform a traditional search engine query.


For each such URL:


obtain the document contents.


find the searched
-
for terms in the document text.


Run the noun phrase recognizer to determine if
text found satisfies the linguistic type requirement


If it does, return the string.


Problems With Traditional Engine


The search itself doesn’t take a long time. Even if
there are multiple search queries



The second stage fetches a large number of
documents, each fetch likely resulting in a random
disk seek; this stage executes slowly.



this disk access is slow regardless of whether it
happens on a locally
-
cached copy or on a remote
document server.



The Binding Engine




Why not use a table to store a list of terms and
documents containing them?!




The Binding Engine supports these queries:


T
yped variables
(such as
NounPhrase
)


String
-
processing functions
(such as “head(X)” or
“ProperNoun(X)”).


Standard query terms.


It processes a variable by returning every possible string in the
corpus that has a matching type, and that can be substituted for the
variable and still satisfy the user's query.

How the Binding Engine Works?


It uses a novel approach called the

“neighborhood index”


The neighborhood index is an augmented
inverted index structure.


For each term in the corpus, the index keeps a list
of documents in which the term appears and a list
of positions where the term occurs.


The index also keeps a list of left
-
hand and right
-
hand
neighbors
at each position. (Adjacent text
strings that satisfy a recognizer, e.g.
NounPhrase
)

How is The Binding Engine Better?







K
is the number of concrete terms in the query.


B
is the number of variable bindings found in the corpus.


N
is the number of documents in the corpus.


Expensive processing such as part
-
of
-
speech tagging or shallow
syntactic parsing is performed only once, while building the index,
and is not needed at query time.

How is The Binding Engine Better?







Average time to return the relevant bindings


in response to a set of queries.


0.06 CPU minutes for
BE
.


8.16 CPU minutes for Nutch (Private search engine)



Disadvantages of The Binding Engine


It consumes a large amount of disk space, as
parts of the corpus text are folded into the
index several times.



The neighborhood index increased disk
space four times that of a standard inverted
index

The URNS Model


We need a way to test that the extractions
from the Binding Engine are correct



KnowItAll issues queries to search engines
and uses the PMI model to verify extractions.



PMI is very efficient but it is also very slow.

How URNS works?



URNS is a probabilistic model


It takes the form of a classic “balls
-
and
-
urns” model from combinatorics.



Each extraction is modeled as a
labeled ball in an urn.



A
label
represents either an instance of
the target class or relation, or
represents an error

How URNS works?


C
-

the set of unique target labels; |C| is the number
of unique target labels in the urn.


E
-

the set of unique error labels; |E| is the number of
unique error labels in the urn.


num(b)
-

the function giving the number of balls
labeled by b where b is a subset of C U E.


num(B) is the multi
-
set giving the number of balls for
each label b, where b is a subset of B.

How URNS works?


The goal of an IE system is to discern which of
the labels it extracts are in fact elements of C.


Given that a particular label
x
was extracted
k
times
in a set of
n
draws from the urn, what is the
probability that
x is a subset of C
?


Alternative to URNS


Items that were extracted more often are
more likely to be true.


i.e. Extractions with higher frequencies are true.


Experiments


Recall:
how many distinct extractions does
each system return at high precision?


Time:
how long did each system take to
produce and rank its extractions?


Extraction Rate:
how many distinct high
quality extractions does the system return
per minute? The extraction rate is simply
recall divided by time.

KnowItNow vs. KnowItAll

Tested on relation “Country”

KnowItNow vs. KnowItAll

Tested on relation “CapitalOf”

KnowItNow vs. KnowItAll

Tested on relation “Corp”

KnowItNow vs. KnowItAll

Tested on relation “CeoOf”

KnowItNow vs. KnowItAll

Contributions


A novel architecture for Information
Extraction which does not depend on Web
search
-
engine queries.


Extract tens of thousands of facts from the
Web in minutes instead of days.


KnowItNow's extraction rate is two to three
orders of magnitude greater than KnowItAll's.