KnowItNow: Fast, Scalable Information Extraction from the Web

cowphysicistInternet and Web Development

Dec 4, 2013 (4 years and 5 months ago)


KnowItNow: Fast, Scalable Information
Extraction from the Web

Michael J. Cafarella, Doug
Downey, Stephen Soderland,
Oren Etzioni

The Problem

Numerous NLP applications rely on search
engine queries to:

Extract information from the web.

Compute statistics over the Web corpus.

Search engines are extremely helpful for several
linguistic tasks such as:

Computing usage statistics.

Finding a subset of web documents to analyze in

Problem With Search Engines

Search engines were not designed as building blocks for NLP
applications. As a result:

An NLP application is forced to issue literally millions of
queries to search engines; increasing processing time and
limiting scalability.

Fetching web documents is also time

Search engines are limiting the use of programmatic
queries to their engines

Google has placed hard quotas on the number of daily queries a
program can issue.

Other engines force applications to introduce “courtesy waits” between

Example of the Problem “KnowItAll”

KnowItAll works in a generate
architecture extracting Information in 2 stages:

First, it Utilizes a small set of domain independent
extraction patterns to
generate candidate

Second, it automatically
tests the plausibility of
candidate facts it extracts using
mutual information
(PMI) statistics computed from
engine hit counts.


Stage in KnowItAll

Take the generic pattern “NP1 such as

This indicates that the head of each simple
noun phrase (NP) in NPList2 is a member of
the class named in NP1.

Take as example the pattern for class City, and
the sentence “We provide tours to cities such as
Paris, London, and Berlin.”

KNOWITALL extracts three candidate cities from
the sentence: Paris, London, Berlin.


Stage in KnowItAll

KnowItAll needs to assess the likelihood of
the information it found.

Verify that Paris is actually a city.

It does that by computing the PMI between
Paris and a set of k
discriminator phrases
that tend to have high mutual information
with city names. (Paris is a city)

This requires at least k search
queries for every candidate extraction!

The Solution

A novel architecture for Information
Extraction which does not depend on Web
engine queries; KnowItNow.

Works over 2 stages like KnowItAll:

Uses a specialized search engine called the Binding Engine
(or BE) which efficiently returns bindings in response to
variabilized queries.

Uses URNS, a combinatorial model, which estimates the
probability that each extraction is correct without using any
additional search engine queries

The Binding Engine vs.

The Traditional Engine

The Traditional Engine

Take the search query (“Cities such as

Perform a traditional search engine query.

For each such URL:

obtain the document contents.

find the searched
for terms in the document text.

Run the noun phrase recognizer to determine if
text found satisfies the linguistic type requirement

If it does, return the string.

Problems With Traditional Engine

The search itself doesn’t take a long time. Even if
there are multiple search queries

The second stage fetches a large number of
documents, each fetch likely resulting in a random
disk seek; this stage executes slowly.

this disk access is slow regardless of whether it
happens on a locally
cached copy or on a remote
document server.

The Binding Engine

Why not use a table to store a list of terms and
documents containing them?!

The Binding Engine supports these queries:

yped variables
(such as

processing functions
(such as “head(X)” or

Standard query terms.

It processes a variable by returning every possible string in the
corpus that has a matching type, and that can be substituted for the
variable and still satisfy the user's query.

How the Binding Engine Works?

It uses a novel approach called the

“neighborhood index”

The neighborhood index is an augmented
inverted index structure.

For each term in the corpus, the index keeps a list
of documents in which the term appears and a list
of positions where the term occurs.

The index also keeps a list of left
hand and right
at each position. (Adjacent text
strings that satisfy a recognizer, e.g.

How is The Binding Engine Better?

is the number of concrete terms in the query.

is the number of variable bindings found in the corpus.

is the number of documents in the corpus.

Expensive processing such as part
speech tagging or shallow
syntactic parsing is performed only once, while building the index,
and is not needed at query time.

How is The Binding Engine Better?

Average time to return the relevant bindings

in response to a set of queries.

0.06 CPU minutes for

8.16 CPU minutes for Nutch (Private search engine)

Disadvantages of The Binding Engine

It consumes a large amount of disk space, as
parts of the corpus text are folded into the
index several times.

The neighborhood index increased disk
space four times that of a standard inverted

The URNS Model

We need a way to test that the extractions
from the Binding Engine are correct

KnowItAll issues queries to search engines
and uses the PMI model to verify extractions.

PMI is very efficient but it is also very slow.

How URNS works?

URNS is a probabilistic model

It takes the form of a classic “balls
urns” model from combinatorics.

Each extraction is modeled as a
labeled ball in an urn.

represents either an instance of
the target class or relation, or
represents an error

How URNS works?


the set of unique target labels; |C| is the number
of unique target labels in the urn.


the set of unique error labels; |E| is the number of
unique error labels in the urn.


the function giving the number of balls
labeled by b where b is a subset of C U E.

num(B) is the multi
set giving the number of balls for
each label b, where b is a subset of B.

How URNS works?

The goal of an IE system is to discern which of
the labels it extracts are in fact elements of C.

Given that a particular label
was extracted
in a set of
draws from the urn, what is the
probability that
x is a subset of C

Alternative to URNS

Items that were extracted more often are
more likely to be true.

i.e. Extractions with higher frequencies are true.


how many distinct extractions does
each system return at high precision?

how long did each system take to
produce and rank its extractions?

Extraction Rate:
how many distinct high
quality extractions does the system return
per minute? The extraction rate is simply
recall divided by time.

KnowItNow vs. KnowItAll

Tested on relation “Country”

KnowItNow vs. KnowItAll

Tested on relation “CapitalOf”

KnowItNow vs. KnowItAll

Tested on relation “Corp”

KnowItNow vs. KnowItAll

Tested on relation “CeoOf”

KnowItNow vs. KnowItAll


A novel architecture for Information
Extraction which does not depend on Web
engine queries.

Extract tens of thousands of facts from the
Web in minutes instead of days.

KnowItNow's extraction rate is two to three
orders of magnitude greater than KnowItAll's.