Distilling Structured Data

unknownlippsAI and Robotics

Oct 16, 2013 (4 years and 8 months ago)


Information Extraction:
Distilling Structured Data
from Unstructured Text.

Andrew McCallum

Presented by

Lalit Bist


Information extraction to rescue

A Tour of Examples Applications

Mine the text Directly

Information extraction the web and

Information Extraction

Is the process of filling database
records with unstructured or loosely
formatted text.

Information extraction populates a
database from unstructured or loosely
structured text; data mining then
discovers patterns in that database.

Information extraction involves five
major subtasks.

Component of Information

Segmentation finds the starting and
ending boundaries of the text snippets
that will fill a database field

Classification determines which
database field is the correct destination
for each text segment.

Component of Information


It determines which fields belong together in the
same record.

It is sometimes referred to as relation extraction for
the case in which two entities are being associated.


puts information in a standard
format in which it can be reliably compared.

Reduplication collapses redundant information so that
there is no duplicate records in database.

Example Applications


job search website


it claimed having twice as many job


in its database as monster.com

it automatically extracted its job openings
directly from more than 60 K company websites


Extracts information about people
one the web creating cross referenced records name,
job titles, employment histories and educational

Example applications

CiteSeer.org extracts citation
information from academic research
papers, including the paper’s title,
publication venue, year, etc.

Verity.com: MeciClaim can extract
various field from medical insurance
claim forms, enabling semi
processing and faster throughput.

How they do it?

By writing regular expressions

tuned programmed rules.

The words, word order, grammar

Statistical and machine
these are methods that
automatically tune their own rules or
parameters to maximize performance
on a set of example texts that have
been correctly labeled by hand.

How they do it?

Statistical Model: HMM( hidden Markov model)

A finite
state machine whit probabilities on the state
transitions and probabilities on the per
state word

Widely used in the 1990s for extraction from English

States of the machine are assigned to different
database fields, and the highest
probability state
path associated with a sequence of words indicates
which sub
sequences of the words belong to those
database fields.

How they do it?

Some of these machine
methods use decision trees or if
else rules.

These approach is often followed in
systems that use machine learning to
create formatting
based extractors
(called wrappers).

Are these methods perfect?

It depends on the regularity of the text input
and the strength of the extraction method

Extraction from the somewhat regular text,
such as postal address blocks or research
paper citations, percentage accuracy in the

to high

Extracting protein name more difficult,
accuracies in a recent competition were 80s.

Deduplication, increases the accuracy.

Shop around before you buy.

Is the product an unchangeable black box?
How much can you tune the extractor to
your own purposes?

If you can tune it yourself, how? By writing
rules? How flexible is this rules language?
What subtleties will it let you capture? Does
it let you express weights or “votes” on
certain outcomes? How does it capture
dependencies and conflicts among the

Shop around before you buy.

Can you train it using machine learning? That
is, if you can tune it yourself, can you do so
by providing examples of data with correct
answers (and have the extractor self
with machine learning)? What machine
learning methods are employed, and how
flexible are the features it uses ?

Is it designed mostly for leveraging HTML
formatting regularities? Does this paradigm
match your needs?

Upcoming Trends and

Estimating uncertainty, managing
multiple hypotheses.

Easier training, semi
learning, interactive extraction.

Alternative Variation: Mine
text Directly

Instead of building structured database
it suggests to use loose mixture of text
extracting and data mining.

These methods leverage whatever
limited structured information is
available and use data mining tool that
are robust enough to operate directly
on the raw text.

Information Extraction , The
Web and The Future

WWW is largest repository of the knowledge.

But it is not in database form with records
and fields that can be easily manipulated and
understood by Computers.

In future machine access the immense
knowledge base, and we will be able to
perform pattern analysis, knowledge
discovery, reasoning and semi
decision making. Information extraction will
be a key part to make this possible.



McCallum, A., Corrada
Emanuel, A., and Wang, X.
2005. Topic and role discovery in social
networks.International Joint Conferences on Artificial

Lafferty, J., McCallum, A., and Pereira, F. 2001.
Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. Proceedings
of the ICML: 282


Klein, D., Smarr, J., Nguyen, H., and Manning, C.
2003. Named entity recognition with character
models. Proceedings of the Seventh Conference on
Natural Language Learning.