CSC 9010: Text Mining

panicyfewInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 4 χρόνια και 4 μήνες)

131 εμφανίσεις

©2012 Paula Matuszek

CSC 9010: Text Mining

Information Retrieval

Dr. Paula Matuszek

(610) 647

©2012 Paula Matuszek


Knowledge is captured in large quantities and many

Much of the knowledge is in unstructured text

books, journals, papers


web pages, blogs, tweets

A very old process

accelerated greatly with the invention of the printing press

and again with the invention of computers


with the advent of the web

Thus the increasing importance of text mining!

©2012 Paula Matuszek

A question!

People interact with all that information
because they want to KNOW something;
there is a question they are trying to
answer or a piece of information they want.
They have an information need.

Hopefully there is some information
somewhere that will satisfy that need

At its most general, information retrieval is
the process of finding the information that
meets that need.

©2012 Paula Matuszek

Basic Information Retrieval

Simplest approach:

Knowledge is organized into chunks (pages)

Goal is to return appropriate chunks

Not a new problem

But some new solutions!

Web Search engines

Text mining includes this process also

still dealing with lots of unstructured text

finding the appropriate “chunk” can be viewed as a
classification problem

©2012 Paula Matuszek

Search Engines

Goal of search engine is to return
appropriate chunks

Steps involve include

asking a question

finding answers

evaluating answers

presenting answers

Value of a search engine depends on
how well it does on all of these.

©2012 Paula Matuszek

Asking a question

Reflect some information need

Query Syntax needs to allow information need to
be expressed


Combining terms

Simple: “required”, NOT (+ and

Boolean expressions with and/or/not and nested parentheses

Variations: strings, NEAR, capitalization.

Simplest syntax that works

Typically more acceptable if predictable

Another set of problems when information isn’t
text: graphics, music

©2012 Paula Matuszek

Finding the Information

Goal is to retrieve all relevant chunks. Too time
consuming to do in real
time, so search engines


Two basic approaches

Index and classify by hand


For BOTH approaches deciding what to index on (e.g.,
what is a keyword) is a significant issue.




©2012 Paula Matuszek

Indexing by Hand

Indexing by hand involves having a person look at
information items and assign them to categories.

Assumes taxonomy of categories exists

Each document can go into multiple categories

Creates high quality indices

Expensive to create

Supports hierarchical browsing for retrieval as well
as search

rater reliability is an issue; requires training and
checking to get consistent category assignment

©2012 Paula Matuszek

Indexing By Hand

For focused collections, even very large ones,


ACM papers

NY Times hand
indexes all abstracts

For the web as a whole, not yet feasible

Evolving solutions

social bookmarking: delicious, reddit, digg

Hash tags: twitter, Google+

Sometimes pre
structured. More often completely

In some domains become a

©2012 Paula Matuszek

Automated Indexing

Automated indexing involves parsing documents to
pull out key words and creating a table which links
keywords to documents

Doesn’t have any predefined categories or

Can cover a much higher proportion of the
information available

Can update more quickly

Much lower quality, therefore important to have
some kind of relevance ranking

©2012 Paula Matuszek


Good information extraction tools can be
used to extract the important terms

Using gazetteers and ontologies to identify

Using named entity and other rules to
assign categories

I2EOnDemand is a good example

©2012 Paula Matuszek

Automating Search

Always involves balancing factors:

Recall, Precision

Which is more important varies with query and
with coverage

Speed, storage, completeness, timeliness

Query response needs to be fast

Documents searched need to be current

Ease of use vs power of queries

Full Boolean queries very rich, very confusing.

Simplest is “and”ing together keywords; fast,

©2012 Paula Matuszek

Search Engine Basics

A spider or crawler starts at a web page, identifies all
links on it, and follows them to new web pages.

A parser processes each web page and extracts
individual words.

An indexer creates/updates a hash table which
connects words with documents

A searcher uses the hash table to retrieve documents
based on words

A ranking system decides the order in which to
present the documents: their

©2012 Paula Matuszek

Selecting Relevant Documents


we already have a corpus of documents defined.

goal is to return a subset of those documents.

Individual documents have been separated into
individual files

Remaining components must parse, index,
find, and rank documents.

Traditional approach is based on the words in
the documents (predates the web)

©2012 Paula Matuszek

Extracting Lexical Features

Process a string of characters

assemble characters into tokens (tokenizer)

choose tokens to index

In place
(problem for www)

Standard lexical analysis problem

Lexical Analyser Generator, such as lex

Tokenizers such as the NLTK and
GATE tokenizers

©2012 Paula Matuszek

Lexical Analyser

Basic idea is a finite state machine

Triples of input state, transition token, output state

Must be very efficient; gets used a LOT







blank, EOF

©2012 Paula Matuszek

Design Issues for Lexical Analyser


treat as whitespace?

treat as characters?

treat specially?




assemble into numbers?

treat as characters?

treat as punctuation?

©2012 Paula Matuszek

Lexical Analyser

Output of lexical analyser is a string of

Remaining operations are all on these

We have already thrown away some
information; makes more efficient, but limits
the power of our search

can’t distinguish “VITA” from “Vita”

Can be somewhat remedied at “relevance” step

©2012 Paula Matuszek


Additional processing at the token level

Turn words into a canonical form:

“cars” into “car”

“children” into “child”

“walked” into “walk”

Decreases the total number of different
tokens to be processed

Decreases the precision of a search, but
increases its recall

NLTK, GATE, other stemmers

©2012 Paula Matuszek

Noise Words (Stop Words)

Function words that contribute little or
nothing to meaning

Very frequent words

If a word occurs in every document, it is not
useful in choosing among documents

However, need to be careful, because this
is corpus

Often implemented as a discrete list

©2012 Paula Matuszek

Example Corpora

We are assuming a fixed corpus. Some
sample corpora:

Medline Abstracts

Email. Anyone’s email.

Reuters corpus

Brown corpus

Textual fields, structured attributes

Textual: free, unformatted, no meta

Structured: additional information beyond the

©2012 Paula Matuszek

Structured Attributes for

Pubmed ID





©2012 Paula Matuszek

Textual Fields for Medline


Reasonably complete standard academic

Capturing the basic meaning of document


Short, formalized

Captures most critical part of meaning


for abstract

©2012 Paula Matuszek

Structured Fields for Email

To, From, Cc, Bcc


Content type


Content length

Subject (partially)

©2012 Paula Matuszek

Text fields for Email


Format is structured, content is arbitrary.

Captures most critical part of content.


for content

but may be inaccurate.

Body of email

Highly irregular, informal English.

Entire document, not summary.

Spelling and grammar irregularities.

Structure and length vary.

©2012 Paula Matuszek


We have a tokenized, stemmed
sequence of words

Next step is to parse document,
extracting index terms

Assume that each token is a word and we
don’t want to recognize any more complex
structures than single words.

When all documents are processed,
create index

©2012 Paula Matuszek

Basic Indexing Algorithm

For each document in the corpus

get the next token

save the posting in a list

doc ID, frequency

For each token found in the corpus

calculate #doc, total frequency

sort by frequency

This is the inverse index

©2012 Paula Matuszek

Fine Points

Dynamic Corpora: requires incremental

resolution data (e.g, char position)

Giving extra weight to proxy text (typically
by doubling or tripling frequency count)

specific processing

In HTML, want to ignore tags

In email, maybe want to ignore quoted

©2012 Paula Matuszek

Choosing Keywords

Don’t necessarily want to index on every

Takes more space for index

Takes more processing time

May not improve our
resolving power

How do we choose keywords?



Exhaustivity vs specificity

©2012 Paula Matuszek

Manually Choosing Keywords

Unconstrained vocabulary: allow creator of
document to choose whatever he/she wants

“best” match

captures new terms easily

easiest for person choosing keywords

Constrained vocabulary: hand

can include hierarchical and other relations

more consistent

easier for searching; possible “magic bullet”

©2012 Paula Matuszek

Examples of Constrained Vocabularies

ACM Computing Classification System

H: Information Retrieval

H3: Information Storage and Retrieval

H3.3: Information Search and Retrieval


Information Filtering

Query formulation

Relevance feedback



Medline Headings

L: Information Science

L01: Information Science

L01.700: Medical Informatics

L01.700.508: Medical Informatics Applications

L01.700.508.280: Information Storage and Retrieval

MedlinePlus [L01.700.508.280.730]

©2012 Paula Matuszek

Automated Vocabulary Selection

Frequency: Zipf’s Law.

In a natural language corpus, frequency of a word is
inversely proportional to its position in a frequency

Within one corpus, words with

frequencies are typically “best”

We have used this in NLTK classification, ignoring
the most frequent terms in creating the BOW.

oriented representation bias: lots of

Oriented representation bias: only the
“most typical” words. Assumes that we are
comparing across documents.

©2012 Paula Matuszek

Choosing Keywords

“Best” depends on actual use; if a word
only occurs in one document, may be
very good for retrieving that document;
not, however, very effective overall.

Words which have no resolving power
within a corpus may be best choices
across corpora

©2012 Paula Matuszek

Keyword Choice for WWW

don’t have

a fixed corpus of documents

New terms appear fairly regularly, and are
likely to be common search terms

Queries that people want to make are
ranging and unpredictable

Therefore: can’t limit keywords, except
possibly to eliminate stop words.

Even stop words are language
So determine language first.

©2012 Paula Matuszek

Comparing and Ranking Documents

Once our search engine has retrieved a
set of documents, we may want to

Rank them by relevance

Which are the best fit to my query?

This involves determining what the

about and how well the document answers it

Compare them

Show me more like this.

This involves determining what the

is about.

©2012 Paula Matuszek

Determining Relevance by

The typical web query consists entirely of

Retrieval can be binary: present or absent

More sophisticated is to look for degree of
relatedness: how much does this document
reflect what the query is

Simple strategies:

How many times does word occur in document?

How close to head of document?

If multiple keywords, how close together?

©2012 Paula Matuszek

Keywords for Relevance Ranking

Count: repetition is an indication of emphasis

Very fast (usually in the index)

Reasonable heuristic

Unduly influenced by document length

Can be "stuffed" by web designers

Position: Lead paragraphs summarize content

Requires more computation

Also reasonably heuristic

Less influenced by document length

Harder to "stuff"; can only have a few keywords near

©2012 Paula Matuszek

Keywords for Relevance Ranking

Proximity for multiple keywords

Requires even more computation

Obviously relevant only if have multiple keywords

Effectiveness of heuristic varies with information
need; typically either excellent or not very helpful
at all

Very hard to "stuff"

All keyword methods

Are computationally simple and adequately fast

Are effective heuristics

typically perform as well as in
depth natural
language methods for standard search

©2012 Paula Matuszek

Comparing Documents

Find me more like this one
" really means that
we are using the document as a query.

This requires that we have some conception
of what a document is about overall.

Depends on context of query. We need to

Characterize the entire content of this document

Discriminate between this document and others in
the corpus

This is basically a document classification problem.

©2012 Paula Matuszek

Describing an Entire

So what

a document

TF*IDF: can simply list keywords in
order of their TF*IDF values

Document is

all of them to some
degree: it is at some point in some
vector space of meaning

©2012 Paula Matuszek

Vector Space

Any corpus has defined set of terms (index)

These terms define a
knowledge space

Every document is somewhere in that
knowledge space

it is or is not

of those terms.

Consider each term as a vector. Then

We have an n
vector space

Where n is the number of terms (very large!)

Each document is a point in that vector space

The document position in this vector space
can be treated as what the document is

©2012 Paula Matuszek

Similarity Between Documents

How similar are two documents?

Measures of association

How much do the feature sets overlap?

Simple Matching coefficient: take into account

Cosine similarity

similarity of angle of the two document vectors

not sensitive to vector length

Same basic similarity ideas as
classification and clustering

©2012 Paula Matuszek

Additional Search Engine

Freshness: how often to revisit documents

Eliminate duplicate documents

Eliminate multiple documents from one site

Provide good context

content based features: citation
graphs (basis of Page rank)

Search Engine Optimization

©2012 Paula Matuszek

Beyond Simple Search

Information Extraction on queries to
recognize some common patterns

Airline flights

tracking #w

Rich IE systems like I2E

Taxonomy browsing

©2012 Paula Matuszek

Beyond Unstructured Text

“Improved” search

Specific types of text

text search constraints

Faceted Search

Searching non
text information assets

Personalizing search

©2012 Paula Matuszek

Improved Search

Modern search engines tweak your query a
lot. Google, for instance, says it will normally

suggest spelling corrections and alternative

personalize your search by using information such
as sites you’ve visited before

include synonyms of your search terms

find results that match similar terms to those in
your query

stem your search terms


©2012 Paula Matuszek

Domain of Documents

May be desirable to limit search to specific types
of document

Google gives you, among other things







May be based on the source

Or may be text mining (classification) at work :

©2012 Paula Matuszek

Text Constraints

We may know some things about
documents that are not captured in the

information included in the document

date created and modified

author, department

keywords or tags

Information that can be determined by
examining the document


images included?

reading level

©2012 Paula Matuszek

Faceted Search

Faceted Search: constrain search along several

Research topic in information retrieval especially
for the last ten years

Flamenco project at Berkeley (

CiteSeer project at Penn State (

Labeling with facets has same issues as hand

except when you already have the information in a
database somewhere

Has become popular primarily for online

©2012 Paula Matuszek

Faceted Search Examples

Search Amazon for “rug”

Generally applicable facets: Department,
Amazon Prime Eligible, Average Customer
Review, Price, Discount, Availability

specific facets: size, material,
pattern, style, theme, color

Flamenco Fine Arts demo

©2012 Paula Matuszek

Searching non
Text Assets

Our information chunks might not be text

images, sounds, videos, maps

Images, sounds, videos often based on
proxy information: captions, tags

Information Extraction useful for maps

Still an active research area, typically at
the intersection of information retrieval
and machine learning.

©2012 Paula Matuszek

Personalized Search

What searches you have done in the past tells a lot
about your information needs

If a search engine has and makes use of that
information it can improve your searches

web page history


explicit sign

Relevance can be influenced by pages you’ve visited,

May go beyond to use information such as your blog
posts, friends links, etc.

Can be a privacy issue

©2012 Paula Matuszek


Information Retrieval is the process of finding and
providing to the user chunks of information which
are relevant to some information need

Where the chunks are text, free text search tools
are the common approach

Text mining tools such as document classification
and information extraction can improve the
relevance of search results

This is not new with the web, but the web has had a
massive impact on the area

It continues to evolve, rapidly