CSC 9010: Text Mining

panicyfewInternet and Web Development

Nov 18, 2013 (3 years and 8 months ago)

109 views

©2012 Paula Matuszek

CSC 9010: Text Mining
Applications:

Information Retrieval

Dr. Paula Matuszek

Paula.Matuszek@villanova.edu

Paula.Matuszek@gmail.com

(610) 647
-
9789

©2012 Paula Matuszek

Knowledge


Knowledge is captured in large quantities and many
forms


Much of the knowledge is in unstructured text


books, journals, papers


letters


web pages, blogs, tweets


A very old process


accelerated greatly with the invention of the printing press


and again with the invention of computers


and
again

with the advent of the web


Thus the increasing importance of text mining!

©2012 Paula Matuszek

A question!


People interact with all that information
because they want to KNOW something;
there is a question they are trying to
answer or a piece of information they want.
They have an information need.


Hopefully there is some information
somewhere that will satisfy that need


At its most general, information retrieval is
the process of finding the information that
meets that need.

©2012 Paula Matuszek

Basic Information Retrieval


Simplest approach:


Knowledge is organized into chunks (pages)


Goal is to return appropriate chunks


Not a new problem


But some new solutions!


Web Search engines


Text mining includes this process also


still dealing with lots of unstructured text


finding the appropriate “chunk” can be viewed as a
classification problem

©2012 Paula Matuszek

Search Engines


Goal of search engine is to return
appropriate chunks


Steps involve include


asking a question


finding answers


evaluating answers


presenting answers


Value of a search engine depends on
how well it does on all of these.

©2012 Paula Matuszek

Asking a question


Reflect some information need


Query Syntax needs to allow information need to
be expressed


Keywords


Combining terms


Simple: “required”, NOT (+ and
-
)


Boolean expressions with and/or/not and nested parentheses


Variations: strings, NEAR, capitalization.


Simplest syntax that works


Typically more acceptable if predictable


Another set of problems when information isn’t
text: graphics, music

©2012 Paula Matuszek

Finding the Information


Goal is to retrieve all relevant chunks. Too time
-
consuming to do in real
-
time, so search engines
index

pages.


Two basic approaches


Index and classify by hand


Automate


For BOTH approaches deciding what to index on (e.g.,
what is a keyword) is a significant issue.


stemming?


stopwords?


capitalization?

©2012 Paula Matuszek

Indexing by Hand


Indexing by hand involves having a person look at
information items and assign them to categories.


Assumes taxonomy of categories exists


Each document can go into multiple categories


Creates high quality indices


Expensive to create


Supports hierarchical browsing for retrieval as well
as search


Inter
-
rater reliability is an issue; requires training and
checking to get consistent category assignment

©2012 Paula Matuszek

Indexing By Hand


For focused collections, even very large ones,
feasible


Medline


ACM papers


NY Times hand
-
indexes all abstracts


For the web as a whole, not yet feasible


Evolving solutions


social bookmarking: delicious, reddit, digg


Hash tags: twitter, Google+


Sometimes pre
-
structured. More often completely
freeform


In some domains become a
folksonomy
.

©2012 Paula Matuszek

Automated Indexing


Automated indexing involves parsing documents to
pull out key words and creating a table which links
keywords to documents


Doesn’t have any predefined categories or
keywords


Can cover a much higher proportion of the
information available


Can update more quickly


Much lower quality, therefore important to have
some kind of relevance ranking

©2012 Paula Matuszek

Or IE
-
Based


Good information extraction tools can be
used to extract the important terms


Using gazetteers and ontologies to identify
terms


Using named entity and other rules to
assign categories


I2EOnDemand is a good example

©2012 Paula Matuszek

Automating Search


Always involves balancing factors:


Recall, Precision


Which is more important varies with query and
with coverage


Speed, storage, completeness, timeliness


Query response needs to be fast


Documents searched need to be current


Ease of use vs power of queries


Full Boolean queries very rich, very confusing.


Simplest is “and”ing together keywords; fast,
straightforward.

©2012 Paula Matuszek

Search Engine Basics


A spider or crawler starts at a web page, identifies all
links on it, and follows them to new web pages.


A parser processes each web page and extracts
individual words.


An indexer creates/updates a hash table which
connects words with documents


A searcher uses the hash table to retrieve documents
based on words


A ranking system decides the order in which to
present the documents: their
relevance

©2012 Paula Matuszek

Selecting Relevant Documents


Assume:


we already have a corpus of documents defined.


goal is to return a subset of those documents.


Individual documents have been separated into
individual files


Remaining components must parse, index,
find, and rank documents.


Traditional approach is based on the words in
the documents (predates the web)

©2012 Paula Matuszek

Extracting Lexical Features


Process a string of characters


assemble characters into tokens (tokenizer)


choose tokens to index


In place
(problem for www)


Standard lexical analysis problem


Lexical Analyser Generator, such as lex


Tokenizers such as the NLTK and
GATE tokenizers

©2012 Paula Matuszek

Lexical Analyser


Basic idea is a finite state machine


Triples of input state, transition token, output state







Must be very efficient; gets used a LOT

0

1

2

blank

A
-
Z

A
-
Z

blank, EOF

©2012 Paula Matuszek

Design Issues for Lexical Analyser


Punctuation


treat as whitespace?


treat as characters?


treat specially?


Case


fold?


Digits


assemble into numbers?


treat as characters?


treat as punctuation?

©2012 Paula Matuszek

Lexical Analyser


Output of lexical analyser is a string of
tokens


Remaining operations are all on these
tokens


We have already thrown away some
information; makes more efficient, but limits
the power of our search


can’t distinguish “VITA” from “Vita”


Can be somewhat remedied at “relevance” step

©2012 Paula Matuszek

Stemming


Additional processing at the token level


Turn words into a canonical form:


“cars” into “car”


“children” into “child”


“walked” into “walk”


Decreases the total number of different
tokens to be processed


Decreases the precision of a search, but
increases its recall


NLTK, GATE, other stemmers

©2012 Paula Matuszek

Noise Words (Stop Words)


Function words that contribute little or
nothing to meaning


Very frequent words


If a word occurs in every document, it is not
useful in choosing among documents


However, need to be careful, because this
is corpus
-
dependent


Often implemented as a discrete list

©2012 Paula Matuszek

Example Corpora


We are assuming a fixed corpus. Some
sample corpora:


Medline Abstracts


Email. Anyone’s email.


Reuters corpus


Brown corpus


Textual fields, structured attributes


Textual: free, unformatted, no meta
-
information


Structured: additional information beyond the
content

©2012 Paula Matuszek

Structured Attributes for
Medline


Pubmed ID


Author


Year


Keywords


Journal

©2012 Paula Matuszek

Textual Fields for Medline


Abstract


Reasonably complete standard academic
English


Capturing the basic meaning of document


Title


Short, formalized


Captures most critical part of meaning


Proxy

for abstract

©2012 Paula Matuszek

Structured Fields for Email


To, From, Cc, Bcc


Dates


Content type


Status


Content length


Subject (partially)

©2012 Paula Matuszek

Text fields for Email


Subject


Format is structured, content is arbitrary.


Captures most critical part of content.


Proxy

for content
--

but may be inaccurate.


Body of email


Highly irregular, informal English.


Entire document, not summary.


Spelling and grammar irregularities.


Structure and length vary.

©2012 Paula Matuszek

Indexing


We have a tokenized, stemmed
sequence of words


Next step is to parse document,
extracting index terms


Assume that each token is a word and we
don’t want to recognize any more complex
structures than single words.


When all documents are processed,
create index

©2012 Paula Matuszek

Basic Indexing Algorithm


For each document in the corpus


get the next token


save the posting in a list


doc ID, frequency


For each token found in the corpus


calculate #doc, total frequency


sort by frequency


This is the inverse index

©2012 Paula Matuszek

Fine Points


Dynamic Corpora: requires incremental
algorithms


Higher
-
resolution data (e.g, char position)


Giving extra weight to proxy text (typically
by doubling or tripling frequency count)


Document
-
type
-
specific processing


In HTML, want to ignore tags


In email, maybe want to ignore quoted
material

©2012 Paula Matuszek

Choosing Keywords


Don’t necessarily want to index on every
word


Takes more space for index


Takes more processing time


May not improve our
resolving power


How do we choose keywords?


Manually


Statistically


Exhaustivity vs specificity

©2012 Paula Matuszek

Manually Choosing Keywords


Unconstrained vocabulary: allow creator of
document to choose whatever he/she wants


“best” match


captures new terms easily


easiest for person choosing keywords


Constrained vocabulary: hand
-
crafted
ontologies


can include hierarchical and other relations


more consistent


easier for searching; possible “magic bullet”
search

©2012 Paula Matuszek

Examples of Constrained Vocabularies


ACM Computing Classification System
(www.acm.org/class/1998)


H: Information Retrieval


H3: Information Storage and Retrieval



H3.3: Information Search and Retrieval

»
Clustering

»
Information Filtering

»
Query formulation


»
Relevance feedback

»

etc.


Medline Headings
(www.nlm.nih.gov/mesh/MBrowser.html)


L: Information Science


L01: Information Science



L01.700: Medical Informatics



L01.700.508: Medical Informatics Applications



L01.700.508.280: Information Storage and Retrieval

»
MedlinePlus [L01.700.508.280.730]


©2012 Paula Matuszek

Automated Vocabulary Selection


Frequency: Zipf’s Law.


In a natural language corpus, frequency of a word is
inversely proportional to its position in a frequency
table.


Within one corpus, words with
middle

frequencies are typically “best”


We have used this in NLTK classification, ignoring
the most frequent terms in creating the BOW.


Document
-
oriented representation bias: lots of
keywords/document


Query
-
Oriented representation bias: only the
“most typical” words. Assumes that we are
comparing across documents.

©2012 Paula Matuszek

Choosing Keywords


“Best” depends on actual use; if a word
only occurs in one document, may be
very good for retrieving that document;
not, however, very effective overall.


Words which have no resolving power
within a corpus may be best choices
across corpora

©2012 Paula Matuszek

Keyword Choice for WWW


We
don’t have

a fixed corpus of documents


New terms appear fairly regularly, and are
likely to be common search terms


Queries that people want to make are
wide
-
ranging and unpredictable


Therefore: can’t limit keywords, except
possibly to eliminate stop words.


Even stop words are language
-
dependent.
So determine language first.

©2012 Paula Matuszek

Comparing and Ranking Documents


Once our search engine has retrieved a
set of documents, we may want to


Rank them by relevance


Which are the best fit to my query?


This involves determining what the
query

is
about and how well the document answers it


Compare them


Show me more like this.


This involves determining what the
document

is about.

©2012 Paula Matuszek

Determining Relevance by
Keyword


The typical web query consists entirely of
keywords.


Retrieval can be binary: present or absent


More sophisticated is to look for degree of
relatedness: how much does this document
reflect what the query is
about
?


Simple strategies:


How many times does word occur in document?


How close to head of document?


If multiple keywords, how close together?

©2012 Paula Matuszek

Keywords for Relevance Ranking


Count: repetition is an indication of emphasis


Very fast (usually in the index)


Reasonable heuristic


Unduly influenced by document length


Can be "stuffed" by web designers


Position: Lead paragraphs summarize content


Requires more computation


Also reasonably heuristic


Less influenced by document length


Harder to "stuff"; can only have a few keywords near
beginning

©2012 Paula Matuszek

Keywords for Relevance Ranking


Proximity for multiple keywords


Requires even more computation


Obviously relevant only if have multiple keywords


Effectiveness of heuristic varies with information
need; typically either excellent or not very helpful
at all


Very hard to "stuff"


All keyword methods


Are computationally simple and adequately fast


Are effective heuristics


typically perform as well as in
-
depth natural
language methods for standard search

©2012 Paula Matuszek

Comparing Documents


"
Find me more like this one
" really means that
we are using the document as a query.


This requires that we have some conception
of what a document is about overall.


Depends on context of query. We need to


Characterize the entire content of this document


Discriminate between this document and others in
the corpus


This is basically a document classification problem.

©2012 Paula Matuszek

Describing an Entire
Document


So what
is

a document
about
?


TF*IDF: can simply list keywords in
order of their TF*IDF values


Document is
about

all of them to some
degree: it is at some point in some
vector space of meaning

©2012 Paula Matuszek

Vector Space


Any corpus has defined set of terms (index)


These terms define a
knowledge space


Every document is somewhere in that
knowledge space
--

it is or is not
about

each
of those terms.


Consider each term as a vector. Then


We have an n
-
dimensional
vector space


Where n is the number of terms (very large!)


Each document is a point in that vector space


The document position in this vector space
can be treated as what the document is
about
.

©2012 Paula Matuszek

Similarity Between Documents


How similar are two documents?


Measures of association


How much do the feature sets overlap?


Simple Matching coefficient: take into account
exclusions


Cosine similarity


similarity of angle of the two document vectors


not sensitive to vector length


Same basic similarity ideas as
classification and clustering

©2012 Paula Matuszek

Additional Search Engine
Issues


Freshness: how often to revisit documents


Eliminate duplicate documents


Eliminate multiple documents from one site


Provide good context


Non
-
content based features: citation
graphs (basis of Page rank)


Search Engine Optimization

©2012 Paula Matuszek

Beyond Simple Search


Information Extraction on queries to
recognize some common patterns


Airline flights


tracking #w


Rich IE systems like I2E


Taxonomy browsing

©2012 Paula Matuszek

Beyond Unstructured Text


“Improved” search


Specific types of text


Non
-
text search constraints


Faceted Search


Searching non
-
text information assets


Personalizing search

©2012 Paula Matuszek

Improved Search


Modern search engines tweak your query a
lot. Google, for instance, says it will normally


suggest spelling corrections and alternative
spellings


personalize your search by using information such
as sites you’ve visited before


include synonyms of your search terms


find results that match similar terms to those in
your query


stem your search terms

(
http://support.google.com/websearch/bin/answer.py?hl=en&p=g_verb&answer=1734130
)

©2012 Paula Matuszek

Domain of Documents


May be desirable to limit search to specific types
of document


Google gives you, among other things


news


books


blogs


recipes


patents


discussions


May be based on the source


Or may be text mining (classification) at work :
-
)

©2012 Paula Matuszek

Non
-
Text Constraints


We may know some things about
documents that are not captured in the
tokens:


Meta
-
information included in the document


date created and modified


author, department


keywords or tags


Information that can be determined by
examining the document


language


images included?


reading level

©2012 Paula Matuszek

Faceted Search


Faceted Search: constrain search along several
attributes


Research topic in information retrieval especially
for the last ten years


Flamenco project at Berkeley (flamenco.berkeley.edu)


CiteSeer project at Penn State (citeseer.ist.psu.edu)


Labeling with facets has same issues as hand
-
indexing


except when you already have the information in a
database somewhere


Has become popular primarily for online
commerce.

©2012 Paula Matuszek

Faceted Search Examples


Search Amazon for “rug”


Generally applicable facets: Department,
Amazon Prime Eligible, Average Customer
Review, Price, Discount, Availability


Object
-
specific facets: size, material,
pattern, style, theme, color


Flamenco Fine Arts demo


http://orange.sims.berkeley.edu/cgi
-
bin/flamenco.cgi/famuseum/Flamenco

©2012 Paula Matuszek

Searching non
-
Text Assets


Our information chunks might not be text


images, sounds, videos, maps


Images, sounds, videos often based on
proxy information: captions, tags


Information Extraction useful for maps


Still an active research area, typically at
the intersection of information retrieval
and machine learning.
http://www.gwap.com/gwap/

©2012 Paula Matuszek

Personalized Search


What searches you have done in the past tells a lot
about your information needs


If a search engine has and makes use of that
information it can improve your searches


web page history


cookies


explicit sign
-
in


Relevance can be influenced by pages you’ve visited,
click
-
throughs


May go beyond to use information such as your blog
posts, friends links, etc.


Can be a privacy issue

©2012 Paula Matuszek

Summary


Information Retrieval is the process of finding and
providing to the user chunks of information which
are relevant to some information need


Where the chunks are text, free text search tools
are the common approach


Text mining tools such as document classification
and information extraction can improve the
relevance of search results


This is not new with the web, but the web has had a
massive impact on the area


It continues to evolve, rapidly