IR Lecture 1

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 25 μέρες)

94 εμφανίσεις

IR

Lecture

1

Information
Retrieval



Information retrieval is concerned with
representing, searching, and manipulating large
collections of

electronic

text

and

other

human
-
language
data
.


Information retrieval (IR) is
finding

material
(
usually documents
)
of

an
unstructured

nature
(usually text) that satisfies an
information

need
from within
large collections
(usually stored on
computers).


1.
Basic techniques

(
Boolean

Retrieval
)

2.
Searching, browsing, ranking, retrieval

3.
Indexing algorithms and data structures


NLP

DB

ML
-
AI

IR


Database Management


Library and Information Science


Artificial Intelligence


Natural Language Processing


Machine Learning


Database Management


Focused on
structured

data stored in relational tables
rather than free
-
form text.


Focused on efficient processing of well
-
defined queries
in a formal language (SQL).


Clearer semantics for both data and queries.


Recent move towards
semi
-
structured

data (XML)
brings it closer to IR.

Library and Information Science


Focused on the human user aspects of information
retrieval (human
-
computer interaction, user
interface, visualization).


Concerned with effective categorization of human
knowledge.


Concerned with citation analysis and
bibliometrics
(structure of information).


Recent work on
digital libraries

brings it closer to CS
& IR.



Artificial Intelligence


Focused on the representation of knowledge,
reasoning, and intelligent action.


Formalisms for representing knowledge and queries:


First
-
order Predicate Logic


Bayesian Networks


Recent work on web ontologies and intelligent
information agents brings it closer to IR.


Natural Language Processing


Focused on the
syntactic
,
semantic
, and pragmatic analysis
of natural language text and discourse.


Ability to analyze syntax (phrase structure) and semantics
could allow retrieval based on
meaning

rather than
keywords.

Natural Language Processing:

IR Directions


Methods for determining the sense of an ambiguous word
based on context (
word sense disambiguation
).


Methods for identifying specific pieces of information in a
document (
information extraction
).


Methods for answering specific NL questions from
document corpora.

Machine Learning


Focused on the development of computational
systems that improve their performance with
experience.


Automated classification of examples based on
learning concepts from labeled training examples
(
supervised learning
).


Automated methods for clustering unlabeled
examples into meaningful groups (
unsupervised
learning
).

Machine Learning:

IR Directions


Text Categorization


Automatic hierarchical classification (Yahoo).


Adaptive filtering/routing/recommending.


Automated spam filtering.


Text Clustering


Clustering of IR query results.


Automatic formation of hierarchies (Yahoo).


Learning for Information Extraction


Text Mining


Information
Retrieval


The indexing and retrieval of textual documents.


Searching for pages on the World Wide Web is the most
recent

and

widely

used

application
.


Concerned firstly with retrieving
relevant

documents to a
query.


Concerned secondly with retrieving from
large

sets of
documents
efficiently
.


Information
Retrieval

Systems

Given:


A corpus of textual natural
-
language documents.


A user query in the form of a textual string.

Find:


A ranked set of documents that are relevant to the query.



Most

IR
systems

share

a
basic

architecture

and

organizations
.
(
adapted

to

the

requirements

of
specific

applications
)


Like

any

technical

field
, IR has
its

own

jargon.



Next

page

illustrates

the

major

components

in an IR
system
.

IR

System

Query
String

Document

corpus

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3


.


.


Before

conducting

a
search
, a
user

has an
information

need

This

information

need

drives

the

search

process

This

information

need

some

times

referred

as a
topic

User
constructs

and

issues

a
query

to

the

IR
system
.
Typically

this

query

consists

of
small

number

of
terms

(
instead

of
word

we

use

«
term
»

A
major

task

of a
search

engine is
to

maintain

and

manipulate

an
inverted

index

for

a
document

collection
.

This

index

forms

the

principal

data
structure

used

by

engine
for

searching

and

relevance

ranking
.

İ
ndex

provides

a
mapping

between

terms

and

the

locations

in
the

collection

in
which

they

occure
.

Relevance


Relevance is a subjective judgment and may include:


Being on the proper subject.


Being timely (recent information).


Being authoritative (from a trusted source).


Satisfying the goals of the user and his/her intended use of the
information (
information need
).


Simplest notion of relevance is that the query string
appears verbatim in the document.


Slightly less strict notion is that the words in the query
appear frequently in the document, in any order (
bag of
words
).


Problems


May not retrieve relevant documents that include
synonymous terms.


“restaurant” vs. “café”



Turkey

vs.

TR



May retrieve irrelevant documents that include ambiguous
terms.


“bat” (baseball vs. mammal)


“Apple” (company vs. fruit)


“bit” (unit of data vs. act of eating)

For instance, the word "bank" has several distinct lexical
definitions, including "
financial institution
" and "
edge of a river
".

19

IR System Architecture

Text

Database

Database

Manager

Indexing

Index

Query

Operations

Searching

Ranking

Ranked

Docs

User

Feedback

Text Operations

User Interface

Retrieved

Docs

User

Need

Text

Query

Logical View

Inverted


file

IR
Components


Text Operations

forms index words (tokens).


Stopword

removal


Stemming


Indexing

constructs an
inverted index

of word to
document pointers.


Searching

retrieves documents that contain a given query
token from the inverted index.


Ranking

scores all retrieved documents according to a
relevance metric.


IR Components


User Interface

manages interaction with the user:


Query input and document output.


Relevance feedback.


Visualization of results.


Query Operations

transform the query to improve
retrieval:


Query expansion using a thesaurus.


Query transformation using relevance feedback.


Web
Search


Application of IR to HTML documents on the World
Wide Web.


Differences:


Must assemble document corpus by
spidering

the web.


Can exploit the structural layout information in HTML (XML).


Documents change uncontrollably.


Can exploit the link structure of the web.


23

Web Search System

Query
String

IR

System

Ranked

Documents

1. Page1

2. Page2

3. Page3


.


.


Document

corpus

Web

Spider

Other IR
-
Related Tasks


Automated document categorization


Information filtering (spam filtering)


Information routing


Automated document clustering


Recommending information or products


Information extraction


Information integration


Question answering


25

History of IR


19
4
0
-
5
0’s:



World
War

II
denoted

the

official

formation

of Information
Representation

and

Retrieval
.
Because

of
war
, a
massive

number

of
technical

reports

and

documents

were

produced

to

record

the

research

and

development

activities

surrounding

weaponary

production
.

26

History of IR


1960
-
70’s:



Initial exploration of text retrieval systems for “small” corpora
of scientific abstracts, and law and business documents.


Development of the basic Boolean and vector
-
space models of
retrieval.


Prof. Salton and his students at Cornell University are the
leading researchers in the area.

27

IR History Continued


1980’s:


Large document database systems, many run by companies:


Lexis
-
Nexis

(
On April 2, 1973, LEXIS launched publicly, offering full
-
text searching of all Ohio and New York
cases
)


Dialog

(
manual

to

computerized

information

retrieval
)


MEDLINE

((
Medical

Literature

Analysis
and

Retrieval

System
)


28

IR History Continued


1990’s:

Networked

Era


Searching
FTPable

documents on the Internet


Archie


WAIS


Searching the World Wide Web


Lycos


Yahoo


Altavista

29

IR History Continued


1990’s continued:


Organized Competitions


NIST TREC

(
Text
REtrieval

Conference (TREC) is an on
-
going series
of workshops focusing on a list of different information retrieval (IR)
research areas, or tracks
.
)


Recommender Systems


Ringo


Amazon


Automated Text Categorization & Clustering


30

Recent IR History


2000’s


Link analysis for Web Search


Google

(
page

rank
)


Automated Information Extraction


Whizbang

(
Build Highly Structured Topic
-
Specific/Data
-
Centric
Databases
) (
White Paper, Information Extraction and Text
Classification” via
WhizBang
!
Corp
)


Burning Glass

(
Burning Glass’s technology for reading, understanding,
and cataloging information directly from free text resumes and job
postings is truly
state
-
of
-
the
-
art
)


Question Answering


TREC Q/A track

(
Question answering systems return an actual
answer, rather than a ranked list of documents, in response to a
question
.
)


31

Recent IR History


2000’s continued:


Multimedia IR


Image


Video


Audio and music


Cross
-
Language IR


DARPA Tides

(
Translingual

Information Detection, Extraction and
Summarization
)


Document Summarization

An example information retrieval problem



A
fat book which many people own is Shakespeare’s Collected
Works.
Suppose

you

wanted
to
determine

which
plays of
Shakespeare contain
the

words

Brutus AND
Caesar AND
NOT Calpurnia
.



One way to do that is to start at
the

beginning
and to read
through all the text, noting for each play
whether

it
contains
Brutus and Caesar and excluding it from
consideration
if it
contains

Calpurnia.



The simplest form of document retrieval is for a
computer

to
do this sort of linear scan through documents.


This process is
commonly

referred
to as
grepping
.


Grepping

through text can be a very effective process
,

especially
given the speed of modern
computers


With modern computers, for simple querying of modest
collections

(
the size of Shakespeare’s Collected Works is
a bit under one
million
words

of
text in total
), you really
need nothing more.

But for many purposes, you do need more
:



To
process large document collections quickly. The amount of
online
data

has
grown at least as quickly as the speed of
computers, and we
would

now
like to be able to search
collections that total in the order of
billions

to
trillions of
words
.



To
allow more flexible matching operations. For example, it is
impractical

to
perform the query Romans NEAR countrymen
with
grep
, where
NEAR

might
be defined as “within 5 words”
or “within the same sentence
”.



T
o
allow ranked retrieval: in many cases you want the best
answer to
an

information
need among many documents that
contain certain words.

Indexing


The way to avoid linearly scanning the texts for each
query is to
index
the

documents
in advance.

B
asics

of the Boolean
R
etrieval

M
odel

Term
-
document
incidence

Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.

Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in
The

tempest
.

B
asics

of the Boolean
R
etrieval

M
odel


Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar.

Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in
The

tempest
.


So we have a 0/1 vector for each term.


To answer the query Brutus and Caesar and
not

Calpurnia:


Take the vectors for Brutus, Caesar, and Calpurnia

Complement

the vector of Calpurnia

Do a (bitwise) and on the three vectors

110100 and 110111 and 101111 = 100100

0/1
Vector

for

Brutus

Answers

to

query

Antony and Cleopatra,

Act III, Scene ii

Agrippa

[Aside to DOMITIUS ENOBARBUS]: Why,
Enobarbus
,


When Antony found Julius
Caesar

dead,


He cried almost to roaring; and he wept


When at Philippi he found
Brutus

slain.


Hamlet, Act III, Scene ii

Lord Polonius:




I
did enact Julius
Caesar

I was killed
i
' the


Capitol;
Brutus

killed me.


Bigger

collections


Consider N =
10
^
6
documents, each with about 1000
tokens




total of
10
^
9
tokens

On average 6 bytes per token, including spaces and


punctuation


size of document collection is about 6


10
^
9 =

6
GB


Assume
there are M = 500,000 distinct terms in the
collection


M = 500,000
×

10
^
6
= half a trillion 0s and 1s.


But the matrix has no more than one billion 1s.


Matrix
is extremely sparse.


What
is a better representations?


We
only record the 1s.

Inverted index


For each term
t
, we must store a list of all
documents that contain
t
.


Identify each by a
docID
, a document serial number


Can we use fixed
-
size arrays for this?

Brutus

Calpurnia

Caesar

1

2

4

5

6

16

57

132

1

2

4

11

31

45

173

2

31

What happens if the word
Caesar

is added to document 14?

174

54

101

Inverted index


We need variable
-
size postings lists


On disk, a continuous run of postings is normal and best


In memory, can use linked lists or variable length arrays


Some tradeoffs in size/ease of insertion

44

Dictionary

Postings

Sorted by docID (more later on why).

Posting

Sec. 1.2

Brutus

Calpurnia

Caesar

1

2

4

5

6

16

57

132

1

2

4

11

31

45

173

2

31

174

54

101

Tokenizer

Token stream

Friends

Romans

Countrymen

Inverted index construction

Linguistic modules

Modified tokens

friend

roman

countryman

Indexer

Inverted index

friend

roman

countryman

2

4

2

13

16

1

More on

these later.

Documents to

be indexed

Friends, Romans, countrymen.

Sec. 1.2

Tokenization and preprocessing

Doc 1. I did enact Julius Caesar: I

was killed
i
’ the Capitol; Brutus killed

me.

Doc 2. So let it be with Caesar. The

noble Brutus hath told you Caesar

was ambitious:

Doc 1.
i

did enact
julius

caesar

i

was

killed
i
’ the capitol
brutus

killed me

Doc 2. so let it be with
caesar

the

noble
brutus

hath told you
caesar

was

ambitious

Indexer steps: Token sequence

Doc 1.
i

did enact
julius

caesar

i

was

killed
i
’ the capitol
brutus

killed me


Doc
2. so let it be with
caesar

the

noble
brutus

hath told you
caesar

was

ambitious

Sequence of (Modified token, Document ID)
pairs.

Indexer steps: Sort


Sort by terms


And then docID

Sec. 1.2

Indexer steps: Dictionary & Postings


Multiple term entries
in a single document
are merged.


Split into Dictionary
and Postings


Doc. frequency
information is added.

Why frequency?

Will discuss later.

Sec. 1.2

Where do we pay in storage?

51

Pointers

Terms
and
counts

Later in the
course:


How do we
index
efficiently?


How much
storage do we
need?

Sec. 1.2

Lists of
docIDs

The index we just built


How do we process a query?


Later
-

what kinds of queries can we process?

52

Sec. 1.3

Query processing: AND


Consider processing the query:

Brutus

AND

Caesar


Locate
Brutus

in the Dictionary;


Retrieve its postings.


Locate
Caesar

in the Dictionary;


Retrieve its postings.


“Merge” the two postings:

53

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Brutus

Caesar

Sec. 1.3

The merge


Walk through the two postings simultaneously, in time
linear in the total number of postings entries

54

34

128

2

4

8

16

32

64

1

2

3

5

8

13

21

128

34

2

4

8

16

32

64

1

2

3

5

8

13

21

Brutus

Caesar

2

8

If list lengths are
x

and
y
, merge takes O(
x+y
) operations.

Crucial
: postings sorted by docID.

Sec. 1.3

Intersecting two postings lists

(a “merge” algorithm)

55

Boolean queries: Exact match


The
Boolean retrieval model

is being able to ask a query
that is a Boolean expression:


Boolean Queries use
AND, OR

and
NOT

to join query terms


Views each document as a
set

of words


Is precise: document matches condition or not.


Perhaps the simplest model to build an IR system on


Primary commercial retrieval tool for 3 decades.


Many search systems you still use are Boolean:


Email, library catalog, Mac OS X Spotlight

56

Sec. 1.3