Computational
Linguiestic
Course
Instructor
: Professor
Cercone
Presenter
:
Morteza
zihayat
Information Retrieval
and
Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF

IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VS Model
2
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages and Disadvantages
of VS Model
Improving the
VS Model
3
Information Retrieval and Vector Space Model
Introduction to IR
The world's total yearly production of unique
information stored in the form of print, film, optical,
and magnetic content would require roughly 1.5
billion gigabytes of storage. This is the equivalent of
250 megabytes per person for each man, woman,
and child on earth.
(Lyman & Hal 00)
Information Retrieval and Vector Space Model
4
Growth of textual information
How can we help manage and
exploit all the information?
Literature
Email
WWW
Desktop
News
Intranet
Blog
Information Retrieval and Vector Space Model
5
Information overflow
Information Retrieval and Vector Space Model
6
What is Information Retrieval (IR)?
Narrow

sense
:
IR= Search Engine Technologies (IR=Google, library info
system)
IR= Text matching/classification
Broad

sense: IR = Text Information Management
:
General problem: how to manage text information?
How to find useful information? (
retrieval
)
Example: Google
How to organize information? (
text classification
)
Example: Automatically assign emails to different folders
How to discover knowledge from text? (
text mining
)
Example: Discover correlation of events
Information Retrieval and Vector Space Model
7
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages of VS Model
Improving the VSM
Model
8
Information Retrieval and Vector Space Model
Formalizing IR Tasks
Vocabulary: V = {w1,w2, …,
wT
} of a language
Query
: q = q1,
q2,
…,
q
m
where
q
i
∈
嘮
Document
: di= di1, di2, …,
dimi where
dij
∈
嘮
Collection
: C = {d1, d2, …,
dN
}
Relevant
document set: R(q)
⊆
䌺䝥湥C慬汹
unknown and user

dependent
Query
provides a “hint” on which documents
should be in R(q)
IR
: find the approximate relevant document set
R’(q)
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
9
Evaluation measures
The quality of many retrieval systems depends on
how well they manage to rank relevant
documents.
How can we evaluate rankings in IR?
IR researchers have developed evaluation measures
specifically designed to evaluate rankings.
Most of these measures combine
precision
and
recall
in a
way that takes account of the ranking.
Information Retrieval and Vector Space Model
10
Precision & Recall
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
11
In other words:
Precision is the percentage of relevant items in the
returned set
Recall is the percentage of all relevant documents
in the collection that is in the returned set.
Information Retrieval and Vector Space Model
12
Evaluating Retrieval Performance
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
13
IR System Architecture
14
User
query
judgments
docs
results
Query
Rep
Doc
Rep
Ranking
Feedback
INDEXING
SEARCHING
QUERY MODIFICATION
INTERFACE
Information Retrieval and Vector Space Model
Indexing Document
Break
documents
into words
Stop list
Stemming
Construct
Index
15
Information Retrieval and Vector Space Model
Searching
Given a query, score documents efficiently
The basic question:
Given a query, how do we know if document A is more
relevant than B?
If document A uses more query words than document B
Word usage in document A is more similar to that in
query
….
We should find a way to compute relevance
Query and documents
16
Information Retrieval and Vector Space Model
The Notion of Relevance
17
Relevance
(Rep(q), Rep(d))
Similarity
P(r=1q,d) r
笰{ㅽ
偲潢P扩b楴i潦⁒汥癡湣e
P(d
焩q潲⁐⡱(
搩
Probabilistic inference
Different
rep &
similarity
Vector space
model
(Salton et al., 75)
Prob. distr.
model
(Wong & Yao, 89)
…
Generative
Model
Regression
Model
(Fox 83)
Classical
prob. Model
(Robertson &
Sparck Jones, 76)
Doc
generation
Query
generation
LM
approach
(Ponte & Croft, 98)
(Lafferty & Zhai, 01a)
Prob. concept
space model
(Wong & Yao, 95)
Different
inference system
Inference
network
model
(Turtle & Croft, 91)
Today’s lecture
Infor
matio
n
Retrie
val
and
Vector
Space
Model
Relevance = Similarity
Assumptions
Query and document are represented similarly
A query can be regarded as a “document”
Relevance(d,q)
similarity(d,q)
R(q) = {d
䍼昨搬焩>
紬}昨焬搩=
(Rep(q), Rep(d))
Key issues
How to represent query/document?
Vector Space Model (VSM)
How to define the similarity measure
?
18
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages of VS Model
Improving the VSM
Model
19
Information Retrieval and Vector Space Model
Vector Space Model (VSM)
The
vector space model
is one of the most widely
used models for
ad

hoc
retrieval
Used in information filtering, information
retrieval, indexing and relevancy rankings.
20
Information Retrieval and Vector Space Model
VSM
Represent a doc/query by a term vector
Term: basic concept, e.g., word or phrase
Each term defines one dimension
N terms define a high

dimensional space
E.g
., d=(x
1
,…,
x
N
), x
i
is “importance” of term
I
Measure relevance by the distance between the
query vector and document vector in the vector
space
21
Information Retrieval and Vector Space Model
VS Model: illustration
22
Java
Microsoft
Starbucks
D
6
D
10
D
9
D
4
D
7
D
8
D
5
D
11
D
2
?
?
D
1
?
?
D
3
?
?
Query
Infor
matio
n
Retrie
val
and
Vector
Space
Model
Some Issues about VS Model
There is no consistent definition for basic concept
Assigning weights to words has not been
determined
Weight in query indicates importance of term
24
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages of VS Model
Improving the VSM
Model
25
Information Retrieval and Vector Space Model
How to Assign Weights?
Different terms have different importance in a text
A term weighting scheme plays an important role
for the similarity measure.
Higher weight = greater impact
We now turn to the question of how to weight
words in the vector space model.
26
Information Retrieval and Vector Space Model
There are three components in a weighting
scheme:
g
i:
the global weight of the ith term,
t
ij:
is the local weight of the ith term in the jth document,
d
j
:the normalization factor for the jth document
27
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages of VS Model
Improving the VSM
Model
29
Information Retrieval and Vector Space Model
TF Weighting
Idea: A term is more important if it occurs more
frequently in a document
Formulas: Let f(
t,d
) be the frequency count of term t
in doc d
Raw TF: TF(
t,d
) = f(
t,d
)
Log TF: TF(
t,d
)=log f(
t,d
)
Maximum frequency normalization:
TF(
t,d
) = 0.5 +0.5*f(
t,d
)/
MaxFreq
(d)
Normalization of TF is very important!
30
Information Retrieval and Vector Space Model
TF Methods
31
Information Retrieval and Vector Space Model
IDF Weighting
Idea: A term is more discriminative if it occurs
only in fewer documents
Formula:
IDF(t) =
1
+ log(n/k)
n : total number of docs
k : # docs with term t (doc
freq
)
32
Information Retrieval and Vector Space Model
IDF weighting Methods
33
Information Retrieval and Vector Space Model
TF Normalization
Why?
Document length variation
“Repeated occurrences” are less informative than the “first
occurrence”
Two views of document length
A doc is long because it uses more words
A doc is long because it has more contents
Generally penalize long doc, but avoid over

penalizing
34
Information Retrieval and Vector Space Model
TF

IDF Weighting
TF

IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
Common in doc
high tf
high weight
Rare in collection
high idf
high weight
Imagine a word count profile, what kind of terms
would have high weights?
35
Information Retrieval and Vector Space Model
How to Measure Similarity?
36
product)
dot
normalized
(
)
(
)
(
)
,
(
:
Cosine
)
,
(
C
:
similarity
product
Dot
absent
is
term
a
if
0
)
,...,
(
)
,...,
(
1
2
1
2
1
1
1
1
N
j
ij
N
j
qj
N
j
ij
qj
i
N
j
ij
qj
i
qN
q
iN
i
i
w
w
w
w
D
Q
sim
w
w
D
Q
S
w
w
w
Q
w
w
D
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages of VS Model
Improving the VSM
Model
37
Information Retrieval and Vector Space Model
VS Example: Raw TF & Dot Product
38
doc3
information
retrieval
search
engine
information
travel
information
map
travel
government
president
congress
doc1
doc2
…
Sim(q,doc1)=4.8*2.4+4.5*4.5
Sim(q,doc2)=2.4*2.4
Sim(q,doc3)=0
query=“information retrieval”
1
(
4.5
)
1(2.4)
Query
1(4.3)
1
(
3.2
)
1(2.2)
Doc3
1(3.3)
2
(
5.6
)
1(2.4)
Doc2
1(5.4)
1
(
2.1
)
1(4.5)
2(4.8)
Doc1
4.3
3.2
2.2
5.4
2.1
3.3
2.8
4.5
2.4
IDF
(fake)
Congress
President
Govern.
Engine
Search
Map
Travel
Retrieval
Info.
Information Retrieval and Vector Space
Model
Example
Q: “gold silver truck”
• D1: “Shipment of gold delivered in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
•
Document Frequency of the
j
th
term (df
j
)
• Inverse Document Frequency (
idf) = log10(n / df
j
)
Tf*idf is used as term weight here
39
Information Retrieval and Vector Space
Model
Example (Cont’d)
Id
Term
df
idf
1
a
3
0
2
arrived
2
0.176
3
damaged
1
0.477
4
delivery
1
0.477
5
fire
1
0.477
6
gold
1
0.176
7
in
3
0
8
of
3
0
9
silver
1
0.477
10
shipment
2
0.176
11
truck
2
0.176
40
Information Retrieval and Vector Space
Model
Example(Cont’d)
Tf*idf is used here
SC(Q, D
1
) = (
0
)(
0
) + (
0
)(
0
) + (
0
)(
0.477
) + (
0
)(
0
) + (
0
)(
0.477
)
+ (
0.176
)(
0.176
) + (
0
)(
0
) + (
0
)(
0
) =
0.031
SC(Q, D
2
) =
0.486
SC(Q,D
3
) =
0.062
The ranking would be D
2
,D
3
,D
1
.
• This SC uses the dot product.
doc
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
D1
0
0
0.477
0
0.477
0.176
0
0
0
0.176
0
D2
0
0.176
0
0.477
0
0
0
0
0.954
0
0.176
D3
0
0.176
0
0
0
0.176
0
0
0
0.176
0.176
Q
0
0
0
0
0
0.176
0
0
0.477
0
0.176
41
Information Retrieval and Vector Space
Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages and Disadvantages
of VS Model
Improving the VSM
Model
42
Information Retrieval and Vector Space Model
Advantages of VS Model
Empirically effective! (Top TREC performance)
Intuitive
Easy to implement
Well

studied/Most evaluated
The Smart system
Developed at Cornell:
1960

1999
Still widely used
Warning: Many variants of TF

IDF!
43
Information Retrieval and Vector Space Model
Disadvantages of VS Model
Assume term independence
Assume query and document to be the same
Lots of parameter tuning!
44
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector
Space Model (VSM
)
How to Assign
Weights?
TF

IDF Weighting
Example
Advantages and Disadvantages
of VS Model
Improving the VSM
Model
45
Information Retrieval and Vector Space Model
Improving the VSM Model
We can improve the model by:
Reducing the number of dimensions
eliminating all stop words and very common terms
stemming terms to their roots
Latent Semantic Analysis
Not retrieving documents below a defined cosine threshold
Normalized frequency of a term i in document j is given
by
[1]
:
Normalized Document Frequencies
Normalized Query Frequencies
Information Retrieval and Vector Space Model
46
Stop List
Function words do not bear useful information for IR
of, not, to, or, in, about, with, I, be, …
Stop list: contain stop words, not to be used as index
Prepositions
Articles
Pronouns
Some adverbs and adjectives
Some frequent words (e.g. document)
The removal of stop words usually improves IR
effectiveness
A few “standard” stop lists are commonly used.
47
Information Retrieval and Vector Space Model
Stemming
48
Reason
:
◦
Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
Stemming
:
◦
Removing some endings of word
dancer
dancers
dance
danced
dancing
dance
Information Retrieval and Vector Space Model
Stemming(Cont’d)
Two main methods :
Linguistic/dictionary

based stemming
high stemming accuracy
high implementation and processing costs and higher
coverage
Porter

style stemming
lower stemming accuracy
lower implementation and processing costs and lower
coverage
Usually sufficient for IR
49
Information Retrieval and Vector Space Model
Latent Semantic Indexing (LSI)
[3]
Reduces the dimensions of the term

document
space
Attempts to solve the synonomy and polysemy
Uses Singular Value Decomposition (SVD)
identifies patterns in the relationships between the terms
and concepts contained in an unstructured collection of text
Based on the principle that words that are used in
the same contexts tend to have similar meanings.
Information Retrieval and Vector Space Model
50
LSI Process
In general, the process involves:
constructing a weighted term

document matrix
performing a
Singular Value Decomposition
on the
matrix
using the matrix to identify the concepts contained in the
text
LSI statistically analyses the patterns of word
usage across the entire document collection
Information Retrieval and Vector Space Model
51
References
Introduction to Information Retrieval
, by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schuetze
https://wiki.cse.yorku.ca/course_archive/2010

11/F/6390/_media/2.pdf
https://wiki.cse.yorku.ca/course_archive/2010

11/F/6390/_media/ir4up.pdf
https://wiki.cse.yorku.ca/course_archive/2010

11/F/6390/_media/e09

3009.pdf
https://wiki.cse.yorku.ca/course_archive/2010

11/F/6390/_media/07models

vsm.pdf
https://wiki.cse.yorku.ca/course_archive/2010

11/F/6390/_media/03vectorspaceimplementation

6per.pdf
https://wiki.cse.yorku.ca/course_archive/2011

12/F/6339/_media/lecture02.ppt
https://wiki.cse.yorku.ca/course_archive/2011

12/F/6339/_media/vector_space_model

updated.ppt
https://wiki.cse.yorku.ca/course_archive/2011

12/F/6339/_media/lecture_13_ir_and_vsm_.ppt
Document Classification based on Wikipedia Content,
http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim
estamp=1318275702299
54
Information Retrieval and Vector Space Model
Thanks For Your Attention ….
55
Infor
matio
n
Retrie
val
and
Vector
Space
Model
Comments 0
Log in to post a comment