Vector Space Model

cowphysicistInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

147 εμφανίσεις



Computational
Linguiestic

Course


Instructor
: Professor
Cercone


Presenter
:
Morteza

zihayat

Information Retrieval

and

Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector Space Model (VSM)


How to Assign Weights?


TF
-
IDF Weighting


Example


Advantages and Disadvantages of VS Model


Improving the VS Model

2

Information Retrieval and Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages and Disadvantages
of VS Model


Improving the
VS Model

3

Information Retrieval and Vector Space Model

Introduction to IR



The world's total yearly production of unique
information stored in the form of print, film, optical,
and magnetic content would require roughly 1.5
billion gigabytes of storage. This is the equivalent of
250 megabytes per person for each man, woman,
and child on earth.


(Lyman & Hal 00)


Information Retrieval and Vector Space Model

4

Growth of textual information

How can we help manage and

exploit all the information?

Literature

Email

WWW

Desktop

News

Intranet

Blog

Information Retrieval and Vector Space Model

5

Information overflow

Information Retrieval and Vector Space Model

6

What is Information Retrieval (IR)?


Narrow
-
sense
:



IR= Search Engine Technologies (IR=Google, library info
system)


IR= Text matching/classification


Broad
-
sense: IR = Text Information Management
:


General problem: how to manage text information?


How to find useful information? (
retrieval
)


Example: Google


How to organize information? (
text classification
)


Example: Automatically assign emails to different folders


How to discover knowledge from text? (
text mining
)


Example: Discover correlation of events

Information Retrieval and Vector Space Model

7

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages of VS Model


Improving the VSM
Model

8

Information Retrieval and Vector Space Model

Formalizing IR Tasks


Vocabulary: V = {w1,w2, …,
wT
} of a language


Query
: q = q1,
q2,
…,
q
m

where
q
i





Document
: di= di1, di2, …,
dimi where
dij




Collection
: C = {d1, d2, …,
dN
}


Relevant
document set: R(q)

䌺䝥湥C慬汹
unknown and user
-
dependent


Query
provides a “hint” on which documents
should be in R(q)



IR
: find the approximate relevant document set
R’(q)



Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model

9

Evaluation measures


The quality of many retrieval systems depends on
how well they manage to rank relevant
documents.



How can we evaluate rankings in IR?


IR researchers have developed evaluation measures
specifically designed to evaluate rankings.


Most of these measures combine
precision

and
recall

in a
way that takes account of the ranking.


Information Retrieval and Vector Space Model

10

Precision & Recall


Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model

11


In other words:



Precision is the percentage of relevant items in the
returned set



Recall is the percentage of all relevant documents
in the collection that is in the returned set.


Information Retrieval and Vector Space Model

12

Evaluating Retrieval Performance

Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model

13

IR System Architecture

14

User

query

judgments

docs

results

Query

Rep

Doc

Rep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE

Information Retrieval and Vector Space Model

Indexing Document

Break
documents
into words

Stop list

Stemming

Construct
Index

15

Information Retrieval and Vector Space Model

Searching


Given a query, score documents efficiently


The basic question:


Given a query, how do we know if document A is more
relevant than B?


If document A uses more query words than document B


Word usage in document A is more similar to that in
query


….


We should find a way to compute relevance


Query and documents


16

Information Retrieval and Vector Space Model

The Notion of Relevance

17

Relevance


(Rep(q), Rep(d))


Similarity

P(r=1|q,d) r

笰{ㅽ



偲潢P扩b楴i潦⁒汥癡湣e

P(d

焩q潲⁐⡱(




Probabilistic inference

Different

rep &

similarity

Vector space

model

(Salton et al., 75)

Prob. distr.

model

(Wong & Yao, 89)



Generative

Model

Regression

Model

(Fox 83)

Classical

prob. Model

(Robertson &

Sparck Jones, 76)

Doc

generation

Query

generation

LM

approach

(Ponte & Croft, 98)

(Lafferty & Zhai, 01a)

Prob. concept

space model

(Wong & Yao, 95)

Different

inference system

Inference

network


model

(Turtle & Croft, 91)

Today’s lecture

Infor
matio
n
Retrie
val
and
Vector
Space
Model

Relevance = Similarity


Assumptions


Query and document are represented similarly


A query can be regarded as a “document”


Relevance(d,q)


similarity(d,q)


R(q) = {d

䍼昨搬焩>

紬}昨焬搩=

(Rep(q), Rep(d))



Key issues


How to represent query/document?


Vector Space Model (VSM)


How to define the similarity measure

?

18

Information Retrieval and Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages of VS Model


Improving the VSM
Model

19

Information Retrieval and Vector Space Model

Vector Space Model (VSM)


The
vector space model
is one of the most widely
used models for
ad
-
hoc
retrieval


Used in information filtering, information
retrieval, indexing and relevancy rankings.


20

Information Retrieval and Vector Space Model

VSM


Represent a doc/query by a term vector


Term: basic concept, e.g., word or phrase


Each term defines one dimension


N terms define a high
-
dimensional space


E.g
., d=(x
1
,…,
x
N
), x
i

is “importance” of term
I



Measure relevance by the distance between the
query vector and document vector in the vector
space


21

Information Retrieval and Vector Space Model

VS Model: illustration

22

Java

Microsoft

Starbucks

D
6

D
10

D
9

D
4

D
7

D
8

D
5

D
11

D
2

?
?

D
1

?

?

D
3

?
?

Query

Infor
matio
n
Retrie
val
and
Vector
Space
Model

Some Issues about VS Model


There is no consistent definition for basic concept


Assigning weights to words has not been
determined


Weight in query indicates importance of term



24

Information Retrieval and Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages of VS Model


Improving the VSM
Model

25

Information Retrieval and Vector Space Model

How to Assign Weights?


Different terms have different importance in a text


A term weighting scheme plays an important role
for the similarity measure.


Higher weight = greater impact


We now turn to the question of how to weight
words in the vector space model.


26

Information Retrieval and Vector Space Model


There are three components in a weighting
scheme:



g
i:

the global weight of the ith term,


t
ij:

is the local weight of the ith term in the jth document,


d
j
:the normalization factor for the jth document

27

Information Retrieval and Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages of VS Model


Improving the VSM
Model

29

Information Retrieval and Vector Space Model

TF Weighting


Idea: A term is more important if it occurs more
frequently in a document


Formulas: Let f(
t,d
) be the frequency count of term t
in doc d


Raw TF: TF(
t,d
) = f(
t,d
)


Log TF: TF(
t,d
)=log f(
t,d
)


Maximum frequency normalization:



TF(
t,d
) = 0.5 +0.5*f(
t,d
)/
MaxFreq
(d)


Normalization of TF is very important!


30

Information Retrieval and Vector Space Model

TF Methods


31

Information Retrieval and Vector Space Model

IDF Weighting


Idea: A term is more discriminative if it occurs
only in fewer documents



Formula:





IDF(t) =
1

+ log(n/k)






n : total number of docs




k : # docs with term t (doc
freq
)


32

Information Retrieval and Vector Space Model

IDF weighting Methods


33

Information Retrieval and Vector Space Model

TF Normalization


Why?


Document length variation


“Repeated occurrences” are less informative than the “first
occurrence”


Two views of document length


A doc is long because it uses more words


A doc is long because it has more contents


Generally penalize long doc, but avoid over
-
penalizing

34

Information Retrieval and Vector Space Model

TF
-
IDF Weighting


TF
-
IDF weighting : weight(t,d)=TF(t,d)*IDF(t)


Common in doc


high tf


high weight


Rare in collection


high idf


high weight



Imagine a word count profile, what kind of terms
would have high weights?

35

Information Retrieval and Vector Space Model

How to Measure Similarity?

36

product)
dot

normalized
(

)
(
)
(
)
,
(




:
Cosine

)
,
(
C

:
similarity
product
Dot
absent

is

term
a

if

0


)
,...,
(
)
,...,
(
1
2
1
2
1
1
1
1

















N
j
ij
N
j
qj
N
j
ij
qj
i
N
j
ij
qj
i
qN
q
iN
i
i
w
w
w
w
D
Q
sim
w
w
D
Q
S
w
w
w
Q
w
w
D






Information Retrieval and Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages of VS Model


Improving the VSM
Model

37

Information Retrieval and Vector Space Model

VS Example: Raw TF & Dot Product

38

doc3

information

retrieval

search

engine

information

travel

information


map

travel

government

president

congress

doc1

doc2



Sim(q,doc1)=4.8*2.4+4.5*4.5




Sim(q,doc2)=2.4*2.4




Sim(q,doc3)=0

query=“information retrieval”

1
(
4.5
)

1(2.4)

Query

1(4.3)

1
(
3.2
)

1(2.2)

Doc3

1(3.3)

2
(
5.6
)

1(2.4)

Doc2

1(5.4)

1
(
2.1
)

1(4.5)

2(4.8)

Doc1

4.3

3.2

2.2

5.4

2.1

3.3

2.8

4.5

2.4

IDF

(fake)

Congress

President

Govern.

Engine

Search

Map

Travel

Retrieval

Info.

Information Retrieval and Vector Space
Model

Example

Q: “gold silver truck”

• D1: “Shipment of gold delivered in a fire”

• D2: “Delivery of silver arrived in a silver truck”

• D3: “Shipment of gold arrived in a truck”


Document Frequency of the
j
th

term (df
j

)

• Inverse Document Frequency (
idf) = log10(n / df
j
)


Tf*idf is used as term weight here


39

Information Retrieval and Vector Space
Model

Example (Cont’d)

Id

Term

df

idf

1

a

3

0

2

arrived

2

0.176

3

damaged

1

0.477

4

delivery

1

0.477

5

fire

1

0.477

6

gold

1

0.176

7

in

3

0

8

of

3

0

9

silver

1

0.477

10

shipment

2

0.176

11

truck

2

0.176

40

Information Retrieval and Vector Space
Model

Example(Cont’d)

Tf*idf is used here






SC(Q, D
1
) = (
0
)(
0
) + (
0
)(
0
) + (
0
)(
0.477
) + (
0
)(
0
) + (
0
)(
0.477
)
+ (
0.176
)(
0.176
) + (
0
)(
0
) + (
0
)(
0
) =
0.031

SC(Q, D
2
) =
0.486

SC(Q,D
3
) =
0.062

The ranking would be D
2
,D
3
,D
1
.

• This SC uses the dot product.

doc

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

D1

0

0

0.477

0

0.477

0.176

0

0

0

0.176

0

D2

0

0.176

0

0.477

0

0

0

0

0.954

0

0.176

D3

0

0.176

0

0

0

0.176

0

0

0

0.176

0.176

Q

0

0

0

0

0

0.176

0

0

0.477

0

0.176

41

Information Retrieval and Vector Space
Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages and Disadvantages
of VS Model


Improving the VSM
Model

42

Information Retrieval and Vector Space Model

Advantages of VS Model


Empirically effective! (Top TREC performance)


Intuitive


Easy to implement


Well
-
studied/Most evaluated


The Smart system


Developed at Cornell:
1960
-
1999


Still widely used


Warning: Many variants of TF
-
IDF!

43

Information Retrieval and Vector Space Model

Disadvantages of VS Model


Assume term independence



Assume query and document to be the same



Lots of parameter tuning!

44

Information Retrieval and Vector Space Model

Outline


Introduction to IR


IR System Architecture


Vector
Space Model (VSM
)


How to Assign
Weights?


TF
-
IDF Weighting


Example


Advantages and Disadvantages
of VS Model


Improving the VSM
Model

45

Information Retrieval and Vector Space Model

Improving the VSM Model


We can improve the model by:


Reducing the number of dimensions


eliminating all stop words and very common terms


stemming terms to their roots


Latent Semantic Analysis


Not retrieving documents below a defined cosine threshold


Normalized frequency of a term i in document j is given
by
[1]
:


Normalized Document Frequencies


Normalized Query Frequencies



Information Retrieval and Vector Space Model

46

Stop List


Function words do not bear useful information for IR



of, not, to, or, in, about, with, I, be, …


Stop list: contain stop words, not to be used as index


Prepositions


Articles


Pronouns


Some adverbs and adjectives


Some frequent words (e.g. document)



The removal of stop words usually improves IR
effectiveness


A few “standard” stop lists are commonly used.

47

Information Retrieval and Vector Space Model

Stemming

48


Reason
:


Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them



Stemming
:


Removing some endings of word





dancer


dancers



dance



danced



dancing




dance

Information Retrieval and Vector Space Model

Stemming(Cont’d)


Two main methods :


Linguistic/dictionary
-
based stemming


high stemming accuracy


high implementation and processing costs and higher
coverage




Porter
-
style stemming


lower stemming accuracy


lower implementation and processing costs and lower
coverage


Usually sufficient for IR


49

Information Retrieval and Vector Space Model

Latent Semantic Indexing (LSI)

[3]


Reduces the dimensions of the term
-
document
space


Attempts to solve the synonomy and polysemy


Uses Singular Value Decomposition (SVD)


identifies patterns in the relationships between the terms
and concepts contained in an unstructured collection of text


Based on the principle that words that are used in
the same contexts tend to have similar meanings.


Information Retrieval and Vector Space Model

50

LSI Process


In general, the process involves:



constructing a weighted term
-
document matrix



performing a
Singular Value Decomposition

on the
matrix


using the matrix to identify the concepts contained in the
text



LSI statistically analyses the patterns of word
usage across the entire document collection


Information Retrieval and Vector Space Model

51

References


Introduction to Information Retrieval
, by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schuetze


https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/2.pdf


https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/ir4up.pdf


https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/e09
-
3009.pdf


https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/07models
-
vsm.pdf


https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/03vectorspaceimplementation
-
6per.pdf


https://wiki.cse.yorku.ca/course_archive/2011
-
12/F/6339/_media/lecture02.ppt


https://wiki.cse.yorku.ca/course_archive/2011
-
12/F/6339/_media/vector_space_model
-
updated.ppt


https://wiki.cse.yorku.ca/course_archive/2011
-
12/F/6339/_media/lecture_13_ir_and_vsm_.ppt


Document Classification based on Wikipedia Content,
http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim
estamp=1318275702299




54

Information Retrieval and Vector Space Model

Thanks For Your Attention ….

55

Infor
matio
n
Retrie
val
and
Vector
Space
Model