# Vector Space Model

Internet and Web Development

Dec 4, 2013 (4 years and 7 months ago)

229 views

Computational
Linguiestic

Course

Instructor
: Professor
Cercone

Presenter
:
Morteza

zihayat

Information Retrieval

and

Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector Space Model (VSM)

How to Assign Weights?

TF
-
IDF Weighting

Example

Improving the VS Model

2

Information Retrieval and Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

of VS Model

Improving the
VS Model

3

Information Retrieval and Vector Space Model

Introduction to IR

The world's total yearly production of unique
information stored in the form of print, film, optical,
and magnetic content would require roughly 1.5
billion gigabytes of storage. This is the equivalent of
250 megabytes per person for each man, woman,
and child on earth.

(Lyman & Hal 00)

Information Retrieval and Vector Space Model

4

Growth of textual information

How can we help manage and

exploit all the information?

Literature

Email

WWW

Desktop

News

Intranet

Blog

Information Retrieval and Vector Space Model

5

Information overflow

Information Retrieval and Vector Space Model

6

What is Information Retrieval (IR)?

Narrow
-
sense
:

IR= Search Engine Technologies (IR=Google, library info
system)

IR= Text matching/classification

-
sense: IR = Text Information Management
:

General problem: how to manage text information?

How to find useful information? (
retrieval
)

How to organize information? (
text classification
)

Example: Automatically assign emails to different folders

How to discover knowledge from text? (
text mining
)

Example: Discover correlation of events

Information Retrieval and Vector Space Model

7

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

Improving the VSM
Model

8

Information Retrieval and Vector Space Model

Vocabulary: V = {w1,w2, …,
wT
} of a language

Query
: q = q1,
q2,
…,
q
m

where
q
i

Document
: di= di1, di2, …,
dimi where
dij

Collection
: C = {d1, d2, …,
dN
}

Relevant
document set: R(q)

䌺䝥湥C慬汹
unknown and user
-
dependent

Query
provides a “hint” on which documents
should be in R(q)

IR
: find the approximate relevant document set
R’(q)

Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model

9

Evaluation measures

The quality of many retrieval systems depends on
how well they manage to rank relevant
documents.

How can we evaluate rankings in IR?

IR researchers have developed evaluation measures
specifically designed to evaluate rankings.

Most of these measures combine
precision

and
recall

in a
way that takes account of the ranking.

Information Retrieval and Vector Space Model

10

Precision & Recall

Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model

11

In other words:

Precision is the percentage of relevant items in the
returned set

Recall is the percentage of all relevant documents
in the collection that is in the returned set.

Information Retrieval and Vector Space Model

12

Evaluating Retrieval Performance

Source: This slide is borrowed from [1]

Information Retrieval and Vector Space Model

13

IR System Architecture

14

User

query

judgments

docs

results

Query

Rep

Doc

Rep

Ranking

Feedback

INDEXING

SEARCHING

QUERY MODIFICATION

INTERFACE

Information Retrieval and Vector Space Model

Indexing Document

Break
documents
into words

Stop list

Stemming

Construct
Index

15

Information Retrieval and Vector Space Model

Searching

Given a query, score documents efficiently

The basic question:

Given a query, how do we know if document A is more
relevant than B?

If document A uses more query words than document B

Word usage in document A is more similar to that in
query

….

We should find a way to compute relevance

Query and documents

16

Information Retrieval and Vector Space Model

The Notion of Relevance

17

Relevance

(Rep(q), Rep(d))

Similarity

P(r=1|q,d) r

P(d

Probabilistic inference

Different

rep &

similarity

Vector space

model

(Salton et al., 75)

Prob. distr.

model

(Wong & Yao, 89)

Generative

Model

Regression

Model

(Fox 83)

Classical

prob. Model

(Robertson &

Sparck Jones, 76)

Doc

generation

Query

generation

LM

approach

(Ponte & Croft, 98)

(Lafferty & Zhai, 01a)

Prob. concept

space model

(Wong & Yao, 95)

Different

inference system

Inference

network

model

(Turtle & Croft, 91)

Today’s lecture

Infor
matio
n
Retrie
val
and
Vector
Space
Model

Relevance = Similarity

Assumptions

Query and document are represented similarly

A query can be regarded as a “document”

Relevance(d,q)

similarity(d,q)

R(q) = {d

䍼昨搬焩>

(Rep(q), Rep(d))

Key issues

How to represent query/document?

Vector Space Model (VSM)

How to define the similarity measure

?

18

Information Retrieval and Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

Improving the VSM
Model

19

Information Retrieval and Vector Space Model

Vector Space Model (VSM)

The
vector space model
is one of the most widely
used models for
-
hoc
retrieval

Used in information filtering, information
retrieval, indexing and relevancy rankings.

20

Information Retrieval and Vector Space Model

VSM

Represent a doc/query by a term vector

Term: basic concept, e.g., word or phrase

Each term defines one dimension

N terms define a high
-
dimensional space

E.g
., d=(x
1
,…,
x
N
), x
i

is “importance” of term
I

Measure relevance by the distance between the
query vector and document vector in the vector
space

21

Information Retrieval and Vector Space Model

VS Model: illustration

22

Java

Microsoft

Starbucks

D
6

D
10

D
9

D
4

D
7

D
8

D
5

D
11

D
2

?
?

D
1

?

?

D
3

?
?

Query

Infor
matio
n
Retrie
val
and
Vector
Space
Model

There is no consistent definition for basic concept

Assigning weights to words has not been
determined

Weight in query indicates importance of term

24

Information Retrieval and Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

Improving the VSM
Model

25

Information Retrieval and Vector Space Model

How to Assign Weights?

Different terms have different importance in a text

A term weighting scheme plays an important role
for the similarity measure.

Higher weight = greater impact

We now turn to the question of how to weight
words in the vector space model.

26

Information Retrieval and Vector Space Model

There are three components in a weighting
scheme:

g
i:

the global weight of the ith term,

t
ij:

is the local weight of the ith term in the jth document,

d
j
:the normalization factor for the jth document

27

Information Retrieval and Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

Improving the VSM
Model

29

Information Retrieval and Vector Space Model

TF Weighting

Idea: A term is more important if it occurs more
frequently in a document

Formulas: Let f(
t,d
) be the frequency count of term t
in doc d

Raw TF: TF(
t,d
) = f(
t,d
)

Log TF: TF(
t,d
)=log f(
t,d
)

Maximum frequency normalization:

TF(
t,d
) = 0.5 +0.5*f(
t,d
)/
MaxFreq
(d)

Normalization of TF is very important!

30

Information Retrieval and Vector Space Model

TF Methods

31

Information Retrieval and Vector Space Model

IDF Weighting

Idea: A term is more discriminative if it occurs
only in fewer documents

Formula:

IDF(t) =
1

+ log(n/k)

n : total number of docs

k : # docs with term t (doc
freq
)

32

Information Retrieval and Vector Space Model

IDF weighting Methods

33

Information Retrieval and Vector Space Model

TF Normalization

Why?

Document length variation

“Repeated occurrences” are less informative than the “first
occurrence”

Two views of document length

A doc is long because it uses more words

A doc is long because it has more contents

Generally penalize long doc, but avoid over
-
penalizing

34

Information Retrieval and Vector Space Model

TF
-
IDF Weighting

TF
-
IDF weighting : weight(t,d)=TF(t,d)*IDF(t)

Common in doc

high tf

high weight

Rare in collection

high idf

high weight

Imagine a word count profile, what kind of terms
would have high weights?

35

Information Retrieval and Vector Space Model

How to Measure Similarity?

36

product)
dot

normalized
(

)
(
)
(
)
,
(

:
Cosine

)
,
(
C

:
similarity
product
Dot
absent

is

term
a

if

0

)
,...,
(
)
,...,
(
1
2
1
2
1
1
1
1

N
j
ij
N
j
qj
N
j
ij
qj
i
N
j
ij
qj
i
qN
q
iN
i
i
w
w
w
w
D
Q
sim
w
w
D
Q
S
w
w
w
Q
w
w
D

Information Retrieval and Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

Improving the VSM
Model

37

Information Retrieval and Vector Space Model

VS Example: Raw TF & Dot Product

38

doc3

information

retrieval

search

engine

information

travel

information

map

travel

government

president

congress

doc1

doc2

Sim(q,doc1)=4.8*2.4+4.5*4.5

Sim(q,doc2)=2.4*2.4

Sim(q,doc3)=0

query=“information retrieval”

1
(
4.5
)

1(2.4)

Query

1(4.3)

1
(
3.2
)

1(2.2)

Doc3

1(3.3)

2
(
5.6
)

1(2.4)

Doc2

1(5.4)

1
(
2.1
)

1(4.5)

2(4.8)

Doc1

4.3

3.2

2.2

5.4

2.1

3.3

2.8

4.5

2.4

IDF

(fake)

Congress

President

Govern.

Engine

Search

Map

Travel

Retrieval

Info.

Information Retrieval and Vector Space
Model

Example

Q: “gold silver truck”

• D1: “Shipment of gold delivered in a fire”

• D2: “Delivery of silver arrived in a silver truck”

• D3: “Shipment of gold arrived in a truck”

Document Frequency of the
j
th

term (df
j

)

• Inverse Document Frequency (
idf) = log10(n / df
j
)

Tf*idf is used as term weight here

39

Information Retrieval and Vector Space
Model

Example (Cont’d)

Id

Term

df

idf

1

a

3

0

2

arrived

2

0.176

3

damaged

1

0.477

4

delivery

1

0.477

5

fire

1

0.477

6

gold

1

0.176

7

in

3

0

8

of

3

0

9

silver

1

0.477

10

shipment

2

0.176

11

truck

2

0.176

40

Information Retrieval and Vector Space
Model

Example(Cont’d)

Tf*idf is used here

SC(Q, D
1
) = (
0
)(
0
) + (
0
)(
0
) + (
0
)(
0.477
) + (
0
)(
0
) + (
0
)(
0.477
)
+ (
0.176
)(
0.176
) + (
0
)(
0
) + (
0
)(
0
) =
0.031

SC(Q, D
2
) =
0.486

SC(Q,D
3
) =
0.062

The ranking would be D
2
,D
3
,D
1
.

• This SC uses the dot product.

doc

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

t11

D1

0

0

0.477

0

0.477

0.176

0

0

0

0.176

0

D2

0

0.176

0

0.477

0

0

0

0

0.954

0

0.176

D3

0

0.176

0

0

0

0.176

0

0

0

0.176

0.176

Q

0

0

0

0

0

0.176

0

0

0.477

0

0.176

41

Information Retrieval and Vector Space
Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

of VS Model

Improving the VSM
Model

42

Information Retrieval and Vector Space Model

Empirically effective! (Top TREC performance)

Intuitive

Easy to implement

Well
-
studied/Most evaluated

The Smart system

Developed at Cornell:
1960
-
1999

Still widely used

Warning: Many variants of TF
-
IDF!

43

Information Retrieval and Vector Space Model

Assume term independence

Assume query and document to be the same

Lots of parameter tuning!

44

Information Retrieval and Vector Space Model

Outline

Introduction to IR

IR System Architecture

Vector
Space Model (VSM
)

How to Assign
Weights?

TF
-
IDF Weighting

Example

of VS Model

Improving the VSM
Model

45

Information Retrieval and Vector Space Model

Improving the VSM Model

We can improve the model by:

Reducing the number of dimensions

eliminating all stop words and very common terms

stemming terms to their roots

Latent Semantic Analysis

Not retrieving documents below a defined cosine threshold

Normalized frequency of a term i in document j is given
by
[1]
:

Normalized Document Frequencies

Normalized Query Frequencies

Information Retrieval and Vector Space Model

46

Stop List

Function words do not bear useful information for IR

of, not, to, or, in, about, with, I, be, …

Stop list: contain stop words, not to be used as index

Prepositions

Articles

Pronouns

Some frequent words (e.g. document)

The removal of stop words usually improves IR
effectiveness

A few “standard” stop lists are commonly used.

47

Information Retrieval and Vector Space Model

Stemming

48

Reason
:

Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them

Stemming
:

Removing some endings of word

dancer

dancers

dance

danced

dancing

dance

Information Retrieval and Vector Space Model

Stemming(Cont’d)

Two main methods :

Linguistic/dictionary
-
based stemming

high stemming accuracy

high implementation and processing costs and higher
coverage

Porter
-
style stemming

lower stemming accuracy

lower implementation and processing costs and lower
coverage

Usually sufficient for IR

49

Information Retrieval and Vector Space Model

Latent Semantic Indexing (LSI)

[3]

Reduces the dimensions of the term
-
document
space

Attempts to solve the synonomy and polysemy

Uses Singular Value Decomposition (SVD)

identifies patterns in the relationships between the terms
and concepts contained in an unstructured collection of text

Based on the principle that words that are used in
the same contexts tend to have similar meanings.

Information Retrieval and Vector Space Model

50

LSI Process

In general, the process involves:

constructing a weighted term
-
document matrix

performing a
Singular Value Decomposition

on the
matrix

using the matrix to identify the concepts contained in the
text

LSI statistically analyses the patterns of word
usage across the entire document collection

Information Retrieval and Vector Space Model

51

References

Introduction to Information Retrieval
, by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schuetze

https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/2.pdf

https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/ir4up.pdf

https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/e09
-
3009.pdf

https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/07models
-
vsm.pdf

https://wiki.cse.yorku.ca/course_archive/2010
-
11/F/6390/_media/03vectorspaceimplementation
-
6per.pdf

https://wiki.cse.yorku.ca/course_archive/2011
-
12/F/6339/_media/lecture02.ppt

https://wiki.cse.yorku.ca/course_archive/2011
-
12/F/6339/_media/vector_space_model
-
updated.ppt

https://wiki.cse.yorku.ca/course_archive/2011
-
12/F/6339/_media/lecture_13_ir_and_vsm_.ppt

Document Classification based on Wikipedia Content,
http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim
estamp=1318275702299

54

Information Retrieval and Vector Space Model