IR Projectx

wrendeceitInternet and Web Development

Oct 21, 2013 (3 years and 9 months ago)

67 views

Improved

TF
-
IDF Ranker

Presentation By,

Muralidhar

Chouhan

Contents


Introduction


Outline of our approach


Background

o
Tf
-
Idf ranker

o
Semantic similarity between sentences


Details of our approach


Results


Conclusion


References

Introduction


Traditional information retrieval systems are particularly susceptible to
all the problems posed by the richness of natural
language.



In
particular
multitude
of ways in which the same concepts can be
described.



Overall context of the user input and the document is ignored.



Traditional TF IDF Ranker ignores the relatedness of concepts.
Searches for the exact word match.



Introduction of semantic analyzer will improve the performance.





Introduction
(cont..)


A
im

of the project is
to
use traditional TF IDF ranker along with
semantic analyzer to retrieve the documents. And to compare the
performance of the new system with the traditional
tf

idf

ranker.


Introduction
(cont..)


This project uses
,


o

Text Retrieval Conference (TREC) data set named Confusion track for
validation

[6
].


o
Wordnet

lexical database


o
.NET framework (
wordnet

.net
)

Input Query

Documents

Primary filter

TF IDF Ranker

Pre
-
processor

Doc ID,
Weight pairs

Traditional TF IDF Ranker



Document
s

Final Docs

Outline of our approach

Input Query

Document
s

Primary filter

TF IDF Ranker

Pre
-
processor

Doc ID,
Weight pairs

TF IDF Ranker with introduction of Semantic knowledge



Document
s

Final Docs

Semantic
similarity

Outline of our
approach
(cont..)

Input Query

Documents

TF
-
IDF Ranker
II

Wordnet

semantic Analyzer

Pre
-
processor

Doc ID,
Semantic score



DocID
, Keywords

Final Docs

Corpus

Word,DF

pairs


Find the Keywords
from each doc


Use
Tf

and
Df

(use
Corpus)


Outline of our
approach
(cont..)

Docs got from
traditional
tf

idf

approach

Pre
-
processor

Tokenize

Remove stop

words

Outline of our
approach
(cont..)

Background

Tf
-
Idf

ranker
:


Tf
-
idf

ranker

is

used

as

a

weighting

factor

in

information

retrieval

and

text

mining
.




Terms

that

appear

often

in

a

document

should

get

high

weights
.




The

more

often

a

document

contains

a

term,

the

more

likely

that

the

document

is

about

the

term
.

It

is

captures

using

Term

frequency

(
TF
)
.




Terms

that

appear

in

many

documents

should

get

a

low

weight,

which

is

captured

using

Inverse

Document

Frequency

(
IDF
)
.



The weight of a term in a document is calculated using below
formula [5
],



W
i,j
=
TF
i,j

* log (N/
DF
i
)




Semantic

similarity

between

sentences
:



Semantic

similarity

between

sentences

is

calculated

using

semantic

information

and

the

word

order

information
.




This

project

has

used

an

implementation

which

calculates

the

semantic

relatedness

between

two

set

of

strings
.



The

implementation

uses

Wordnet

lexical

database,

to

calculate

the

semantic

relatedness
.




The

score

lies

between

0

and

1
.

0

representing

least

similarity

score
.

1
being

highest
.


Wordnet
:



Wordnet

is

the

product

of

a

research

project

at

Princeton

University

[
4
]
.



Information

in

Wordnet

is

organized

around

logical

groupings

called

synsets
.




Each

synset

consists

of

a

list

of

synonymous

word

forms

and

semantic

pointers

that

describe

relationships

between

the

current

synset

and

other

synsets
.



In

Wordnet,

each

part

of

speech

words

(nouns/verbs
...
)

are

organized

into

taxonomies

where

each

node

is

a

set

of

synonyms

(
synset)

represented

in

one

sense
.


Wordnet

(cont
..
)



If

a

word

has

more

than

one

sense,

it

will

appear

in

multiple

synsets

at

various

locations

in

the

taxonomy
.




Wordnet

defines

relations

between

synsets

and

relations

between

word

senses
.

A

relation

between

synsets

is

a

semantic

relation,

and

a

relation

between

word

senses

is

a

lexical

relation
.

Wordnet

(cont
..
)













For

example,



T
he

shortest

path

between

male

and

female

in

Fig
.

1

is

male
-
person
-
female,

the

minimum

path

length

is

2
.


T
he

minimum

path

length

between

female

and

teacher

is

5
.


Details of our approach

Traditional

TF
-
IDF

Ranker

Step
1
:
Preprocess

input

query

o
Tokenization

o
Remove

stop

words


Step
2
:

Apply

Tf
-
Idf

ranker


TF
-
Idf

ranker

would

identify

number

of

times

each

word

appears

in

each

of

the

documents

as

shown

below
.








Where

TF
ij

is

the

term

frequency

of

word

wi

in

document

Dj
.


DF
i

indicates

document

frequency

of

word

Wi

in

document

collection



D1

D2

D3

, ,

D
N

DF

W1

TF11

TF12





TF1N

DF1

W2

TF21

TF22





TF2N

DF2

W3

TF31

TF32





TF3N

DF3

:













:













Wn

TFn1

TFn2





TFnN

DFn

Details of our approach
(cont..)

Calculating

the

weight
:



The weight of each word is calculated using below formula
.


W
i,j
=
TF
i,j

*
log (N/
DF
i
)












D1

D2

D3

, ,

D
N

DF

W1

W
11

W
12





W
1N

DF1

W2

W
21

W
22





W
2N

DF2

W3

W
31

W
32





W
3N

DF3

:













:













Wn

W
n1

W
n2





W
nN

DFn

Weight
sum

S1

S2





S
N



Details of our approach
(
cont
)

Step
3

:

Retrieve

the

documents


Sort

all

the

documents

according

to

the

weights
.

Pick

top

Q

documents

for

further

processing
.

Q

is

chosen

such

as

the

weight

of

each

document

crosses

a

particular

threshold

d
1
.


Improved

TF
-
IDF

Ranker

Step
1
:

We

choose

top

S

from

the

step
3

of

previous

method
.

Here

we

use

another

threshold

d
2
(d
2
<d
1
)

to

get

the

set

of

docs

for

further

processing
.


Step
2
:

Extract

the

keywords

(Words

which

have

high

TF

and

low

DF)

from

each

document
.








Doc

DF

Weight

W1

TF1

DF1

We1

W2

TF2

DF2

We2

W3

TF3

DF3

We3

:







:







Wn

TFn

DFn

Wen

Details of our approach
(
cont
)

Corpus containing IDF (
logN
/DF) of each word from docs

Details of our approach
(cont..)

Step

3
:

For

each

document,

calculate

the

semantic

similarity

score

between

its

keyword

set

and

the

input

query
.








Step

4
:

Sort

the

docs

w
.
r
.
t

to

score
.

Eliminate

the

docs

with

score

less

than

a

specified

threshold

(b=
0
.
5
)
.


Step 5: Display the docs.



Confusion Track result set

Results

Results: Old system
vs

New system

Results
(cont..)

Calculating precision & recall for 10 queries

Results
(cont..)

Precision& Recall bar chat: Old system
vs

New system

Results
(cont..)

0
0.2
0.4
0.6
0.8
1
1.2
1
2
3
4
5
6
7
8
9
10
TF IDF( P)
TF IDF (R)
Semantic( P)
Semantic(R)
Screenshots


Traditional IF IDF Ranker

Screenshots
(cont..)

Improved IF IDF Ranker(with semantic knowledge)


Conclusion


This project
has improvised traditional TF
-
IDF ranker by introducing
Semantic analyzer.



Successfully showed that, using semantic analyzer has good
precision and recall values.



Next
, it used a dataset from Text Retrieval Conference Data (TREC) to
validate the project
.



One limitation of Tf
-
Idf Ranker is, terms that occur in query input text
but that cannot be found in documents gets zero scores.


References

[1] R.
Rada
, H.
Mili
, E.
Bichnell
, and M.
Blettner
, “Development and
Application of a Metric on Semantic Nets,” IEEE Trans. System, Man, and
Cybernetics, vol. 9, no. 1, pp. 17
-
30, 1989
.


[2] Li,
Yuhua,et.al
, “Sentence Similarity Based on Semantic Nets and Corpus
Statistics,” IEEE Trans on knowledge and data engineering,
vol

18, no.8,2006.


[3] Dao,
Thanh
, Troy Simpson, “Measuring similarity between the sentences”
.Web
.



[4] R. Richardson, A. F.
Smeaton

and J. Murphy, “Using
WordNet

as a
Knowledge Base for
Measuring Semantic
Similarity between Words,” School
of Computer Applications, Dublin City
University.Web
.



[5]
TfIdf

Ranker, ‘http://vetsky.narod2.ru/catalog/
tfidf_ranker
/’ .web
.


[
6] Confusion track, TREC
dataset

‘http
://trec.nist.gov/data/t5_confusion.html’ .Web.


Thank you