COMP 4321 Search Engines for Web and Corporate Data

toadspottedincurableInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

165 εμφανίσεις

COMP 4321 Search Engines for Web and Corporate Data

Homework 1


1.

[40 points]

(a) Now
a
new semester has come. Every new semester students need to take several brand new
courses. However, memorizing the course web page
’s URL is a problem.

At the end of the f
irst
lecture, you want to check out what lab 1 is about, what can you do to reach comp4321’s lab page
?



Sol
: No standard solution. Some students may choose to bookmark the URL, some may memorize it,
some may use search engine by typing queries in search b
ox.


(b)

Suppose Professor Lee recently opens a new course named “Information Retrieval”. He spent
several days working on
building the
course website and slides. After finishing the materials,
P
rofessor
Lee
will make the files
accessible
on
-
line. Several days

later students can find the
course website by searching on Google, “Information Retrieval HKUST”. We know that there
are

billions of webpages on the internet. How can Google detect these new course webpages
created by Prof Lee?


Sol
: Google will crawl all

the pages on internet

periodically
, by which newly created website or pages
can

be detected.


(c)

Type the query “comp4321 lab” in Google, Bing, and Yahoo. Find the result
s

returned by the
above three search engines. Which
search engine
has the best performan
ce according to your
experience? How
do

you judge which
search engine

is better?


Sol
: Google is the best since the first few results are exactly the information I want to find. However,
both Bing and Yahoo don’t return the page we expect. For evaluation,
we can use precision and recall.
Of course, students may not be able to use specific term “precision” and “recall”, but they may have a
general and intuitive idea about that. As long as students can provide solutions related to the concept of
precision and

recall, we can give them full mark.


(d)

In the search result pages P1, P2,
and
P3

returned from

G
oogle, Bing, Yahoo, respectively
,
what
do

you see above the
first result
? For example, Google
displays


About 78 results

(0.28
seconds)”. You need to list the
in
formation for Bing and Yahoo
. According to the difference,
can you get some idea about
what action(s) the search engines had performed on the query

so
that search engine
s

can better understand
the
user’s needs?


Sol
:
In Bing, it shows like:





In Yahoo, it shows like:





From the comparison we can see, Bing separate “comp4321” to “comp 4321”.
Bing can automatically
do the separation, which is a

preprocess of the query. Besides separate “comp4321” to “comp
” and
“4321”, it also do the compensation for the query. Sometimes, user may not be able to type the queries
clear and clean, thus most of the time search engine need to preprocess the queries first.


Note that question (c) and (d) are open questions. The
snapshot listed in (d) are the searching result of
“Yahoo” and “Bing” a month ago

(before Oct 2012)
. However, after the continuous clicking of the
URL on the lab page (probably by comp4321 students), “Yahoo” and “Bing” re
-
ranked our lab page
into a high pl
ace, thus

comp4321 lab page becomes the re
sults of the search engine
.

2.

[60 points]

A small document collection contains only the following three short documents:

D1: to be or not to be

D2: to live or not to live

D3: to publish or perish



(a)

Draw the invert
ed file structure (using the style in the lecture slides) for the above three
documents with enough information in the index to support tfxidf ranking (no normalization on
tf and idf are needed)
and

phrase search. All terms are indexed (no stopword removal

or
stemming are needed) and assume that the entire document collection contains only these three
documents.


Sol
:

be



D1,
2,6



live



D2, 2
,6



not



D1,
4

D2,
4


or



D1,
3

D2,
3

D3,
3

perish



D3,
4



publish



D3, 2



to



D1,
1,5

D2,
1,5

D3
,
1


The numbers in the postings are word positions. Note that in this design, I don’t keep
explicit values for tf and idf, which can be obtained by counting.

(b)

Give the document vectors for the three documents using tf*idf weights.

All words are
retained.


be

live

not

or

perish

publis
h

to

D1








D2








D3








Sol
:


be

live

not

or

perish

publis
h

to

D1

2*log
2
(3)

0

1*log
2

(3/2)

1*log
2

(3/3)

0

0

2*log
2

(3/3
)

D2

0

2*log
2

(3)

1*log
2

(3/2)

1*log
2

(3/3)

0

0

2*log
2

(3/3
)

D3

0

0

0

1*log
2

(3/3)

1*log
2
(3/1
)

1*log
2
(3/1)

2*log
2

(3/3
)

D1 = <3.17, 0,

0.585,
0
, 0,

0,
0
>

D2 = <0,
3.17
, 0.585,
0
, 0, 0,
0
>

D3 = <0, 0, 0,
0
,
1.585
,
1.585
, 0>


(c)

Compute the
inner,
cosine
and Jaccard
similarity
values
between the documents and the query
<live, or,
publish
>

Sol
:
Q = <0,

1, 0, 1, 0, 1, 0>

INNER (Q,D1) = 0*3.17+1*0+0*0.585+1*0+0*0+1*0+0*0 = 0

INNER (Q,D2) = 0*0+1*3.17+0*0.585+1*0+0*0+1*0+0*0 = 3.17

INNER(Q,D3) = 0*0+1*0+0*0+1*0+0*1.585+1*1.585+0*0 = 1.585

|Q| = sqrt(1
2
+1
2
+1
2
) =1.7321

|
D1| = sqrt(3.17
2
+0.585
2
)=3.224

|D2| =
sqrt(3.17
2
+0.585
2
)=3.224

|D3| = sqrt(
1.585
2
+1.585
2
) = 2.242

COS(Q, D1) = (0)/(1.7321*3.224) = 0

COS(Q, D2) = (3.17)/(1.7321*3.224) = 0.568

COS(Q, D3) = (1.585)/(1.7321*2.242) = 0.408

JACCARD(Q, D1) = INNER(Q,D1) / (|Q|
2
+
|
D1|
2
-
INNER(Q,D1)) = 0

JACCARD(Q, D2
) = INNER(Q,D2) / (|Q|
2
+
|
D2|
2
-
INNER(Q,D2)) =
(3.17)/(3+10.394
-
3.17) = 0.31

JACCARD(Q, D3) = INNER(Q,D3) / (|Q|
2
+
|
D3|
2
-
INNER(Q,D3)) =
(1.585)/(3+5.027
-
1.585)= 0.25