Searching the Web

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

62 views

1

Announcements


Research Paper due today


Research Talks



Nov. 29 (Monday) Kayatana and Lance


Dec. 1 (Wednesday) Mark and Jeremy


Dec. 3 (Friday) Joe and Anton


Dec. 5 (Monday) Colin and Paul




2

Web Search

Lecture 23

3

Searching the Web


Only search what is indexed


1999, 800 million documents indexed by
Northern Light[7]


Largest Index
-

16% of the indexable web


2004, 800 billion urls indexed by Google
[1]


Largest Index
-

?% of indexable web

4

Visualizing the Web


View the web as a directed graph of
nodes and edges


set of abstract nodes (the pages)


joined by directional edges (the
hyperlinks)


Structure provides significant insight
about the content


5

Example Graph [6]

6

Citation Analysis[2]


Use structure to identify important, or
prominent, nodes


Garfield’s
impact factor


Quantitative “score” for each journal
proportional to the average number of
citations per paper published in the previous
two years


More heavily cited journals have more overall
impact on a field


Consider it better to receive citations from
an important journal

7

Influence Weights


Pinski and Narin’s notion of
influence
weights


strength of the connection from one journal to
another



percentage of citations in the first journal that refer
to the second


equilibrium: the weight of each journal
J

equal
to sum of the weights of all journals citing
J

(scaled by strengths of connections)


If a journal receives regular citations from other
journals of large weight, it will acquire large weight

8

On the web


Lot of dead
-
ends in the link structure


Prominent sites may have no links to
outside world


Use “smoothing” operation, giving all
pages a small, positive connection
strength to every other page


Compute equilibrium weights with
respect to modified connection
strengths

9

Different Model on the Web


Prominent cites do not link to other
prominent cites


Search engines won’t link to other search
engines because they are competitors


Want to keep users on its sites


Large collection of pages link to many
prominent sites in a focused manner


act as resource lists and guides to search
engines

10

Hubs and Authorities


Authorities


most prominent sources of
primary content for a topic


Hubs


high quality guides and resource
lists direct users to recommended
authorities


Each page is assigned a hub weight and
an authority weight


authority weight
-

proportional to the sum of
the hub weights of pages that link to it


hub weight
-

proportional to the sum of the
authority weights of the pages that it links to

11

Simplified PageRank Algorithm[5]


Formula used by Google to rank
pages




Let
u
be a web page


F
u

is a set of pages
u

points to


B
u

is the set of pages that point to
u


N
u

= |
F
u
|


c

factor used for normalization





u
B
v
v
N
v
R
c
u
R
)
(
)
(
12

Simplified PageRank Calculation

where
c

= 1

13

PageRank Formula


Account for sinks





Complete Formula




d

is empirically set to about 0.15 to 0.2 by the
system






u
B
v
v
N
v
R
d
d
u
R
)
(
)
1
(
)
(
14

Using Queries to find Documents

Vector Space Model


Content Relevance

Slide by Mark Levene [3]

15

Term Frequency (TF)


Count number of
occurrences of each term.


Bag of words approach


Ignore stopwords such as
is
,

a
,

of
,

the
,



Stemming
-

computer

is
replaced by
comput
, as are
its variants:
computers
,
computing

computation
,
computer

and
computed
.


Normalise TF by dividing
by doc length, byte size of
doc or max num of
occurrences of a word in
the bag.


chess

computer

programming

chess

game

chess

game

is a

Slide by Mark Levene [3]

16

Inverse Document Frequency (IDF)

i
n
N
log

N

is number of documents in the corpus.


n
i

is number of docs in which word
i
appears.


Log dampens the effect of IDF.


IDF is also number of bits to represent the
term.

Slide by Mark Levene [3]

17

Ranking with TF
-
IDF





q
i
j
i
j
i
j
i
j
i
w
score
IDF
TF
w
,
,
,

i


refers to document
i


j


refers to word (or term)
j
in doc
i


q


is the query which is a sequence of
terms


score
j

-

is the score for document
j

given
q


Rank results according to the scoring
function.




Slide by Mark Levene [3]

18

Factor in Link Metrics

i
i
j
i
j
i
PR
IDF
TF
w
,
,


Multilply by PageRank of document (web
page).


We do not know exactly how Google
factors in the PR, it may be that log(PR) is
used.

Slide by Mark Levene [3]

19

Rate of change on the Web [4]


Search engines update their index
periodically in order to keep up with
evolving web


obsolete index leads to irrelevant or “broken”
search results


update both content and link structure


Source of change


content of pages change


new pages are added

20

What’s new on the Web?


New pages created rate of 8% a
week[4]


New pages borrow significant amount
of content from old pages


After one year, 50% of the content on
the web is new


Only 20% of pages available today
accessible after one year

21

New Link Structure


After a year, about 80% of links on
the Web will be replaced with new
ones


25% change per week


week
-
old rankings may not reflect the
current ranking of the pages very well

22

Change in old pages


After one week



30% of the changed pages




difference > 5%


After one year



less than 50% of changed pages



difference > 5%


Creation of new pages more significant
source of change on the Web


23

Impact on Search Engines


Need to continually update links


this
data changes more rapidly then content


most links persist for less than 6 months


Page removed and replaced by new ones
at rapid rates


Sometimes better to used cached version of
page


Pages that persist usually do not change
very much


Past change does not predict future change

24

Citations

[1] GOOGLE. Google.
www.google.com

[2] J. Kleinberg.
Hubs, Authorities, and Communities
. ACM
Computing Surveys, 31(4es), 1999.

[3] M. Levene. Lecture 4: Searching the Web.
www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt


[4] A. Ntoulas
et al
.
What’s New on the Web? The Evolution of the
Web from a Search Engine Perspective
. In Proceedings of The
Thirteenth International World Wide Web Conference, New York,
May 17
-
22, 2004.

[5] L. Page
et al
.
The PageRank citation ranking: Bringing Order to
the web
. Stanford Digital Libraries Working Paper, 1998.

[6] I. Rogers.
The Google PageRank Algorithm and How It Works.

www.iprcom.com/papers/pagerank
, April, 2002.

[7] E. Selberg and O. Etzioni.
On the Stability of Web Search Engines.

In Proceedings of RIAO 2000 Conference, Paris, April 12
-
14, 2000.