1
Announcements
•
Research Paper due today
•
Research Talks
–
Nov. 29 (Monday) Kayatana and Lance
–
Dec. 1 (Wednesday) Mark and Jeremy
–
Dec. 3 (Friday) Joe and Anton
–
Dec. 5 (Monday) Colin and Paul
2
Web Search
Lecture 23
3
Searching the Web
•
Only search what is indexed
–
1999, 800 million documents indexed by
Northern Light[7]
•
Largest Index

16% of the indexable web
–
2004, 800 billion urls indexed by Google
[1]
•
Largest Index

?% of indexable web
4
Visualizing the Web
•
View the web as a directed graph of
nodes and edges
–
set of abstract nodes (the pages)
–
joined by directional edges (the
hyperlinks)
•
Structure provides significant insight
about the content
5
Example Graph [6]
6
Citation Analysis[2]
•
Use structure to identify important, or
prominent, nodes
•
Garfield’s
impact factor
–
Quantitative “score” for each journal
proportional to the average number of
citations per paper published in the previous
two years
–
More heavily cited journals have more overall
impact on a field
•
Consider it better to receive citations from
an important journal
7
Influence Weights
•
Pinski and Narin’s notion of
influence
weights
–
strength of the connection from one journal to
another
•
percentage of citations in the first journal that refer
to the second
–
equilibrium: the weight of each journal
J
equal
to sum of the weights of all journals citing
J
(scaled by strengths of connections)
•
If a journal receives regular citations from other
journals of large weight, it will acquire large weight
8
On the web
•
Lot of dead

ends in the link structure
–
Prominent sites may have no links to
outside world
–
Use “smoothing” operation, giving all
pages a small, positive connection
strength to every other page
•
Compute equilibrium weights with
respect to modified connection
strengths
9
Different Model on the Web
•
Prominent cites do not link to other
prominent cites
–
Search engines won’t link to other search
engines because they are competitors
–
Want to keep users on its sites
•
Large collection of pages link to many
prominent sites in a focused manner
–
act as resource lists and guides to search
engines
10
Hubs and Authorities
•
Authorities
–
most prominent sources of
primary content for a topic
•
Hubs
–
high quality guides and resource
lists direct users to recommended
authorities
•
Each page is assigned a hub weight and
an authority weight
–
authority weight

proportional to the sum of
the hub weights of pages that link to it
–
hub weight

proportional to the sum of the
authority weights of the pages that it links to
11
Simplified PageRank Algorithm[5]
•
Formula used by Google to rank
pages
–
Let
u
be a web page
–
F
u
is a set of pages
u
points to
–
B
u
is the set of pages that point to
u
–
N
u
= 
F
u

–
c
factor used for normalization
u
B
v
v
N
v
R
c
u
R
)
(
)
(
12
Simplified PageRank Calculation
where
c
= 1
13
PageRank Formula
•
Account for sinks
•
Complete Formula
–
d
is empirically set to about 0.15 to 0.2 by the
system
u
B
v
v
N
v
R
d
d
u
R
)
(
)
1
(
)
(
14
Using Queries to find Documents
Vector Space Model
–
Content Relevance
Slide by Mark Levene [3]
15
Term Frequency (TF)
•
Count number of
occurrences of each term.
•
Bag of words approach
•
Ignore stopwords such as
is
,
a
,
of
,
the
,
…
•
Stemming

computer
is
replaced by
comput
, as are
its variants:
computers
,
computing
computation
,
computer
and
computed
.
•
Normalise TF by dividing
by doc length, byte size of
doc or max num of
occurrences of a word in
the bag.
chess
computer
programming
chess
game
chess
game
is a
Slide by Mark Levene [3]
16
Inverse Document Frequency (IDF)
i
n
N
log
•
N
is number of documents in the corpus.
•
n
i
is number of docs in which word
i
appears.
•
Log dampens the effect of IDF.
•
IDF is also number of bits to represent the
term.
Slide by Mark Levene [3]
17
Ranking with TF

IDF
q
i
j
i
j
i
j
i
j
i
w
score
IDF
TF
w
,
,
,
•
i
–
refers to document
i
•
j
–
refers to word (or term)
j
in doc
i
•
q
–
is the query which is a sequence of
terms
•
score
j

is the score for document
j
given
q
•
Rank results according to the scoring
function.
Slide by Mark Levene [3]
18
Factor in Link Metrics
i
i
j
i
j
i
PR
IDF
TF
w
,
,
•
Multilply by PageRank of document (web
page).
•
We do not know exactly how Google
factors in the PR, it may be that log(PR) is
used.
Slide by Mark Levene [3]
19
Rate of change on the Web [4]
•
Search engines update their index
periodically in order to keep up with
evolving web
–
obsolete index leads to irrelevant or “broken”
search results
–
update both content and link structure
•
Source of change
–
content of pages change
–
new pages are added
20
What’s new on the Web?
•
New pages created rate of 8% a
week[4]
–
New pages borrow significant amount
of content from old pages
–
After one year, 50% of the content on
the web is new
•
Only 20% of pages available today
accessible after one year
21
New Link Structure
•
After a year, about 80% of links on
the Web will be replaced with new
ones
•
25% change per week
–
week

old rankings may not reflect the
current ranking of the pages very well
22
Change in old pages
•
After one week
–
30% of the changed pages
–
difference > 5%
•
After one year
–
less than 50% of changed pages
–
difference > 5%
•
Creation of new pages more significant
source of change on the Web
23
Impact on Search Engines
•
Need to continually update links
–
this
data changes more rapidly then content
–
most links persist for less than 6 months
•
Page removed and replaced by new ones
at rapid rates
–
Sometimes better to used cached version of
page
•
Pages that persist usually do not change
very much
–
Past change does not predict future change
24
Citations
[1] GOOGLE. Google.
www.google.com
[2] J. Kleinberg.
Hubs, Authorities, and Communities
. ACM
Computing Surveys, 31(4es), 1999.
[3] M. Levene. Lecture 4: Searching the Web.
www.dsc.bbk.ac.uk/~mark/download/lec4_searching_the_web.ppt
[4] A. Ntoulas
et al
.
What’s New on the Web? The Evolution of the
Web from a Search Engine Perspective
. In Proceedings of The
Thirteenth International World Wide Web Conference, New York,
May 17

22, 2004.
[5] L. Page
et al
.
The PageRank citation ranking: Bringing Order to
the web
. Stanford Digital Libraries Working Paper, 1998.
[6] I. Rogers.
The Google PageRank Algorithm and How It Works.
www.iprcom.com/papers/pagerank
, April, 2002.
[7] E. Selberg and O. Etzioni.
On the Stability of Web Search Engines.
In Proceedings of RIAO 2000 Conference, Paris, April 12

14, 2000.
Comments 0
Log in to post a comment