Overview of Web Ranking Algorithms: HITS and PageRank

cowphysicistInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

52 εμφανίσεις

Overview of Web Ranking
Algorithms: HITS and PageRank


April 6, 2006

Presented by: Bill Eberle

Overview


Problem


Web as a Graph


HITS


PageRank


Comparison



Problem


Specific queries (scarcity problem).


Broad
-
topic queries (abundance
problem).


Goal: to find the smallest set of
“authoritative” sources.

Web as a Graph


Web pages as nodes of a graph.


Links as directed edges.

www.uta.edu


my page

www.uta.edu

www.google.com

www.google.com

my page

www.uta.edu

www.google.com

Link Structure of the Web


Forward links (out
-
edges).


Backward links (in
-
edges).


Approximation of importance/quality: a
page may be of high quality if it is
referred to by many other pages, and by
pages of high quality.


HITS


HITS (Hyperlinked
-
Induced Topic
Search)


“Authoritative Sources in a Hyperlinked
Environment”, Jon Kleinberg, Cornell
University. 1998.



Authorities and Hubs


Authority is a page which has relevant
information about the topic.


Hub is a page which has collection of
links to pages about that topic.





h

a
1

a
2

a
3

a
4

Authorities and Hubs (cont.)


Good hubs are the ones that point to good
authorities.


Good authorities are the ones that are
pointed to by


good hubs.



h
2

h
3

h
4

h
5

a
1

a
2

a
3

a
4

a
5

a
6

h
1

Finding Authorities and Hubs


First, construct a focused sub
-
graph of the
www.


Second, compute Hubs and Authorities from
the sub
-
graph.


Construction of Sub
-
graph

Topic

Search Engine

Crawler

Rootset

Pages

Expanded

set

Pages

Rootset

Forward link pages

Root Set and Base Set


Use query term to
collect a
root set

of
pages from text
-
based search engine
(AltaVista).


Root set
Root Set and Base Set (cont.)


Expand root set into
base set by including
(up to a designated
size cut
-
off):


All pages linked to by
pages in root set


All pages that link to a
page in root set







Root set
Base set
Hubs & Authorities Calculation


Iterative algorithm on Base Set
: authority weights
a
(p), and
hub weights
h
(p).


Set authority weights
a
(p) = 1, and hub weights
h
(p) = 1
for all p.


Repeat following two operations

(and then re
-
normalize
a

and
h

to have unit norm):

v
1

p

v
2

v
3


h(v
2
)


h(v
3
)



p
q
p
a

to
points

h(q)
)
(
v
1

p


a(v
1
)

v
2

v
3


a(v
2
)


a(v
3
)



q
p
a
p
h

to
points

(q)
)
(

h(v
1
)

Example

Hub 0.45, Authority 0.45

0.45, 0.45

0.45, 0.45

0.45, 0.45

Example (cont.)

Hub 0.9, Authority 0.45

1.35, 0.9

0.45, 0.9

0.45, 0.9

Algorithmic Outcome


Applying iterative multiplication (power
iteration) will lead to calculating
eigenvector of any

non
-
degenerate”
initial vector.


Hubs and au
th
orities as outcome of
process.


Principal e
igenvector
contains highest
hub and authorities
.

Results


Although HITS is only link
-
based (it
completely disregard
s

page content) results
are quite good in many tested queries.


When the a
uthors tested
the query “search
engines”:


The a
lgorithm returned Yahoo!, Excite, Magellan,
Lycos, AltaVista


However,
none of these pages described
themselves as a


search engine” (at the time of
the
experiment)


Issues


From narrow topic, HITS tends to end in
more general one.


Specific of hub pages
-

many links can
cause algorithm drift. They can point to
authorities in different topics
.


Pages from single domain / website can
dominate result, if they point to one
page
-

not necessar
il
y a good authority.

Possible Enhancements


Use weighted sums for link calculation
.


Take advantage of
“anchor
text”
-

text
surrounding link itself.


Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page
as one.


Disregard or minimize influence of links inside
one domain.


IBM expanded HITS into Clever; not seen as
viable real
-
time search engine.

PageRank


“The PageRank Citation Ranking:
Bringing Order to the Web”, Lawrence
Page and Sergey Brin, Stanford
University. 1998.



Basic Idea


Back
-
links coming from important pages
convey more importance to a page. For
example, if a web page has a link off the
yahoo home page, it may be just one link but
it is a very important one.


A page has high rank if the sum of the ranks
of its back
-
links is high. This covers both the
case when a page has many back
-
links and
when a page has a few highly ranked back
-
links.

Definition


My page’s rank is equal to the sum of
all the pages pointing to me.

v
from
links
of
number
N
u
to
links
with
pages
of
set
B
N
v
Rank
u
Rank
v
u
B
v
v
u





)
(
)
(
Simplified PageRank Example


Rank(u) = Rank of
page
u

, where
c
is
a


normalization
constant (c < 1 to
cover for pages with
no outgoing links
).

Expanded Definition


R(u)
: page rank of page
u


c
: factor used for normalization (<1)


B
u
: set of pages pointing to
u


N
v
: outbound links of
v


R(v)
: page rank of site
v

that points to
u


E(u)
: distribution of web pages that a random
surfer periodically jumps (set to 0.15)

)
(
)
(
)
(
u
cE
N
v
R
c
u
R
u
B
v
v




Problem 1
-

Rank Sink


Page cycles pointed by some incoming link.







Loop will accumulate rank but never
distribute it.

Problem 2
-

Dangling Links


In general, many Web pages do not have either back links or forward
links.













Dangling links do not affect the ranking of any other page directly, so
they are removed until all the PageRanks are calculated.

Random Surfer Model


PageRank corresponds to the probability
distribution of a random walk on the web
graphs.





Solution


Escape Term


Escape term: E(u) can be thought of as the
random surfer gets bored periodically and jumps
to a different page


not staying in the loop
forever.






We term this E to be a vector over all the web
pages that accounts for each page’s escape
probability (user defined parameter).

)
(
)
(
)
(
u
cE
N
v
R
c
u
R
u
B
v
v




PageRank Computation




-

initialize vector over web pages


Loop
:


-

new ranks sum of normalized backlink ranks









-

compute normalizing factor







-

add escape term






-

control parameter



While

-

stop when converged

S
R

0
i
T
i
R
A
R


1
1
1
1



i
i
R
R
d
dE
R
R
i
i




1
1
i
i
R
R



1




Matrices


A is designated to be a matrix, u and v correspond to the
columns of this matrix.










Given that A is a matrix, and R be a vector over all the Web
pages, the dominant eigenvector is the one associated with
the maximal eigenvalue.

Example

A
T
=

Example (cont.)

A =

R =

Normalized =

A x =
λ

x

| A
-

λ
I | x = 0

R = c A R =
M

R

c : eigenvalue

R : eigenvector of A

Implementation

1. URL
-
> id

2. Store each hyperlink in a database.

3. Sort link structure by Parent id.

4. Remove dangling links.

5. Calculate the PR giving each page an
initial value.

6. Iterate until convergence.

7. Add the dangling links.


Example

1

B
N
Page A

Page B

Page C

2

A
N
1

C
N
Which of these three has the highest page
rank?

1

B
N
Page A

Page B

Page C

2

A
N
1

C
N
0
1
)
(
2
)
(
)
(
0
0
2
)
(
)
(
1
)
(
0
0
)
(









B
Rank
A
Rank
C
Rank
A
Rank
B
Rank
C
Rank
A
Rank
Example (cont.)



Re
-
write the system of equations as a Matrix
-


Vector product.

























































)
(
)
(
)
(
0
1
2
1
0
0
2
1
1
0
0
)
(
)
(
)
(
C
Rank
B
Rank
A
Rank
C
Rank
B
Rank
A
Rank

The
PageRank

vector is simply an eigenvector
(scalar*vector = matrix*vector) of the coefficient
matrix.

Example (cont.)

1

B
N
Page A

Page B

Page C

2

A
N
1

C
N
PageRank

= 0.4

PageRank

= 0.4

PageRank

= 0.2

Example (cont.)

0

1

2

3

.

.

.

.

11

12


with d= 0.5

Pr(A)


PR(B) PR(C)

A

B

C

Example (cont.)

Convergence


PageRank computation is O(log(|V|)).




Other Applications


Help user decide if a site is trustworthy.


Estimate web traffic.


Spam detection and prevention.


Predict citation counts.


Issues


Users are not random walkers.


Starting point distribution (actual usage
data as starting vector).


Bias towards main pages.


Linkage spam.


No query specific rank.

PageRank vs. HITS


PageRank


(Google)


computed for all web
pages stored in the
database prior to the
query


computes authorities
only


Trivial and fast to
compute


HITS


(CLEVER)


performed on the set
of retrieved web
pages for each query


computes authorities
and hubs


easy to compute, but
real
-
time execution
is hard

References


“Authoritative Sources in a Hyperlinked
Environment”, Jon Kleinberg, Cornell
University. 1998.


“The PageRank Citation Ranking:
Bringing Order to the Web”, Lawrence
Page and Sergey Brin, Stanford
University. 1998.