BYU Computer Science Students Homepage Index

bloatdecorumSoftware and s/w Development

Oct 30, 2013 (3 years and 7 months ago)

65 views

PageRank:

A Link Analysis
Approach


Source: Introduction to Information Retrieval, Chapter 21


Manning, Raghavan, and Schutze


Cambridge, 2008

2

The Web as a Directed Graph


Assumption 1: A hyperlink is a quality signal


A hyperlink between pages denotes that the author perceived

relevance


Assumption 2: An anchor text describes the target page
d
2


We use anchor text somewhat loosely here: the text surrounding

the hyperlink


Example
. “You can find cheap cars <a ref=http://...>here</a>”


Examples for hyperlinks that
violate

these 2 assumptions?

3

Document Text +/
-

Anchor Text


Searching on
document text
+
anchor text
is often more

effective than searching on document text only


Example
. Query “
IBM



Matches IBM’s
copyright page


Matches many
spam pages


Matches IBM
Wikipedia article


May not match IBM
home page



(if IBM home page is mostly




graphical)



Searching on anchor text is better for the query
IBM


Represent each page by all the anchor text pointing to it


In this representation, the page with the most occurrence

of IBM is www.ibm.com

4

Indexing Anchor Text


Anchor text
is often a better description of a page’s

content

than the page itself


Anchor text
can be
weighted

more highly than document


text (based on Assumptions 1 and 2)


Indexing anchor text can have unexpected side effects


Google bombs


A
Google bomb
is a search with “bad” results due to maliciously

manipulated anchor text


Google introduced a new weighting function in 01/2007 to fix it

5

Origins of PageRank: Citation Analysis


Citation analysis: analyze citations in the scientific literature


“Miller (2001)” is a hyperlink linking two scientific articles


One application of these “hyperlinks” in the scientific

literature:


Measure the similarity of 2 articles by the overlap of other


This is called
co
-
citation similarity


Are there any “co
-
citation similarity on the Web”?


Yes,

similar

Web pages


Measure:
citation frequency
/
citation rank


An article’s vote is weighted according to its
citation impact


Circular? No: can be formalized in a well
-
defined way


Based on
PageRank
: invented in the context of citation analysis


Citation analysis is a big deal: impact of publications

6

Basis for PageRank: Random walk


Imagine a Web surfer doing a
random walk
on the Web


Start at a random page, which is treated as a
state


At each step, go out of the current page to one of the links

based on the
transition probability


In the steady state, each page has a long
-
term visit rate


This long
-
term visit rate is the page’s
PageRank


Pages that visited
more

often

in the walk are
more


important


PageRank =
steady state probability
=
long
-
term visit rate


Can be modified by the
Markov chain

7

Basis for PageRank: Markov chains


A Markov chain consists of
N

states, where state = page,

plus an
N

x
N

transition
probability matrix
P


At each step, we are on exactly one of the pages.


For 1


i
,
j



N
, the matrix entry
P
ij



[0, 1]
denotes the

probability

of

j

being the next page, given that the

currently page is
i



Clearly, for all
i
,


P
ij

= 1


Markov chains are abstractions of random walks

N

j

=1

8

Teleport


The Web is full of dead ends, i.e., pages w/o outgoing links



Random walk can get stuck in dead ends


At a dead end, jump to a random web page


At a non
-
dead end, with
probability
10%, jump to a random

page


With remaining probability (90%), go out on a random hyperlink,

e.g., randomly choose with probability

(1
-

0.1) / 4 = 0.225,

for one of the 4 hyperlinks of the page)



With
teleporting
, a walk cannot get stuck in a dead end


Over a long time period, the random walk visits each state in

proportion to

the
steady
-
state probability distribution rate
,



regardless where the walk starts

9


Construction process of the
transition probability matrix P

1.
Given an
N
×

N adjacency matrix A
, for each row of
A

that


has no 1’s, set each column value as 1/
N

2.
For all the other rows in
A

a)
Divide each 1 in
A

by the number of 1s in its row

b)
Multiple
(1
-


), the
random walk probability,
to
the matrix in

(a) to yield
M
, where 0
<


<
1is a
teleport probability

c)
Add


/
N

to every entry of
M
to obtain
P


The probability distribution of the surfer’s position at any

time can be depicted by a probability vector
x

The Transition Probability Matrix



10


Example
. Consider the following Web graph and Matrices





Let


= 0.5, and thus 1
-



= 0.5 = 1/2. Hence,




and adding


/
N

= 0.5 / 3 = 1/6 to each entry in
M

yields
P





The Transition Probability Matrix


0 1 0

1 0 1

0 1 0

B

C

A

1

1

0.5

0.5

A
(djacency) =


0 1/2 0

1/4 0 1/4


0 1/2 0

M
(ultiplication) =


1/6 2/3 1/6


5/12 1/6 5/12


1/6 2/3 1/6

P
(robability) =


0 1 0

1/2 0 1/2


0 1 0

A
’ =

11

Markov chains


A sample Web graph


The

Link
(
Adjacency
)

matrix


Transition probability matrix P

12


A probability (row) vector
x

= (
x
1
, . . . ,
x
N
) indicates where the

random walk is, i.e., walk is on page
i

with probability
x
i
, e.g.,






x
i

= 1, 1


i



N


If the probability vector is
x

= (
x
1
, . . . ,
x
N
) on page
i

at this

step, then row
i

of the
transition probability matrix P

dictates

where to go next from state
i


From
x
, the next state is distributed as
xP


Example
. Let π
1

be the long
-
term visit rate (PageRank) of page 1

Propability Vectors



( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 )


1 2 3 . . .
i

. . .
N
-
2
N
-
1
N







π = (π
1

π
2
) = (0.25 0.75)



13


Regardless of where to start, the steady state
π

is eventually

reached


Start with (almost) any distribution
π


After one step, we’re at
π
P



After two steps, we’re at
π
P
2



After
k

steps, we’re at
π
P
k


Multiply
π

by increasing powers of
P

until the product is stable,

which is the
power (iteration) method


Example
. Let π = (0.5, 0.5),
P

=


π
P

=
(0.25, 0.75)


π
P
2

= (0.25, 0.75)


C
onvergence in one iteration!

Computing the PageRank


















0.25 0.75

0.25 0.75





14


If
π (the
probability vector
) is the initial distribution over the


states, then the distribution at time
t

is π
P
t


As
t

grows to become large,
π
P
t

= π
P
t+
1
, which yields the


PageRank

values


Example
. Consider the following matrix computed on Page 10




Assume that the surfer starts in state
A,
where the
initial distribution

vector

is


After one step the distribution vector is


After two steps,


The PageRank Computation


















B

C

A

1

1

0.5

0.5


1/6 2/3 1/6


5/12 1/6 5/12


1/6 2/3 1/6

P
(robability) =

x
0

= (1, 0, 0).

x
0
P

= (1/6, 2/3, 1/6) =
x
1

x
1
P

= (1/6, 2/3, 1/6)
P
= (1/3, 1/3, 1/3) =
x
2



15

The PageRank Computation



Example
. (Continued). Eventually, the sequence of probability are



x
0
1 0 0

x
1
1/6 2/3 1/6

x
2

1/3 1/3 1/3

x
3

1/4 1/2 1/4

x
4

7/24 5/12 7/24

… … … …


x

5/18 4/9 5/18











the steady
-
state

probability distribution

the PageRank values of Web pages

16

How Important is PageRank


Frequent claim:
PageRank

is the most important component


of Web ranking


The reality:


There are several components that are at least as important,


e.g., anchor text, phrases, proximity, headings, etc.


Rumor has it that
PageRank

in its original form has a




negligible impact on ranking, since
link spam
is difficult




and crucial to detect


However, variants of a page’s
PageRank

are still an essential


part of ranking