PageRank:
A Link Analysis
Approach
Source: Introduction to Information Retrieval, Chapter 21
Manning, Raghavan, and Schutze
Cambridge, 2008
2
The Web as a Directed Graph
Assumption 1: A hyperlink is a quality signal
A hyperlink between pages denotes that the author perceived
relevance
Assumption 2: An anchor text describes the target page
d
2
We use anchor text somewhat loosely here: the text surrounding
the hyperlink
Example
. “You can find cheap cars <a ref=http://...>here</a>”
Examples for hyperlinks that
violate
these 2 assumptions?
3
Document Text +/

Anchor Text
Searching on
document text
+
anchor text
is often more
effective than searching on document text only
Example
. Query “
IBM
”
•
Matches IBM’s
copyright page
•
Matches many
spam pages
•
Matches IBM
Wikipedia article
•
May not match IBM
home page
(if IBM home page is mostly
graphical)
Searching on anchor text is better for the query
IBM
Represent each page by all the anchor text pointing to it
In this representation, the page with the most occurrence
of IBM is www.ibm.com
4
Indexing Anchor Text
Anchor text
is often a better description of a page’s
content
than the page itself
Anchor text
can be
weighted
more highly than document
text (based on Assumptions 1 and 2)
Indexing anchor text can have unexpected side effects
–
Google bombs
A
Google bomb
is a search with “bad” results due to maliciously
manipulated anchor text
Google introduced a new weighting function in 01/2007 to fix it
5
Origins of PageRank: Citation Analysis
Citation analysis: analyze citations in the scientific literature
“Miller (2001)” is a hyperlink linking two scientific articles
One application of these “hyperlinks” in the scientific
literature:
Measure the similarity of 2 articles by the overlap of other
This is called
co

citation similarity
Are there any “co

citation similarity on the Web”?
Yes,
similar
Web pages
Measure:
citation frequency
/
citation rank
An article’s vote is weighted according to its
citation impact
Circular? No: can be formalized in a well

defined way
Based on
PageRank
: invented in the context of citation analysis
Citation analysis is a big deal: impact of publications
6
Basis for PageRank: Random walk
Imagine a Web surfer doing a
random walk
on the Web
Start at a random page, which is treated as a
state
At each step, go out of the current page to one of the links
based on the
transition probability
In the steady state, each page has a long

term visit rate
This long

term visit rate is the page’s
PageRank
Pages that visited
more
often
in the walk are
more
important
PageRank =
steady state probability
=
long

term visit rate
Can be modified by the
Markov chain
7
Basis for PageRank: Markov chains
A Markov chain consists of
N
states, where state = page,
plus an
N
x
N
transition
probability matrix
P
At each step, we are on exactly one of the pages.
For 1
i
,
j
N
, the matrix entry
P
ij
[0, 1]
denotes the
probability
of
j
being the next page, given that the
currently page is
i
Clearly, for all
i
,
P
ij
= 1
Markov chains are abstractions of random walks
N
j
=1
8
Teleport
The Web is full of dead ends, i.e., pages w/o outgoing links
Random walk can get stuck in dead ends
At a dead end, jump to a random web page
At a non

dead end, with
probability
10%, jump to a random
page
With remaining probability (90%), go out on a random hyperlink,
e.g., randomly choose with probability
(1

0.1) / 4 = 0.225,
for one of the 4 hyperlinks of the page)
With
teleporting
, a walk cannot get stuck in a dead end
Over a long time period, the random walk visits each state in
proportion to
the
steady

state probability distribution rate
,
regardless where the walk starts
9
Construction process of the
transition probability matrix P
1.
Given an
N
×
N adjacency matrix A
, for each row of
A
that
has no 1’s, set each column value as 1/
N
2.
For all the other rows in
A
a)
Divide each 1 in
A
by the number of 1s in its row
b)
Multiple
(1

), the
random walk probability,
to
the matrix in
(a) to yield
M
, where 0
<
<
1is a
teleport probability
c)
Add
/
N
to every entry of
M
to obtain
P
The probability distribution of the surfer’s position at any
time can be depicted by a probability vector
x
The Transition Probability Matrix
10
Example
. Consider the following Web graph and Matrices
Let
= 0.5, and thus 1

= 0.5 = 1/2. Hence,
and adding
/
N
= 0.5 / 3 = 1/6 to each entry in
M
yields
P
The Transition Probability Matrix
0 1 0
1 0 1
0 1 0
B
C
A
1
1
0.5
0.5
A
(djacency) =
0 1/2 0
1/4 0 1/4
0 1/2 0
M
(ultiplication) =
1/6 2/3 1/6
5/12 1/6 5/12
1/6 2/3 1/6
P
(robability) =
0 1 0
1/2 0 1/2
0 1 0
A
’ =
11
Markov chains
A sample Web graph
The
Link
(
Adjacency
)
matrix
Transition probability matrix P
12
A probability (row) vector
x
= (
x
1
, . . . ,
x
N
) indicates where the
random walk is, i.e., walk is on page
i
with probability
x
i
, e.g.,
x
i
= 1, 1
i
N
If the probability vector is
x
= (
x
1
, . . . ,
x
N
) on page
i
at this
step, then row
i
of the
transition probability matrix P
dictates
where to go next from state
i
From
x
, the next state is distributed as
xP
Example
. Let π
1
be the long

term visit rate (PageRank) of page 1
Propability Vectors
( 0.05 0.01 0.0 . . . 0.2 . . . 0.01 0.05 0.03 )
1 2 3 . . .
i
. . .
N

2
N

1
N
π = (π
1
π
2
) = (0.25 0.75)
13
Regardless of where to start, the steady state
π
is eventually
reached
Start with (almost) any distribution
π
After one step, we’re at
π
P
After two steps, we’re at
π
P
2
After
k
steps, we’re at
π
P
k
Multiply
π
by increasing powers of
P
until the product is stable,
which is the
power (iteration) method
Example
. Let π = (0.5, 0.5),
P
=
π
P
=
(0.25, 0.75)
π
P
2
= (0.25, 0.75)
C
onvergence in one iteration!
Computing the PageRank
0.25 0.75
0.25 0.75
14
If
π (the
probability vector
) is the initial distribution over the
states, then the distribution at time
t
is π
P
t
As
t
grows to become large,
π
P
t
= π
P
t+
1
, which yields the
PageRank
values
Example
. Consider the following matrix computed on Page 10
Assume that the surfer starts in state
A,
where the
initial distribution
vector
is
After one step the distribution vector is
After two steps,
The PageRank Computation
B
C
A
1
1
0.5
0.5
1/6 2/3 1/6
5/12 1/6 5/12
1/6 2/3 1/6
P
(robability) =
x
0
= (1, 0, 0).
x
0
P
= (1/6, 2/3, 1/6) =
x
1
x
1
P
= (1/6, 2/3, 1/6)
P
= (1/3, 1/3, 1/3) =
x
2
15
The PageRank Computation
Example
. (Continued). Eventually, the sequence of probability are
x
0
1 0 0
x
1
1/6 2/3 1/6
x
2
1/3 1/3 1/3
x
3
1/4 1/2 1/4
x
4
7/24 5/12 7/24
… … … …
x
5/18 4/9 5/18
the steady

state
probability distribution
the PageRank values of Web pages
16
How Important is PageRank
Frequent claim:
PageRank
is the most important component
of Web ranking
The reality:
There are several components that are at least as important,
e.g., anchor text, phrases, proximity, headings, etc.
Rumor has it that
PageRank
in its original form has a
negligible impact on ranking, since
link spam
is difficult
and crucial to detect
However, variants of a page’s
PageRank
are still an essential
part of ranking
Comments 0
Log in to post a comment