Linear Algebra in Use: Ranking Web Pages with an Eigenvector

Arya MirInternet and Web Development

Oct 6, 2011 (6 years and 18 days ago)

998 views

Maia Bittner, Yifei Feng Abstract—Google PageRank is an algorithm that uses the underlying, hyperlinked structure of the web to determine the theoretical number of times a random web surfer would visit each page. Google converts these numbers into probabilities, and uses these probabilities as each web page’s relative importance. Then, the most important pages for a given query can be returned first in search results. In this paper, we focus on the math behind PageRank, which includes eigenvectors and Markov Chains, and on explaining how it is used to rank webpages.

1
Linear Algebra in Use:Ranking Web Pages with an
Eigenvector
Maia Bittner,Yifei Feng
Abstract—Google PageRank is an algorithm that uses the
underlying,hyperlinked structure of the web to determine the
theoretical number of times a randomweb surfer would visit each
page.Google converts these numbers into probabilities,and uses
these probabilities as each web page’s relative importance.Then,
the most important pages for a given query can be returned first
in search results.In this paper,we focus on the math behind
PageRank,which includes eigenvectors and Markov Chains,and
on explaining how it is used to rank webpages.
Google PageRank,Eigenvector,Eigenvalue,Markov Chain
I.INTRODUCTION
The internet is humanity’s largest collection of information,
and its democratic nature means that anyone can contribute
more information to it.Search engines help users sort through
the billions of available web pages to find the information
that they are looking for.Most search engines use a two step
process to return web pages based on the user’s query.The first
step involves finding which of the pages the search engine has
indexed are related to the query,either by containing the words
in the query,or by more advanced means that use semantic
models.The second step is to order this list of relevant pages
according to some criterion.For example,the first web search
engines,launched in the early 90’s,used text-based matching
systems as their criterion for ordering returned results by
relevancy.This ranking method often resulted in returning
exact matches in unauthoritative,poorly written pages before
results that the user could trust.Even worse,this system was
easy to exploit by page owners,who could fill their pages
with irrelevant words and phrases,with the hope of ranking
highly.These problems prompted researchers to investigate
more advanced methods of ranking.
Larry Page and Sergey Brin were researching a new kind
of search engine at Stanford University when they had the
idea that pages could be ranked by link popularity.The
underlying social basis of their idea was that more reputable
pages are linked to by other pages more often.Page and Brin
developed an algorithm that could quantify this idea,and in
1998,christened the algorithm PageRank [2] and published
their first paper on it.Shortly afterwards,they founded Google,
a web search engine that uses PageRank to help rank the
returned results.Google’s famously useful search results [3]
helped it reach almost $29 billion dollars in revenue in 2010
[1].The original algorithm organized the indexed web pages
such that the links between them are used to construct the
probability of a random web surfer navigating from one page
to another.This system can be characterized as a Markov
Chain,a mathematical model described below,in order to take
advantage of their convenient properties.
In this paper,we will explain how the interlinking structure
of the web and the properties of Markov Chains can be used to
quantify the relative importance of each indexed page.We will
examine Markov Chains,eigenvectors,and the power iteration
method,as well as some of the problems that arise when using
this system to rank web pages.
A.Markov Chains
Markov chains are mathematical models that describe par-
ticular types of systems.For example,we can construe the
number of students at Olin College who are sick as a Markov
Chain if we know the how likely it is that a student will
become sick.Let us say that if a student is feeling well,she has
a 5% chance of becoming sick the next day,and that if she is
already sick,she has a 35%chance of feeling better tomorrow.
In our example,a student can only be healthy or sick;these
two states are called the state space of the system.In addition,
we’ve decided to only ask how the students are feeling in the
morning,and their health on any day only depends on how
they were feeling the previous morning.This constant,discrete
increase in time makes the system time-homogenous.We can
generate a set of linear equations that will describe how many
students at Olin College are healthy and sick on any given day.
If we let m
k
indicates the number of healthy students and n
k
indicates the number of sick students at morning k,then we
get the following two equations:
m
k+1
=:95m
k
+:35n
k
n
k+1
=:05m
k
+:65n
k
Putting this system of linear equations into matrix notation,
we get:

m
k+1
n
k+1

=

:95:35
:05:65

m
k
n
k

(1)
We can take this matrix full of probabilities and call it P,the
transition matrix.
P =

:95:35
:05:65

(2)
The columns can be viewed as representing the present state
of the system,and the rows can be viewed as representing
the future state of the system.The first column accounts for
the students who are healthy today,and the second column
accounts for the students who are sick today,while the first row
indicates the students who will be healthy tomorrow and the
2
second row indicates the students who will be sick tomorrow.
The intersecting elements of the transition matrix represent
the probability that a student will transition from that column
state to the row state.So we can see that p
1;1
indicates that
95% of the students who are healthy today will be healthy
tomorrow.The total number of students who will be healthy
tomorrow is represented by the first row,so that it is 95% of
the students who are healthy today plus 35% of the students
who are sick today.Similarly,the second row shows us the
number of students who will be sick tomorrow:5% of the
students who are healthy today plus 65% of the students who
are sick today.
You can see that each column sums to one,to account for
100% of students who are sick and 100% of students who
are healthy in the current state.Square matrices like this,that
have nonnegative real entries and where every column sums
to one,are called column-stochastic.
We can find the total number of students who will be in
a state on day k +1 by multiplying the transition matrix by
a vector containing how many students were in each state on
day k.
Px
k
= x
k+1
(3)
For example,if 270 students at Olin College are healthy today,
and 30 are sick,we can find the state vector for tomorrow:

:95:35
:05:65

270
30

=

267
33

(4)
which shows that tomorrow,267 students will be healthy and
33 students will be sick.To find the next day’s values,you
can multiply again by the transition matrix:

:95:35
:05:65

267
33

=

265:2
34:8

(5)
which is the same as

:95:35
:05:65

2

270
30

=

265:2
34:8

(6)
So we can see that in order to find m
k
and n
k
,we can
multiply P
k
by the state vector containing m
0
and n
0
,as in
the equation below:

m
k
n
k

=

:95:35
:05:65

k

m
0
n
0

(7)
If you continue to multiply the state vector by the transition
matrix for very high values of k,we will see that it will
eventually converge upon a steady state,regardless of initial
conditions.This is represented by vector q in Eqn.(8).
Pq = q (8)
Being able to find this steady-state vector is the main
advantage of using a column-stochastic matrix to model a
system.Column-stochastic transition matrices are always used
to represent the known probabilities of transitioning between
states in Markov Chains.To model a system as a Markov
Chain,it must be a discrete-time process that has a finite
state space,and the probability distribution for any state must
depend only on the previous state.Every situation that can be
classified as a Markov Chain has these steady-state values that
the system will remain constant at,regardless of the initial
state.This steady-state vector is a specific example of an
eigenvector,explained below.
B.Eigenvalues and Eigenvectors
An eigenvector is a nonzero vector x that,when multipled
by a matrix A,only scales in length and does not change
direction,except for potentially a reverse in direction.
Ax = x (9)
The corresponding amount that an eigenvector is scaled by,,
is called its eigenvalue.There are several techniques to find the
eigenvalues and eigenvectors of a matrix.We will demonstrate
one technique below with matrix A.
A=
2
4
1 2 5
0 3 0
0 0 4
3
5
How to find the eigenvalues for matrix A:We know that  is
a eigenvalue of A if and only if the equation
Ax = x (10)
has a nontrivial solution.This is equivalent to finding  such
that:
(AI)x = 0 (11)
The above equation has nontrivial solution when the determi-
nant of AI is zero.
det(AI
3
) =






1  2 5
0 3  0
0 0 4 






= (4 )(1 )(3 ) = 0
Solving for ,we get that the eigenvalues are 
1
= 1,
2
= 3,
and 
3
= 4.If we solve for Av
i
= 
i
v
i
,we will get the
corresponding eigenvector for each eigenvalue:
v
1
=
2
4
1
0
0
3
5
;v
2
=
2
4
4
1
2
3
5
;v
3
=
2
4
5
0
3
3
5
This means that if v
2
is transformed by A,the result will
scale v
2
by its eigenvalue,3.
You can see in Eqn.(8) that the steady state of a Markov
Chain has an eigenvalue of 1.This is why those steady state
vectors are a special case of eigenvectors.Because they are
column-stochastic,all transition matrices of Markov Chains
will have an eigenvalue of 1 (we invite the reader to prove
this in Exercise 4).A system having an eigenvalue of 1 is the
same as it having a steady state.
In some matrices,we may get repeated roots when solving
det(AI) = 0.We will demonstrate this for the column-
stochastic matrix P:
P =
2
6
6
6
6
4
0 1 0 0 0
1 0 0 0 0
0 0 0 1
1
2
0 0 0 0
1
2
0 0 1 0 0
3
7
7
7
7
5
3
Figure 1.A small web of 4 pages,connected by directional links
To find the eigenvalues,solve:
det(PI
5
) =










 1 0 0 0
1  0 0 0
0 0  1
1
2
0 0  0
1
2
0 0 1 0 










= 
1
2
( 1)
2
( +1)(2
2
+2 +1) = 0
When we solve the characteristic equation,we find that the
five eigenvalues are:
1
= 1,
2
= 1,
3
= 1,
4
= 
1
2

i
2
,

5
= 
1
2
+
i
2
.Since 1 appears twice as an eigenvalue,we
say that is has algebraic multiplicity of 2.The number of
individual eigenvectors associated with eigenvalue 1 is called
the geometric multiplicity of  = 1.The reader can confirm
that in this case, = 1 has geometric multiplicity of 2 with
associated eigenvectors x and y.
x =
2
6
6
6
6
4
p
2
2p
2
2
0
0
0
3
7
7
7
7
5
;y =
2
6
6
6
6
4

p
2
2p
2
2
0
0
0
3
7
7
7
7
5
We can see that when transition matrices for Markov chains
have geometric multiplicity for eigenvalue of 1,it’s unclear
which independent eigenvector should be used to represent
the steady-state of the system.
II.PAGERANK
When the founders of Google created PageRank,they were
trying to discern the relative authority of web pages from the
underlying structure of links that connects the web.They did
this by calculating an importance score for each web page.
Given a webpage,call it page k,we can use x
k
to denote the
importance of this page among the total number of n pages.
There are many different ways that one could calculate an
importance score.One simple and intuitive way to do page
ranking is to count the number of links from other pages
to page k,and assign that number as x
k
.We can think of
each link as being one vote for the page it links to,and of
the number of votes a page gets as showing the importance
of the page.In the example network of Figure 1,there are
four webpages.Page 1 is linked to by pages 2 and 3,so its
importance score is x
1
= 2.In the same way,we can get
x
2
= 3,x
3
= 1,x
4
= 2.Page 2 has the highest importance
score,indicating that page 2 is the most important page in this
web.
However,this approach has several drawbacks.First,the
pages that have more links to other pages would have more
votes,which means that a website can easily gain more
influence by creating many links to other pages.Second,we
would expect that a vote from an important page should weigh
more than a vote from an unimportant one,but every page’s
vote is worth the same amount with this method.A way to
fix both of these problems is to give each page the amount
of voting power that is equivalent to its importance score.So
for webpage k with an importance score of x
k
,it has a total
voting power of x
k
.Then we can equally distribute x
k
to all
the pages it links to.We can define the importance score of
a page as the sum of all the weighted votes it gets from the
pages that link to it.So if webpage k has a set of pages S
k
linked to it,we have
x
k
=
X
j2S
k
x
j
n
j
(12)
where n
j
is the number of links from page j.If we apply this
method to the network of Figure 1,we can get a system of
linear equations:
x
1
=
x
2
1
+
x
4
2
x
2
=
x
1
3
+
x
3
2
+
x
4
2
x
3
=
x
1
3
x
4
=
x
1
3
+
x
3
2
which can be written in the matrix form x = Lx,where x =
[x
1
;x
2
;x
3
;x
4
]
T
and
L =
2
6
6
4
0 1 0
1
2
1
3
0
1
2
1
2
1
3
0 0 0
1
3
0
1
2
0
3
7
7
5
L is called the link matrix of this network system since it
encapsulates the links between all the pages in the system.
Because we’ve evenly distributed x
k
to each of the pages
k links to,the link matrix is always column-stochastic.As
we defined earlier,vector x contains the importance scores
of all web pages.To find these scores,we can solve for
Lx = x.You’ll notice that this looks similar to Eqn.(8),
and indeed,we can transform this problem into finding the
eigenvector with eigenvalue  = 1 for the matrix L!For
matrix L,the eigenvector is [0:387;0:290;0:129;0:194]
T
for
 = 1.So we know the importance score of each page is
x
1
 0:387;x
2
 0:290;x
3
 0:129;x
4
 0:194.Note that
with this more sophisticated method,page 1 has the highest
importance score instead of page 2.This is because page
2 only links to page 1,so it casts its entire vote to page
1,boosting up its score.Knowing that the link matrix is a
4
column-stochastic matrix,let us now look at the problem of
ranking internet pages in terms of a Markov Chain system.
For a network with n pages,the ith entry of the n  1
vector x
k
denotes the probability of visiting page i after k
clicks.The link matrix L is the transition matrix such that
the entry l
ij
is the probability of clicking on a link to page
i when on page j.Finding the importance score of a page is
the same as finding its entry in the steady state vector of the
Markov chain that describes the system.For example,say we
start from page 1,so that vector that represents our begining
state is x
0
= [1;0;0;0]
T
To find the probability of ending up
on each web page after n clicks,we do:
x
n
= L
n
x
0
(13)
where n represents the state after n clicks (the possibility of
being on each page),and L is the link matrix,or transition
matrix.So by calculating the powers of L,we can determine
the steady state vector.This process is called the Power
Iteration method,and it converges on an estimate of the
eigenvector for the greatest eigenvalue,which is always 1 in
the case of a column-stochastic matrix.For example,by raising
the link matrix L to the 25th power,we have
B 
2
6
6
4
0:387 0:387 0:387 0:387
0:290 0:290 0:290 0:290
0:129 0:129 0:129 0:129
0:194 0:194 0:194 0:194
3
7
7
5
If we multiply matrix B by our initial state x
0
,we get our
steady state vector
s =

0:387 0:290 0:129 0:194

T
which shows the probability of each link being visited.This
power iteration process gives us approximately the same
result as finding the eigenvector of the link matrix,but is
often more computationally feasible,especially for matrices
with a dimension of around 1 billion,like Google’s.These
computation savings are why this is the method by which
Google actually calculates the PageRank of web pages [5].By
taking powers of the matrix to estimate eigenvectors,Google
is doing the reverse of many applications,which diagonalize
matrices into their eigenvector components in order to take
them to a high power.Few applications actually use the power
iteration method,since it is only appropriate given a narrow
range of conditions.The sparseness of the web’s link matrix,
and the need to only know the eigenvector corresponding to
the dominant eigenvalue,make it an application well-suited to
take advantages of the power iteration method.
A.Subwebs
We now address a problem that this model has when
faced with real-world constraints.We refer to this problem
as disconnected subwebs,as shown in Figure 2.If there are
two or more groups of linked pages that do not link to each
other,it is impossible to rank all pages relative to each other.
The matrix shown below is the link matrix for the web shown
in Figure 2:
Figure 2.Here are two small subwebs,which do not exchange links
A=
2
6
6
6
6
4
0 1 0 0 0
1 0 0 0 0
0 0 0 1
1
2
0 0 0 0
1
2
0 0 1 0 0
3
7
7
7
7
5
Mathematically,this problem poses itself as being a multi-
dimensional Markov Chain.Link matrix A has an geometric
multiplicity of 2 for the eigenvalue of 1,as we showed
in section I.B.It’s unclear which of the two associated
eigenvectors should be chosen to form the rankings.The
two eigenvectors are essentially eigenvectors for each subweb,
and each shows rankings which are accurate locally,but not
globally.
Google has chosen to solve this problem of subwebs by
introducing an element of randomness into the link matrix.
Defining a matrix S as an n  n matrix with all entries
1
n
and a value m between 0 and 1 as a relative weight,we can
replace link matrix A with:
M= (1 m)A+mS (14)
If m > 0,there will be no parts of matrix M which
represent entirely disconnected subwebs,as every web surfer
has some probability of reaching another page regardless of
the page they’re on.In the original PageRank algorithm,an
m value of:15 was used.Today,it is spectulated by those
outside of Google that the value currently in use lies between
:1 and:2.The larger the value of m,the higher the random
matrix is weighted,and the more egalitarian the corresponding
PageRank values are.If m is 1,a web surfer has equal
probability of getting to any page on the web from any other
page,and the all links would be ignored.If m is 0,any
subwebs contained in the system will cause the eigenvalues to
have a multiplicity greater than 1,and there will be ambiguity
in the system.
III.DISCUSSION
We’ve shown how systems that can be characterized as
a Markov Chain will converge to a steady state,and that
these steady state values can be found by using either the
characteristic equation or the power iteration method.We then
5
investigated how the web can be viewed as a Markov Chain
when the state is which page a web surfer is on,and the hy-
perlinks between pages dictate the probability of transitioning
from one state to another.With this characterization,we can
view the steady-state vector as the proportional amount of time
a web surfer would spend on every page,hence,a valuable
metric for which pages are more important.
REFERENCES
[1] O’Neill,Nick.Why Facebook Could Be Worth $200 Billion All Facebook.
Available at http://www.allfacebook.com/is-facebooks-valuation-hype-a-
macro-perspective-2011-01
[2] Page,L.and Brin,S.and Motwani,R.and Winograd,T.The pagerank
citation ranking:Bringing order to the web Technical report,Stanford
Digital Library Technologies Project,1998.
[3] Granka,Laura A.and Joachims,Thorsten and Gay,Geri.Eye-tracking
analysis of user behavior in WWW search SIGIR,2004.Available at
http://www.cs.cornell.edu/People/tj/publications/granketal04a.pdf
[4] Grinstead,Charles M.and Snell,J.Laurie Introduction
to Probability:Chapter 11 Markov Chains Available at
http://www.dartmouth.edu/chance/teachingaids/booksarticles/probabilitybook/Chapter11.pdf
[5] Ipsen,Ilse,and Rebecca M.Wills Analysis and Computation of Google’s
PageRank 7th IMACS International Symposium on Iterative Methods in
Scientific Computing,Fields Institute,Toronto,Canada,5 May 2005.
Available at http://www4.ncsu.edu/ipsen/ps/slidesimacs.pdf
Figure 3.A small web of 5 pages
IV.EXERCISES
Exercise 1:Find the eigenvalues and correspoding eigenvectors of the
following matrix.
A=
2
6
4
3 0 1 0
4 3 2 0
0 0 1 0
2 0 3 2
3
7
5
Exercise 2:Given the column-stochastic matrix P:
P =

0:6 0:3
0:4 0:7

find the steady-state vector for P.
Exercise 3:Create a link matrix for the network with 5 internet pages
in Figure 3,then rank the pages.
Exercise 4:In section I B.we claim that all transition matrices for
Markov chains have 1 as an eigenvalue.Why is this true for every
column-stochastic matrix?
6
V.SOLUTION TO EXERCISES
Solution 1:The eigenvalues are 
1
= 2;
2
= 3;
3
= 3 and 
4
= 1,
and corresponding eigenvectors are
x
1
=
2
6
4
0
0
0
1
3
7
5
x
2
=
2
6
4
0
1
0
0
3
7
5
x
3
=
2
6
4
3
2
0
6
3
7
5
x
4
=
2
6
4
1
2
2
4
3
7
5
Solution 2:The steady-state vector for P is,
s =

0:4286
5714

Solution 3:The link matrix is,
A=
2
6
6
6
6
4
0 1
1
3
0 0
1
3
0
1
3
0 0
0 0 0
1
2
1
1
3
0
1
3
0 0
1
3
0 0
1
2
0
3
7
7
7
7
5
The ranking vector is,
x =

1
4
1
6
1
4
1
6
1
6

T
Solution 4:By definition,every column in a column stochastic matrix
contains no negative numbers and sums to 1.As shown in Section I.B.,
eigenvalues are the numbers that,when subtracted fromthe main diagonal
of a matrix,cause its determinant to equal 0.Since every column adds to
1,and we are subtracting 1 from every column,clearly the determinant
will equal 0.Therefore,1 is always an eigenvalue of column-stochastic
matrices.