Hyperlink Analysis for the Web

homelybrrrInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

89 εμφανίσεις

Hyperlink Analysis for the Web

Information Retrieval


Input:

Document collection


Goal:

Retrieve documents or text with information
content
that is
relevant
to user’s
information need


Two aspects:

1.

Processing the collection

2.

Processing queries (searching)

Classic information retrieval


Ranking is a function of
query term frequency within the
document (tf) and across all documents (idf)



This works because of the following
assumptions

in
classical IR:


Queries are
long and well specified


“What is the impact of the Falklands war on Anglo
-
Argentinean
relations”


Documents (e.g., newspaper articles) are
coherent
,
well authored
,
and are usually about one topic


The
vocabulary is small

and relatively well understood


Web information retrieval


None of these assumptions

hold:


Queries are
short:
2.35 terms in avg


Huge variety in documents
: language, quality,
duplication


Huge
vocabulary: 100s million of terms


Deliberate misinformation



Ranking is a function of the
query terms and of the
hyperlink structure

Connectivity
-
based ranking


Ranking Returned Documents


Query dependent raking



Query independent ranking


Hyperlink analysis



Idea: Mine structure of the
web graph



Each web page is a node



Each hyperlink is a directed edge


Query dependent
ranking


Assigns a score that measures the quality
and relevance of a selected set of pages
to a given user query.



The basic idea is to build a query
-
specific
graph, called a
neighborhood graph
, and
perform hyperlink analysis on it.

Building a neighborhood graph


A
start set
of documents matching the query is
fetched from a search engine (say, the top 200
matches).


The start set is augmented by its
neighborhood
,
which is the set of documents that either hyperlinks to
or is hyperlinked to by documents in the start set .


Since the
indegree
of nodes can be very large, in practice a
limited number of these documents (say, 50) is included.


Each document in both the start set and the
neighborhood is modeled by a node. There exists an
edge from node
A
to node
B
if and only if document
A
hyperlinks to document
B
.


Hyperlinks between pages on the same Web host can be
omitted.

Query Results

= Start Set


Forward Set


Back Set

Neighborhood graph

An edge for each hyperlink, but no edges within the same host

Result
1

Result
2

Result
n

f
1

f
2

f
s

...

b
1

b
2

b
m





...


Subgraph associated to each query

HITS [K’98]


Goal:

Given a query find:




Good sources of content (authorities)




Good sources of links (hubs)


Authority

comes from

in
-
edges.

Being a
good

hub

comes from

out
-
edges.






Better authority

comes from

in
-
edges from
good
hubs
. Being a
better hub

comes from

out
-
edges
to
good authorities
.


Intuition


q
1

q
k

...

A

q
2

r
1

r
k

r
2

...

H

p

HITS details



HITS


Kleinberg proved that the
H
and
A
vectors will
eventually converge, i.e., that termination is
guaranteed.


In practice we found the vectors to converge in
about 10 iterations.


Documents are ranked by hub and authority
scores respectively.


The algorithm
does not claim to find
all
relevant pages
, since there may be some that
have good content but have not been linked
to by many authors.

Problems with the HITS algorithm(1)



Only a relatively small part of the Web graph is
considered, adding edges to a few nodes can
change the resulting hubs and authority scores
considerably.


It is relatively easy to manipulate these scores.

Problems with the HITS algorithm(2)


We often find that the neighborhood graph
contains documents not relevant to the query
topic. If these nodes are well connected, the
topic drift

problem arises.


The most highly ranked authorities and hubs tend
not to be about the original topic.


For example, when running the algorithm on the
query “jaguar
and
car" the computation drifted to
the general topic “car" and returned the home
pages of different car manufacturers as top
authorities, and lists of car manufacturers as the
best hubs.


Improvements


To avoid “undue weight” of the opinion of a
single person


All the documents on a single host have the same
influence on the document they are connected to
as a single document would.


Ideas


If there are
k
edges from documents on a first host
to a single document on a second host, we give
each edge an
authority weight
of 1/
k
.


If there are
l
edges from a single document on a
first host to a set of documents on a second host,
we give each edge a
hub weight
of 1/
l
.


Improvements



Improvements


To solve topic drift problem, content
analysis can be used.


Ideas


Eliminating non
-
relevant nodes from the
graph


Regulating the influence of a node based
on its relevance.

Improvements


Computing Relevance Weights for Nodes


The documents in the start set is used to define a
broader query and match every document in the
graph against this query.


Specifically, the concatenation of the first 1000
words from each document are considered to be
the query,
Q
and compute
similarity
(
Q;D
)


All nodes whose weights are below a
threshold are pruned.

Improvements


Regulating the Influence of a Node


Let
W
[
n
] be the relevance weight of a node
n


W
[
n
]*

A
[
n
] is used instead of
A
[
n
] for computing
the hub scores.


W
[
n
]*
H
[
n
] is used instead of
H
[
n
] for computing the
authority score.


This reduces the influence of less relevant
nodes on the scores of their neighbors.

Query
-
independent ordering


First generation: using link counts as simple
measures of popularity.


Two basic suggestions:


Undirected popularity:


Each page gets a score = the number of in
-
links plus the number of out
-
links (3+2=5).


Directed popularity:


Score of a page = number of its in
-
links (3).

Query processing


First retrieve all pages meeting the text query
(say
venture capital
).


Order these by their link popularity (either
variant on the previous page).

Spamming simple popularity


Exercise
: How do you spam each of the
following heuristics so your page gets a
high score?


Each page gets a static score = the
number of in
-
links plus the number of out
-
links.


Static score of a page = number of its in
-
links.


Pagerank scoring


Imagine a browser doing a random walk on
web pages:


Start at a random page



At each step, go out of the current page along one
of the links on that page, equiprobably


“In the steady state” each page has a long
-
term visit rate
-

use this as the page’s score.

1/3

1/3

1/3

Not quite enough


The web is full of dead
-
ends.


Random walk can get stuck in dead
-
ends.


Makes no sense to talk about long
-
term visit rates.

??

Teleporting


At a dead end, jump to a random
web page.


At any non
-
dead end, with
probability 10%, jump to a random
web page.


With remaining probability (90%), go
out on a random link.


10%
-

a parameter.

Result of teleporting


Now cannot get stuck locally.


There is a long
-
term rate at
which any page is visited (not
obvious, will show this).


How do we compute this visit
rate?

Markov chains


A Markov chain consists of
n
states
, plus an
n

n

transition probability matrix

P
.


At each step, we are in exactly one of the
states.


For
1


i,j


n,
the matrix entry
P
ij

tells us the
probability of
j

being the next state, given we
are currently in state
i
.

i

j

P
ij

P
ii
>0

is OK
.

.
1
1



ij
n
j
P
Markov chains


Clearly, for all i,



Markov chains are abstractions of random
walks.


Exercise
: represent the teleporting random
walk from 3 slides ago as a Markov chain, for
this case:

Ergodic Markov chains


A Markov chain is
ergodic

if


you have a path from any state to any
other


For any start state, after a finite transient
time T
0
,
the probability of being in any state
at a fixed time T>T
0

is nonzero.

Ergodic Markov chains


For any ergodic Markov chain, there is a
unique
long
-
term visit rate

for each state.


Steady
-
state probability distribution
.


Over a long time
-
period, we visit each
state in proportion to this rate.


It doesn’t matter where we start.

Probability vectors


A probability (row) vector
x

= (x
1
, … x
n
)
tells us where the walk is at any point.


E.g., (000…1…000) means we’re in state
i
.

i

n

1

More generally, the vector
x

= (x
1
, … x
n
)

means the walk is in state
i

with probability
x
i
.


.
1
1



n
i
i
x
Change in probability vector


If the probability vector is
x

= (x
1
, …
x
n
)
at this step, what is it at the next
step?


Recall that row
i

of the transition
prob. Matrix
P

tells us where we go
next from state
i
.


So from
x
, our next state is
distributed as
xP
.

Steady state example


The steady state looks like a vector of
probabilities
a

= (a
1
, … a
n
):


a
i

is the probability that we are in state
i
.

1

2

3/4

1/4

3/4

1/4

For this example,
a
1
=1/4

and
a
2
=3/4
.

How do we compute this vector?


Let
a

= (a
1
, … a
n
)

denote the row vector of
steady
-
state probabilities.


If we our current position is described by
a
,
then the next step is distributed as
aP
.


But
a

is the steady state, so
a
=
aP
.


Solving this matrix equation gives us
a
.

One way of computing
a


Recall, regardless of where we start, we
eventually reach the steady state
a
.


Start with any distribution (say
x
=(
10…0
)).


After one step, we’re at
xP
;


after two steps at
xP
2

, then
xP
3

and so on.


“Eventually” means for “large”
k
,
xP
k
=
a
.


Algorithm: multiply
x

by increasing powers of
P

until the product looks stable.

Google’s approach


Assumption: A

link

from page A to page B is a
recommendation

of page B by the author of A

(we say B is
successor

of A)


Quality of a page is related to its in
-
degree



Recursion: Quality of a page is related to



its in
-
degree, and to



the
quality

of pages linking to it


PageRank

[BP ‘98]

Definition of PageRank


Consider the following infinite
random walk

(surf):


Initially the surfer is at a random page


At each step, the surfer proceeds


to a randomly chosen web page with probability d


to a randomly chosen successor of the current page with
probability 1
-
d


The PageRank of a page p is
the fraction of steps
the surfer spends at p in the limit.

PageRank (cont.)

By random walk theorem:


PageRank = stationary probability for this
Markov chain, i.e.



where n is the total number of nodes in the
graph






E
p
q
q
outdegree
q
PageRank
d
n
d
p
PageRank
)
,
(
)
(
/
)
(
)
1
(
)
(
PageRank (cont.)

P

A

B

PageRank of P is

(1
-
d)
*

(


1/4
th

the PageRank of A + 1/3
rd

the PageRank of B
) +d/n


d

d

PageRank


Used in Google’s ranking function


Query
-
independent


Summarizes the “web opinion” of the page
importance


PageRank vs. HITS


Computation:


Once for all documents
and queries (offline)



Query
-
independent


requires combination
with query
-
dependent
criteria


Hard to spam


Computation:


Requires computation for
each query



Query
-
dependent




Relatively easy to spam


Quality depends on
quality of start set


We want top
-
ranking documents
to be both
relevant
and
authoritative



Relevance

is being modeled by cosine
scores



Authority
is typically a query
-
independent
property of a document


Assign to each document a
query
-
independent

quality score

in [0,1] to each document
d


Denote this by
g(d)


Net score


Consider a simple total score combining
cosine relevance and authority


net
-
score(
q,d
) =
g(d) +
cosine(
q,d
)


Can use some other linear combination than an
equal weighting



Now we seek the top
K

docs by
net score

Top
K
by net score


fast methods


First idea: Order all postings by
g(d)


Key: this is a common ordering for all
postings


Thus, can concurrently traverse query terms’
postings for


Postings intersection


Cosine score computation

Why order postings by
g(d)?


Under
g(d)
-
ordering, top
-
scoring docs likely to
appear early in postings traversal


In time
-
bound applications (say, we have to
return whatever search results we can in 50
ms), this allows us to stop postings traversal
early


Short of computing scores for all docs in postings

Champion lists in
g(d)
-
ordering


Can combine champion lists with
g(d)
-
ordering


Maintain for each term a champion list of the
r

docs with highest
g(d) +
tf
-
idf
td


Seek top
-
K

results from only the docs in
these champion lists

High and low lists


For each term, we maintain two postings lists
called
high
and
low


Think of
high

as the champion list


When traversing postings on a query, only
traverse
high
lists first


If we get more than
K

docs, select the top
K
and
stop


Else proceed to get docs from the
low

lists


Can be used even for simple cosine scores,
without global quality
g(d)


A means for segmenting index into two
tiers

Impact
-
ordered postings


We only want to compute scores for docs for
which
wf
t,d

is high enough


We sort each postings list by
wf
t,d


Now: not all postings in a common order!


How do we compute scores in order to pick
off top
K?


Two ideas follow

1. Early termination


When traversing
t’
s postings, stop early after
either


a fixed number of
r

docs


wf
t,d
drops below some threshold


Take the union of the resulting sets of docs


One from the postings of each query term


Compute only the scores for docs in this
union


2. idf
-
ordered terms


When considering the postings of query
terms


Look at them in order of decreasing idf


High idf terms likely to contribute most to score


As we update score contribution from each
query term


Stop if doc scores relatively unchanged


Can apply to cosine or some other net scores

Other applications


Web Pages Collection


The crawling process usually starts from a set of
source Web pages. The Web crawler follows the
source page hyperlinks to find more Web pages.


This process is repeated on each new set of
pages and continues until no more new pages are
discovered or until a predetermined number of
pages have been collected.


The crawler has to decide in which order to collect
hyperlinked pages that have not yet been crawled.


The crawlers of different search engines make
different decisions, and so collect different sets of
Web documents.


A crawler might try to preferentially crawl “high quality”
Web pages.

Other applications


Web Page Categorization


Geographical Scope


Whether a given Web page is of interest only for people in a
given region or is of nation
-

or worldwide interest is an
interesting problem for hyperlink analysis.


For example, a weather
-
forecasting page is interesting only
to the region it covers, while the Internal Revenue Service
Web page may be of interest to U.S. taxpayers throughout
the world.


A page’s hyperlink structure also reflects its range of interest.



Local pages are mostly hyperlinked to by pages from the
same region, while hyperlinks to pages of nationwide interest
are roughly uniform throughout the country.


This information lets search engines tailor query results to the
region the user is in.

Reference


Monika Henzinger, “hyperlink analysis for the web”, IEEE internet
computing 2001.


J. Cho, H. García
-
Molina, and L. Page, “Efficient Crawling through URL
Ordering,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science,
New York, 1998.


S. Chakrabarti et al., “Automatic Resource Compilation by Analyzing
Hyperlink Structure and Associated Text,”
Proc. Seventh Int’l World
Wide Web Conf
., Elsevier Science, New York, 1998.


K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation
in Hyperlinked Environments,”
Proc. 21
st

Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval
(SIGIR 98), ACM
Press, New York, 1998


L. Page et al., “The PageRank Citation Ranking: Bringing Order to the
Web,” Stanford Digital Library Technologies, Working Paper 1999
-
0120, Stanford Univ., Palo Alto, Calif., 1998.


I. Varlamis et al., “THESUS, a Closer View on Web Content
Management Enhanced with Link Semantics”, IEEE Transactions on
Knowledge and Data Engineering, vol. 16, No. 6, June 2004.