Hyperlink Analysis for the Web
Information Retrieval
•
Input:
Document collection
•
Goal:
Retrieve documents or text with information
content
that is
relevant
to user’s
information need
•
Two aspects:
1.
Processing the collection
2.
Processing queries (searching)
Classic information retrieval
•
Ranking is a function of
query term frequency within the
document (tf) and across all documents (idf)
•
This works because of the following
assumptions
in
classical IR:
–
Queries are
long and well specified
“What is the impact of the Falklands war on Anglo

Argentinean
relations”
–
Documents (e.g., newspaper articles) are
coherent
,
well authored
,
and are usually about one topic
–
The
vocabulary is small
and relatively well understood
Web information retrieval
•
None of these assumptions
hold:
–
Queries are
short:
2.35 terms in avg
–
Huge variety in documents
: language, quality,
duplication
–
Huge
vocabulary: 100s million of terms
–
Deliberate misinformation
•
Ranking is a function of the
query terms and of the
hyperlink structure
Connectivity

based ranking
•
Ranking Returned Documents
–
Query dependent raking
–
Query independent ranking
•
Hyperlink analysis
–
Idea: Mine structure of the
web graph
–
Each web page is a node
–
Each hyperlink is a directed edge
Query dependent
ranking
•
Assigns a score that measures the quality
and relevance of a selected set of pages
to a given user query.
•
The basic idea is to build a query

specific
graph, called a
neighborhood graph
, and
perform hyperlink analysis on it.
Building a neighborhood graph
•
A
start set
of documents matching the query is
fetched from a search engine (say, the top 200
matches).
•
The start set is augmented by its
neighborhood
,
which is the set of documents that either hyperlinks to
or is hyperlinked to by documents in the start set .
–
Since the
indegree
of nodes can be very large, in practice a
limited number of these documents (say, 50) is included.
•
Each document in both the start set and the
neighborhood is modeled by a node. There exists an
edge from node
A
to node
B
if and only if document
A
hyperlinks to document
B
.
–
Hyperlinks between pages on the same Web host can be
omitted.
Query Results
= Start Set
Forward Set
Back Set
Neighborhood graph
An edge for each hyperlink, but no edges within the same host
Result
1
Result
2
Result
n
f
1
f
2
f
s
...
b
1
b
2
b
m
…
...
•
Subgraph associated to each query
HITS [K’98]
•
Goal:
Given a query find:
–
Good sources of content (authorities)
–
Good sources of links (hubs)
•
Authority
comes from
in

edges.
Being a
good
hub
comes from
out

edges.
•
Better authority
comes from
in

edges from
good
hubs
. Being a
better hub
comes from
out

edges
to
good authorities
.
Intuition
q
1
q
k
...
A
q
2
r
1
r
k
r
2
...
H
p
HITS details
HITS
•
Kleinberg proved that the
H
and
A
vectors will
eventually converge, i.e., that termination is
guaranteed.
–
In practice we found the vectors to converge in
about 10 iterations.
•
Documents are ranked by hub and authority
scores respectively.
•
The algorithm
does not claim to find
all
relevant pages
, since there may be some that
have good content but have not been linked
to by many authors.
Problems with the HITS algorithm(1)
•
Only a relatively small part of the Web graph is
considered, adding edges to a few nodes can
change the resulting hubs and authority scores
considerably.
–
It is relatively easy to manipulate these scores.
Problems with the HITS algorithm(2)
•
We often find that the neighborhood graph
contains documents not relevant to the query
topic. If these nodes are well connected, the
topic drift
problem arises.
–
The most highly ranked authorities and hubs tend
not to be about the original topic.
–
For example, when running the algorithm on the
query “jaguar
and
car" the computation drifted to
the general topic “car" and returned the home
pages of different car manufacturers as top
authorities, and lists of car manufacturers as the
best hubs.
Improvements
•
To avoid “undue weight” of the opinion of a
single person
–
All the documents on a single host have the same
influence on the document they are connected to
as a single document would.
•
Ideas
–
If there are
k
edges from documents on a first host
to a single document on a second host, we give
each edge an
authority weight
of 1/
k
.
–
If there are
l
edges from a single document on a
first host to a set of documents on a second host,
we give each edge a
hub weight
of 1/
l
.
Improvements
Improvements
•
To solve topic drift problem, content
analysis can be used.
•
Ideas
–
Eliminating non

relevant nodes from the
graph
–
Regulating the influence of a node based
on its relevance.
Improvements
•
Computing Relevance Weights for Nodes
–
The documents in the start set is used to define a
broader query and match every document in the
graph against this query.
–
Specifically, the concatenation of the first 1000
words from each document are considered to be
the query,
Q
and compute
similarity
(
Q;D
)
•
All nodes whose weights are below a
threshold are pruned.
Improvements
•
Regulating the Influence of a Node
–
Let
W
[
n
] be the relevance weight of a node
n
–
W
[
n
]*
A
[
n
] is used instead of
A
[
n
] for computing
the hub scores.
–
W
[
n
]*
H
[
n
] is used instead of
H
[
n
] for computing the
authority score.
•
This reduces the influence of less relevant
nodes on the scores of their neighbors.
Query

independent ordering
•
First generation: using link counts as simple
measures of popularity.
•
Two basic suggestions:
–
Undirected popularity:
•
Each page gets a score = the number of in

links plus the number of out

links (3+2=5).
–
Directed popularity:
•
Score of a page = number of its in

links (3).
Query processing
•
First retrieve all pages meeting the text query
(say
venture capital
).
•
Order these by their link popularity (either
variant on the previous page).
Spamming simple popularity
•
Exercise
: How do you spam each of the
following heuristics so your page gets a
high score?
•
Each page gets a static score = the
number of in

links plus the number of out

links.
•
Static score of a page = number of its in

links.
Pagerank scoring
•
Imagine a browser doing a random walk on
web pages:
–
Start at a random page
–
At each step, go out of the current page along one
of the links on that page, equiprobably
•
“In the steady state” each page has a long

term visit rate

use this as the page’s score.
1/3
1/3
1/3
Not quite enough
•
The web is full of dead

ends.
–
Random walk can get stuck in dead

ends.
–
Makes no sense to talk about long

term visit rates.
??
Teleporting
•
At a dead end, jump to a random
web page.
•
At any non

dead end, with
probability 10%, jump to a random
web page.
–
With remaining probability (90%), go
out on a random link.
–
10%

a parameter.
Result of teleporting
•
Now cannot get stuck locally.
•
There is a long

term rate at
which any page is visited (not
obvious, will show this).
•
How do we compute this visit
rate?
Markov chains
•
A Markov chain consists of
n
states
, plus an
n
n
transition probability matrix
P
.
•
At each step, we are in exactly one of the
states.
•
For
1
i,j
n,
the matrix entry
P
ij
tells us the
probability of
j
being the next state, given we
are currently in state
i
.
i
j
P
ij
P
ii
>0
is OK
.
.
1
1
ij
n
j
P
Markov chains
•
Clearly, for all i,
•
Markov chains are abstractions of random
walks.
•
Exercise
: represent the teleporting random
walk from 3 slides ago as a Markov chain, for
this case:
Ergodic Markov chains
•
A Markov chain is
ergodic
if
–
you have a path from any state to any
other
–
For any start state, after a finite transient
time T
0
,
the probability of being in any state
at a fixed time T>T
0
is nonzero.
Ergodic Markov chains
•
For any ergodic Markov chain, there is a
unique
long

term visit rate
for each state.
–
Steady

state probability distribution
.
•
Over a long time

period, we visit each
state in proportion to this rate.
•
It doesn’t matter where we start.
Probability vectors
•
A probability (row) vector
x
= (x
1
, … x
n
)
tells us where the walk is at any point.
•
E.g., (000…1…000) means we’re in state
i
.
i
n
1
More generally, the vector
x
= (x
1
, … x
n
)
means the walk is in state
i
with probability
x
i
.
.
1
1
n
i
i
x
Change in probability vector
•
If the probability vector is
x
= (x
1
, …
x
n
)
at this step, what is it at the next
step?
•
Recall that row
i
of the transition
prob. Matrix
P
tells us where we go
next from state
i
.
•
So from
x
, our next state is
distributed as
xP
.
Steady state example
•
The steady state looks like a vector of
probabilities
a
= (a
1
, … a
n
):
–
a
i
is the probability that we are in state
i
.
1
2
3/4
1/4
3/4
1/4
For this example,
a
1
=1/4
and
a
2
=3/4
.
How do we compute this vector?
•
Let
a
= (a
1
, … a
n
)
denote the row vector of
steady

state probabilities.
•
If we our current position is described by
a
,
then the next step is distributed as
aP
.
•
But
a
is the steady state, so
a
=
aP
.
•
Solving this matrix equation gives us
a
.
One way of computing
a
•
Recall, regardless of where we start, we
eventually reach the steady state
a
.
•
Start with any distribution (say
x
=(
10…0
)).
•
After one step, we’re at
xP
;
•
after two steps at
xP
2
, then
xP
3
and so on.
•
“Eventually” means for “large”
k
,
xP
k
=
a
.
•
Algorithm: multiply
x
by increasing powers of
P
until the product looks stable.
Google’s approach
•
Assumption: A
link
from page A to page B is a
recommendation
of page B by the author of A
(we say B is
successor
of A)
Quality of a page is related to its in

degree
•
Recursion: Quality of a page is related to
–
its in

degree, and to
–
the
quality
of pages linking to it
PageRank
[BP ‘98]
Definition of PageRank
•
Consider the following infinite
random walk
(surf):
–
Initially the surfer is at a random page
–
At each step, the surfer proceeds
•
to a randomly chosen web page with probability d
•
to a randomly chosen successor of the current page with
probability 1

d
•
The PageRank of a page p is
the fraction of steps
the surfer spends at p in the limit.
PageRank (cont.)
By random walk theorem:
•
PageRank = stationary probability for this
Markov chain, i.e.
where n is the total number of nodes in the
graph
E
p
q
q
outdegree
q
PageRank
d
n
d
p
PageRank
)
,
(
)
(
/
)
(
)
1
(
)
(
PageRank (cont.)
P
A
B
PageRank of P is
(1

d)
*
(
1/4
th
the PageRank of A + 1/3
rd
the PageRank of B
) +d/n
d
d
PageRank
•
Used in Google’s ranking function
•
Query

independent
•
Summarizes the “web opinion” of the page
importance
PageRank vs. HITS
•
Computation:
–
Once for all documents
and queries (offline)
•
Query

independent
–
requires combination
with query

dependent
criteria
•
Hard to spam
•
Computation:
–
Requires computation for
each query
•
Query

dependent
•
Relatively easy to spam
•
Quality depends on
quality of start set
We want top

ranking documents
to be both
relevant
and
authoritative
•
Relevance
is being modeled by cosine
scores
•
Authority
is typically a query

independent
property of a document
–
Assign to each document a
query

independent
quality score
in [0,1] to each document
d
•
Denote this by
g(d)
Net score
•
Consider a simple total score combining
cosine relevance and authority
•
net

score(
q,d
) =
g(d) +
cosine(
q,d
)
–
Can use some other linear combination than an
equal weighting
•
Now we seek the top
K
docs by
net score
Top
K
by net score
–
fast methods
•
First idea: Order all postings by
g(d)
•
Key: this is a common ordering for all
postings
•
Thus, can concurrently traverse query terms’
postings for
–
Postings intersection
–
Cosine score computation
Why order postings by
g(d)?
•
Under
g(d)

ordering, top

scoring docs likely to
appear early in postings traversal
•
In time

bound applications (say, we have to
return whatever search results we can in 50
ms), this allows us to stop postings traversal
early
–
Short of computing scores for all docs in postings
Champion lists in
g(d)

ordering
•
Can combine champion lists with
g(d)

ordering
•
Maintain for each term a champion list of the
r
docs with highest
g(d) +
tf

idf
td
•
Seek top

K
results from only the docs in
these champion lists
High and low lists
•
For each term, we maintain two postings lists
called
high
and
low
–
Think of
high
as the champion list
•
When traversing postings on a query, only
traverse
high
lists first
–
If we get more than
K
docs, select the top
K
and
stop
–
Else proceed to get docs from the
low
lists
•
Can be used even for simple cosine scores,
without global quality
g(d)
•
A means for segmenting index into two
tiers
Impact

ordered postings
•
We only want to compute scores for docs for
which
wf
t,d
is high enough
•
We sort each postings list by
wf
t,d
•
Now: not all postings in a common order!
•
How do we compute scores in order to pick
off top
K?
–
Two ideas follow
1. Early termination
•
When traversing
t’
s postings, stop early after
either
–
a fixed number of
r
docs
–
wf
t,d
drops below some threshold
•
Take the union of the resulting sets of docs
–
One from the postings of each query term
•
Compute only the scores for docs in this
union
2. idf

ordered terms
•
When considering the postings of query
terms
•
Look at them in order of decreasing idf
–
High idf terms likely to contribute most to score
•
As we update score contribution from each
query term
–
Stop if doc scores relatively unchanged
•
Can apply to cosine or some other net scores
Other applications
•
Web Pages Collection
–
The crawling process usually starts from a set of
source Web pages. The Web crawler follows the
source page hyperlinks to find more Web pages.
–
This process is repeated on each new set of
pages and continues until no more new pages are
discovered or until a predetermined number of
pages have been collected.
–
The crawler has to decide in which order to collect
hyperlinked pages that have not yet been crawled.
–
The crawlers of different search engines make
different decisions, and so collect different sets of
Web documents.
•
A crawler might try to preferentially crawl “high quality”
Web pages.
Other applications
•
Web Page Categorization
•
Geographical Scope
–
Whether a given Web page is of interest only for people in a
given region or is of nation

or worldwide interest is an
interesting problem for hyperlink analysis.
•
For example, a weather

forecasting page is interesting only
to the region it covers, while the Internal Revenue Service
Web page may be of interest to U.S. taxpayers throughout
the world.
–
A page’s hyperlink structure also reflects its range of interest.
•
Local pages are mostly hyperlinked to by pages from the
same region, while hyperlinks to pages of nationwide interest
are roughly uniform throughout the country.
–
This information lets search engines tailor query results to the
region the user is in.
Reference
•
Monika Henzinger, “hyperlink analysis for the web”, IEEE internet
computing 2001.
•
J. Cho, H. García

Molina, and L. Page, “Efficient Crawling through URL
Ordering,” Proc. Seventh Int’l World Wide Web Conf., Elsevier Science,
New York, 1998.
•
S. Chakrabarti et al., “Automatic Resource Compilation by Analyzing
Hyperlink Structure and Associated Text,”
Proc. Seventh Int’l World
Wide Web Conf
., Elsevier Science, New York, 1998.
•
K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation
in Hyperlinked Environments,”
Proc. 21
st
Int’l ACM SIGIR Conf.
Research and Development in Information Retrieval
(SIGIR 98), ACM
Press, New York, 1998
•
L. Page et al., “The PageRank Citation Ranking: Bringing Order to the
Web,” Stanford Digital Library Technologies, Working Paper 1999

0120, Stanford Univ., Palo Alto, Calif., 1998.
•
I. Varlamis et al., “THESUS, a Closer View on Web Content
Management Enhanced with Link Semantics”, IEEE Transactions on
Knowledge and Data Engineering, vol. 16, No. 6, June 2004.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο