Web Search
–
Summer Term 2006
VI. Web Search

Ranking (cont.)
(c) Wolfgang Hürst, Albert

Ludwigs

University
The Evolution of Search Engines
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
1st generation
: Use only "on page", text data

Word frequency, language
1995

1997 (AltaVista, Excite, Lycos, etc.)
2nd gen.
: Use off

page, web

specific data

Link (or connectivity) analysis

Click

through data (what results people
click on)

Anchor

text (how people refer to a page)
From 1998 (made popular by Google but
everyone now)
PageRank
[2], introduced by Brin
and Page, used by Google
HITS
[3], introduced by Kleinberg
(used by Teoma?)
Link

based ranking: HITS
Motivation
(compare PageRank):
Broad

topic queries:
deliver (too) large set of relevant results
Therefore:
Ranking based on the
authority
of a web
page (cf. PageRank: quality / importance)
Link:
Interpreted as a
conferral of authority
Goal:
Find pages with high authority (balance
between relevance and popularity)
Link

based ranking: HITS (cont.)
Basic idea
:
Consider sub

graph of the web graph that
contains as much relevant pages as
possible
Analyze the graph's link structure to find:
Authorities
= the most authoritative
or definitive subset of relevant pages
(for ranking)
Hubs
= Pages pointing to many related
authorities (for their identification)
Authorities and Hubs

Example
Example: Query “search engine”
www.google.com
www.teoma.com
www.alltheweb.com
www.altavista.com
AUTHORITIES
HUBS
dir.yahoo.com/
Computers_and_Internet/
Internet/World_Wide_Web/
Searching_the_Web/
Search_Engines_and_Directories/
searchenginewatch.com
Authorities and Hubs

Basic idea
Approach
:

Generate a query

dependent sub

graph

Recursively calculate hubs and authorities
Assume S is the set of
pages in this sub

graph
, then S should be

rather small

contain lots of relevant pages

contain the most important authorities
Basic idea
to generate such a sub

graph:

Get initial
root set
based on any IR criteria

Include the local
neighborhood
of this set
Authorities and Hubs

Base set
GIVEN
:

QUERY Q

TEXT

BASED SEARCH ENGINE SE

CONSTANTS T AND D (NAT. NUMBERS)

SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q
ALGORITHM TO CALCULATE SUBGRAPH S(Q)
S(Q) := R(Q)
FOR EACH
PAGE P
IN
R(Q)
T+(P) := SET OF PAGES LINKED BY P
T

(P) := SET OF PAGES LINKING TO P
ADD ALL PAGES FROM T+(P) TO S(Q)
IF
T

(P) < D
THEN
ADD ALL PAGES FROM T

(P) TO S(Q)
ELSE
ADD RANDOM SUBSET OF T

(P) TO S(Q)
Query

dependent base set

Comments
Why only use a
sub

graph?

Advantage of query dependence

Reduces processing time (online calculation!)
Why not just take the root set?

Appearance of query terms does not
necessarily represent relevance (or authority)

Larger network is needed for link analysis
In original work:
Heuristics for special cases

Remove intrinsic links, i.e. links from the
same domain (navigational links, etc.)

Consider only a certain number of links from
one domain to a page p (to avoid spamming)
Calculating Hubs and Authorities
Obviously, there exists a mutual reinforcing
relationship between Hubs and Authorities:

A good Hub links to many good Authorities

A good Authority is linked by many Hubs
Hence, use an iterative algorithm to estimate a
Hub
and
Authority value
, respectively
Hubs: O

Operation
Authorities: I

Operation
Calculating Hubs and Authorities
Hubs: O

Operation
Authorities: I

Operation
q1
q2
q3
PAGE p
q1
q2
q3
PAGE p
Calculating Hubs and Authorities
GIVEN
:

SUB

GRAPH G WITH N PAGES (FROM BASE SET S(Q))

CONSTANT NUMBER K
ALGORITHM TO CALCULATE HUBS AND AUTHOR.
X0 := (1, 1, ..., 1)
Y0 := (1, 1, ..., 1)
FOR
i = 1, ..., K
CALCULATE NEW WEIGHTS Xi BY
APPLYING THE I

OPERATION TO Xi

1, Yi

1
CALCULATE NEW WEIGHTS Yi BY
APPLYING THE O

OPERATION TO Xi, Yi

1
NORMALIZE Xi AND Yi
Calculating Hubs and Authorities
Convergence:
see lit.
Basic idea:
PageRank vs. HITS
PageRank
TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002
HITS

Hard to spam

Computes quality
signal for all pages

Easy to compute, real

time execution is hard

Query specific

Works on small graphs

Non

trivial to compute

Not query specific

Does not work on
small graphs

Local graph structure
can be manufactured

Provides a signal only
when there is direct
connectivity (e.g.
home pages)
Proven to be effective for
general purpose ranking
Well suited for supervised
directory construction
+
+


Commercial search engines using HITS
(Maybe?) Teoma, now search.ask.com
"Teomas underlying technology is an extension of
the HITS algorithm …", C. Sherman, April 2002,
http://dc.internet.com/news/article.php/1002061
(Not online anymore)
References

HITS
[1] S. BRIN, L. PAGE: "THE ANATOMY OF A
LARGE

SCALE HYPERTEXTUAL WEB
SEARCH ENGINE", WWW 1998
[2] JON KLEINBERG: "AUTHORITATIVE
SOURCES IN A HYPERLINKED
ENVIRONMENT", JOURNAL OF THE ACM,
VOL. 46, NO. 5, SEPTEMBER 1999
General Web Search Engine Architecture
CLIENT
QUERY
ENGINE
RANKING
CRAWL
CONTROL
CRAWLER(S)
USAGE FEEDBACK
RESULTS
QUERIES
WWW
COLLECTION
ANALYSIS MOD.
INDEXER
MODULE
PAGE
REPOSITORY
INDEXES
STRUCTURE
UTILITY
TEXT
(CF. [1] FIG. 1)
Comments 0
Log in to post a comment