VI. Web Search -

photofitterInternet and Web Development

Dec 4, 2013 (3 years and 4 months ago)

53 views

Web Search


Summer Term 2006



VI. Web Search
-

Ranking (cont.)

(c) Wolfgang Hürst, Albert
-
Ludwigs
-
University

The Evolution of Search Engines

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

1st generation
: Use only "on page", text data

-

Word frequency, language

1995
-
1997 (AltaVista, Excite, Lycos, etc.)

2nd gen.
: Use off
-
page, web
-
specific data

-

Link (or connectivity) analysis

-

Click
-
through data (what results people


click on)

-

Anchor
-
text (how people refer to a page)

From 1998 (made popular by Google but
everyone now)

PageRank

[2], introduced by Brin
and Page, used by Google

HITS

[3], introduced by Kleinberg
(used by Teoma?)

Link
-
based ranking: HITS

Motivation

(compare PageRank):

Broad
-
topic queries:


deliver (too) large set of relevant results

Therefore:


Ranking based on the
authority

of a web


page (cf. PageRank: quality / importance)

Link:


Interpreted as a
conferral of authority

Goal:


Find pages with high authority (balance


between relevance and popularity)

Link
-
based ranking: HITS (cont.)

Basic idea
:

Consider sub
-
graph of the web graph that
contains as much relevant pages as
possible

Analyze the graph's link structure to find:


Authorities

= the most authoritative


or definitive subset of relevant pages


(for ranking)


Hubs

= Pages pointing to many related


authorities (for their identification)

Authorities and Hubs
-

Example

Example: Query “search engine”

www.google.com

www.teoma.com

www.alltheweb.com

www.altavista.com

AUTHORITIES

HUBS

dir.yahoo.com/

Computers_and_Internet/

Internet/World_Wide_Web/

Searching_the_Web/

Search_Engines_and_Directories/

searchenginewatch.com

Authorities and Hubs
-

Basic idea

Approach
:

-

Generate a query
-
dependent sub
-
graph

-

Recursively calculate hubs and authorities

Assume S is the set of
pages in this sub
-
graph
, then S should be

-

rather small

-

contain lots of relevant pages

-

contain the most important authorities

Basic idea

to generate such a sub
-
graph:

-

Get initial
root set

based on any IR criteria

-

Include the local
neighborhood

of this set

Authorities and Hubs
-

Base set

GIVEN
:

-

QUERY Q

-

TEXT
-
BASED SEARCH ENGINE SE

-

CONSTANTS T AND D (NAT. NUMBERS)

-

SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q

ALGORITHM TO CALCULATE SUBGRAPH S(Q)

S(Q) := R(Q)

FOR EACH

PAGE P
IN

R(Q)


T+(P) := SET OF PAGES LINKED BY P


T
-
(P) := SET OF PAGES LINKING TO P


ADD ALL PAGES FROM T+(P) TO S(Q)


IF

|T
-
(P)| < D


THEN

ADD ALL PAGES FROM T
-
(P) TO S(Q)


ELSE

ADD RANDOM SUBSET OF T
-
(P) TO S(Q)

Query
-
dependent base set
-

Comments

Why only use a
sub
-
graph?

-

Advantage of query dependence

-

Reduces processing time (online calculation!)

Why not just take the root set?

-

Appearance of query terms does not


necessarily represent relevance (or authority)

-

Larger network is needed for link analysis

In original work:
Heuristics for special cases

-

Remove intrinsic links, i.e. links from the


same domain (navigational links, etc.)

-

Consider only a certain number of links from


one domain to a page p (to avoid spamming)

Calculating Hubs and Authorities

Obviously, there exists a mutual reinforcing
relationship between Hubs and Authorities:

-

A good Hub links to many good Authorities

-

A good Authority is linked by many Hubs

Hence, use an iterative algorithm to estimate a
Hub
and
Authority value
, respectively

Hubs: O
-
Operation

Authorities: I
-
Operation

Calculating Hubs and Authorities

Hubs: O
-
Operation

Authorities: I
-
Operation

q1

q2

q3

PAGE p

q1

q2

q3

PAGE p

Calculating Hubs and Authorities

GIVEN
:

-

SUB
-
GRAPH G WITH N PAGES (FROM BASE SET S(Q))

-

CONSTANT NUMBER K

ALGORITHM TO CALCULATE HUBS AND AUTHOR.

X0 := (1, 1, ..., 1)

Y0 := (1, 1, ..., 1)

FOR

i = 1, ..., K


CALCULATE NEW WEIGHTS Xi BY


APPLYING THE I
-
OPERATION TO Xi
-
1, Yi
-
1


CALCULATE NEW WEIGHTS Yi BY


APPLYING THE O
-
OPERATION TO Xi, Yi
-
1


NORMALIZE Xi AND Yi

Calculating Hubs and Authorities

Convergence:
see lit.

Basic idea:

PageRank vs. HITS

PageRank

TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002

HITS

-

Hard to spam

-

Computes quality


signal for all pages

-

Easy to compute, real
-


time execution is hard

-

Query specific

-

Works on small graphs

-

Non
-
trivial to compute

-

Not query specific

-

Does not work on


small graphs

-

Local graph structure


can be manufactured

-

Provides a signal only


when there is direct


connectivity (e.g.


home pages)

Proven to be effective for
general purpose ranking

Well suited for supervised
directory construction

+

+

-

-

Commercial search engines using HITS

(Maybe?) Teoma, now search.ask.com

"Teomas underlying technology is an extension of
the HITS algorithm …", C. Sherman, April 2002,
http://dc.internet.com/news/article.php/1002061

(Not online anymore)

References
-

HITS

[1] S. BRIN, L. PAGE: "THE ANATOMY OF A
LARGE
-
SCALE HYPERTEXTUAL WEB
SEARCH ENGINE", WWW 1998


[2] JON KLEINBERG: "AUTHORITATIVE
SOURCES IN A HYPERLINKED
ENVIRONMENT", JOURNAL OF THE ACM,
VOL. 46, NO. 5, SEPTEMBER 1999

General Web Search Engine Architecture

CLIENT

QUERY
ENGINE

RANKING

CRAWL
CONTROL

CRAWLER(S)

USAGE FEEDBACK

RESULTS

QUERIES

WWW

COLLECTION
ANALYSIS MOD.

INDEXER
MODULE

PAGE


REPOSITORY

INDEXES

STRUCTURE

UTILITY

TEXT

(CF. [1] FIG. 1)