The Anatomy of a Large-Scale Hypertextual Web Search Engine

photofitterInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 14 μέρες)

44 εμφανίσεις

The Anatomy of a

Large
-
Scale

Hypertextual Web Search Engine

A review by:


Adam Chamberlain, Adrian Hudnott, Rob
Garrood & Ben Smith

November 2005

2

Agenda


Introduction


Overview of Google


PageRank


Motivation & Description


Example


Issues & Comparison


Further Work


Application


Conclusions

3

Introduction


About the paper


Brin & Page, 1998, Stanford University


Details a prototype search engine, Google


Covers both architecture and algorithms


Cited in web metrics with relation to significance


Also relevant to Web Graph Properties


PageRank


Covered in a separate paper from Brin & Page



Is the primary metric used in the paper

4

Overview : What is Google?


Web search engine


Tackles issues faced by previous crawlers of
scalability and manipulation


Academic


Built on strong understanding of web metrics


Use of hyperlink structures


Transparent


Initially released into the public domain


Support for informatics research

5

Overview : Architecture

URL Server

Crawler

Store Server

Repository

Indexer

URL Resolver

Anchors

Lexicon

Barrels

Links

Doc

Index

Sorter

PageRank

Searcher

Check

sums

6

Overview: Google Architecture

(Explanation for handout only.)


URL Server:

Finds pages to surf.


Crawler:

Downloads pages and places them in the repository.


Store Server:

Document compression.


Repository:

Cached copies of most web pages.


Indexer:

Creates the
forward index

(documents


words) and

extracts
hyperlink tags into the
Anchors file.


URL Resolver:

Converts relative URLs into absolute URLs and creates the
Links file
.


Links file:

Ordered pairs of document IDs where a hyperlink exists between
them.


Sorter:

Re
-
sorts the
forward index

to create the
inverted index

(words


documents) and creates the
Lexicon
.


Lexicon:

Dictionary of all possible search keywords.


Doc Index:

Maps document identifier codes to URLs.


PageRank
: An influential web metric used to sort Google’s matches.


Searcher
: Performs searches!







7

Overview : Forward Index


Indexer identifies key word ‘hits’
in a document


Maps document (page) ID’s to
word ID’s in Lexicon


Word ID’s partially sorted into
barrels


64 of these


Word ID’s within a barrel are
unsorted.


Individual document may spread
over barrels
.


However, not useful for search!

8

Overview : Inverted Index


Want to know in what
documents a key word
occurs


Need the ‘Inverted Index’


Sorts the forward index into
its inverted form


Function performed by the
‘Sorter’

9

Overview : Ranking System


Proximity of keyword ‘hits’


This is the sum of the distance between them


Hits have
‘types’


Types: body text, heading text, anchor text, url, …


Relative

font size factor used


Count how many hits occur of each
type

and
range of proximity values


Apply a function to each type
-
proximity count


These form a type
-
proximity vector,
C

10

Overview : Ranking System (2)


V

=
C

W

(dot product) is computed.


W

is the importance associated with each type
-
proximity class.



Combine
V

with the PageRank score



Effect of increasing hits declines


Prevents large scale manipulation


Hit Count,
x

f
(
x
)

11

PageRank : Motivation


Academic Citation Analysis* attempted, but…


Web has no formal quality control or peer review


Possible to inflate citation counts artificially


Web pages vary more than academic papers


Consider:


One link from the University’s main page, or one
link from Yahoo’s main page…


Which citation should carry the
higher weight
?

*Also known as
bibliometrics

12

PageRank : Description


Informal Definition:


“A page has a high rank if the sum of the ranks of
its backlinks are high”


Handles ‘Yahoo’ case on previous slide


Intuitive Definition:


Corresponds to the
Random Surfer Model


User keeps clicking on links ‘linearly’ then gets
bored and restarts at a random location


Now for the maths…

13

PageRank : Description (2)


Formal Definition:


c is a ‘dampening’ factor, was 0.85


N
v

is number of out
-
links from page v


B
u

is the set of backlinks from the current page


cE(u)

corresponds to the surfer getting ‘bored’



)
(
)
(
'
)
(
'
u
cE
N
v
R
c
u
R
u
v
B
v




14

PageRank : Example


Considering an example network


Calculating A:


))
(
/
)
(
)
(
/
)
(
)
(
/
)
(
(
)
1
(
)
(
E
N
E
R
C
N
C
R
B
N
B
R
c
c
A
R





c

= dampening factor

N
= out
-
degree

R

= PageRank

A

B

E

D

C

15

PageRank : Example (2)


Initially set all PageRank to 1






First Iteration:

In
-
Links

Rank (R)

Out
-
Links (N)

R/N

B

1

1

1

C

1

2

0.5

E

1

2

0.5

85
.
1
)
5
.
0
5
.
0
1
(
85
.
0
)
85
.
0
1
(
)
(






A
R
A

B

E

D

C

16

PageRank : Example (3)


Repeat process for B, C, D and E


Feed computed values into next iteration



Iteration

1

2

3

4

5

6

A

1.8500

1.2479

1.1967

1.5230

1.3412

1.2954

B

0.4333

0.4333

0.6380

0.4930

0.4807

0.5593

C

0.8583

0.7981

0.9772

0.9084

0.8668

0.9277

D

1.0000

1.7225

1.2107

1.1672

1.4445

1.2900

E

0.8583

0.7981

0.9772

0.9084

0.8668

0.9277

Order

ADCEB

DACEB

ADCEB

ADCEB

DACEB

ADCEB

17

PageRank : Analysis


Converges in
log n

time


Constrained by the time to build a full
-
text index
more than anything


Rank ‘Sinks’


Caused by two pages that point to each other but
not to any other pages: rank accumulates


Solved by random surfer model


Manipulation



Google Bombing’


French Military ‘Victories’ links to ‘Defeats’


‘Miserable Failure’ links to George Bush biography

18

19

PageRank : Comparison


Web Graph Properties


Uses graph of the
entire

web: depends on full crawl


More sophisticated than simply summing in/out
-
degrees


Web Page Significance


Uses
Boolean Spread Activation



match all words


Enhanced citation analysis


building on work of
Kleinberg, Egghe & Rousseau


Doesn’t suffer from
Tightly Knit Communities
effect
of Kleinberg’s
Hubs & Authorities




20

PageRank : Further Work


Personalised PageRank, Haveliwala, 1999


In
-
memory, block oriented, algorithm


PageRank can be computed in an hour on a PIII
450Mhz using less than 100Mb of main memory


Compute PageRank on the client
-
side


Use local information: bookmarks, searches,
history


Provide the link structure of the web on a DVD


11/11/05, “Personalized Search” released

21

PageRank : Further Work (2)


Topic Sensitive PageRank,
Haveliwala, 2002


Improve Google by giving weight to the
informational relationship between sites


A) Uniform Results


Similar to ‘current’ Google but with topics


B) Personalised to a particular user


Based on previous searches and users’ surfing
habits

22

Applications : Google


Google Inc.


Largest search engine


Technologies utilised by others (e.g. Yahoo!)


Biggest ever technology IPO, 2004


Redefining search


Set a trend for other search providers


Raised importance of quality web search results


Combining information retrieval methods


Business model based on advertising


Potential area for conflict


Over 100 factors now influence results

23

Applications : PageRank


Back
-
link prediction


Desire for optimal web crawling strategy


Better indicator than citation counts!


Improving user navigation


‘The PageRank Proxy’


Providing PageRank information with links


Establishing trust


Wealth of authors on the web, who to trust?


Use PageRank to rate trust

24

Applications : The Future


Internal Development


Project no longer in academic realm


Lack of transparency initially intended


Role of PageRank unclear


Likely focus on extensions and results tuning


External Development



API’s


Allowing innovative use of Google technologies


Open Source Code


Focused on developing infrastructure



25

Conclusions


Academic Background


Success from strong academic understanding


Raised profile of informatics and search


Good platform for future research


Success as a failure


Intention for transparency and use in academia


Commercial success has removed transparency


Potentially bad for further research in this area

26

Summary


We have seen:


The architecture used by Google


PageRank as a web metric


Strengths and potential manipulations


The commercial success of Google


Applications


Potential areas of future research



27

References


Work by Brin & Page (now at Google)


Brin, S., Page, L. (1998), ‘The anatomy of a large
-
scale hypertextual search
engine’,
Computer Networks and ISDN Systems
,
30
(1
-
7):107
--
117.


Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), ‘The PageRank
Citation Ranking: Bringing Order to the Web',
Stanford Digital Library
Technologies Project.


More papers at: http://www.google.com on many aspects of web metrics
and search in general


PageRank


http://www.iprcom.com/papers/pagerank/


Take a look at the example at: http://www.dcs.warwick.ac.uk/~csucbu


http://en.wikipedia.org/wiki/Google_bomb

28

References (2)


Further Developments


Haveliwala, T. H. (1999), ‘Efficient computation of PageRank’. Technical
report, Stanford University, Stanford, CA, 1999.


Haveliwala, T. H. (2002), ‘Topic
-
sensitive PageRank’. In Proceedings of the
Eleventh International World Wide Web Conference, Honolulu, Hawaii, May
2002.


Commercial Aspect


http://money.cnn.com/2004/04/29/technology/google/


http://www.google.com/corporate/history.html


Web Metrics


Dhyani, D., Keong N., W. , and Bhowmick, S. (2002), ‘A survey of web
metrics’,
ACM Computing Surveys
,
34
(4):469
--
503.