The anatomy of a Large Scale Search Engine

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 10 months ago)

73 views

1

The anatomy of a Large Scale
Search Engine

Sergey Brin ,Lawrence Page

Dept. CS of Stanford University.

2

Outline

Introduction

Design Goals

System Features

System Anatomy

Results & Performance

Conclusion



3

Introduction

Google: Large
-
scale search engine

Design to crawl & index the web efficiently

Crawler: download web pages

Gives better results

Google = 10 ^ 100

To engineer a SE is a challenging task.

2 ways of surfing


High quality human maintained lists (Yahoo!)


too slow to improve, can

t cover esoteric topics

4

Introduction(Cont.)


Expensive to build and maintain.


Search engines (google)


search by keywords


too many low quality matches


people try to mislead automated search engines.

Challenges in creating a search engine



Fast crawling technology



Efficient storage space



Handle queries quickly

5

Introduction(Cont.)

Scaling with the web


Improved hardware performance


exceptions : disk seek time, OS


Google

s data structure are optimized for fast and
efficient access.


Google is a centralized SE.


6

Design Goals

Improved search quality


Junk results


Often wash out any results that a user is interested in


As the collection size grows, we need tools with very
high precision


Use of hypertextual information


Quality filtering

link structure and link text

Support novel research on large


scale web data




7

System features

PageRank : bringing order to the web


Most web SE has largely ignored the link graph


Maps containing 518 million of hyperlinks


Correspond well with people idea of importance

citation importance.

B

C

A

B and C are backlinks
of A

8



For this example:

PR(A) = (1
-
d) + d(PR(T1)/3 + PR(T2) + PR(T3) + PR(T4)/2)


Motivation:


Pages that are cited from many places are worth looking
at.


Pages that are cited from an important place are worth
looking at.

9

System features


Pr(A)

(1
-
d)

(Pr(T1) / C(T1) +

+Pr(Tn) / C(Tn))


Assume page A has pages T1

Tn which points to it.

The parameter d can be set between 0 and 1(0.85).


C(A)

the number of links going out of page A.


Random Surfer


Given a random URL


Clicks randomly on links


After a while gets bored and gets a new random URL


d is the probability at each page the

random surfer


gets
bored and request another random page.

10

System features


Difference from traditional methods


Not counting links from pages equally


Normalizing by the number of links in a page

Anchor Text


Associate link text with the page it points to.


advantages:


Anchor provide more accurate description


Can exist for documents that can

t be indexed.

Images,non
-
text docs.


Can return pages that hadn

t been crawled

11

12

System features

Other Features


Location information: use of proximity in search


Visualization Information: font relative size


Full raw HTML is available


Users can view a cached version of pages



13

System Anatomy

14

System Anatomy

Design to avoid disk seek.

Web pages are fetched, compressed and
stored in repository

Indexer parses the documents into hits
(stored in barrels) and anchors.

15

Major Data Structures



Hit Lists



Forward Index



Inverted Index



Crawling the web



Indexing the web



Life of Query



The Ranking system

16

17

18

19

20

21

Hit List

What is Hit List?


A hit list is a list of occurrences of a particular
word in a particular document including
position. Font, and capitalization information.

Stored in both the forward and inverted
indices.

Encoded by hand optimized compact
encoding(less space, less bit manipulation)

2 bytes storage.

Cap:1

Imp:3

Position: 12

Cap:1

Imp:3

Type:4

Pos:8

Cap:1

Imp:3

Type:4

Hash:
4

Pos:
4


Plain:

Fancy:

Anchor:

22

23

Forward Index

Given a docID, get it

s wordID and hit lists.

Partial sorted and stored in forward barrels.

Each barrel holds a range of wordID

s.

Duplicated docIDs exist in the barrels.

Instead of storing actual wordID, each wordID
is stored as a relative difference from the
minimum wordID in that barrel. So 2
24

= 16
millions
.


docID

wordID:24

Nhits: 8

Hit hit hit hit

wordID:24

Nhits: 8

Hit hit hit hit

null
wordID

docID

wordID:24

Nhits: 8

Hit hit hit hit

24

Inverted Index

Given a wordID


摯d䥄

却潲敤⁩渠瑨攠獡浥⁢慲牥汳l慳a景牷r牤⁩湤數⸠
Sorted by the sorter.

Every valid wordID, the lexicon contains a
pointer into the barrel that wordID falls into.

Two sets of inverted barrels: one set for hit
lists which include title or anchor hits, another
for all hit lists.

wordID

Ndocs

wordID

ndocs

docID:27

Nhits:5

Hit hit hit hit

docID:27

Nhits:5

Hit hit hit hit

25

Crawling the web

Fast distributed crawling system.

URL Server & Crawlers are implemented in Python.

One single URL server, 3 crawlers. Each keeps 300
connections open at the same time, speed at about
600K /sec of data.

Internal cached DNS lookup


looking up DNS


connecting to host


sending
request


receiving response.

Asynchronous IO to manage events.

26

Indexing the web


Parsing


Should know to handle errors.


HTML typos


Non
-
ASCII characters


HTML tags nested hundreds deep


Develop their own parser


Indexing documents into barrels


Turning words into wordIDs


In
-
memory hash table


the Lexicon


New additions are logged to files

27

Indexing the web


Parallelization

shared lexicon of 14 million pages,

log of all the extra words.


Sorting


Creating the inverted index


Produces two types of barrels.


For titles and anchor


For full text


Running sorters at parallel


The sorting is done in main memory


28

Searching

1.
Parse the query

2.
Convert word into wordIDs

3.
Seek to the start of the doclist in the short barrel
for every word

4.
Scan through the doclist until there is a document
that matches all of the search terms

5.
Compute the rank of that document

6.
If we

re at the end of the short barrels, start at the
doclist of the full barrel for every word and go to
step 4

7.
If we

re not at the end of any doclist go to step 4

8.
Sort the documents by rank return the top K.

29

30

The Ranking system

PageRank(TM) to determine the relative importance
of each page Google crawls on the web. Among the
characteristics PageRank evaluates are the text
included in the links to a site, the text on each page
and the PageRank of the sites linking to the site
being evaluated.

Single word search, check the hit list for that word.

In Multi
-
word search, jots occurring close together in
a document are weighted higher than hits occurring
far apart
.

31

32

Results

Example: query

Bill Clinton



Return results from the

Whitehouse.gov



Email address of the president


All the results are high quality pages


No broken links


No Bill without Clinton and vice versa.



33

Storage Requirements

Using compression on the repository

About 55GB for all the data used by the SE

Most of the queries can be answered by just the
short inverted index

With better compression,a high quality SE can fit
onto a 7GB drive of a new PC.

34

35

System Performance

It took 9 days to download 26 million pages

48.5 pages per second

The Indexer & Crawler ran simultaneously

The Indexer runs at 54 pages per second

The sorter run in parallel using 4 machines, the
whole process took 24 hours.


36

Conclusion

Scalable Search Engine

High quality search results

Search techniques


PageRank


Anchor Text


Proximity information

Search feartures


Catalog, Site Search, Cached links, Similar pages,

Who links to you, File Types


Speed: efficient algorithm , thousands of low cost PCs
networked together