The Anatomy of a Large-Scale Hypertextual Web Search Engine

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

66 views

The Anatomy of a Large
-
Scale
Hypertextual Web Search Engine

Sergey Brin & Lawrence Page




Presented by:

Siddharth Sriram & Joseph Xavier

Department of Electrical and Computer Engineering

Overview



@ Stanford University


Presented as a prototype of a large
-
scale search engine


26 million pages, 147 GB


Google ~ googol


Issues


Scaling


Exploiting structure in Hypertext


PageRank Algorithm


Architecture


Data Structures, Crawling, Indexing, Searching


Results




PageRank Algorithm using link graph


Anchor Text


Associate the anchor text of a link to the page it points to


Information Retrieval


TREC => well controlled, homogenous collections


Not equipped to handle Hypertext documents


Vector Space Model not enough


Architecture


URL Server


Distributed Crawlers


Storeserver


Repository


Indexer


Barrels


URL Resolver


Sorter


DumpLexicon


Searcher

Data Structures


BigFiles


Repository


Document Index


Lexicon


Hit Lists


Forward Index


Inverted Index

Repository


Full HTML of every webpage


Compressed using zlib


Prefixed by docID, length, URL


Files stored one after another


Document Index


Fixed width ISAM index


Stores document status, pointer to
repository, document checksum


If document has been crawled, ptr to
variable length docinfo file stored


Otherwise ptr to URLlist stored


Hit Lists


Plain and Fancy hits


2 bytes for each hit


Length of hit list


stored before hit


Forward Index


Stored in 64 barrels.


If a document contains words in a barrel,
then the docID is recorded into the barrel,
with the list of wordID’s and hitlists.


Each wordID stored as a relative difference
from the minimum wordID in a barrel. (24
bits for the wordID, 8 for hitlist length).

Inverted Index


Same barrels as forward index, but
processed by the sorter.


For every wordID, doclist of docIDs
generated, with corresponding hitlists.


Two sets of inverted barrels, one for hitlists
with anchor or title text, another for all
hitlists.

Indexing the Web


Parser


flex used to generate a lexical
analyzer


“involved a fair amount or work”


Indexing Documents into barrels


Every word hashed into wordID


Occurrences translated into hitlists and written into forward barrels


Lexicon needs to be shared


Extra words written into a log, processed by one final indexer






Searching

1.
Parse the query.

2.
Convert words into wordIDs.

3.
Seek to the start of the doclist in the short barrel for every word.

4.
Scan through the doclists until there is a document that matches all the
search terms.

5.
Compute the rank of that document for the query.

6.
If we are in the short barrels and at the end of any doclist, seek to the start
of the doclist in the full barrel for every word and go to step 4.

7.
If we are not at the end of any doclist go to step 4.

8.
Sort the documents that have matched by rank and return the top k.


Ranking…


Count weight generated for each word in
query


Dot product taken with type weight vector
(for single word queries) or with type
-
prox
weight vector (for multiple word queries)


Combined with PageRank to give final
score.

Results


High quality pages


zlib


3:1 ratio


9 days to download 26 million pages


Indexer and crawler ran simultaneously


Future work:


Query caching, smart disk allocation, updates


User context, relevance feedback


Footnote


foot in mouth!!



we expect that
advertising funded
search engines will be
inherently biased
towards the
advertisers and away
from the needs of the
consumers
.”