Presented By: Vamsee Raja Jarugula for the sake of CIS 764 presentation . Kansas State University.

panicyfewInternet and Web Development

Nov 18, 2013 (3 years and 9 months ago)

141 views

Presented

By
:

Vamsee

Raja

Jarugula

for

the

sake

of

CIS

764

presentation

.



Kansas

State

University
.


Presentation Overview


Problem Definition.


Design Goals


Google Search Engine Characteristics.


Google Architecture


Scalability


Conclusions


Vamsee
Raja Jarugula CIS 764

Problem


Web is vast and ever expanding. It is getting flooded with data.



This data is heterogeneous and consists of all forms


Text


Images


Ascii


Java applets




Lists maintained by Humans cannot keep track of this.



Human attention is confined to 10
-
1000 documents



Previous search methodologies relied on keyword matching producing
inferior quality results.





Vamsee Raja Jarugula CIS 764

Solution = Search Engine


Search engines facilitate users to get the text or
documents of their choice within a click of mouse. ”


Some examples of Search engines:

Google,Altavista,MetaCrawler,Kosmix.


For comprehensive list of search engines do visit:

http://en.wikipedia.org/wiki/List_of_search_engines



Vamsee Raja Jarugula CIS 764

Specific Design Goals


Deliver results that have very high precision even
at the expense of recall



Bring search engine technology into academic
realm in order to support novel research activities



Make search engine technology transparent, i.e.
advertising shouldn’t bias results



Make system user friendly .


Vamsee Raja Jarugula CIS 764

Google Search Engine Features



Uses link structure of web (PageRank)


Uses text surrounding hyperlinks to improve accurate
document retrieval


Other features include:


Takes into account word proximity in documents


Uses font size, word position, etc. to weight word


Storage of full raw html pages


Vamsee Raja Jarugula CIS 764

PageRank For Layman



Imagine a web surfer doing a simple random walk on the
entire web for an infinite number of steps.


Occasionally, the surfer will get bored and instead of
following a link pointing outward from the current page
will jump to another random page.


At some point, the percentage of time spent at each page
will converge to a fixed value.


This value is known as the PageRank of the page.


Vamsee Raja Jarugula CIS 764

PageRank For Techies

N(p): # outgoing links from page p

B(p): set of pages that point to p

d: tendency to get “bored” .

R(p): PageRank of p


R(p) = [(1
-
d)+d*R(q)/N(q)] .




Vamsee Raja Jarugula CIS 764

Why do we need d?


In the real world virtually all web graphs are not
connected, i.e. they have dead
-
ends, islands, etc.


If we don’t have d we get “ranks leaks”

for graphs that are not connected, i.e. leads to
numerical instability.



Vamsee Raja Jarugula CIS 764

Justifications for using PageRank


Attempts to model user behavior


Captures the notion that the more a page is pointed to
by “important” pages, the more it is worth looking at


Takes into account global structure of web


Vamsee Raja Jarugula CIS 764

Google Architecture

Implemented in C and C++ on Solaris and Linux


Reference from Anatomy of a large
scale search engine

Sergy Brin, Larry Page.

Preliminary

“Hitlist” is defined as list of occurrences of a particular
word in a particular document including additional
meta info:


-

position of word in doc


-

font size


-

capitalization


-

descriptor type, e.g. title, anchor, etc.





Vamsee Raja Jarugula CIS 764

Google Architecture (cont.)

Keeps track of
URLs that have and
need to be crawled

Compresses and
stores web pages

Multiple crawlers run in parallel.
Each crawler keeps its own DNS
lookup cache and ~300 open
connections open at once.

Uncompresses and parses
documents. Stores link
information in anchors file.

Stores each link and
text surrounding link.

Converts relative URLs
into absolute URLs.

Contains full html of every web
page. Each document is prefixed
by
docID
, length, and URL.

core figure referred from Sery Brin and Larry
Page.
----
Anatomy of a large scale search engine.

Google Architecture (cont.)

Maps absolute URLs into
docIDs

stored in
Doc Index. Stores anchor text in “barrels”.
Generates database of links (pairs of
docIds
).

Parses & distributes hit lists
into “barrels.”

Creates inverted index whereby
document list containing
docID

and
hitlists

can be retrieved
given
wordID
.

In
-
memory hash table that
maps words to
wordIds
.
Contains pointer to
doclist

in
barrel which
wordId

falls into.

Partially sorted forward
indexes sorted by
docID
. Each
barrel stores
hitlists

for a
given range of
wordIDs
.

DocID

keyed index where each entry includes info such as pointer to doc in
repository, checksum, statistics, status, etc. Also contains URL info if doc has
been crawled. If not just contains URL.

Core figure referred from Sergy Brin and Larry
Page.

Google Architecture (cont.)

List of
wordIds

produced by Sorter and
lexicon created by
Indexer used to create
new lexicon used by
searcher. Lexicon stores
~14 million words.

New lexicon keyed by
wordID
, inverted doc
index keyed by
docID
,
and
PageRanks

used
to answer queries

2 kinds of barrels.
Short
barrell

which
contain hit list which
include title or anchor
hits. Long
barrell

for
all hit lists.

Core figure reference from Sergy Brin and
Larry page.

Google Query Evaluation

1.
Parse the query.

2.
Convert words into
wordIDs
.

3.
Seek to the start of the
doclist

in the short barrel for
every word.

4.
Scan through the
doclists

until there is a document that
matches all the search terms.

5.
Compute the rank of that document for the query.

6.
If we are in the short barrels and at the end of any
doclist
, seek to the start of the
doclist

in the full barrel
for every word and go to step 4.

7.
If we are not at the end of any
doclist

go to step 4.

8.
Sort the documents that have matched by rank and
return the top k.


Vamsee Raja Jarugula CIS 764

Single Word Query Ranking


Hitlist is retrieved for single word


Each hit can be one of several types: title, anchor, URL,
large font, small font, etc.


Each hit type is assigned its own weight


Type
-
weights make up vector of weights


# of hits of each type is counted to form count vector


Dot product of two vectors is used to compute IR score


IR score is combined with PageRank to compute final rank



Vamsee Raja Jarugula CIS 764

Multi
-
word Query Ranking


Similar to single
-
word ranking except now must
analyze proximity


Hits occurring closer together are weighted higher


Each proximity relation is classified into 1 of 10
values ranging from a phrase match to “not even
close”


Counts are computed for every type of hit and
proximity


Vamsee Raja Jarugula CIS 764

Scalability



Cluster architecture combined with Moore’s Law make
for high scalability. At time of writing:


~ 24 million documents indexed in one week


~518 million hyperlinks indexed


Four crawlers collected 100 documents/sec





Vamsee Raja Jarugula CIS 764

Summary of Key Optimization Techniques


Each crawler maintains its own DNS lookup cache


Use flex to generate lexical analyzer with own stack for parsing
documents


Parallelization of indexing phase


In
-
memory lexicon


Compression of repository


Compact encoding of hitlists accounting for major space savings


Indexer is optimized so it is just faster than the crawler so that
crawling is the bottleneck


Document index is updated in bulk


Critical data structures placed on local disk


Overall architecture designed avoid to disk seeks wherever possible




Vamsee Raja Jarugula CIS 764

References:


http://video.google.com/videoplay?docid=
-
1400721382961784115


http://google.stanford.edu


http://en.wikipedia.org/wiki/List_of_search_engines


The Anatomy of a Large
-
Scale Hyper textual


Web Search Engine
Sergey Brin and Lawrence
Page(pdf).



The audio presentation of my lecture will be posted
on my HomePage and will be given to Dr.Hankley.


www.cis.ksu.edu/~vamsee



Vamsee Raja Jarugula CIS 764