Search Engines and Google

smilinggnawboneInternet and Web Development

Dec 4, 2013 (3 years and 11 months ago)

124 views

Search Engines and
Google
Francisco Velázquez
3. Nov. 2010
1
Motivation

Human maintained lists are subjective,
expensive to build and maintain, slow to
improve and cannot cover all esoteric topics.

Automated search engines that rely on
keyword matching return low quality matches.

Advertisers mislead automated search engines.

Scalability in search engines must meet WWW
growth.
2
Content

3 Tier Framework

Components of a search
engine

Crawler

PageRank

Indices

Map-Reduce Parallelism
Framework

Finding Similar Pages

Jaccard Measure of
Similarity

Minhashing

Locality-Sensitive
Hashing

Google
3
3 Tier Framework
http://goo.gl/CmFF
4
The components of a search engine
5
Crawler

A process that
downloads web pages to
a Page Repository.

Examine pages for links
to other pages and insert
the ones that are not in
the Page Repository in
the set for pages to be
crawled.
http://goo.gl/gG3s
6
Crawler
Challenge
Description
Solution
Terminating search
Dynamically generated pages could
create a forever loop
Limit number of pages to crawl with
a “depth” limit per site
Managing the repository
1.
Duplication of URL to be
crawled
2.
Duplicated pages due to mirror
sites, different routes, plagiarism,
etc.
1.
An efficient index for checking
stored pages
2.
Minhash and locality-sensitive
hashing signatures
Selecting the next page
How to prioritise next page to be
crawled?
Give priority to “important” pages
Speeding up the crawl
1.
How many processes should be
simultaneously run?
2.
How to synchronise them to
avoid they crawl the same site.
3.
Avoid DoS attack
1.
Scale to several machines
2.
Assign processes to entire hosts
or sites
3.
Do not issue frequent requests
to a single site. Several processes
in a single machine due to idle
states.
7
Query Processing in
Search Engines

Search engine queries are not like SQL
queries

Require inverted indices

Disk access is very expensive to offer the
user acceptable response time

Matched records are ranked before showing
to the user
8
PageRank

Algorithm for identifying
“important” pages

A Web page is important if
many important pages link to it
http://goo.gl/gKsQ
http://goo.gl/CsuN
9
Recursive Formulation
of Page Rank
Yahoo!
Amazon
Microsoft
The Web in 1839
Transition Matrix
1/2
1/2
0
M

1/2
0
1
0
1/2
0
Yahoo!
Amazon
Microsoft
Amazon
Yahoo!
Microsoft
The Matrix M, the transition matrix of the Web has element
rank
r
,
m
ij
in row
i
and column
j
, where
1.
m
ij
 1/
r
if page
j
has a link to page
i
, and there are a total
of
r

1 pages that
j
links to
2.
m
ij
 0 otherwise
10
Suppose
y
,
a
, and
m
represent PageRanks and fractions of the
time the random walker spends
y
1/2
1/2
0
y
a

1/2
0
1
a
m
0
1/2
0
m
2/6
1/2
1/2
0
1/3
3/6

1/2
0
1
1/3
1/6
0
1/2
0
1/3
5/12
1/2
1/2
0
2/6
4/12

1/2
0
1
3/6
3/12
0
1/2
0
1/6
After repeating the process several times:
9/24
20/48
2/5
11/24
,
17/48
, … ,
2/5
4/24
11/48
1/5
Yahoo!
Amazon
Microsoft
Suggested since the probability
of
y

a

m
1
11
Spider Traps and Dead
Ends
Microsoft becomes a spider trap
Yahoo!
Amazon
Microsoft
Yahoo!
Amazon
Microsoft
Microsoft becomes a dead end
0
0
1
Yahoo!
Amazon
Microsoft
0
0
0
Yahoo!
Amazon
Microsoft
12
Spider traps and dead
ends solution

Limit the time that random walker is allowed to
wander at random

Pick a constant
β
1, typically in the range 0.8 to 0.9.

Taxation rate
: 1-
β

If the walker gets stuck in a spider trap, it will
disappear and be replace by a new walker after few
time steps

If the walker reaches a dead end and disappears, a
new walker will take over shortly
13
1/2
1/2
0
1/3
P
new

0.8
1/2
0
0
P
old
 0.2
1/3
0
1/2
1
1/3
Yahoo!
Amazon
Microsoft
Microsoft becomes a spider trap
7/33
5/33
21/33
After several iterations
Yahoo!
Amazon
Microsoft
14
Teleport Sets

Selected set of nodes

Eliminate spam and pages that don’t concern
to the search topic

Nodes are selected from trusted open
directories, keywords in pages on a topic,
users’s bookmarks, recently searched
keywords, etc.
15
Yahoo!
Amazon
Microsoft
The Web in 1839
y
1/2
1/2
0
y
0
a

0.8
1/2
0
1
a
 0.2
1
m
0
1/2
0
m
0
10/31
15/31
6/31
After several iterations
Yahoo!
Amazon
Microsoft
P
new
= β M P
old
+ (1-β)t
16
Link Spam

Spam farming in order to
accumulate and
concentrate PageRank on
a few pages

Links to the spam farm
from pulicly accessible
blogs, with messages like
“I agree with you. See
x1234.mySpam.Farm.com

S


Links from outside
17
Link Spam Solution

Compute the TrustRank of pages

TrustRank: Topic-specific PageRank computed with a
Teleport set consisting of only “trusted” pages

Manual trusted pages collection

User Teleports with sets of serious pages such as
universities

Compute the difference between the PageRank and
TrustRank for each page. This difference is the
negative TrustRank
18
Indices
Documents with ids 0,1,2
Documents with ids 0,1,2
Documents with ids 0,1,2
Documents with ids 0,1,2
Documents with ids 0,1,2
0
1
2
the cat
is fat
was
raining
cats
and
dogs
Fido
the
dog
Inverted Index
Inverted Index
and
1
cat
0, 1
dog
1, 2
fat
0
fido
2
is
0
raining
1
the
0, 2
was
1
19
Inverted Indices

Essential for
Web Queries

Uses indirect
buckets for
space
efficiency
Buckets
cat
dog
Inverted Index
... the cat is fat ...
... was raining cats
and dogs ...
... Fido the dog ...
Documents
20
Sorting more information in the
inverted index
Type
Position
Document
title
5
header
10
anchor
3
text
57
title
100
title
12
Doc 1
Doc 2
Doc 3
Cat
Dog
Dogs compared
with cats
21
Map-Reduce Parallelism Framework

Large-scale parallel
machines share high load
operations such as joins

Distributed architectures

Grid, networks and
corporate DBs

MRP paradigm expresses
large-scale computations
Map
Reduce
Input
Key-Value
Pairs
Output
Lists
Sort Intermediate
Key-Value
Pairs by Keys
Execution of map and reduce functions
22
Jaccard Measure of
Similarity

Finding Similar Items

Jaccard similarity is the radio of the sizes of
interaction and union the sets S and T.
|S

T|/|S

T|
{1,2,3} and {1,3,4,5} has radio 2/5

A set of
k-grams
or
k-Shingle
is a substring of length
k

of a set.

A number of
…” “
A n
”, “
nu
”, “
num
”, and so on.
23
Minhashing

It is a technique to form a short signature for
each set

Computes the Jaccard similarity using signatures

A
minhash
value of a set
S
is the first element of a
randomly permuted universal set, that is a
member of
S

Universal set of elements is {1,2,3,4,5} and a
permuted order is: (3,5,4,2,1). Then, the hash value
for the set {2,3,5} is 3.
24
Locality-Sensitive
Hashing (LSH)

Minhashing is fast but there are still too
many pairs of sets

LSH hashes sets to buckets so that “similar”
elements are assigned to the same bucket

Tradeoffs number of buckets (constrained by
memory) and chances to miss a pair of
similar elements
25
n
signatures
r
rows
r
bands
Buckets
Dividing signatures into bands and
hashing based on the values in a band
s = (1/b)
1/r
Probability of
at least one
bucket in
common
Similarity
s
1
1
0
0
The probability that a pair of signatures will
appear together in at least one bucket
26
Combining Minhashing
and LSH
1.
Compute minhash signature with as many hash
functions as desired accuracy
2.
Perform LSH to get candidate pairs of signatures that
hash to the same bucket for at least one band
3.
For each candidate pair, compute the estimate of their
Jaccard similarity by counting the number of
components in which their signature agree
4.
Optionally, for each pair whose signatures are
sufficiently similar, compute their true Jaccard
similarity by examining the sets themselves
27
Google Apps
28
Anatomy of a Google
Search

Uses: links, PageRank, anchors, proximity and
visual presentation (e.g. bold text is weighted
higher) in search logic. Search the index
1.
Search the index
2.
Analyze the web pages for relevance
3.
Evaluate the site’s reputation
4.
Rank the web pages
29
Google’s System Anatomy
http://goo.gl/yYbb
30
Google particularities

PageRank

Anchor text

Location information and use of proximity in
search

Visual presentations such as font,
capitalization and size of words are weighted
differently
31
References

The Anatomy of a Large-Scale Hypertextual
Web Search Engine
http://infolab.stanford.edu/backrub/
google.html

Database Systems. The Complete Book.
Second Edition. Hector Garcia-Molina, Jeffrey
D. Ullman, Jennifer Widom
32
Questions
francisvifi.uio.no
33