1
Random Sampling
from a Search
Engine‘s Index
Ziv Bar

Yossef and Maxim Gurevich
Department of Electrical Engineering Technion
Presentation at group meeting, Oct., 24
Allen, Zhenjiang Lin
2
Outline
Introduction
Search Engine Samplers
Motivation
The Bharat

Broder Sampler (WWW’98)
Infrastructure of Proposed Methods
Search Engines as Hypergraphs
Monte Carlo Simulation Methods
–
Rejection Sampling
The Pool

based Sampler
The Random Walk Sampler
Experimental Results
Conclusions
3
Search Engine Samplers
Index
Public
Interface
Search Engine
Sampler
Web
D
Queries
Top k
results
Random document
x
D
Indexed
Documents
4
Motivation
Useful tool for search engine evaluation:
Freshness
Fraction of up

to

date pages in the index
Topical bias
Identification of overrepresented/underrepresented topics
Spam
Fraction of spam pages in the index
Security
Fraction of pages in index infected by viruses/worms/trojans
Relative Size
Number of documents indexed compared with other search
engines
5
Size Wars
August 2005
: We index
20 billion documents
.
So, who’s right?
September 2005
: We index
8 billion documents,
but
our index
is
3 times larger
than our competition’s.
6
Why Does Size Matter, Anyway?
Comprehensiveness
A good crawler covers the most documents
possible
Narrow

topic queries
E.g., get homepage of John Doe
Prestige
A marketing advantage
7
Measuring size using random
samples
[BharatBroder98, CheneyPerry05, GulliSignorni05]
Sample pages uniformly at random from the
search engine’s index
Two alternatives
Absolute size estimation
Sample until collision
Collision expected after k ~ N
½
random samples (b
irthday
paradox)
Return k
2
Relative size estimation
Check how many samples from search engine A are present
in search engine B and vice versa
8
Related Work
Random Sampling from a Search Engine’s
Index
[BharatBroder98, CheneyPerry05, GulliSignorni05]
Anecdotal queries
[SearchEngineWatch, Google, BradlowSchmittlein00]
Queries from user query logs
[LawrenceGiles98,
DobraFeinberg04]
Random sampling from the whole web
[Henzinger et al 00, Bar

Yossef et al 00,
Rusmevichientong et al 01]
9
The Bharat

Broder Sampler:
Preprocessing Step
C
Large corpus
L
t
1
, freq(t
1
,C)
t
2
, freq(t
2
,C)
…
…
Lexicon
10
The Bharat

Broder Sampler
Search Engine
BB Sampler
t
1
AND t
2
Top k
results
Random document
from top k results
L
Two random terms
t
1
, t
2
Only if:
•
all queries return
the same number of results
≤ k
•
all documents are of the
same length
Then, samples are uniform.
11
The Bharat

Broder Sampler:
Drawbacks
Documents have varying lengths
Bias towards
long documents
Some queries have more than k matches
Bias towards documents with
high static rank
12
Two novel samplers
A pool

based sampler
Guaranteed
to produce near

uniform samples
Needs an lexicon / query pool
A random walk sampler
After sufficiently many steps,
guaranteed
to produce
near

uniform samples
Does not need an explicit lexicon / pool at all!
Focus of
this talk
13
Search Engines as Hypergraphs
results(q)
= { documents returned on query q }
queries(x)
= { queries that return x as a result }
P
= query pool = a set of queries
Query pool hypergraph:
Vertices:
Indexed documents
Hyperedges:
{ result(q)  q
P }
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“
news”
“
bbc”
“
google”
“
maps”
en.wikipedia.org/wiki/BBC
14
Query Cardinalities and Document
Degrees
Query cardinality:
card(q) = results(q)
Document degree:
deg(x) = queries(x)
Examples:
card(“news”) = 4, card(“bbc”) = 3
deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“
news”
“
bbc”
“
google”
“
maps”
en.wikipedia.org/wiki/BBC
15
Sampling documents uniformly
Sampling documents from D uniformly
Hard
Sampling documents from D non

uniformly:
Easier
Will show later:
can sample documents
proportionally to
their degrees
:
16
Sampling documents by degree
p(news.bbc.co.uk) = 2/13
p(www.cnn.com) = 1/13
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“
news”
“
bbc”
“
google”
“
maps”
en.wikipedia.org/wiki/BBC
17
Monte Carlo Simulation
We need:
Samples from the
uniform distribution
We have:
Samples from the
degree distribution
Can we somehow use the samples from the degree
distribution to generate samples from the uniform
distribution?
Yes!
Monte Carlo Simulation
Methods
Rejection
Sampling
Importance
Sampling
Metropolis

Hastings
Maximum

Degree
18
Rejection Sampling Algorithm
Sampling values from an arbitrary probability distribution f(x)
by using an instrumental distribution g(x)
The algorithm (due to
John von Neumann
) is as follows:
Sample
x
from
g
(
x
) and
u
from
U
(0,1)
Check whether or not
u
<
f
(
x
) /
Mg
(
x
).
If this holds, accept
x
as a realization of
f
(
x
);
if not, reject the value of
x
and repeat the sampling step.
M > 1 is an appropriate bound on f(x) / g(x).
Prove
:
p
RS
(x) = g(x) . f(x) / Mg(x) = f(x) / M.
f(x) / Mg(x) ≤ 1 <=> M ≥ f(x) / g(x),
∨
x
∈
D.
19
Rejection Sampling: An Example
Sampling u.a.r from Square: g(x)
Easy
Sampling u.a.r from Disc: f(x)
Hard
Since f(x)=F, g(x)=G, set M = F/G;
Generate a candidate point x from
unit square, g(x);
If x is in unit disc, f(x) = F≠ 0,
thus f(x)/Mg(x)=1, accept x;
If x is in square/disc, f(x) = 0,
thus f(x)/Mg(x)=0, reject x;
Therefore, x is sampled u.a.r from the unit disc.
20
Monte Carlo Simulation
: Target distribution
In our case:
= uniform on D
p
: Trial distribution
In our case: p = degree distribution
Bias weight
of p(x) relative to
(x):
In our case:
Monte Carlo
Simulator
Samples
from p
Sample
from
x

Sampler
(x
1
,w(x)),
(x
2
,w(x)),
…
p

Sampler
21
Bias Weights
Unnormalized forms
of
and p:
:
(unknown)
normalization constants
Examples:
= uniform:
p = degree distribution:
Bias weight:
22
C: envelope constant
C ≥ w(x) for all x
The algorithm:
accept := false
while (not accept)
generate a sample x from p
toss a coin whose heads probability is
if coin comes up heads,
accept := true
return x
In our case: C = 1 and acceptance prob = 1/deg(x)
Rejection Sampling
[von Neumann]
23
Pool

Based Sampler
Degree distribution
sampler
Search Engine
Rejection
Sampling
q
1
,q
2
,…
results(q
1
), results(q
2
),…
x
Pool

Based Sampler
(x
1
,1/deg(x
1
)),
(x
2
,1/deg(x
2
)),…
Uniform
sample
Documents sampled from degree
distribution with corresponding weights
Degree distribution
: p(x) = deg(x) /
x’
deg(x’)
24
Sampling documents by degree
Select a random query q
Select a random x
results(q)
Documents with high degree are more likely to be sampled
If we sample q uniformly
“oversample” documents that
belong to narrow queries

the weights of queries are different.
We need to sample q proportionally to its cardinality
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“
news”
“
bbc”
“
google”
“
maps”
en.wikipedia.org/wiki/BBC
25
Sampling documents by degree (2)
Select a query q proportionally to its cardinality
Select a random x
results(q)
Analysis:
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
maps.yahoot.com
“
news”
“
bbc”
“
google”
“
maps”
en.wikipedia.org/wiki/BBC
26
Degree Distribution Sampler
Search Engine
results(q)
x
Cardinality Distribution
Sampler
Sample x uniformly
from results(q)
q
Degree Distribution Sampler
Query sampled
from cardinality
distribution
Document
sampled from
degree
distribution
27
Sampling queries by cardinality
Sampling queries from pool uniformly:
Easy
Sampling queries from pool by cardinality:
Hard
Requires knowing cardinalities of all queries in the
search engine
Use Monte Carlo methods to simulate
biased
sampling via
uniform
sampling:
Target distribution
: the cardinality distribution
Trial distribution
: uniform distribution on the query pool
28
Sampling queries by cardinality
Bias weight of cardinality distribution relative to the
uniform distribution:
Can be computed using a single search engine query
Use rejection sampling:
Envelope constant for rejection sampling:
Queries are sampled uniformly from the pool
Each query q is accepted with probability
29
Degree Distribution
Sampler
Complete Pool

Based Sampler
Search Engine
Rejection
Sampling
x
(x,1/deg(x)),…
Uniform
document
sample
Documents sampled from degree
distribution with corresponding weights
Uniform Query
Sampler
Rejection
Sampling
(q,card(q)),…
Uniform
query
sample
Query
sampled from
cardinality
distribution
(q,results(q)),…
30
Dealing with Overflowing Queries
Problem:
Some queries may
overflow
(card(q) > k)
Bias towards highly ranked documents
Solutions
:
Select a pool P in which overflowing queries are rare
(e.g., phrase queries)
Skip overflowing queries
Adapt rejection sampling to deal with approximate
weights
Theorem
:
Samples of PB sampler are at most

away from
uniform.
(
= overflow probability of P)
31
Creating the query pool
C
Large corpus
P
q
1
…
…
Query Pool
Example:
P = all 3

word phrases that occur in C
If “
to be or not to be
” occurs in C, P contains:
“
to be or
”, “
be or not
”, “
or not to
”, “
not to be
”
Choose P that “covers” most documents in D
q
2
32
A random walk sampler
Define a graph G over the indexed documents
(x,y)
E
iff queries(x)
∩ queries(y) ≠
Run a random walk on G
Limit distribution = degree distribution
Use MCMC methods to make limit distribution uniform.
Metropolis

Hastings
Maximum

Degree
Does not need
a preprocessing step
Less efficient than the pool

based sampler
33
Bias towards Long Documents
0%
10%
20%
30%
40%
50%
60%
1
2
3
4
5
6
7
8
9
10
Deciles of documents ordered by size
Percent of documents from sample .
Pool Based
Random Walk
BharatBroder
34
Relative Sizes of Google, MSN and
Yahoo!
Google = 1
Yahoo! = 1.28
MSN Search = 0.73
35
Top

Level Domains in Google,
MSN and Yahoo!
0%
10%
20%
30%
40%
50%
60%
com
org
net
uk
edu
de
au
gov
ca
us
it
no
es
ie
info
Top level domain name
Percent of documents from sample .
Google
MSN
Yahoo!
36
Conclusions
Two new search engine samplers
Pool

based sampler
Random walk sampler
Samplers are guaranteed to produce near

uniform samples, under plausible
assumptions.
Samplers show no or little bias in
experiments.
37
Thank You
Comments 0
Log in to post a comment