Random Sampling from a Search Engine's Index - Department of ...

prudencecoatInternet and Web Development

Nov 18, 2013 (3 years and 9 months ago)

68 views

1

Random Sampling
from a Search
Engine‘s Index

Ziv Bar
-
Yossef and Maxim Gurevich

Department of Electrical Engineering Technion


Presentation at group meeting, Oct., 24

Allen, Zhenjiang Lin


2

Outline


Introduction


Search Engine Samplers


Motivation


The Bharat
-
Broder Sampler (WWW’98)


Infrastructure of Proposed Methods


Search Engines as Hypergraphs


Monte Carlo Simulation Methods


Rejection Sampling


The Pool
-
based Sampler


The Random Walk Sampler


Experimental Results


Conclusions


3

Search Engine Samplers

Index

Public

Interface

Search Engine

Sampler

Web

D

Queries

Top k
results

Random document
x


D

Indexed
Documents

4

Motivation


Useful tool for search engine evaluation:


Freshness


Fraction of up
-
to
-
date pages in the index


Topical bias


Identification of overrepresented/underrepresented topics


Spam


Fraction of spam pages in the index


Security


Fraction of pages in index infected by viruses/worms/trojans


Relative Size


Number of documents indexed compared with other search
engines

5

Size Wars

August 2005


: We index

20 billion documents
.

So, who’s right?

September 2005


: We index
8 billion documents,
but
our index

is
3 times larger

than our competition’s.

6

Why Does Size Matter, Anyway?


Comprehensiveness


A good crawler covers the most documents
possible



Narrow
-
topic queries


E.g., get homepage of John Doe



Prestige


A marketing advantage

7

Measuring size using random
samples

[BharatBroder98, CheneyPerry05, GulliSignorni05]


Sample pages uniformly at random from the
search engine’s index


Two alternatives


Absolute size estimation


Sample until collision


Collision expected after k ~ N
½

random samples (b
irthday
paradox)


Return k
2


Relative size estimation


Check how many samples from search engine A are present
in search engine B and vice versa

8

Related Work


Random Sampling from a Search Engine’s
Index

[BharatBroder98, CheneyPerry05, GulliSignorni05]


Anecdotal queries


[SearchEngineWatch, Google, BradlowSchmittlein00]


Queries from user query logs
[LawrenceGiles98,
DobraFeinberg04]


Random sampling from the whole web
[Henzinger et al 00, Bar
-
Yossef et al 00,
Rusmevichientong et al 01]

9

The Bharat
-
Broder Sampler:
Preprocessing Step

C

Large corpus

L

t
1
, freq(t
1
,C)

t
2
, freq(t
2
,C)





Lexicon

10

The Bharat
-
Broder Sampler

Search Engine

BB Sampler

t
1

AND t
2

Top k
results

Random document
from top k results

L

Two random terms
t
1
, t
2

Only if:



all queries return
the same number of results

≤ k



all documents are of the
same length

Then, samples are uniform.

11

The Bharat
-
Broder Sampler:

Drawbacks


Documents have varying lengths


Bias towards
long documents




Some queries have more than k matches


Bias towards documents with

high static rank

12

Two novel samplers


A pool
-
based sampler


Guaranteed

to produce near
-
uniform samples


Needs an lexicon / query pool


A random walk sampler


After sufficiently many steps,
guaranteed

to produce
near
-
uniform samples


Does not need an explicit lexicon / pool at all!

Focus of
this talk

13

Search Engines as Hypergraphs


results(q)

= { documents returned on query q }


queries(x)

= { queries that return x as a result }


P

= query pool = a set of queries


Query pool hypergraph:


Vertices:


Indexed documents


Hyperedges:

{ result(q) | q


P }

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com


news”


bbc”


google”


maps”

en.wikipedia.org/wiki/BBC

14

Query Cardinalities and Document
Degrees


Query cardinality:


card(q) = |results(q)|


Document degree:
deg(x) = |queries(x)|


Examples:


card(“news”) = 4, card(“bbc”) = 3


deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com


news”


bbc”


google”


maps”

en.wikipedia.org/wiki/BBC

15

Sampling documents uniformly


Sampling documents from D uniformly


Hard


Sampling documents from D non
-
uniformly:
Easier



Will show later:

can sample documents
proportionally to
their degrees
:

16

Sampling documents by degree



p(news.bbc.co.uk) = 2/13


p(www.cnn.com) = 1/13

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com


news”


bbc”


google”


maps”

en.wikipedia.org/wiki/BBC

17

Monte Carlo Simulation


We need:

Samples from the
uniform distribution


We have:

Samples from the
degree distribution


Can we somehow use the samples from the degree
distribution to generate samples from the uniform
distribution?


Yes!

Monte Carlo Simulation
Methods

Rejection
Sampling

Importance
Sampling

Metropolis
-
Hastings

Maximum
-
Degree

18

Rejection Sampling Algorithm

Sampling values from an arbitrary probability distribution f(x)
by using an instrumental distribution g(x)

The algorithm (due to
John von Neumann
) is as follows:


Sample
x

from
g
(
x
) and
u

from
U
(0,1)


Check whether or not
u

<
f
(
x
) /
Mg
(
x
).


If this holds, accept
x

as a realization of
f
(
x
);


if not, reject the value of
x

and repeat the sampling step.

M > 1 is an appropriate bound on f(x) / g(x).


Prove
:

p
RS
(x) = g(x) . f(x) / Mg(x) = f(x) / M.

f(x) / Mg(x) ≤ 1 <=> M ≥ f(x) / g(x),

x

D.

19

Rejection Sampling: An Example


Sampling u.a.r from Square: g(x)
Easy


Sampling u.a.r from Disc: f(x)
Hard


Since f(x)=F, g(x)=G, set M = F/G;


Generate a candidate point x from

unit square, g(x);


If x is in unit disc, f(x) = F≠ 0,

thus f(x)/Mg(x)=1, accept x;


If x is in square/disc, f(x) = 0,

thus f(x)/Mg(x)=0, reject x;


Therefore, x is sampled u.a.r from the unit disc.

20

Monte Carlo Simulation



: Target distribution


In our case:


= uniform on D


p
: Trial distribution


In our case: p = degree distribution



Bias weight

of p(x) relative to

(x):


In our case:



Monte Carlo
Simulator

Samples
from p

Sample
from


x


-
Sampler

(x
1
,w(x)),
(x
2
,w(x)),



p
-
Sampler

21

Bias Weights


Unnormalized forms

of


and p:





:
(unknown)
normalization constants



Examples:




= uniform:


p = degree distribution:





Bias weight:

22


C: envelope constant


C ≥ w(x) for all x


The algorithm:


accept := false


while (not accept)


generate a sample x from p


toss a coin whose heads probability is


if coin comes up heads,



accept := true


return x


In our case: C = 1 and acceptance prob = 1/deg(x)

Rejection Sampling
[von Neumann]

23

Pool
-
Based Sampler

Degree distribution
sampler

Search Engine

Rejection
Sampling

q
1
,q
2
,…

results(q
1
), results(q
2
),…

x

Pool
-
Based Sampler

(x
1
,1/deg(x
1
)),

(x
2
,1/deg(x
2
)),…

Uniform
sample

Documents sampled from degree
distribution with corresponding weights


Degree distribution
: p(x) = deg(x) /

x’
deg(x’)

24

Sampling documents by degree


Select a random query q


Select a random x


results(q)


Documents with high degree are more likely to be sampled


If we sample q uniformly


“oversample” documents that
belong to narrow queries
-
the weights of queries are different.


We need to sample q proportionally to its cardinality

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com


news”


bbc”


google”


maps”

en.wikipedia.org/wiki/BBC

25

Sampling documents by degree (2)


Select a query q proportionally to its cardinality


Select a random x


results(q)


Analysis:

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com


news”


bbc”


google”


maps”

en.wikipedia.org/wiki/BBC

26

Degree Distribution Sampler

Search Engine

results(q)

x

Cardinality Distribution
Sampler

Sample x uniformly
from results(q)

q

Degree Distribution Sampler

Query sampled
from cardinality
distribution

Document
sampled from
degree
distribution

27

Sampling queries by cardinality


Sampling queries from pool uniformly:


Easy


Sampling queries from pool by cardinality:

Hard


Requires knowing cardinalities of all queries in the
search engine



Use Monte Carlo methods to simulate
biased

sampling via
uniform

sampling:


Target distribution
: the cardinality distribution


Trial distribution
: uniform distribution on the query pool

28

Sampling queries by cardinality


Bias weight of cardinality distribution relative to the
uniform distribution:




Can be computed using a single search engine query


Use rejection sampling:


Envelope constant for rejection sampling:



Queries are sampled uniformly from the pool


Each query q is accepted with probability

29

Degree Distribution
Sampler

Complete Pool
-
Based Sampler

Search Engine

Rejection
Sampling

x

(x,1/deg(x)),…

Uniform
document
sample

Documents sampled from degree
distribution with corresponding weights

Uniform Query
Sampler

Rejection
Sampling

(q,card(q)),…

Uniform
query
sample

Query
sampled from
cardinality
distribution

(q,results(q)),…

30

Dealing with Overflowing Queries


Problem:

Some queries may
overflow

(card(q) > k)


Bias towards highly ranked documents


Solutions
:


Select a pool P in which overflowing queries are rare
(e.g., phrase queries)


Skip overflowing queries


Adapt rejection sampling to deal with approximate
weights




Theorem
:

Samples of PB sampler are at most

-
away from
uniform.

(


= overflow probability of P)

31

Creating the query pool

C

Large corpus

P

q
1





Query Pool


Example:
P = all 3
-
word phrases that occur in C


If “
to be or not to be
” occurs in C, P contains:



to be or
”, “
be or not
”, “
or not to
”, “
not to be



Choose P that “covers” most documents in D


q
2

32

A random walk sampler


Define a graph G over the indexed documents


(x,y)


E
iff queries(x)
∩ queries(y) ≠








Run a random walk on G


Limit distribution = degree distribution


Use MCMC methods to make limit distribution uniform.


Metropolis
-
Hastings


Maximum
-
Degree



Does not need

a preprocessing step


Less efficient than the pool
-
based sampler

33

Bias towards Long Documents

0%
10%
20%
30%
40%
50%
60%
1
2
3
4
5
6
7
8
9
10
Deciles of documents ordered by size
Percent of documents from sample .
Pool Based
Random Walk
Bharat-Broder
34

Relative Sizes of Google, MSN and
Yahoo!

Google = 1

Yahoo! = 1.28

MSN Search = 0.73

35

Top
-
Level Domains in Google,
MSN and Yahoo!

0%
10%
20%
30%
40%
50%
60%
com
org
net
uk
edu
de
au
gov
ca
us
it
no
es
ie
info
Top level domain name
Percent of documents from sample .
Google
MSN
Yahoo!
36

Conclusions


Two new search engine samplers


Pool
-
based sampler


Random walk sampler


Samplers are guaranteed to produce near
-
uniform samples, under plausible
assumptions.


Samplers show no or little bias in
experiments.

37

Thank You