Efficient Search Engine
Measurements
Maxim Gurevich
Technion
Ziv Bar

Yossef
Technion and Google
Search Engine Benchmarks
State of the art:
No objective benchmarks for search engines
Need to rely on “anecdotal” studies or on subjective search
engine reports
Users, advertisers, partners cannot compare search engines
Our goal:
Design search engine benchmarking techniques
Accurate
Efficient
Objective
Transparent
Search Engine Corpus
Evaluation
Corpus size
How many pages are indexed?
Search engine overlap
What fraction of the pages indexed by search engine A are
also indexed by search engine B?
Freshness
How old are the pages in the index?
Spam resilience
What fraction of the pages in the index are spam?
Duplicates
How many unique pages are there in the index?
Search Engine Corpus Metrics
Index
Public
Interface
Search Engine
Web
D
Indexed
Documents
•
Corpus size
•
Number of unique pages
•
Overlap
•
Average age of a page
Focus of
this talk
Target function
Search Engine Estimators
Index
Public
Interface
Search Engine
Estimator
Web
D
Queries
Top k
results
Estimate of D
Indexed
Documents
Success Criteria
Estimation accuracy:
Bias E(Estimate

D)
Amortized cost (cost times variance):
Amortized query cost
Amortized fetch cost
Amortized function cost
Previous Work
Average metrics:
Anecdotal queries
[SearchEngineWatch, Google, BradlowSchmittlein00]
Queries from user query logs
[LawrenceGiles98, DobraFeinberg04]
Random queries
[BharatBroder98, CheneyPerry05, GulliSignorini05,
BarYossefGurevich06, Broder et al 06]
Random sampling from the web
[Henzinger et al 00, Bar

Yossef et al 00, Rusmevichientong et al 01]
Sum metrics:
Random queries
[Broder et al 06]
Our Contributions
A new search engine estimator
Applicable to both sum metrics and average metrics
Arbitrary target functions
Arbitrary target distributions (measures)
Less bias than the Broder et al estimator
In one experiment, empirical relative bias was reduced
from
75% to 0.01%
More efficient than the BarYossefGurevich06 estimator
In one experiment, query cost was reduced
by
a factor of
375
.
Techniques
Approximate ratio importance sampling
Rao

Blackwellization
Roadmap
Recast the Broder et al corpus size estimator
as an importance sampling estimator.
Describe the “degree mismatch problem”
(DMP)
Show how to overcome DMP using
approximate ratio importance sampling
Discuss Rao

Blackwellization
Gloss over some experimental results
Query Pools
C
Training corpus of
web documents
P
q
1
…
…
Query Pool
Working example
:
P = all length

3 phrases that occur in C
If “
to be or not to be
” occurs in C, P contains:
“
to be or
”, “
be or not
”, “
or not to
”, “
not to be
”
Choose P that “covers” most documents in D
q
2
Pre

processing step: Create a query pool
maps.yahoo.com
The Search Engine Graph
P
= query pool
neighbors(q)
=
{ documents returned
on query q }
deg(q)
= neighbors(q)
neighbors(x)
=
{ queries that return x
as a result }
deg(x)
= neighbors(x)
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
www.mapquest.com
“news”
“bbc”
“google”
“maps”
en.wikipedia.org/wiki/BBC
deg(“news”) = 4, deg(“bbc”) = 3
deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2
Corpus Size as an Integral
E
= Edges in the queries

documents graph
Lemma
:
Proof
:
Contribution of edge
(q,x) to sum:
1/deg(x)
Total contribution of
edges incident to x:
1
Total contribution of all
edges:
D
www.cnn.com
www.foxnews.com
news.google.com
news.bbc.co.uk
www.google.com
maps.google.com
www.bbc.co.uk
“news”
“bbc”
“google”
en.wikipedia.org/wiki/BBC
Corpus Size as an Integral
Express corpus size as an integral:
Target measure:
(q,x) = 1/deg(x)
Target function
:
f(q,x) = 1
Monte Carlo Estimation
Monte Carlo estimation of the integral
Sample (Q,X) according to
Output f(Q,X)
Works only if:
is a proper distribution
We can easily sample from
BUT,
In our case
is not a distribution
Even if it was, sampling from
=
1/deg(x) may not be easy
So instead, we sample (Q,X) from an easy “
trial distribution
” p
Sampling Edges, Easily
Search Engine
Sampler
Q
Top k
results
(Q,X)
X

a random result of Q
P
A random query
Q
Sample an edge (q,x) with probability
p(q,x) = 1/(P
¢
deg(q))
Importance Sampling (IS)
[Marshal56]
We have
: A sample (Q,X) from
p
We need
: Estimate the integral
So we cannot use simple Monte Carlo estimation
Importance sampling
comes to the rescue…
Compute an “importance weight” for (Q,X):
Importance sampling estimator:
IS: Bias Analysis
Computing the Importance
Sampling Estimator
We need to compute
Computing P is easy
–
we know P
How to compute deg(Q) = neighbors(Q)?
Since Q was submitted to the search engine, we know deg(Q)
How to compute deg(X) = neighbors(X)?
Fetch content of X from the web
pdeg(X)
= number of distinct queries from P that X contains
Use pdex(X) as an estimate for deg(X)
The Degree Mismatch Problem
(DMP)
In reality, pdeg(X) may be different from deg(X)
Neighbor recall problem:
There may be
q
neighbors(x) that do not occur in x
q occurs as “anchor text” in a page linking to x
q occurs in x, but our parser failed to find it
Neighbor precision problem:
There may be q that
occur in x, but q
neighbors(x)
q “overflows”
q occurs in x, but the search engine’s parser failed to find it
Implications of DMP
Can only approximate document degrees
Bias of importance sampling estimator may
become significant
In one of our experiments, relative bias was 75%
Eliminating the Neighbor
Recall Problem
The
predicted search engine graph
:
pneighbors(x) = queries that occur in x
pneighbors(q) = documents in whose text q occurs
An edge (q,x) is “
valid
”, if it occurs both in the search
engine graph and the predicted search engine graph
The
valid search engine graph
:
vneighbors(x) = neighbors(x) ∩ pneighbors(x)
vneighbors(q) = neighbors(q) ∩ pneighbors(q)
Eliminating the Neighbor
Recall Problem (cont.)
We use the valid search engine graph rather
than the real search engine graph:
vdeg(q)
= vneighbors(q)
vdeg(x)
= vneighbors(x)
P
+
= queries q in P with vdeg(q) > 0
D
+
= documents x in D with vdeg(x) > 0
Assuming D
+
= D, then E(IS(Q,X)) = D
Approximate Importance
Sampling (AIS)
We need to compute
vdeg(Q)
–
Easy
vdeg(X)
–
Hard
P
+


Hard
We therefore
approximate
P
+
 and vdeg(X):
IVD(X)
= unbiased probabilistic estimator for
pdeg(X)/vdeg(X)
Estimating pdeg(x)/vdeg(x)
Given:
A document x
Want:
Estimate pdeg(x) / vdeg(x)
Geometric estimation
:
n = 1
forever do
Choose a random phrase Q that occurs in content(x)
Send Q to the search engine
If x
neighbors(Q)
, return n
n
n + 1
Probability to hit a “valid” query: vdeg(x) / pdeg(x)
So, expected number of iterations: pdeg(x) / vdeg(x)
Approximate Importance
Sampling: Bias Analysis
Lemma
: Multiplicative bias of AIS(Q,X) is
Approximate Importance
Sampling: Bias Elimination
How to eliminate the bias in AIS?
Estimate the bias P/P
+

Divide AIS by this estimate
Well, this doesn’t quite work
Expected ratio ≠ ratio of expectations
So, use a standard trick in estimation of ratio
statistics:
BE = estimator of P/P
+

Bias Analysis
Theorem
:
Estimating P/P
+

Also by geometric estimation:
n = 1
forever do
Choose a random query Q from P
Send Q to the search engine
If vdeg(Q) > 0, return n
n
n + 1
Probability to hit a “valid” query: P
+
/P
So, expected number of iterations: P/P
+

Recap
1.
Sample valid edges (Q
1
,X
1
),…,(Q
n
,X
n
) from p
2.
Compute vdeg(Q
i
) for each query Q
i
3.
Compute pdeg(X
i
) for each document X
i
4.
Estimate IVD(X
i
) = pdeg(X
i
)/vdeg(X
i
) for each X
i
5.
Compute AIS
6.
Estimate the expected bias BE
i
= P/P
+

7.
Output
Rao

Blackwellization
Question
: We currently use only one
(random) result for each query submitted to
the search engine. Can we use also the rest?
Rao & Blackwell
: Sure! Use them as
additional samples. It can only help!
The
Rao

Blackwellized AIS estimator
:
Recall:
RB

AIS: Analysis
The
Rao

Blackwell Theorem
:
AIS
RB
has exactly the same bias as AIS
The variance of AIS
RB
can only be lower
Variance reduces, if query results are sufficiently
“variable”
Now, use AIS
RB
instead of AIS in
SizeEstimator:
Corpus Size, Bias Comparison
0%
10%
20%
30%
40%
50%
60%
70%
80%
5
20
100
200
Result set size limit (k)
Relative bias
Broder et al
SizeEstimator
Corpus Size, Query Cost
Comparison
0.0E+00
2.0E+14
4.0E+14
6.0E+14
8.0E+14
1.0E+15
1.2E+15
5
20
100
200
Result set size limit (k)
Amortized query cost
SizeEstimator without RB
SizeEstimator with RB
Corpus Size Estimations for 3
Major Search Engines
0
1
2
3
4
5
6
7
8
9
10
SE1
SE2
SE3
Absolute corpus size (billions)
Broder et al
SizeEstimator
Thank You
Average Metric, Bias
Comparison
0%
10%
20%
30%
40%
50%
60%
70%
5
20
100
200
Result set size limit (k)
Relative bias
BarYossefGurevich06
AvgEstimator
Average Metric, Query Cost
Comparison
0
2
4
6
8
10
12
14
16
18
5
20
100
200
Result set size limit (k)
Square root of amortized query cost
BarYossefGurevich06
AvgEstimator without RB
AvgEstimator with RB
Comments 0
Log in to post a comment