Efficient Search Engine Measurements

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 6 months ago)

48 views

Efficient Search Engine
Measurements

Maxim Gurevich

Technion




Ziv Bar
-
Yossef

Technion and Google





Search Engine Benchmarks


State of the art:


No objective benchmarks for search engines


Need to rely on “anecdotal” studies or on subjective search
engine reports


Users, advertisers, partners cannot compare search engines


Our goal:


Design search engine benchmarking techniques


Accurate


Efficient


Objective


Transparent


Search Engine Corpus
Evaluation


Corpus size


How many pages are indexed?


Search engine overlap


What fraction of the pages indexed by search engine A are
also indexed by search engine B?


Freshness


How old are the pages in the index?


Spam resilience


What fraction of the pages in the index are spam?


Duplicates


How many unique pages are there in the index?

Search Engine Corpus Metrics

Index

Public

Interface

Search Engine

Web

D

Indexed
Documents


Corpus size


Number of unique pages


Overlap


Average age of a page


Focus of
this talk

Target function

Search Engine Estimators

Index

Public

Interface

Search Engine

Estimator

Web

D

Queries

Top k
results

Estimate of |D|

Indexed
Documents

Success Criteria

Estimation accuracy:


Bias E(Estimate
-

|D|)


Amortized cost (cost times variance):


Amortized query cost


Amortized fetch cost


Amortized function cost

Previous Work

Average metrics:


Anecdotal queries


[SearchEngineWatch, Google, BradlowSchmittlein00]


Queries from user query logs


[LawrenceGiles98, DobraFeinberg04]


Random queries


[BharatBroder98, CheneyPerry05, GulliSignorini05,
BarYossefGurevich06, Broder et al 06]


Random sampling from the web


[Henzinger et al 00, Bar
-
Yossef et al 00, Rusmevichientong et al 01]

Sum metrics:


Random queries


[Broder et al 06]

Our Contributions


A new search engine estimator


Applicable to both sum metrics and average metrics


Arbitrary target functions


Arbitrary target distributions (measures)


Less bias than the Broder et al estimator


In one experiment, empirical relative bias was reduced
from

75% to 0.01%


More efficient than the BarYossefGurevich06 estimator


In one experiment, query cost was reduced
by
a factor of
375
.


Techniques


Approximate ratio importance sampling


Rao
-
Blackwellization

Roadmap


Recast the Broder et al corpus size estimator
as an importance sampling estimator.


Describe the “degree mismatch problem”
(DMP)


Show how to overcome DMP using
approximate ratio importance sampling


Discuss Rao
-
Blackwellization


Gloss over some experimental results

Query Pools

C

Training corpus of
web documents

P

q
1





Query Pool


Working example
:
P = all length
-
3 phrases that occur in C


If “
to be or not to be
” occurs in C, P contains:



to be or
”, “
be or not
”, “
or not to
”, “
not to be



Choose P that “covers” most documents in D

q
2

Pre
-
processing step: Create a query pool

maps.yahoo.com

The Search Engine Graph


P

= query pool


neighbors(q)

=


{ documents returned
on query q }


deg(q)

= |neighbors(q)|


neighbors(x)

=


{ queries that return x
as a result }


deg(x)

= |neighbors(x)|

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC


deg(“news”) = 4, deg(“bbc”) = 3


deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

Corpus Size as an Integral

E

= Edges in the queries
-
documents graph

Lemma
:

Proof
:

Contribution of edge
(q,x) to sum:
1/deg(x)


Total contribution of
edges incident to x:
1


Total contribution of all
edges:
|D|

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.uk

www.google.com

maps.google.com

www.bbc.co.uk

“news”

“bbc”

“google”

en.wikipedia.org/wiki/BBC

Corpus Size as an Integral


Express corpus size as an integral:


Target measure:


(q,x) = 1/deg(x)


Target function
:

f(q,x) = 1


Monte Carlo Estimation


Monte Carlo estimation of the integral


Sample (Q,X) according to



Output f(Q,X)



Works only if:




is a proper distribution


We can easily sample from




BUT,


In our case


is not a distribution


Even if it was, sampling from


=
1/deg(x) may not be easy



So instead, we sample (Q,X) from an easy “
trial distribution
” p

Sampling Edges, Easily

Search Engine

Sampler

Q

Top k
results

(Q,X)

X
-

a random result of Q

P

A random query
Q


Sample an edge (q,x) with probability


p(q,x) = 1/(|P|
¢

deg(q))

Importance Sampling (IS)
[Marshal56]


We have
: A sample (Q,X) from
p


We need
: Estimate the integral



So we cannot use simple Monte Carlo estimation


Importance sampling

comes to the rescue…


Compute an “importance weight” for (Q,X):





Importance sampling estimator:



IS: Bias Analysis

Computing the Importance
Sampling Estimator


We need to compute







Computing |P| is easy


we know P



How to compute deg(Q) = |neighbors(Q)|?


Since Q was submitted to the search engine, we know deg(Q)



How to compute deg(X) = |neighbors(X)|?


Fetch content of X from the web


pdeg(X)

= number of distinct queries from P that X contains


Use pdex(X) as an estimate for deg(X)

The Degree Mismatch Problem
(DMP)


In reality, pdeg(X) may be different from deg(X)



Neighbor recall problem:
There may be

q


neighbors(x) that do not occur in x


q occurs as “anchor text” in a page linking to x


q occurs in x, but our parser failed to find it



Neighbor precision problem:
There may be q that
occur in x, but q


neighbors(x)


q “overflows”


q occurs in x, but the search engine’s parser failed to find it

Implications of DMP


Can only approximate document degrees


Bias of importance sampling estimator may
become significant


In one of our experiments, relative bias was 75%

Eliminating the Neighbor
Recall Problem


The
predicted search engine graph
:


pneighbors(x) = queries that occur in x


pneighbors(q) = documents in whose text q occurs




An edge (q,x) is “
valid
”, if it occurs both in the search
engine graph and the predicted search engine graph



The
valid search engine graph
:


vneighbors(x) = neighbors(x) ∩ pneighbors(x)


vneighbors(q) = neighbors(q) ∩ pneighbors(q)


Eliminating the Neighbor
Recall Problem (cont.)


We use the valid search engine graph rather
than the real search engine graph:



vdeg(q)

= |vneighbors(q)|


vdeg(x)

= |vneighbors(x)|


P
+

= queries q in P with vdeg(q) > 0


D
+

= documents x in D with vdeg(x) > 0



Assuming D
+

= D, then E(IS(Q,X)) = |D|

Approximate Importance
Sampling (AIS)


We need to compute


vdeg(Q)


Easy


vdeg(X)


Hard


|P
+
|
-

Hard


We therefore
approximate

|P
+
| and vdeg(X):










IVD(X)

= unbiased probabilistic estimator for
pdeg(X)/vdeg(X)


Estimating pdeg(x)/vdeg(x)


Given:

A document x


Want:

Estimate pdeg(x) / vdeg(x)


Geometric estimation
:


n = 1


forever do


Choose a random phrase Q that occurs in content(x)


Send Q to the search engine


If x


neighbors(Q)
, return n


n


n + 1


Probability to hit a “valid” query: vdeg(x) / pdeg(x)


So, expected number of iterations: pdeg(x) / vdeg(x)

Approximate Importance
Sampling: Bias Analysis


Lemma
: Multiplicative bias of AIS(Q,X) is




Approximate Importance
Sampling: Bias Elimination


How to eliminate the bias in AIS?


Estimate the bias |P|/|P
+
|


Divide AIS by this estimate


Well, this doesn’t quite work


Expected ratio ≠ ratio of expectations


So, use a standard trick in estimation of ratio
statistics:



BE = estimator of |P|/|P
+
|

Bias Analysis



Theorem
:

Estimating |P|/|P
+
|


Also by geometric estimation:


n = 1


forever do


Choose a random query Q from P


Send Q to the search engine


If vdeg(Q) > 0, return n


n


n + 1


Probability to hit a “valid” query: |P
+
|/|P|


So, expected number of iterations: |P|/|P
+
|

Recap

1.
Sample valid edges (Q
1
,X
1
),…,(Q
n
,X
n
) from p

2.
Compute vdeg(Q
i
) for each query Q
i

3.
Compute pdeg(X
i
) for each document X
i

4.
Estimate IVD(X
i
) = pdeg(X
i
)/vdeg(X
i
) for each X
i


5.
Compute AIS



6.
Estimate the expected bias BE
i

= |P|/|P
+
|

7.
Output

Rao
-
Blackwellization


Question
: We currently use only one
(random) result for each query submitted to
the search engine. Can we use also the rest?


Rao & Blackwell
: Sure! Use them as
additional samples. It can only help!


The
Rao
-
Blackwellized AIS estimator
:


Recall:



RB
-
AIS: Analysis


The

Rao
-
Blackwell Theorem
:


AIS
RB

has exactly the same bias as AIS


The variance of AIS
RB

can only be lower


Variance reduces, if query results are sufficiently
“variable”



Now, use AIS
RB

instead of AIS in
SizeEstimator:


Corpus Size, Bias Comparison

0%
10%
20%
30%
40%
50%
60%
70%
80%
5
20
100
200
Result set size limit (k)
Relative bias
Broder et al
SizeEstimator
Corpus Size, Query Cost
Comparison

0.0E+00
2.0E+14
4.0E+14
6.0E+14
8.0E+14
1.0E+15
1.2E+15
5
20
100
200
Result set size limit (k)
Amortized query cost
SizeEstimator without RB
SizeEstimator with RB
Corpus Size Estimations for 3
Major Search Engines

0
1
2
3
4
5
6
7
8
9
10
SE1
SE2
SE3
Absolute corpus size (billions)
Broder et al
SizeEstimator
Thank You

Average Metric, Bias
Comparison

0%
10%
20%
30%
40%
50%
60%
70%
5
20
100
200
Result set size limit (k)
Relative bias
BarYossefGurevich06
AvgEstimator
Average Metric, Query Cost
Comparison

0
2
4
6
8
10
12
14
16
18
5
20
100
200
Result set size limit (k)
Square root of amortized query cost
BarYossefGurevich06
AvgEstimator without RB
AvgEstimator with RB