Sreenivas Gollapudi
Search Labs, Microsoft Research
Joint work with Aneesh Sharma (Stanford) ,
Samuel Ieong, Alan Halverson, and Rakesh
Agrawal (Microsoft Research)
wine 2009
Intuitive definition
◦
Represent a variety of relevant meanings for a given
query
Mathematical definitions:
◦
Minimizing query abandonment
Want to represent different user categories
◦
Trade

off between relevance and novelty
Query and document similarities
◦
Maximal Marginal Relevance [CG98]
◦
Personalized re

ranking of results [RD06]
Probability Ranking Principle not optimal [CK06]
◦
Query abandonment
Topical diversification [Z+05, AGHI09]
◦
Needs topical (categorical) information
Loss minimization framework [Z02, ZL06]
◦
“
Diminishing returns
” for docs w/ the same intent is a
specific loss function [AGHI09]
Express diversity requirements in terms of
desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
Inspired by similar approaches for
◦
Recommendation systems [Andersen et al ’08]
◦
Ranking [Altman,
Tennenholtz
’07]
◦
Clustering [Kleinberg ’02]
Map the space of functions
–
a “basis vector”
Input:
◦
Candidate documents:
U={u
1
,u
2
,…, u
n
}
, query
q
◦
Relevance function:
w
q
(
u
i
)
◦
Distance function:
d
q
(
u
i
,
u
j
)
(symmetric,
non

metric
)
◦
Size
k
of output result set
w
q
(u
5
)
u
5
u
1
u
2
u
3
u
4
u
6
d
q
(,u
2
,u
4
)
Output
◦
Diversified set
S*
of documents (
S*= k
)
◦
Diversification function:
f : S x
w
q
x
d
q
R
+
S* =
argmax
f(S)
(
S=k
)
u
5
u
1
u
2
u
3
u
4
u
6
k = 3
S* = {u
1
,u
2
,u
6
}
1.
Scale

invariance
2.
Consistency
3.
Richness
4.
Strength
a)
Relevance
b)
Diversity
5.
Stability
6.
Two technical properties
S*
=
argmax
S
f(S, w(∙), d(∙, ∙))
=
argmax
S
f(S, w
΄
(∙), d
΄
(∙, ∙))
◦
w
΄
(
u
i
) =
α
∙ w(
u
i
)
◦
d
΄
(
u
i
,u
j
) =
α
∙ d(
u
i
,u
j
)
•
No
built

in
scale
for
f !
S*(3)
S*
=
argmax
S
f(S, w(∙), d(∙, ∙))
=
argmax
S
f(S, w
΄
(∙), d
΄
(∙, ∙))
◦
w
΄
(
u
i
) = w(
u
i
) +
a
i
for
u
i
є
S*
◦
d
΄
(
u
i
,u
j
) = d(
u
i
,u
j
) + b
i
for
u
i
and/or
u
j
є
S*
•
Increasing
relevance
/
diversity doesn’t
hurt!
S*(3)
S*(k)
=
argmax
S
f(S, w(∙), d(∙, ∙),k)
◦
S*(k) S*(k+1) for all k
•
Output set
shouldn’t oscillate
(
change arbitrarily
)
with size
S*(3)
S*(4)
Proof via constructive argument
Theorem:
No function
f
can
s
atisfy all the
axioms
simultaneously.
Scale

invariance, Consistency, Richness,
Strength of
Relevance/Diversity, Stab
ility, Two technical properties
Baseline for what is possible
Mathematical criteria for choosing
f
Modular approach:
f
is independent of specific
w
q
(∙)
and
d
q
(∙, ∙)
!
Express diversity requirements in terms of
desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
Max

sum (
avg
) objective:
u
5
u
1
u
2
u
3
u
4
u
6
k = 3
S* = {u
1
,u
2
,u
6
}
Violates stability!
u
3
u
5
k = 4
S* = {u
1
,u
3
,u
5
,u
6
}
Max

min objective:
u
5
u
1
u
2
u
3
u
4
u
6
k = 3
S* = {u
1
,u
2
,u
6
}
Violates consistency and stability!
u
5
S* = {u
1
,u
5
,u
6
}
A taxonomy

based diversification objective
◦
Uses the analogy of
marginal utility
to determine
whether to include more results from an already
covered category
◦
Violates stability and one of the technical axioms
Express diversity requirements in terms of
desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
Recast as facility dispersion
◦
Max

sum
(
MaxSumDispersion
):
◦
Max

min
(
MaxMinDispersion
):
Known approximation algorithms
Lower bounds
Lots of other facility dispersion objectives and
algorithms
S
= ∅
∀
c
∈
C
,
U
(
c

q
)
←
P
(
c

q
)
while

S
 <
k
do
for
d
∈
D
do
g
(
d

q
,
c
) ←
c
U
(
c

q
)
V
(
d

q
,
c
)
end
for
d
∗
←
argmax
g
(
d

q
,
c
)
S
←
S
∪ {
d
∗
}
∀
c
∈
C
,
U
(
c

q
)
←
(
1−
V
(
d
∗

q,
c
))
U
(
c

q
)
D
←
D
\
{
d
∗
}
end
while
P
(
c

q
): conditional
prob
of intent
c
given query
q
g
(
d

q, c
): current
prob
of
d
satisfying
q
,
c
Update the utility of a
category
Intent distribution:
P
(
R

q
) = 0.8,
P
(
B

q
) = 0.2
.
0.4
0.9
0.5
0.4
0.4
D
V
(d  q, c)
0.08
0.72
0.40
0.32
0.08
g
(d  q, c)
U
(
R

q
) =
U
(
B

q
) =
0.8
0.2
×
0.8
×
0.8
×
0.8
×
0.2
×
0.2
×
0.08
×
0.08
×
0.2
×
0.2
0.08
0.08
0.04
0.03
0.08
0.12
×
0.08
×
0.08
×
0.12
0.05
0.4
0.9
0.4
0.07
S
•
Actually produces an
ordered
set of results
•
Results not proportional to
intent distribution
•
Results not according to
(raw) quality
•
Better results
⇒
less
needed to be shown
Express diversity requirements in terms of
desired properties
Define objectives that satisfy these properties
Develop efficient algorithms
Metrics and evaluation methodologies
Approach
◦
Represent real queries
◦
Scale beyond a few user studies
Problem: Hard to define ground truth
Use disambiguated information sources on
the web as the ground truth
Incorporate intent into human judgments
◦
Can exploit the user distribution
(need to be careful)
Query = Wikipedia disambiguation page title
Large

scale ground truth set
Open source
Growing in size
Novelty
◦
Coverage of
wikipedia
topics
Relevance
◦
coverage of top Wikipedia results
Relevance function:
◦
1/position
◦
Can use the search engine score
◦
Maybe use query category information
Distance function:
◦
Compute TF

IDF distances
◦
Jaccard
similarity score for two documents
A
and
B
:
Topics/categories = list of disambiguation
topics
Given a set
S
k
of results:
◦
For each result, compute a distribution over topics
(using our
d(∙, ∙)
)
◦
Sum confidence over all topics
◦
Threshold to get # topics represented
jaguar.com
Jaguar
cat (0.1)
Jaguar
car (0.9)
wikipedia.org/jaguar
Jaguar
cat (0.8)
Jaguar
car (0.2)
Category confidence
•
Jaguar cat:
0.1+0.8
•
Jaguar car:
0.9+0.2
Threshold = 1.0
•
Jaguar cat: 0
•
Jaguar car: 1
Query
–
get ranking of search restricted to
Wikipedia pages
a(
i
)
= position of Wikipedia topic
i
in this list
b(
i
)
= position of Wikipedia topic
i
in list
being tested
Relevance is measured in terms of reciprocal
ranks:
Take expectation over distribution of intents
◦
Interpretation: how will the average user feel?
Consider
NDCG@k
◦
Classic:
◦
NDCG

IA depends on intent distribution and
intent

specific NDCG
c
c
k
S
q
c
P
k
S
)

;
(
NDCG
)

(
)
;
(
IA

NDCG
)

;
(
DCG
/
)

;
(
DCG
)

;
(
NDCG
ideal
c
k
S
c
k
S
c
k
S
Created two types of
HITs on Mechanical Turk
◦
Query classification:
workers are asked to
choose among three
interpretations
◦
Document rating (under
the given interpretation)
Two additional
evaluations
◦
MT classification + current
ratings
◦
MT classification + MT
document ratings
When is it right to diversify?
◦
Users have certain expectations about the workings
of a search engine
What is the best way to diversify?
◦
Evaluate approaches beyond diversifying the
retrieved results
Metrics that capture both relevance and
diversity
◦
Some preliminary work suggests that there will be
certain trade

offs to make
Otherwise, need to encode explicit user
model in the metric
◦
Selection only needs
k
(which is 10)
Later, can rank set according to relevance
◦
Personalize based on clicks
Alternative to stability:
◦
Select sets repeatedly (this loses information)
◦
Could refine selection online, based on user clicks
0
20
40
60
80
100
120
140
160
180
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency count
Normalized difference in novelty between diversified and original results
Novelty difference over
650
ambiguous queries
Maxsum
Maxmin
0
100
200
300
400
500
600
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency count
Normalized difference in relevance between diversified and original results
Relevance difference over
650
ambiguous queries
Max…
•
Results for query
cd
player
•
Relevance: popularity
•
Distance: from product hierarchy
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fractional difference in novelty
Thresholds for measuring novelty
Novelty for
Max

sum
as a function of thresholds and
lambda
0.1
0.2
0.4
0.6
0.8
1
2
4
6
8
0.00
0.10
0.20
0.30
0.40
0.50
0.60
MAPIA@3
MAPIA@5
MAPIA@10
MAP

IA value
Diverse
Engine 1
Engine 2
Engine 3
0.00
0.05
0.10
0.15
0.20
0.25
NDCGIA@3
NDCGIA@5
NDCGIA@10
NDCG

IA value
Diverse
Engine 1
Engine 2
Engine 3
0.00
0.10
0.20
0.30
0.40
0.50
0.60
MRRIA@3
MRRIA@5
MRRIA@10
Diverse
Engine 1
Engine 2
Engine 3
Many metrics for relevance
◦
Normalized discounted cumulative gains at k
(
NDCG@k
)
◦
Mean average precision at k (
MAP@k
)
◦
Mean reciprocal rank (MRR)
Some metrics for diversity
◦
Maximal marginal relevance (MMR)
[CG98]
◦
Nugget

based instantiation of NDCG
[C+08]
Want a metric that can take into account both
relevance and diversity
[JK00]
D
IVERSIFY
(
K
)
Given a query
q,
a set of documents
D,
distribution
P
(
c

q
), quality estimates
V
(
d

c
,
q
), and integer
k
Find a set of docs
S
D
with 
S
 =
k
that
maximizes
interpreted as the probability that the set
S
is
relevant to the query over all possible
intentions
c
S
d
c
q
d
V
q
c
P
q
S
P
))
,

(
1
(
1
)(

(
)

(
Find at least one relevant doc
Multiple intents
Makes explicit use of taxonomy
◦
In contrast, similarity

based:
[CG98], [CK06], [RKJ08]
Captures both diversification and doc relevance
◦
In contrast, coverage

based:
[Z+05], [C+08], [V+08]
Specific form of “loss minimization”
[Z02], [ZL06]
“
Diminishing returns
” for docs w/ the same intent
Objective is order

independent
◦
Assumes that all users read
k
results
◦
May want to optimize
k
P
(
k
)
P
(
S

q
)
Comments 0
Log in to post a comment