Sreenivas Gollapudi Search Labs, Microsoft Research

nostrilshumorousInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

73 εμφανίσεις

Sreenivas Gollapudi

Search Labs, Microsoft Research

Joint work with Aneesh Sharma (Stanford) ,
Samuel Ieong, Alan Halverson, and Rakesh
Agrawal (Microsoft Research)

wine 2009



Intuitive definition


Represent a variety of relevant meanings for a given
query



Mathematical definitions:


Minimizing query abandonment


Want to represent different user categories


Trade
-
off between relevance and novelty


Query and document similarities


Maximal Marginal Relevance [CG98]


Personalized re
-
ranking of results [RD06]


Probability Ranking Principle not optimal [CK06]


Query abandonment


Topical diversification [Z+05, AGHI09]


Needs topical (categorical) information


Loss minimization framework [Z02, ZL06]



Diminishing returns
” for docs w/ the same intent is a
specific loss function [AGHI09]




Express diversity requirements in terms of
desired properties



Define objectives that satisfy these properties



Develop efficient algorithms



Metrics and evaluation methodologies



Inspired by similar approaches for


Recommendation systems [Andersen et al ’08]


Ranking [Altman,
Tennenholtz

’07]


Clustering [Kleinberg ’02]



Map the space of functions


a “basis vector”


Input:


Candidate documents:
U={u
1
,u
2
,…, u
n
}
, query
q


Relevance function:
w
q
(
u
i
)


Distance function:
d
q
(
u
i
,
u
j
)
(symmetric,
non
-
metric
)


Size
k

of output result set

w
q
(u
5
)

u
5

u
1

u
2

u
3

u
4

u
6

d
q
(,u
2
,u
4
)


Output


Diversified set
S*

of documents (
|S*|= k
)


Diversification function:



f : S x
w
q

x
d
q



R
+



S* =
argmax

f(S)

(
|S|=k
)

u
5

u
1

u
2

u
3

u
4

u
6

k = 3

S* = {u
1
,u
2
,u
6
}

1.
Scale
-
invariance

2.
Consistency

3.
Richness

4.
Strength

a)
Relevance

b)
Diversity

5.
Stability

6.
Two technical properties



S*
=
argmax
S

f(S, w(∙), d(∙, ∙))


=
argmax
S

f(S, w
΄
(∙), d
΄
(∙, ∙))


w
΄
(
u
i
) =
α

∙ w(
u
i
)


d
΄
(
u
i
,u
j
) =
α

∙ d(
u
i
,u
j
)


No

built
-
in
scale


for

f !

S*(3)


S*
=
argmax
S

f(S, w(∙), d(∙, ∙))


=
argmax
S

f(S, w
΄
(∙), d
΄
(∙, ∙))


w
΄
(
u
i
) = w(
u
i
) +
a
i

for
u
i

є

S*


d
΄
(
u
i
,u
j
) = d(
u
i
,u
j
) + b
i

for
u
i

and/or
u
j

є

S*


Increasing
relevance
/
diversity doesn’t
hurt!

S*(3)


S*(k)
=
argmax
S

f(S, w(∙), d(∙, ∙),k)


S*(k) S*(k+1) for all k


Output set
shouldn’t oscillate

(
change arbitrarily
)
with size


S*(3)

S*(4)


Proof via constructive argument

Theorem:


No function
f
can

s
atisfy all the
axioms

simultaneously.

Scale
-
invariance, Consistency, Richness,

Strength of
Relevance/Diversity, Stab
ility, Two technical properties




Baseline for what is possible



Mathematical criteria for choosing
f



Modular approach:


f

is independent of specific
w
q
(∙)
and
d
q
(∙, ∙)
!



Express diversity requirements in terms of
desired properties



Define objectives that satisfy these properties



Develop efficient algorithms



Metrics and evaluation methodologies


Max
-
sum (
avg
) objective:



u
5

u
1

u
2

u
3

u
4

u
6

k = 3

S* = {u
1
,u
2
,u
6
}

Violates stability!

u
3

u
5

k = 4

S* = {u
1
,u
3
,u
5
,u
6
}


Max
-
min objective:



u
5

u
1

u
2

u
3

u
4

u
6

k = 3

S* = {u
1
,u
2
,u
6
}

Violates consistency and stability!

u
5

S* = {u
1
,u
5
,u
6
}



A taxonomy
-
based diversification objective


Uses the analogy of
marginal utility
to determine
whether to include more results from an already
covered category


Violates stability and one of the technical axioms



Express diversity requirements in terms of
desired properties



Define objectives that satisfy these properties



Develop efficient algorithms



Metrics and evaluation methodologies


Recast as facility dispersion


Max
-
sum
(
MaxSumDispersion
):




Max
-
min
(
MaxMinDispersion
):




Known approximation algorithms



Lower bounds



Lots of other facility dispersion objectives and
algorithms


S

= ∅



c


C
,
U
(
c
|
q
)

P
(
c
|
q
)


while

|
S
| <
k

do




for

d


D

do




g
(
d
|
q
,
c
) ←

c

U
(
c
|
q
)
V

(
d
|
q
,
c
)




end
for




d



argmax

g
(
d
|
q
,
c
)




S

S

∪ {
d

}





c

C
,

U
(
c
|
q
)

(
1−
V
(
d

|
q,
c
))
U
(
c
|
q
)




D

D
\

{
d

}


end
while

P
(
c

|
q
): conditional
prob

of intent
c
given query
q

g
(
d

|
q, c
): current
prob

of
d

satisfying
q
,
c

Update the utility of a
category


Intent distribution:
P
(
R
|
q
) = 0.8,
P
(
B

|
q
) = 0.2
.


0.4

0.9

0.5

0.4

0.4

D

V
(d | q, c)

0.08

0.72

0.40

0.32

0.08

g
(d | q, c)

U
(
R

|
q
) =

U
(
B

|
q
) =

0.8

0.2

×
0.8

×
0.8

×
0.8

×
0.2

×
0.2

×
0.08

×
0.08

×
0.2

×
0.2

0.08

0.08

0.04

0.03

0.08

0.12

×
0.08

×
0.08

×
0.12

0.05

0.4

0.9

0.4

0.07

S


Actually produces an
ordered

set of results


Results not proportional to
intent distribution


Results not according to
(raw) quality


Better results


less
needed to be shown



Express diversity requirements in terms of
desired properties



Define objectives that satisfy these properties



Develop efficient algorithms



Metrics and evaluation methodologies


Approach


Represent real queries


Scale beyond a few user studies


Problem: Hard to define ground truth



Use disambiguated information sources on
the web as the ground truth


Incorporate intent into human judgments


Can exploit the user distribution
(need to be careful)


Query = Wikipedia disambiguation page title


Large
-
scale ground truth set


Open source


Growing in size



Novelty


Coverage of
wikipedia

topics



Relevance


coverage of top Wikipedia results



Relevance function:


1/position


Can use the search engine score


Maybe use query category information



Distance function:


Compute TF
-
IDF distances


Jaccard

similarity score for two documents
A

and
B
:




Topics/categories = list of disambiguation
topics


Given a set
S
k

of results:


For each result, compute a distribution over topics
(using our
d(∙, ∙)
)


Sum confidence over all topics


Threshold to get # topics represented

jaguar.com

Jaguar
cat (0.1)

Jaguar
car (0.9)

wikipedia.org/jaguar

Jaguar
cat (0.8)

Jaguar
car (0.2)

Category confidence


Jaguar cat:
0.1+0.8


Jaguar car:
0.9+0.2

Threshold = 1.0


Jaguar cat: 0


Jaguar car: 1


Query


get ranking of search restricted to
Wikipedia pages


a(
i
)

= position of Wikipedia topic
i

in this list


b(
i
)

= position of Wikipedia topic
i

in list
being tested


Relevance is measured in terms of reciprocal
ranks:




Take expectation over distribution of intents


Interpretation: how will the average user feel?


Consider
NDCG@k


Classic:




NDCG
-
IA depends on intent distribution and
intent
-
specific NDCG



c
c
k
S
q
c
P
k
S
)
|
;
(
NDCG
)
|
(
)
;
(
IA
-
NDCG
)
|
;
(
DCG
/
)
|
;
(
DCG
)
|
;
(
NDCG
ideal
c
k
S
c
k
S
c
k
S


Created two types of
HITs on Mechanical Turk


Query classification:
workers are asked to
choose among three
interpretations


Document rating (under
the given interpretation)


Two additional
evaluations


MT classification + current
ratings


MT classification + MT
document ratings



When is it right to diversify?


Users have certain expectations about the workings
of a search engine


What is the best way to diversify?


Evaluate approaches beyond diversifying the

retrieved results


Metrics that capture both relevance and
diversity


Some preliminary work suggests that there will be
certain trade
-
offs to make



Otherwise, need to encode explicit user
model in the metric


Selection only needs
k

(which is 10)



Later, can rank set according to relevance


Personalize based on clicks



Alternative to stability:


Select sets repeatedly (this loses information)


Could refine selection online, based on user clicks

0
20
40
60
80
100
120
140
160
180
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency count

Normalized difference in novelty between diversified and original results

Novelty difference over
650
ambiguous queries

Max-sum
Max-min
0
100
200
300
400
500
600
-1
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency count

Normalized difference in relevance between diversified and original results

Relevance difference over
650
ambiguous queries

Max-…


Results for query
cd

player









Relevance: popularity



Distance: from product hierarchy

0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fractional difference in novelty

Thresholds for measuring novelty

Novelty for
Max
-
sum

as a function of thresholds and
lambda

0.1
0.2
0.4
0.6
0.8
1
2
4
6
8
0.00
0.10
0.20
0.30
0.40
0.50
0.60
MAP-IA@3
MAP-IA@5
MAP-IA@10
MAP
-
IA value

Diverse
Engine 1
Engine 2
Engine 3
0.00
0.05
0.10
0.15
0.20
0.25
NDCG-IA@3
NDCG-IA@5
NDCG-IA@10
NDCG
-
IA value

Diverse
Engine 1
Engine 2
Engine 3
0.00
0.10
0.20
0.30
0.40
0.50
0.60
MRR-IA@3
MRR-IA@5
MRR-IA@10
Diverse
Engine 1
Engine 2
Engine 3

Many metrics for relevance


Normalized discounted cumulative gains at k
(
NDCG@k
)


Mean average precision at k (
MAP@k
)


Mean reciprocal rank (MRR)


Some metrics for diversity


Maximal marginal relevance (MMR)
[CG98]


Nugget
-
based instantiation of NDCG
[C+08]


Want a metric that can take into account both
relevance and diversity

[JK00]

D
IVERSIFY
(
K
)


Given a query
q,
a set of documents
D,
distribution
P
(
c
|
q
), quality estimates
V
(
d
|
c
,
q
), and integer
k


Find a set of docs
S


D
with |
S
| =
k
that
maximizes




interpreted as the probability that the set
S

is
relevant to the query over all possible
intentions







c
S
d
c
q
d
V
q
c
P
q
S
P
))
,
|
(
1
(
1
)(
|
(
)
|
(
Find at least one relevant doc

Multiple intents


Makes explicit use of taxonomy


In contrast, similarity
-
based:
[CG98], [CK06], [RKJ08]


Captures both diversification and doc relevance


In contrast, coverage
-
based:

[Z+05], [C+08], [V+08]


Specific form of “loss minimization”
[Z02], [ZL06]



Diminishing returns
” for docs w/ the same intent


Objective is order
-
independent


Assumes that all users read
k

results


May want to optimize

k

P
(
k
)
P
(
S

|
q
)