Rank Aggregation Methods II Experiments

cowphysicistInternet and Web Development

Dec 4, 2013 (3 years and 4 months ago)

94 views

Rank Aggregation Methods II

Experiments

CS728

Lecture 12


Recall the Rank Aggregation Problem


m candidates

(a.k.a. “alternatives”)


M = {
1
,…,m}: set of candidates


n voters

(a.k.a. “agents” or “judges”)


N = {
1
,…,n}: set of voters


Each voter i, has an
ranking

i

on M




i
(a) <

i
(b)

means i
-
th voter prefers
a

to
b


Ranking may be a total or partial order


The rank aggregation problem:


Combine

1
,…,

n

into a single ranking


on M, which
represents the “social choice” of the voters.


Rank aggregation function:
f(

1
,…,

n
) =






may be a total or partial order



Experiments: Distance Measures


Goal: Quantitatively compare different rank aggregation
methods.

Performance Measures:

(1)
Spearman footrule distance

is sum of pointwise distances. It
is normalized by dividing this number by the maximum value
(1/2)
|S|
2
, value between 0 and 1.

(2)
Kendall tau distance

counts the number of pairwise
disagreements. Dividing by the maximum possible value
(1/2)
S
(
S

-

1) we obtain a normalized version, value between 0
and 1.

(3) The
induced footrule distance

is obtained by taking the
projections of a full list s with each partial list. In a similar
manner,
induced Kendall tau distance

can be defined.

(4) The
scaled footrule distance

weights contributions of
elements based on the length of the lists they are present in. If
s is a full list and t is a partial list, then:



SF
(s, t) = Sum |

s(
i
)/|s|)
-

(t(
i
)/|t|)

|. Normalize
SF

by
dividing by |t|/2.

Experiments: Distance Measures



So for each aggregation method and each
distance measure we get a vector of values,
each component representing a distance to
from the aggregation to each voter list


Simplest is to take the average (or 1
-
norm)


Other norms are interesting


Mean square distance (2
-
norm)


Max distance (

-
norm)


Experiments: Minimizing Average

Altavista

(AV),
Alltheweb

(AW),
Excite

(EX),
Google

(GG),
Hotbot

HB),
Lycos

(LY), and
Northernlight

(NL)


K = Kendall distance SF = scaled footrule distance

IF = induced footrule distance LK = Local Kemenization

Experiments in Spam Filtering


Define
spam

to be web pages are low
-
ranked by
majority

opinion (machine and human


a simplifying
assumption)


although they may be highly ranked by
some search engines


Intuition: if a page spams most search engines for a
particular query, then no combination of these search
engines can filter the spam.
---
garbage in, garbage out.


Spam pages are the
Condorcet losers
, and will
occupy the bottom of ranking that satisfies the
extended Condorcet criterion


Similarly, good pages will be in the
Condorcet
winners
, and will rank above the losers.


Condorcet Criterion


An candidate of M which wins every other in
pairwise simple majority voting should be ranked first.


Extended Condorcet Criterion (XCC):


Version 1: If most voters prefer candidate a to
candidate b (i.e., # of i s.t.

i
(a) <

i
(b) is at least n/2),
then also


should prefer a to b (i.e.,

(a) <

(b)).


Version 2: If there is a partition (
W
,
L
) of
M

such that
for any
x

in
W

and
y

in
L

the majority prefers
x

to
y
,
then
x

must be ranked above
y
.
W

is called Condorcet
winners and
L

is Condorcet losers




Condorcet Criteria

XCC(2) and SPAM Filtering


Note that XCC(1) => XCC(2), so Version 1 is
stronger


But XCC(1) is not always realizable


As we will see XCC(2) is always realizable via
Local Keminization


Hence using rank aggregation with XCC(2)
should assist in SPAM filtering, since
Condorcet losers will be lowest rank


Let us look at where spam pages (human
determined) are ranked with good aggregation
methods.

Experiments: Filtering SPAM

Table 3:



Ranks of "spam" pages for the queries:



Feng Shui, organic vegetables
and
gardening
.



url

AV

AW

GG

HB

LY

NL

SFO

MC4


www.lucky
-
bamboo.com


4

43





41



144

63

www.cambriumcrystals.com




9

51



5



31

59

www.luckycat.com


11

14

26



13



49

36

www.davesorganics.com


84

19

1



17



77

93

www.frozen.ch




9



63

11



49

121

www.eonseed.com




18



6

16



23

66

www.augusthome.com


26

16



27

12

16

57

54

www.taun
ton.com




25





21



78

67

www.egroups.com




34





29



108

101



Experiment: Word association


Different search engines and portals have different (default)
semantics of handling a multi
-
word query.


Some use OR semantics (documents contain one of the given
query terms) while Google uses the AND semantics (all the
query words must appear). Both inconvenient in many
situations.


Consider searching for the job of a software engineer from an
on
-
line job database. The user lists a number of skills and a
number of potential keywords in the job description, for
example, "Silicon Valley C++ Java CORBA TCP
-
IP
algorithms start
-
up pre
-
IPO stock options". It is clear that the
"AND" rule might produce no document or SPAM, and the
"OR" rule is equally disastrous.


Experiment with rank aggregation using multiple queries
based on small subsets of terms.



Results for query: madras madurai coimbatore vellore.



(cities in the state of Tamil Nadu, India)






Google

www.mssrf.org/Fris9809/location
-
tamilnadu.html



www.indiaplus.com/Info/schools.html



www.focustamilnadu.com/tamilnadu/Policy%20Note
...Forests.html



www.tn.gov.in/policy/environ.htm



www.indiacolleges.com/Tamil_Nadu.htm



SFO with LK

www.madurai.com



www.ozemail.com.au/clday/locations.htm



www.utoledo.edu/homepages/speelam/coimbatore.html



www.ozemail.com.au/clday/madras.htm



www.madurai.com/around.htm



www.indiatraveltimes.com/tamilnadu/tamil1.html



MC4 with LK

www.madurai.com



www.surfindia.com/omsakthi/tourism.htm



www.indiatraveltimes.com/tamilnadu/tamil1.html



www.indiatraveltimes.com/tamilnadu/tamil2.html



www.indiatravels.com/forts/vellore_fort.htm



www.india
-
tourism.de/english/south/tamil_nadu.html









Locally Kemeny optimal
aggregation and XCC(2)



Many of existing aggregation methods do not
satisfy XCC(1) or XCC(2).


It is possible to use your favorite aggregation
method to obtain a full list. Then apply local
kemenization to realize XCC(2) which filters
Condorcet losers.

Locally Kemeny optimal


Recall that Kemeny optimal is NP
-
hard


Definition of locally optimal

A permutation p is a
locally Kemeny optimal

aggregation of partial lists t1, t2, ..., t
k
, if there is no
permutation p' that can be obtained from p by
performing a single transposition of an
adjacent pair

of elements and for which

Kendal distance



K
(p', t1, t2, ..., t
k
) <
K
(p, t1, t2, ..., t
k
).


In other words, it is impossible to reduce the total
distance to the t's by flipping an adjacent pair.

Example of LKO but not KO


Example 1



t1 = (1,2), t2 = (2,3), t3 = t4 = t5 = (3,1).



p = (1,2,3),

We have that p satisfies Definition of LKO,
K
(p, t1, t2, ..., t5)= 3, but transposing 1 and 3
decreases the sum to 2.


LKO satisfies XCC(2)


Proof by contradiction


If the result is false then there exist partial lists t1, t2, ..., t
k
, a
LKO aggregation p, and a partition (W,L) that violates
XCC(2); that is some pair c in
W

and d in
L
, such that p(d) <
p(c). Let (c,d) be the closest such pair in p.


Consider the immediate successor of
d

in p, call it
e
. If
e=c

then
c

is adjacent to
d

in p and transposing this adjacent pair of
alternatives produces a p' such that
K
(p', t1, t2, ..., t
k
) <
K
(p,
t1, t2, ..., t
k
), contradicting the assumption on p.


If
e

does not equal
c
, then either
e

is in
W
, in which case the
pair (
e,d
) is a closer pair in p than (
d
,
c
) and also violates the
XCC(2), or
e

is in
L
, in which case (
e
,
c
) is a closer pair than
(
d
,
c
) that violates XCC(2). Both cases contradict the choice of
(
d
,
c
).


A local Kemenization of a full list with respect to preference
lists so as to compute a locally Kemeny optimal aggregation

that is maximally consistent with original.


This approach:

(1) preserves the strengths of the initial aggregation

(2) ranks non
-
spam above spam.

(3) gives a result that disagrees with original on any pair
(i, j) only if a majority
endorse this disagreement.

(4) for every d, 1 ≤
d

≤ |
μ

|, the restriction of the output is a
local Kemenization of the top d elements of
μ


Local Kemenization procedure


Local Kemenization procedure


A simple inductive construction.


Assume inductively for that we have constructed p, a local
Kemenization of the projection of the t's onto the elements 1,
...,
l
-
1.


Insert next element x into the lowest
-
ranked "permissible"
position in p: just below the lowest
-
ranked element
y

in p such
that


(a) no majority among the (original) t's prefers
x

to
y

and


(b) for all successors
z

of
y

in p there is a majority that prefers
x

to
z
.


In other words, we try to insert
x

at the end (bottom) of the list
p; we bubble it up toward the top of the list as long as a
majority of the t's insists that we do.

Example local kemenization procedure

A

B

F

E

C

D

B

C

A

E

F

D

A

C

F

D

E

B

B

F

D

C

A

E

C

A

B

F

E

D

B

A

D

C

E

F

B






B

A





A

B





A

B

D




A

B

D

C



A

B

C

D



A

B

C

F

E

D


Local Kemenization Example!

disagree

A>B: 3

A<B: 2

B>D: 4

B<D: 1

RA and Searching Workplace Web


Axiom 1: Intranet documents are not spam


Axiom 2: Queries usually have unique answers
(not broad topic based)


Axiom 3: Intranet docs are not search engine
friendly (docs are accessed through portals and
database queries


Rank aggregation allows us to combine
number of heuristic alternatives: static and
dynamic, query dependent and independent