Weighted Clustering

dealerdeputyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 23 μέρες)

84 εμφανίσεις

Weighted Clustering

Margareta Ackerman







Work with
Shai

Ben
-
David,
Simina

Branzei
, and David
Loker



Clustering is one of the most widely used tools
for exploratory data analysis.


Social Sciences


Biology


Astronomy


Computer Science


….


All apply clustering to gain a first understanding of
the structure of large data sets.


The Theory
-
Practice Gap

2

“While the interest in and application of
cluster analysis has been rising rapidly,
the abstract nature of the tool is still
poorly understood” (Wright, 1973)


“There has been relatively little work aimed at
reasoning about clustering independently of
any particular algorithm, objective function,
or generative data model” (Kleinberg, 2002)


Both statements still apply today.


3

The Theory
-
Practice Gap

Clustering aims to assign data into groups of
similar items


Beyond that, there is very little consensus on
the definition of clustering





4

Inherent Obstacles:

Clustering is ill
-
defined


Clustering is
inherently ambiguous


There may be multiple reasonable clusterings


There is usually no ground truth


There are many clustering algorithms with
different (often implicit) objective
functions


Different algorithms have radically
different input
-
output behaviour






5

Inherent
Obstacles

6

Differences
in
Input/Output

Behavior of
Clustering Algorithms

7

Differences
in
Input/Output

Behavior of
Clustering Algorithms


There are a wide variety of clustering algorithms, which
can produce very different clusterings.












8

How should a user decide
which algorithm to use for a
given application?

Clustering Algorithm Selection




Users rely on cost related considerations: running
times, space usage, software purchasing costs, etc…


There is inadequate emphasis on

input
-
output behaviour







9

Clustering Algorithm Selection

We propose a framework that lets a user utilize prior
knowledge to select an algorithm



Identify properties that distinguish between different
input
-
output behaviour
of clustering paradigms


The properties should be:

1)

Intuitive and “user
-
friendly”

2)
Useful for distinguishing clustering algorithms


10

Our Framework for Algorithm Selection


In essence, our goal is to understand
fundamental differences between
clustering methods, and convey them
formally
,
clearly
, and as
simply as
possible
.


11

Our Framework for Algorithm Selection



12

Previous Work



Axiomatic perspective


Impossibility Result: Kleinberg (NIPS, 2003)


Consistent axioms for
q
uality

measures: Ackerman &
Ben
-
David (NIPS, 2009)


Axioms in the weighted setting: Wright (Pattern
Recognition, 1973)








13

Previous Work



Characterizations of Single
-
Linkage


Partitional Setting:
Bosah

Zehad

and Ben
-
David (UAI,
2009)


Hierarchical Setting: Jarvis and Sibson (Mathematical
Taxonomy, 1981) and
Carlsson

and
Memoli

(JMLR, 2010).



Characterizations of Linkage
-
Based Clustering


Partitional Setting: Ackerman, Ben
-
David, and
Loker

(COLT,
2010)
.


Hierarchical Setting: Ackerman & Ben
-
David (IJCAI, 2011).






14

Previous Work



Classification
s
of clustering methods


Fischer and Van Ness (
Biometrica
, 1971)


Ackerman, Ben
-
David, and
Loker

(NIPS,

2010)





15

What’s Left To Be Done?



Despite much work on clustering properties, some basic
questions remain unanswered.



Consider some of the most popular clustering methods:


k
-
means, single
-
linkage, average
-
linkage, etc…



What are the advantages of
k
-
means over other
methods?


Previous classifications are missing key properties.









16

Our Contributions (at a high level)



We indentify 3 fundamental categories that clearly
delineate some essential differences between common
clustering methods



The strength of these categories lies in their simplicity.



We hope this gives insight into core differences between
popular clustering methods.



To define these categories, we first present the weighted
clustering setting.







Outline


Formal framework


Categories and classification


A result from each category


Conclusions and future work



Outline




Every element is associated with a
real valued weight, representing its
mass or importance.





Generalizes the notion of element duplication.



Algorithm design, particularly design of approximation
algorithms, is often done in this framework.

18

Weighted Clustering




Apply clustering to facility allocation, such
as the
placement of
police stations
in a new district.


The distribution
of
stations should
enable quick
access to most areas in
the district
.



19

Other Reasons to Add Weight:

An Example




A
ccessibility

of different
institutions to a station may
have varying importance.


The weighted setting enables
a convenient method for
prioritizing certain landmarks.








Traditional
clustering algorithms can be
readily
translated
into the weighted
setting by
considering their behavior on data containing
element duplicates.






20

Algorithms in the Weighted Clustering
Setting



For a finite domain set
X
, a
weight
function





w
: X

→R
+


defines the weight of every element
.



For a finite domain set
X
, a
distance function





d: X
x

X

→R

+
u {0}


is the distance defined between the domain points.




Formal Setting

(
X,d
)
denotes
unweighted

data

(w[X],d)
denotes weighted
data




A Partitional Algorithm maps

Input:
(w[X],
d,k
)

to

Output
:
a
k
-
partition (
k
-
clustering
) of
X







Formal Setting

Partitional Clustering Algorithm

A Hierarchical Algorithm maps

Input:
(w[X],d)

to

Output
:
dendrogram

of
X



A
dendrogram

of
(
X,d
)
is a strictly binary tree whose

leaves correspond to elements of
X


C
appears

in

A
(w[X
],d
)

if its clusters are in the
dendrogram






Formal Setting

Hierarchical Clustering Algorithm






24

Our Contributions




We utilize the weighted framework to indentify 3
fundamental categories, describing how algorithms
respond to weight.


Classify traditional algorithms according to these
categories


Fully characterize
when different algorithms react to
weight


PARTITIONAL:

Range(A(X,
d,k
)) = {C |



献t
⸠䌽䄨A存崬⁤⥽



The
set of
clusterings that
A

outputs on
(
X,
d)
over all possible
weight functions
.


HIERARCHICAL:

Range(A(X, d)) = {D |



献t
⸠䐽D⡷存崬⁤(}




The set of
dendrograms

that
A

outputs on
(X, d)
over all
possible weight functions.


Towards Basic Categories

Range(
X,d
)

Outline


Formal framework


Categories and classification


A result from each category


Conclusions and future work



Outline






27

Categories:

Weight Robust


A

is
weight
-
robust

if
for
all
(
X,
d
), |Range(
X,d
)| = 1.







A
never responds to weight.








28

Categories:

Weight Sensitive


A

is
weight
-
sensitive

if for all
(X, d)
,
|Range(
X,d
)| > 1.




A

always responds

to weight.



29

Categories:

Weight Considering


An
algorithm

A
is
weight
-
considering

if

1)
There
exists
(X,
d)
where
|Range(
X,d
)|=1.

2)
There exists
(
X,
d)
where
|Range(
X,d
)|>1
.



A

responds to weight on some

data sets, but not others.


Range(A(X, d)) = {C
|


w

such that
A(w[X], d) = C}

Range(A(X, d)) = {D
|


w

such that
A(w[X], d) = D}



Weight
-
robust
: for all
(X, d), |Range(
X,d
)| = 1.



Weight
-
sensitive:
for all
(X, d),|Range(
X,d
)| > 1.



W
eight
-
considering:

1)


(X,
d)
where
|Range(
X,d
)|=1.

2)


(X,
d)
where
|Range(
X,d
)|>1
.




30

Summary of Categories

Outline


In the facility allocation
example above, a weight
-
sensitive algorithm may be
preferred.


Connecting To Applications


In phylogeny, where sampling
procedures can be highly biased,
some degree of weight
robustness may be desired.



The

desired category depends on the application.

Partitional

Hierarchical

Weight Robust

Min Diameter

K
-
center

Single Linkage

Complete Linkage

Weight Sensitive

K
-
means, k
-
medoids
,

k
-
median, min
-
sum

Ward’s Method

䉩B散瑩湧⁋
-
浥慮a

Weight
Considering

Ratio Cut


Average Linkage


Classification

For the weight considering algorithms, we fully
characterize when they are sensitive to weight.

Outline


Formal framework


Categories and classification


A result from each category


Classification of heuristics


Conclusions and future work



Outline

Partitional

Hierarchical

Weight Robust

Min Diameter

K
-
center

Single Linkage

Complete Linkage

Weight Sensitive

K
-
means


k
-
medoids


k
-
median, min
-
sum

Ward’s Method

Bisecting K
-
means

Weight
Considering

Ratio Cut


Average Linkage


Classification






35

Zooming Into:

Weight Sensitive Algorithms

We show that k
-
means is weight
-
sensitive.


A

is
weight
-
separable

if for any data
set
(X,
d)
and
subset
S

of
X

with at most
k

points,


w

so
that
A(w[X],
d,k
)
separates all points of
S
.



Fact:
Every algorithm that is weight
-
separable is also
weight
-
sensitive.




36

K
-
means is Weight
-
Sensitive











Proof
:


Show that
k
-
means is weight
-
separable


Consider any
(
X,d
)
and
S


on at least

k
points


Increase weight of points in
S

until each belongs to a distinct cluster.


Theorem:

k
-
means is

weight
-
sensitive.





We show that Average
-
Linkage is Weight
Considering.


Characterize the precise conditions under
which it is sensitive to weight.


37

Zooming Into:

Weight Considering Algorithms







Recall:

An algorithm

A
is
weight
-
considering

if

1)
There exists
(X, d)
where
|Range(
X,d
)|=1.

2)
There exists
(X, d)
where
|Range(
X,d
)|>1
.










38

Average Linkage



Average
-
Linkage is a hierarchical algorithm.



It starts by creating a leaf for every element.



It then repeatedly merges the “closest”
clusters using the following linkage function:












Average weighted distance between clusters








Average Linkage is Weight Considering

(
X,d
)
where
Range(
X,d
) =1:





The same
dendrogram

is output

for every weight function.











A

B

C

D

A

B

C

D








Average Linkage is Weight Considering

(
X,d
)
where
Range(
X,d
) >1:















A

B

C

D

A

B

C

D

E

2+2
ϵ


1

1

1
+
ϵ

A

B

C

D

E

2+2
ϵ


1

1

1+

ϵ

E

A

B

C

D

E






41

When is Average Linkage

Sensitive to Weight?






We showed that Average
-
Linkage is weight
-
considering.


Can we show
when

it is sensitive to weight?


We provide a complete characterization of
when Average
-
Linkage is sensitive to
weight, and when it is not.












42






A clustering is
nice
if every point is closer to all points
within its cluster than to all other points.







Nice

Nice Clustering






43






A clustering is
nice
if every point is closer to all points
within its cluster than to all other points.







Nice

Nice Clustering






44






A clustering is
nice
if every point is closer to all points
within its cluster than to all other points.







Not nice

Nice Clustering






45






Theorem:

Range(AL(
X,d
)) = 1
if and only if
(
X,d
)
has a nice
dendrogram
.



A
dendrogram

is
nice
if all of its clusterings are nice.









Characterizing When Average Linkage is

Sensitive to Weight






46

Characterizing When Average Linkage is Sensitive to
Weight:
Proof







Proof:

Show that:

1)
If there is a nice
dendrogram

for
(
X,d
)
,

then

Average
-
Linkage outputs it.

2)
If a clustering that is
not

nice appears in
dendrogram

AL(w[X],d)

for some
w
, then
Range(AL(
X,d
)) > 1
.








Theorem:

Range(AL(
X,d
)) = 1
if and only if
(
X,d
)
has a nice
dendrogram
.







47


















Characterizing When Average Linkage is Sensitive to
Weight:
Proof (
cnt
.)


Lemma:

If there is a nice
dendrogram

for
(
X,d
)
,

then
Average
-
Linkage outputs it.




Proof Sketch:

1)
Assume that
(w[X],d)
has a nice
dendrogram
.

2)
Main idea
: Show that every nice clustering of the
data appears in
AL(w[X],d)
.

3)
For that, we show that each cluster in a nice
clustering is formed by the algorithm.














48






Given a nice clustering
C
, it can be shown that

for any clusters
C
i

and
C
j

of
C
,

any disjoint subsets
Y

and
Z

of
C
i
, and any subset
W
of
C
j
,
Y

and

Z
are closer than
Y

and
W
.









This implies that
C

appears in the
dendrogram
.







Characterizing When Average Linkage is Sensitive to
Weight:
Proof (
cnt
.)






49









Proof:



Since
C

is not nice, there exist points
x
,
y
, and
z
, so that



x

and
y

are belong to the same cluster in
C



x

and
z

belong to difference clusters



yet
d(
x,z
) < d(
x,y
)



If
x
,

y
and

z
are sufficiently heavier than all other points,
then
x

and
z

will be merged before
x

and
y
, so
C

will not
be formed.











Lemma:

If a clustering
C

that is not nice appears in
AL(w[X],d)
, then
range(
X,d
)>1
.



Characterizing When Average Linkage Responds to
Weight:
Proof (
cnt
.)







50

Characterizing When Average Linkage is

Sensitive to Weight







Average Linkage is robust to weight whenever there is
a
dendrogram

of
(
X,d
)
consisting of only nice
clusterings, and it is sensitive to weight otherwise.







Theorem:

Range(AL(
X,d
)) = 1
if and only if
(
X,d
)
has a nice
dendrogram
.








51

Zooming Into:

Weight Robust Algorithms

These

algorithms are invariant to element duplication.


Ex. Min
-
D
iameter

returns a clustering that minimizes
the length of the longest within
-
cluster edge.


As this quantity is not effected by
the number of points
(or weight) at
any location, Min
-
Diameter is weight
robust.



Outline


Introduce framework


Present categories and classification


Show several results from different categories


Conclusions and future work



Outline

Conclusions


We introduced three basic categories
describing how algorithms respond to weights


We characterize the precise conditions under
which algorithms respond to weights


The same results apply in the non
-
weighted
setting for data duplicates


This classification can be used to help select
clustering algorithms for specific applications


Capture differences between objective functions
similar to
k
-
means (ex.
k
-
medians,
k
-
medoids
,
min
-
sum)


Show bounds on the size of the Range of weight
considering and weight sensitive methods


Analyze clustering algorithms for categorical data


Analyze clustering algorithms with a noise bucket


Indentify properties that are significant for
specific clustering applications
(some previous work in this
directions by Ackerman, Brown, and
Loker

(ICCABS, 2012)).


Future Directions

Supplementary Material






56

Characterizing When Ratio Cut

is Weight Responsive

Ratio
-
cut

is a similarity based clustering function






A clustering is
perfect

if all within
-
cluster distances
are shorter than all between
-
cluster distances.


A clustering is
separation uniform
if all between
-
cluster distances are equal.











57

Characterizing When Ratio Cut

is Weight Responsive






Theorem:

Given a clustering
C

of
(
X,
s)
where
every
cluster
has more than one element, ratio
-
cut is
weight
-
responsive on
C

if and only if either



C
is not perfect,
or



C
is not separation
-
uniform.








58

What about heuristics?









Our analysis for k
-
means and similar methods is for
their corresponding objective functions.




Unfortunately, optimal partitions are NP
-
hard to find.
In practice, heuristics such as the Lloyd method are
used
.




We analyze several variations of the Lloyd method for
k
-
means, as well as the Partitional Around
Medoids

(PAM) algorithm for k
-
medoids
.



Partitional

Weight Sensitive


Lloyd

with random initialization


K
-
浥慮猫+


P䅍

Weight Considering


The

Lloyd Method with Further
Centroid

Initialization


Heuristics Classification

Note that the more popular heuristics respond to
weights in the same way as the k
-
means and k
-
medoids

objective functions.

Partitional

Weight Sensitive


Lloyd

with random initialization


K
-
浥慮猫+


P䅍

Weight Considering


The

Lloyd Method with Further
Centroid

Initialization


Heuristics Classification

Just like Average
-
Linkage, the Lloyd Method with
Furthest
Centroid

initialization responds to weight
only on clusterings that are not nice.



Partitional

Hierarchical

Weight Robust


Min Diameter


K
-
center


Single Linkage


Complete Linkage

Weight Sensitive


K
-
浥慮a


k
-
medoids


k
-
median


䵩M
-
卵S


剡R摯浩z敤

䱬oyd


k
-
浥慮猫+


Ward’s Method


Bisecting K
-
means

Weight
Considering


Ratio Cut


Lloyd

with Furthest
Centroid



Average Linkage


Classification






62

Wright’s Axioms


In 1973, Wright proposed axioms requiring in this
setting



Points
with zero mass can be treated
as non
-
existent



Multiple
points with mass at
the same
location are
equivalent to one point whose
weight is
the sum of
these
masses









63


















Lemma:

Consider any data set
(w[X], d)
.

If a
clustering
C
of
X
is nice, then
C
appears in the
dendrogram

Average
-
Linkage(w[X],d)
.



Characterizing When Average Linkage Responds to
Weight:
Proof (
cnt
.)


We use the following lemma to show that if there is a nice
dendrogram
,

then it is output by
Avergae
-
Linkage.


Logic: Every nice clustering is output by AL. So if there is a nice
dendrog
,
consider the set of clusters it has. All of them have to appear in any
dendrogam

output by AL. Because a binary tree always have the same
number of nodes, and every node corresponds to a unique cluster,
every
dendrog

produces the same clusters, which means it has the same
clusterings as the nice
dendrogam
. Show by induction that every level in
a level
-
condensed
dendrogram

has the same clusters.

64

Previous Work



Classification
s
of clustering methods


Fischer and Van Ness (
Biometrica
, 1971)