Weighted Clustering
Margareta Ackerman
Work with
Shai
Ben

David,
Simina
Branzei
, and David
Loker
Clustering is one of the most widely used tools
for exploratory data analysis.
Social Sciences
Biology
Astronomy
Computer Science
….
All apply clustering to gain a first understanding of
the structure of large data sets.
The Theory

Practice Gap
2
“While the interest in and application of
cluster analysis has been rising rapidly,
the abstract nature of the tool is still
poorly understood” (Wright, 1973)
“There has been relatively little work aimed at
reasoning about clustering independently of
any particular algorithm, objective function,
or generative data model” (Kleinberg, 2002)
Both statements still apply today.
3
The Theory

Practice Gap
Clustering aims to assign data into groups of
similar items
Beyond that, there is very little consensus on
the definition of clustering
4
Inherent Obstacles:
Clustering is ill

defined
•
Clustering is
inherently ambiguous
–
There may be multiple reasonable clusterings
–
There is usually no ground truth
•
There are many clustering algorithms with
different (often implicit) objective
functions
•
Different algorithms have radically
different input

output behaviour
5
Inherent
Obstacles
6
Differences
in
Input/Output
Behavior of
Clustering Algorithms
7
Differences
in
Input/Output
Behavior of
Clustering Algorithms
There are a wide variety of clustering algorithms, which
can produce very different clusterings.
8
How should a user decide
which algorithm to use for a
given application?
Clustering Algorithm Selection
Users rely on cost related considerations: running
times, space usage, software purchasing costs, etc…
There is inadequate emphasis on
input

output behaviour
9
Clustering Algorithm Selection
We propose a framework that lets a user utilize prior
knowledge to select an algorithm
•
Identify properties that distinguish between different
input

output behaviour
of clustering paradigms
•
The properties should be:
1)
Intuitive and “user

friendly”
2)
Useful for distinguishing clustering algorithms
10
Our Framework for Algorithm Selection
In essence, our goal is to understand
fundamental differences between
clustering methods, and convey them
formally
,
clearly
, and as
simply as
possible
.
11
Our Framework for Algorithm Selection
12
Previous Work
•
Axiomatic perspective
•
Impossibility Result: Kleinberg (NIPS, 2003)
•
Consistent axioms for
q
uality
measures: Ackerman &
Ben

David (NIPS, 2009)
•
Axioms in the weighted setting: Wright (Pattern
Recognition, 1973)
13
Previous Work
•
Characterizations of Single

Linkage
•
Partitional Setting:
Bosah
Zehad
and Ben

David (UAI,
2009)
•
Hierarchical Setting: Jarvis and Sibson (Mathematical
Taxonomy, 1981) and
Carlsson
and
Memoli
(JMLR, 2010).
•
Characterizations of Linkage

Based Clustering
•
Partitional Setting: Ackerman, Ben

David, and
Loker
(COLT,
2010)
.
•
Hierarchical Setting: Ackerman & Ben

David (IJCAI, 2011).
14
Previous Work
•
Classification
s
of clustering methods
•
Fischer and Van Ness (
Biometrica
, 1971)
•
Ackerman, Ben

David, and
Loker
(NIPS,
2010)
15
What’s Left To Be Done?
Despite much work on clustering properties, some basic
questions remain unanswered.
Consider some of the most popular clustering methods:
k

means, single

linkage, average

linkage, etc…
•
What are the advantages of
k

means over other
methods?
•
Previous classifications are missing key properties.
16
Our Contributions (at a high level)
We indentify 3 fundamental categories that clearly
delineate some essential differences between common
clustering methods
The strength of these categories lies in their simplicity.
We hope this gives insight into core differences between
popular clustering methods.
To define these categories, we first present the weighted
clustering setting.
Outline
•
Formal framework
•
Categories and classification
•
A result from each category
•
Conclusions and future work
Outline
Every element is associated with a
real valued weight, representing its
mass or importance.
Generalizes the notion of element duplication.
Algorithm design, particularly design of approximation
algorithms, is often done in this framework.
18
Weighted Clustering
•
Apply clustering to facility allocation, such
as the
placement of
police stations
in a new district.
•
The distribution
of
stations should
enable quick
access to most areas in
the district
.
19
Other Reasons to Add Weight:
An Example
•
A
ccessibility
of different
institutions to a station may
have varying importance.
•
The weighted setting enables
a convenient method for
prioritizing certain landmarks.
Traditional
clustering algorithms can be
readily
translated
into the weighted
setting by
considering their behavior on data containing
element duplicates.
20
Algorithms in the Weighted Clustering
Setting
•
For a finite domain set
X
, a
weight
function
w
: X
→R
+
defines the weight of every element
.
•
For a finite domain set
X
, a
distance function
d: X
x
X
→R
+
u {0}
is the distance defined between the domain points.
Formal Setting
(
X,d
)
denotes
unweighted
data
(w[X],d)
denotes weighted
data
A Partitional Algorithm maps
Input:
(w[X],
d,k
)
to
Output
:
a
k

partition (
k

clustering
) of
X
Formal Setting
Partitional Clustering Algorithm
A Hierarchical Algorithm maps
Input:
(w[X],d)
to
Output
:
dendrogram
of
X
A
dendrogram
of
(
X,d
)
is a strictly binary tree whose
leaves correspond to elements of
X
C
appears
in
A
(w[X
],d
)
if its clusters are in the
dendrogram
Formal Setting
Hierarchical Clustering Algorithm
24
Our Contributions
•
We utilize the weighted framework to indentify 3
fundamental categories, describing how algorithms
respond to weight.
•
Classify traditional algorithms according to these
categories
•
Fully characterize
when different algorithms react to
weight
PARTITIONAL:
Range(A(X,
d,k
)) = {C 
∃
眠
献t
⸠䌽䄨A存崬⥽
The
set of
clusterings that
A
outputs on
(
X,
d)
over all possible
weight functions
.
HIERARCHICAL:
Range(A(X, d)) = {D 
∃
眠
献t
⸠䐽D⡷存崬(}
The set of
dendrograms
that
A
outputs on
(X, d)
over all
possible weight functions.
Towards Basic Categories
Range(
X,d
)
Outline
•
Formal framework
•
Categories and classification
•
A result from each category
•
Conclusions and future work
Outline
27
Categories:
Weight Robust
A
is
weight

robust
if
for
all
(
X,
d
), Range(
X,d
) = 1.
A
never responds to weight.
28
Categories:
Weight Sensitive
A
is
weight

sensitive
if for all
(X, d)
,
Range(
X,d
) > 1.
A
always responds
to weight.
29
Categories:
Weight Considering
An
algorithm
A
is
weight

considering
if
1)
There
exists
(X,
d)
where
Range(
X,d
)=1.
2)
There exists
(
X,
d)
where
Range(
X,d
)>1
.
A
responds to weight on some
data sets, but not others.
Range(A(X, d)) = {C

∃
w
such that
A(w[X], d) = C}
Range(A(X, d)) = {D

∃
w
such that
A(w[X], d) = D}
Weight

robust
: for all
(X, d), Range(
X,d
) = 1.
Weight

sensitive:
for all
(X, d),Range(
X,d
) > 1.
W
eight

considering:
1)
∃
(X,
d)
where
Range(
X,d
)=1.
2)
∃
(X,
d)
where
Range(
X,d
)>1
.
30
Summary of Categories
Outline
In the facility allocation
example above, a weight

sensitive algorithm may be
preferred.
Connecting To Applications
In phylogeny, where sampling
procedures can be highly biased,
some degree of weight
robustness may be desired.
The
desired category depends on the application.
Partitional
Hierarchical
Weight Robust
Min Diameter
K

center
Single Linkage
Complete Linkage
Weight Sensitive
K

means, k

medoids
,
k

median, min

sum
Ward’s Method
䉩B散瑩湧⁋

浥慮a
Weight
Considering
Ratio Cut
Average Linkage
Classification
For the weight considering algorithms, we fully
characterize when they are sensitive to weight.
Outline
•
Formal framework
•
Categories and classification
•
A result from each category
•
Classification of heuristics
•
Conclusions and future work
Outline
Partitional
Hierarchical
Weight Robust
Min Diameter
K

center
Single Linkage
Complete Linkage
Weight Sensitive
K

means
k

medoids
k

median, min

sum
Ward’s Method
Bisecting K

means
Weight
Considering
Ratio Cut
Average Linkage
Classification
35
Zooming Into:
Weight Sensitive Algorithms
We show that k

means is weight

sensitive.
A
is
weight

separable
if for any data
set
(X,
d)
and
subset
S
of
X
with at most
k
points,
∃
w
so
that
A(w[X],
d,k
)
separates all points of
S
.
Fact:
Every algorithm that is weight

separable is also
weight

sensitive.
36
K

means is Weight

Sensitive
Proof
:
•
Show that
k

means is weight

separable
•
Consider any
(
X,d
)
and
S
⊂
堠
on at least
k
points
•
Increase weight of points in
S
until each belongs to a distinct cluster.
Theorem:
k

means is
weight

sensitive.
•
We show that Average

Linkage is Weight
Considering.
•
Characterize the precise conditions under
which it is sensitive to weight.
37
Zooming Into:
Weight Considering Algorithms
Recall:
An algorithm
A
is
weight

considering
if
1)
There exists
(X, d)
where
Range(
X,d
)=1.
2)
There exists
(X, d)
where
Range(
X,d
)>1
.
38
Average Linkage
•
Average

Linkage is a hierarchical algorithm.
•
It starts by creating a leaf for every element.
•
It then repeatedly merges the “closest”
clusters using the following linkage function:
Average weighted distance between clusters
Average Linkage is Weight Considering
(
X,d
)
where
Range(
X,d
) =1:
The same
dendrogram
is output
for every weight function.
A
B
C
D
A
B
C
D
Average Linkage is Weight Considering
(
X,d
)
where
Range(
X,d
) >1:
A
B
C
D
A
B
C
D
E
2+2
ϵ
1
1
1
+
ϵ
A
B
C
D
E
2+2
ϵ
1
1
1+
ϵ
E
A
B
C
D
E
41
When is Average Linkage
Sensitive to Weight?
We showed that Average

Linkage is weight

considering.
Can we show
when
it is sensitive to weight?
We provide a complete characterization of
when Average

Linkage is sensitive to
weight, and when it is not.
42
A clustering is
nice
if every point is closer to all points
within its cluster than to all other points.
Nice
Nice Clustering
43
A clustering is
nice
if every point is closer to all points
within its cluster than to all other points.
Nice
Nice Clustering
44
A clustering is
nice
if every point is closer to all points
within its cluster than to all other points.
Not nice
Nice Clustering
45
Theorem:
Range(AL(
X,d
)) = 1
if and only if
(
X,d
)
has a nice
dendrogram
.
A
dendrogram
is
nice
if all of its clusterings are nice.
Characterizing When Average Linkage is
Sensitive to Weight
46
Characterizing When Average Linkage is Sensitive to
Weight:
Proof
Proof:
Show that:
1)
If there is a nice
dendrogram
for
(
X,d
)
,
then
Average

Linkage outputs it.
2)
If a clustering that is
not
nice appears in
dendrogram
AL(w[X],d)
for some
w
, then
Range(AL(
X,d
)) > 1
.
Theorem:
Range(AL(
X,d
)) = 1
if and only if
(
X,d
)
has a nice
dendrogram
.
47
Characterizing When Average Linkage is Sensitive to
Weight:
Proof (
cnt
.)
Lemma:
If there is a nice
dendrogram
for
(
X,d
)
,
then
Average

Linkage outputs it.
Proof Sketch:
1)
Assume that
(w[X],d)
has a nice
dendrogram
.
2)
Main idea
: Show that every nice clustering of the
data appears in
AL(w[X],d)
.
3)
For that, we show that each cluster in a nice
clustering is formed by the algorithm.
48
Given a nice clustering
C
, it can be shown that
for any clusters
C
i
and
C
j
of
C
,
any disjoint subsets
Y
and
Z
of
C
i
, and any subset
W
of
C
j
,
Y
and
Z
are closer than
Y
and
W
.
This implies that
C
appears in the
dendrogram
.
Characterizing When Average Linkage is Sensitive to
Weight:
Proof (
cnt
.)
49
Proof:
•
Since
C
is not nice, there exist points
x
,
y
, and
z
, so that
•
x
and
y
are belong to the same cluster in
C
•
x
and
z
belong to difference clusters
•
yet
d(
x,z
) < d(
x,y
)
•
If
x
,
y
and
z
are sufficiently heavier than all other points,
then
x
and
z
will be merged before
x
and
y
, so
C
will not
be formed.
Lemma:
If a clustering
C
that is not nice appears in
AL(w[X],d)
, then
range(
X,d
)>1
.
Characterizing When Average Linkage Responds to
Weight:
Proof (
cnt
.)
50
Characterizing When Average Linkage is
Sensitive to Weight
Average Linkage is robust to weight whenever there is
a
dendrogram
of
(
X,d
)
consisting of only nice
clusterings, and it is sensitive to weight otherwise.
Theorem:
Range(AL(
X,d
)) = 1
if and only if
(
X,d
)
has a nice
dendrogram
.
51
Zooming Into:
Weight Robust Algorithms
These
algorithms are invariant to element duplication.
Ex. Min

D
iameter
returns a clustering that minimizes
the length of the longest within

cluster edge.
As this quantity is not effected by
the number of points
(or weight) at
any location, Min

Diameter is weight
robust.
Outline
•
Introduce framework
•
Present categories and classification
•
Show several results from different categories
•
Conclusions and future work
Outline
Conclusions
•
We introduced three basic categories
describing how algorithms respond to weights
•
We characterize the precise conditions under
which algorithms respond to weights
•
The same results apply in the non

weighted
setting for data duplicates
•
This classification can be used to help select
clustering algorithms for specific applications
•
Capture differences between objective functions
similar to
k

means (ex.
k

medians,
k

medoids
,
min

sum)
•
Show bounds on the size of the Range of weight
considering and weight sensitive methods
•
Analyze clustering algorithms for categorical data
•
Analyze clustering algorithms with a noise bucket
•
Indentify properties that are significant for
specific clustering applications
(some previous work in this
directions by Ackerman, Brown, and
Loker
(ICCABS, 2012)).
Future Directions
Supplementary Material
56
Characterizing When Ratio Cut
is Weight Responsive
Ratio

cut
is a similarity based clustering function
A clustering is
perfect
if all within

cluster distances
are shorter than all between

cluster distances.
A clustering is
separation uniform
if all between

cluster distances are equal.
57
Characterizing When Ratio Cut
is Weight Responsive
Theorem:
Given a clustering
C
of
(
X,
s)
where
every
cluster
has more than one element, ratio

cut is
weight

responsive on
C
if and only if either
•
C
is not perfect,
or
•
C
is not separation

uniform.
58
What about heuristics?
•
Our analysis for k

means and similar methods is for
their corresponding objective functions.
•
Unfortunately, optimal partitions are NP

hard to find.
In practice, heuristics such as the Lloyd method are
used
.
•
We analyze several variations of the Lloyd method for
k

means, as well as the Partitional Around
Medoids
(PAM) algorithm for k

medoids
.
Partitional
Weight Sensitive
•
Lloyd
with random initialization
•
K

浥慮猫+
•
P䅍
Weight Considering
•
The
Lloyd Method with Further
Centroid
Initialization
Heuristics Classification
Note that the more popular heuristics respond to
weights in the same way as the k

means and k

medoids
objective functions.
Partitional
Weight Sensitive
•
Lloyd
with random initialization
•
K

浥慮猫+
•
P䅍
Weight Considering
•
The
Lloyd Method with Further
Centroid
Initialization
Heuristics Classification
Just like Average

Linkage, the Lloyd Method with
Furthest
Centroid
initialization responds to weight
only on clusterings that are not nice.
Partitional
Hierarchical
Weight Robust
•
Min Diameter
•
K

center
•
Single Linkage
•
Complete Linkage
Weight Sensitive
•
K

浥慮a
•
k

medoids
•
k

median
•
䵩M

卵S
•
剡R摯浩z敤
䱬oyd
•
k

浥慮猫+
•
Ward’s Method
•
Bisecting K

means
Weight
Considering
•
Ratio Cut
•
Lloyd
with Furthest
Centroid
•
Average Linkage
Classification
62
Wright’s Axioms
In 1973, Wright proposed axioms requiring in this
setting
•
Points
with zero mass can be treated
as non

existent
•
Multiple
points with mass at
the same
location are
equivalent to one point whose
weight is
the sum of
these
masses
63
Lemma:
Consider any data set
(w[X], d)
.
If a
clustering
C
of
X
is nice, then
C
appears in the
dendrogram
Average

Linkage(w[X],d)
.
Characterizing When Average Linkage Responds to
Weight:
Proof (
cnt
.)
We use the following lemma to show that if there is a nice
dendrogram
,
then it is output by
Avergae

Linkage.
Logic: Every nice clustering is output by AL. So if there is a nice
dendrog
,
consider the set of clusters it has. All of them have to appear in any
dendrogam
output by AL. Because a binary tree always have the same
number of nodes, and every node corresponds to a unique cluster,
every
dendrog
produces the same clusters, which means it has the same
clusterings as the nice
dendrogam
. Show by induction that every level in
a level

condensed
dendrogram
has the same clusters.
64
Previous Work
•
Classification
s
of clustering methods
•
Fischer and Van Ness (
Biometrica
, 1971)
Comments 0
Log in to post a comment