Clustering
Usman Roshan
CS 698
Clustering
•
Suppose we want to cluster
n
vectors in
R
d
into two groups. Define
C
1
and
C
2
as
the two groups.
•
Our objective is to find
C
1
and
C
2
that
minimize
where
m
i
is the mean of class
C
i

x
j
m
i

2
x
j
C
i
i
1
2
Clustering
•
NP hard even for 2

means
•
NP hard even on plane
•
K

means heuristic
–
Popular and hard to beat
–
Introduced in 1950s and 1960s
K

means algorithm for two clusters
Input:
Algorithm:
1.
Initialize: assign
x
i
to
C
1
or
C
2
with equal probability and compute
means:
2.
Recompute clusters: assign
x
i
to
C
1
if
x
i

m
1
<x
i

m
2

, otherwise
assign to
C
2
3.
Recompute means
m
1
and
m
2
4.
Compute objective
5.
Compute objective of new clustering. If difference is smaller than
then stop, otherwise go to step 2.
x
i
R
d
,
i
1
…
n
m
1
1
C
1
x
i
x
i
C
1
m
2
1
C
2
x
i
x
i
C
2

x
j
m
i

2
x
j
C
i
i
1
2
K

means
•
Is it guaranteed to find the clustering
which optimizes the objective?
•
It is guaranteed to find a local optimal
•
We can prove that the objective
decreases with subsequence iterations
Proof sketch of convergence
of k

means

x
j
m
i

2
x
j
C
i
i
1
2

x
j
m
i

2
x
j
C
i
*
i
1
2

x
j
m
i
*

2
x
j
C
i
*
i
1
2
Justification of first inequality: by
assigning
x
j
to the closest mean the
objective decreases or stays the
same
Justification of second inequality:
for a given cluster its mean
minimizes squared error loss
Other clustering algorithms
•
Hierarchical clustering
–
Initialize n clusters where each datapoint is
in its own cluster
–
Merge two nearest clusters into one
–
Update distances of new cluster to existing
ones
–
Repeat step 2 until k clusters are formed.
Other clustering algorithms
•
Graph clustering (Spectral clustering)
–
Find cut of minimum cost in a bipartition of
the data
–
NP

hard if look for minimum cut such that
size of two clusters are close
–
Relaxation leads to spectral clustering
–
Based on calculating Laplacian of a graph
and eigenvalues and eigenvectors of
similarity matrix
Comments 0
Log in to post a comment