# Clustering

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 11 μήνες)

126 εμφανίσεις

Clustering

Usman Roshan

CS 698

Clustering

Suppose we want to cluster
n

vectors in
R
d
into two groups. Define
C
1

and
C
2
as
the two groups.

Our objective is to find
C
1

and
C
2

that
minimize

where
m
i

is the mean of class
C
i

 
||
x
j

m
i
||
2
x
j

C
i

i

1
2

Clustering

NP hard even for 2
-
means

NP hard even on plane

K
-
means heuristic

Popular and hard to beat

Introduced in 1950s and 1960s

K
-
means algorithm for two clusters

Input:

Algorithm:

1.
Initialize: assign
x
i

to
C
1

or
C
2

with equal probability and compute
means:

2.
Recompute clusters: assign
x
i

to
C
1

if
||x
i
-
m
1
||<||x
i
-
m
2
||
, otherwise
assign to
C
2

3.
Recompute means
m
1

and
m
2

4.
Compute objective

5.
Compute objective of new clustering. If difference is smaller than
then stop, otherwise go to step 2.


 
x
i

R
d
,
i

1

n
 
m
1

1
C
1
x
i
x
i

C
1

 
m
2

1
C
2
x
i
x
i

C
2

 
||
x
j

m
i
||
2
x
j

C
i

i

1
2

 

K
-
means

Is it guaranteed to find the clustering
which optimizes the objective?

It is guaranteed to find a local optimal

We can prove that the objective
decreases with subsequence iterations

Proof sketch of convergence
of k
-
means

 
||
x
j

m
i
||
2
x
j

C
i

i

1
2

||
x
j

m
i
||
2
x
j

C
i
*

i

1
2

||
x
j

m
i
*
||
2
x
j

C
i
*

i

1
2

Justification of first inequality: by
assigning
x
j

to the closest mean the
objective decreases or stays the
same

Justification of second inequality:
for a given cluster its mean
minimizes squared error loss

Other clustering algorithms

Hierarchical clustering

Initialize n clusters where each datapoint is
in its own cluster

Merge two nearest clusters into one

Update distances of new cluster to existing
ones

Repeat step 2 until k clusters are formed.

Other clustering algorithms

Graph clustering (Spectral clustering)

Find cut of minimum cost in a bipartition of
the data

NP
-
hard if look for minimum cut such that
size of two clusters are close