Hierarchical Stability Based Model

muttchessAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

91 views

11/8/2013

1

Hierarchical Stability Based Model
Selection for Data Clustering

Bing Yin


Advisor: Greg Hamerly

11/8/2013

2

Roadmap


What is clustering?


What is model selection for clustering algorithms?


Stability Based Model Selection: Proposals and Problems


Hierarchical Stability Based Model Selection





Algorithm





Unimodality Test





Experiments


Future work



Main Contribution



Extended the concept of stability to hierarchical stability.



Solved the symmetric data sets problem.



Make stability a competing tool for model selection.

11/8/2013

3

What is clustering?


Given:



data set of “objects”



some relations between those objects: similarities, distances, neighborhoods,
connections,…


Goal: Find meaningful groups of objects s. t.





objects in the same group are “similar”





objects in different groups are “dissimilar”


Clustering is:





a form of unsupervised learning





a method of data exploration

11/8/2013

4

What is clustering? An Example


Image Segmentation: Micro array Analysis:


Serum Stimulation of Human Fibroblasts

(Eisen,Spellman,PNAS,1998)




9800 spots representing 8600 genes




12 samples taken over 24 hour period




Clusters can be roughly categorized as
gene involved in

A:

cholesterol biosynthesis

B:

the cell cycle

C:

the immediate
-
early response

D:

signaling and angiogenesis

E:

wound healing and tissue remodeling

Document Clustering

Post
-
search Grouping

Data Mining

Social Network Analysis

Gene Family Grouping



11/8/2013

5

What is clustering? An Algorithm


K
-
Means algorithm (Lloyd, 1957)

Given: data points X
1
,…,X
n


d
, number K clusters to find.

1. Randomly initialize the centers m
1
0
,…,m
K
0
.

2. Iterate until convergence:



2.1 Assign each point to the closest center according to Euclidean distance,
i.e., define clusters C
1
i+1
,…,C
K
i+1

by



X
s



C
k
i+1

where ||X
s
-
m
k
i
||
2

< ||X
s
-
m
l
i
||
2
, l=1 to K

2.2 Compute the new cluster centers by



m
k
i+1

=


X
s

/ |C
k
i+1
|




What is optimized?

Minimizing within
-
cluster distances:


11/8/2013

6

What is model selection?


Clustering algorithms need to know the K before running.


The correct answer of K for a given data is unknown


So we need a better way to find this K
and also the positions of the K centers


This can be intuitively called model selection for clustering algorithms.


Existing model selection method:



Bayesian Information Criterion



Gap statistics



Projection Test






Stability based approach




11/8/2013

7

Stability Based Model Selection


The basic idea:




scientific truth should be reproducible in experiments.


Repeatedly run a clustering algorithm on the same data

with parameter K and get a collection of clustering:





If K is the correct model, clustering should be similar to each other




If K is a wrong model, clustering may be quite different from each other


This fact is referred as the stability of K (Ulrike von Luxburg,2007)




11/8/2013

8

Stability Based Model Selection(2)


Example on the toy data:



















If we can mathematically define this stability score for K, then stability

can be used to find the correct model for the given data.

11/8/2013

9

Define the Stability


Variation of Information (VI)




Clustering C
1
: X
1
,…,X
k

and Clustering C
2
: X’
1
,…,X’
k
on date X







The prob. point p belongs in X
i

is :












The entropy of C
1
:













The joint prob.
p

in X
i

and X’
j
is P(i,j) with entropy:











The VI is defined as:




VI indicates a distance between two clustering.




11/8/2013

10

Define the stability (2)


Calculate the VI score for a single K





Clustering the data using K
-
Means for K clusters, run M times





Calculate pair wise VI of these M clustering.





Average the VI and use it as the VI score for K


The calculated VI score for K indicates
instability

of K


Try this over different K


The K with
lowest

VI score/instability is chosen as the

correct model


11/8/2013

11

Define the Stability(3)

An good example of Stability







An bad example of Stability: symmetric data







Why?


Because Clustering data into 9 clusters apparently has more grouping choices
than clustering them into 3.

11/8/2013

12

Hierarchical Stability


Problems with the concept of stability introduced above:




Symmetric Data Set




Only local optimization


the smaller K


Proposed solution




Analyze the stability in an hierarchical manner




Do Unimodality Test to detect the termination of the recursion

11/8/2013

13

Hierarchical Stability


Given: Data set X

HS
-
means:



1. Test if X is not a unimodal cluster




2. If yes, find the optimal K for X by analyzing stability;


otherwise, X is a single cluster, return.




3. Partition X into K subsets




4. For each subset, recursively perform this algorithm from step 1




5. Merge answers from each subset as answer for current data

11/8/2013

14

Unimodality Test
-


2
Unimodality test

Fact: sum of squared Gaussians follows

2
distribution.




If x
1
,…,x
d

are
d

independent Gaussian variables, then



S = x
1
2
+…+x
d
2

follows

2
distribution of degree
d
.


For given data set X, calculate S
i
=X
i1
2
+…+X
id
2




If X is a single Gaussian, then S follows

2

of degree d




Otherwise, S is not a

2
distribution.

11/8/2013

15

Unimodality Test
-

Gap Test


Fact: the within cluster dispersion drops most apparently


with the correct K (Tibshirani, 2000)


Given: Data set X, candidate k



cluster X to k clusters and


get within cluster dispersion W
k



generate uniform data sets, cluster to


k clusters, calculate W*
k

(averaged)



gap(k) = W*
k



W
k



select smallest k s. t. gap(k)>gap(k+1)






we use it in another way: just ask k=1?

11/8/2013

16

Experiments

Synthetic data




Both Gaussian Distribution and Uniform Distribution




In dimensions from 2 up to 20




c
-
separation between each cluster center and its nearest neighbor is 4




200 points in each cluster, 10 clusters in total

Handwritten Digits




U.S. Postal Service handwritten digits




9298 instances in 256 dimensions




10 true clusters (maybe!)

KDDD Control Curves




600 instances in 60 dimensions




6 true clusters, each has 100 instances

Synthetic Gaussian

(10 true clusters)

Synthetic Uniform

(10 true clusters)

Handwritten Digits

(10 true clusters)

KDDD Control Curves

(6 true clusters)

HS
-
means

10

1

10

1

6

0

6.5

0.5

Lange Stability

6.5

1.5

7

1

2

0

3

0

PG
-
means

10

1

19.5

1.5

20

1

17

1

11/8/2013

17

Experiments



symmetric data



HS
-
means




Lange Stability


11/8/2013

18

Future Work




Better Unimodality Testing approach.



More detailed comparison on the performance with existing method


like within cluster distance, VI metric and so on.



Improve the speed of the algorithm.




11/8/2013

19


Questions and Comments






Thank you!