# Hierarchical Stability Based Model

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

109 εμφανίσεις

11/8/2013

1

Hierarchical Stability Based Model
Selection for Data Clustering

Bing Yin

11/8/2013

2

What is clustering?

What is model selection for clustering algorithms?

Stability Based Model Selection: Proposals and Problems

Hierarchical Stability Based Model Selection

Algorithm

Unimodality Test

Experiments

Future work

Main Contribution

Extended the concept of stability to hierarchical stability.

Solved the symmetric data sets problem.

Make stability a competing tool for model selection.

11/8/2013

3

What is clustering?

Given:

data set of “objects”

some relations between those objects: similarities, distances, neighborhoods,
connections,…

Goal: Find meaningful groups of objects s. t.

objects in the same group are “similar”

objects in different groups are “dissimilar”

Clustering is:

a form of unsupervised learning

a method of data exploration

11/8/2013

4

What is clustering? An Example

Image Segmentation: Micro array Analysis:

Serum Stimulation of Human Fibroblasts

(Eisen,Spellman,PNAS,1998)

9800 spots representing 8600 genes

12 samples taken over 24 hour period

Clusters can be roughly categorized as
gene involved in

A:

cholesterol biosynthesis

B:

the cell cycle

C:

the immediate
-
early response

D:

signaling and angiogenesis

E:

wound healing and tissue remodeling

Document Clustering

Post
-
search Grouping

Data Mining

Social Network Analysis

Gene Family Grouping

11/8/2013

5

What is clustering? An Algorithm

K
-
Means algorithm (Lloyd, 1957)

Given: data points X
1
,…,X
n


d
, number K clusters to find.

1. Randomly initialize the centers m
1
0
,…,m
K
0
.

2. Iterate until convergence:

2.1 Assign each point to the closest center according to Euclidean distance,
i.e., define clusters C
1
i+1
,…,C
K
i+1

by

X
s

C
k
i+1

where ||X
s
-
m
k
i
||
2

< ||X
s
-
m
l
i
||
2
, l=1 to K

2.2 Compute the new cluster centers by

m
k
i+1

=

X
s

/ |C
k
i+1
|

What is optimized?

Minimizing within
-
cluster distances:

11/8/2013

6

What is model selection?

Clustering algorithms need to know the K before running.

The correct answer of K for a given data is unknown

So we need a better way to find this K
and also the positions of the K centers

This can be intuitively called model selection for clustering algorithms.

Existing model selection method:

Bayesian Information Criterion

Gap statistics

Projection Test

Stability based approach

11/8/2013

7

Stability Based Model Selection

The basic idea:

scientific truth should be reproducible in experiments.

Repeatedly run a clustering algorithm on the same data

with parameter K and get a collection of clustering:

If K is the correct model, clustering should be similar to each other

If K is a wrong model, clustering may be quite different from each other

This fact is referred as the stability of K (Ulrike von Luxburg,2007)

11/8/2013

8

Stability Based Model Selection(2)

Example on the toy data:

If we can mathematically define this stability score for K, then stability

can be used to find the correct model for the given data.

11/8/2013

9

Define the Stability

Variation of Information (VI)

Clustering C
1
: X
1
,…,X
k

and Clustering C
2
: X’
1
,…,X’
k
on date X

The prob. point p belongs in X
i

is :

The entropy of C
1
:

The joint prob.
p

in X
i

and X’
j
is P(i,j) with entropy:

The VI is defined as:

VI indicates a distance between two clustering.

11/8/2013

10

Define the stability (2)

Calculate the VI score for a single K

Clustering the data using K
-
Means for K clusters, run M times

Calculate pair wise VI of these M clustering.

Average the VI and use it as the VI score for K

The calculated VI score for K indicates
instability

of K

Try this over different K

The K with
lowest

VI score/instability is chosen as the

correct model

11/8/2013

11

Define the Stability(3)

An good example of Stability

An bad example of Stability: symmetric data

Why?

Because Clustering data into 9 clusters apparently has more grouping choices
than clustering them into 3.

11/8/2013

12

Hierarchical Stability

Problems with the concept of stability introduced above:

Symmetric Data Set

Only local optimization

the smaller K

Proposed solution

Analyze the stability in an hierarchical manner

Do Unimodality Test to detect the termination of the recursion

11/8/2013

13

Hierarchical Stability

Given: Data set X

HS
-
means:

1. Test if X is not a unimodal cluster

2. If yes, find the optimal K for X by analyzing stability;

otherwise, X is a single cluster, return.

3. Partition X into K subsets

4. For each subset, recursively perform this algorithm from step 1

11/8/2013

14

Unimodality Test
-

2
Unimodality test

Fact: sum of squared Gaussians follows

2
distribution.

If x
1
,…,x
d

are
d

independent Gaussian variables, then

S = x
1
2
+…+x
d
2

follows

2
distribution of degree
d
.

For given data set X, calculate S
i
=X
i1
2
+…+X
id
2

If X is a single Gaussian, then S follows

2

of degree d

Otherwise, S is not a

2
distribution.

11/8/2013

15

Unimodality Test
-

Gap Test

Fact: the within cluster dispersion drops most apparently

with the correct K (Tibshirani, 2000)

Given: Data set X, candidate k

cluster X to k clusters and

get within cluster dispersion W
k

generate uniform data sets, cluster to

k clusters, calculate W*
k

(averaged)

gap(k) = W*
k

W
k

select smallest k s. t. gap(k)>gap(k+1)

we use it in another way: just ask k=1?

11/8/2013

16

Experiments

Synthetic data

Both Gaussian Distribution and Uniform Distribution

In dimensions from 2 up to 20

c
-
separation between each cluster center and its nearest neighbor is 4

200 points in each cluster, 10 clusters in total

Handwritten Digits

U.S. Postal Service handwritten digits

9298 instances in 256 dimensions

10 true clusters (maybe!)

KDDD Control Curves

600 instances in 60 dimensions

6 true clusters, each has 100 instances

Synthetic Gaussian

(10 true clusters)

Synthetic Uniform

(10 true clusters)

Handwritten Digits

(10 true clusters)

KDDD Control Curves

(6 true clusters)

HS
-
means

10

1

10

1

6

0

6.5

0.5

Lange Stability

6.5

1.5

7

1

2

0

3

0

PG
-
means

10

1

19.5

1.5

20

1

17

1

11/8/2013

17

Experiments

symmetric data

HS
-
means

Lange Stability

11/8/2013

18

Future Work

Better Unimodality Testing approach.

More detailed comparison on the performance with existing method

like within cluster distance, VI metric and so on.

Improve the speed of the algorithm.

11/8/2013

19