Lecture 9 - Ryan A. Rossi

naivenorthAI and Robotics

Nov 8, 2013 (4 years and 1 month ago)

85 views

Main Clustering Algorithms


K
-
Means




Hierarchical




SOM

K
-
Means


MacQueen, 1967



clusters defined by means/centroids



Many clustering algorithms are derivatives of K
-
Means



Widespread use in industry and academia, despite
it’s many problems



K
-
Means Example

Hierarchical Clustering


Starts by assuming each point as a cluster



Iteratively links most similar pair of clusters



User
-
defined threshold parameter specifies
the output clusters

Hierarchical Clustering Variants
In Minitab
©

Linkage Methods


Average


Centroid


Complete


McQuitty


Median


Single


Ward

Distance Measures


Euclidean


Manhattan


Pearson


Squared Euclidean


Squared Pearson



Hierarchical Clustering Example

Results

Still There are Problems

Clustering Documents

“bag of words”








D
i
: vector of length l


Distance between D
i

and D
j
: <D
i
, D
j
>


W
1

W
2

W
3

W
i

W
j

W
n

f
11

f
21

f
31

f
i1

f
j1

f
n1

. . . .

. . .

. . . . . . . . .

. . . . . . . . .

. . .

. . . .

D
1
:

f
12

f
22

f
32

f
i2

f
j2

f
n2

. . . . . . . . .

. . .

. . . .

D
2
:

D
m
:

f
1m

f
2m

f
3m

f
im

f
jm

f
nm

. . . . . . . . .

. . .

. . . .

M

Cluster Centroid


Cluster defined by distance to centroid: C

C = 1/m
S
D
i
,
where m is
the # of vectors

Elevations

Elevation of D: El(D) = <C, D>


Problem: Would like:


Mapping to higher Dimension


Utilizing Kernel Function K(X,Y)

K(X,Y) = <
F
(X),
F
(Y)>,


where, X,Y are vectors in R
n
, and
F
is a mapping into R
d
, d
>> n



Key element in Support Vector Machines




Data needs to appear as Dot Product only: <D
i
,D
j
>

Kernel Function Examples


Polynomial:


K(X, Y) = (<X, Y> + 1)
n



Feedforward Neural Network Classifier


K(X, Y) = tanh(
β
<X, Y>

+ b)



Radial Basis


K(X, Y) = e
-
<X,

Y>^2/2
s
^2


First Step: Penalizing Outliers



C
k

= 1/m
S(
<D
i
,
N
(C
k
-
1
)>D
i
) (1)





Convergence:


C


= Principal Eigenvector of M
T
M,where M is the


matrix of D
i
’s


C

=
lim
L



(M
T
M)
L
U


(2)



Both (1) and (2) are efficient methods of computing
C



Cannot with:


F
k

= 1/m
S(
<
F
(D
i
),
N
(F
k
-
1
)>
F
(D
i
))


Or by using (2):



M =


M
T
M has unmanageable




(eventually infinite) dimension




So instead we use
a
i
k

=
<
F
(D
i
),
N
(F
k
-
1
)> =

(1/m)
S(a
j
k
-
1
<F(
D
i
),
F(
D
j
)>)


(3)


F
(D1)

F
(D2)

.

.


Using Kernels to replace
F

Theorem

F =

Sa
i
*
F(
D
i
) ,



a
i
*
= lim {
a
i
n
=(1/m)
Sa
j
n
-
1
K
(D
i
, D
j
)}




El(D): Elevation of vector D =
Sa
i
*
K
(D
i
, D)



where

for n



Zoomed Clusters


Clusters defined through peaks


Peaks: all vectors, which are the highest in their vicinity:


PEAKS = {D
j
|
El(D
j
)
 (
El(D
i
)<D
i
,D
j
>
S
) for all i}



S: Sharpening/Smoothing Parameter


Cluster: Set of vectors, which are in the vicinity of a
peak


1



2



3



0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
0
.
7
C
1
C
2
K
e
r
n
e
l
:

L
i
n
e
a
r
S
:

D
e
f
a
u
l
t

(
1
)
Clustering Example

Zooming Example

0.0
0
0.5
1.0
1
2
3
0.0
0.5
1.0
1.5
0.0
0
.
0
0
.
1
0
.
5
0
.
2
1
.
4
1
.
2
1
.
0
C
2
1
.
0
0
.
3
0
.
8
0
.
6
0
.
4
C
1
0
.
2
C
3
0
.
4
0
.
0
0
.
5
0
.
6
1



2



3



4



5



6



7



8



K
e
r
n
e
l
:

L
i
n
e
a
r
S
:

D
e
f
a
u
l
t

(
1
)
0
.
0
0
.
1
0
.
5
0
.
2
1
.
4
1
.
2
1
.
0
C
2
1
.
0
0
.
3
0
.
8
0
.
6
0
.
4
C
1
0
.
2
C
3
0
.
4
0
.
0
0
.
5
0
.
6
1




2




3




4




5




6




7




8




9




1
0



1
1



1
2



1
3



1
4



1
5



1
6



K
e
r
n
e
l
:

P
o
l
y
n
o
m
i
a
l

D
e
g
r
e
e

2
S
:

1
6
Zoomed Clusters Results

0
.
0
0
.
1
0
.
5
0
.
2
1
.
4
1
.
2
1
.
0
C
2
1
.
0
0
.
3
0
.
8
0
.
6
0
.
4
C
1
0
.
2
C
3
0
.
4
0
.
0
0
.
5
0
.
6
1



2



3



4



K
e
r
n
e
l
:

P
o
l
y
n
o
m
i
a
l

D
e
g
r
e
e

8
0
0
0
S
:

1
.
5
0
.
0
0
.
1
0
.
5
0
.
2
1
.
4
1
.
2
1
.
0
C
2
1
.
0
0
.
3
0
.
8
0
.
6
0
.
4
C
1
0
.
2
C
3
0
.
4
0
.
0
0
.
5
0
.
6
1



2



K
e
r
n
e
l
:

P
o
l
y
n
o
m
i
a
l

D
e
g
r
e
e

8
0
0
0
S
:

D
e
a
f
a
u
l
t

(
1
)
Default

Genes

Experiments

Clustering MicroArray Data

Expression
Level of
Gene i during
Experiment j

MicroArrays As Time Series





Clustering Time Series


Reveals groups of genes, which
have similar reactions to
experiments



Functionally related genes should
cluster


Simulated Time Series


Simulated 180 Time Series, with 3 clusters and 9
sub
-
clusters (20 per sub
-
cluster)






Each time series is a vector with 1000 components


Each component is expression level at a given time

Results

Kernel: Polynomial Degree 3




S: 6

Kernel: Polynomial Degree 3




S: 7

Kernel: Polynomial Degree 6




S: 15

HMM Parameter Estimation

Viterbi
Algorithm

Refinement of
HMM Model

Final
HMM
Model

Sequential
K
-
Means

Baum
-
Welch
Algorithm

Final
HMM
Model

Refinement of
HMM Model

Initial
HMM
Model

Parameter Estimation with
Zoomed Clusters

Zoomed
Clusters

Initial
HMM
Model

Advantages:



Flexibility with number of states


Initial Model is closer to the final one

Consequences:



Higher accuracy and faster convergence for either
Baum
-
Welch or Viterbi

Example: Coins

HHHHHTTTTTTTHHHHHHHTHTHTHTHTHTTTTTTTT


HHHHH
TTTTTTT
HHHHHHH
THTHTHTHTH
TTTTTTTT


Coin 1:
100% Heads

Coin 1:
100% Tails

Coin 3:
50% Tails
50% Heads


Regions with similar statistical distribution of Heads and Tails
represent the states in the initial HMM Model


Use Elevation Functions, separately for Heads and Tails to
represent these distributions

HHHHH

HHHHHHH H H H H H








TTTTTTT T T T T T TTTTTTTT

Step 1: Separating Letters

Step 2: Calculating Elevation
Function for each letter

Step 3: For each position in the
sequence of throws …

Position
i

Step 3: Get the Elevation
Functions for Heads and Tails

Step 3: Create point D
i
in R
2
,
whose components are the
elevations

Step 4: Cluster all the points
obtained from each position


Point D
i

= [E
h
, E
t
]

What Clustering Achieves


Each cluster defines regions of similar
distributions of heads and tails



Each Cluster is a state in the initial HMM
model



State transition/emission probabilities, are
estimated from the clusters



References



MacQueen, J. 1967. Some methods for classification and analysis of
multivariate observations. Pp. 281
-
297
in
: L. M. Le Cam & J. Neyman [eds.]
Proceedings of the fifth Berkeley symposium on mathematical statistics and
probability, Vol. 1. University of California Press, Berkeley. xvii + 666 p.



Jain, A. K., Murty, M. N., and Flynn, P. J. Data Clustering: A Review.
ACM
Computing Surveys, Vol. 31, No. 3, September 1999



http://www.gene
-
chips.com/ by
Leming Shi, Ph.D.