Clustering With EMand K-Means

Neil Alldrin

Department of Computer Science

University of California,San Diego

La Jolla,CA 92037

nalldrin@cs.ucsd.edu

Andrew Smith

Department of Computer Science

University of California,San Diego

La Jolla,CA 92037

atsmith@cs.ucsd.edu

Doug Turnbull

Department of Computer Science

University of California,San Diego

La Jolla,CA 92037

dturnbul@cs.ucsd.edu

Abstract

Two standard algorithms for data clustering are expectation maximiza-

tion (EM) and K-means.We run these algorithms on various data sets

to evaluate how well they work.For high dimensional data we use ran-

dom projection and principal components analysis (PCA) to reduce the

dimensionality.

1 Introduction

The K-Means algorithm ﬁnds

clusters by choosing

data points at random as initial

cluster centers.Each data point is then assigned to the cluster with center that is closest to

that point.Each cluster center is then replaced by the mean of all the data points that have

been assigned to that cluster.This process is iterated until no data point is reassigned to a

different cluster.

EM ﬁnds clusters by determining a mixture of Gaussians that ﬁt a given data set.Each

Gaussian has an associated mean and covariance matrix.However,since we use shperical

Gaussians,a variance scalar is used in place of the covariance matrix.The prior probability

for each Gaussian is the fraction of points in the cluster deﬁned by that Gaussian.These

parameters can be initialized by randomly selecting means of the Gaussians,or by using

the output of K-means for initial centers.The algorithm converges on a locally optimal

solution by iteratively updating values for means and variances.

2 Low Dimensional Data Clustering

For the ﬁrst part of our project,we implemented the EM and K-Means algorithms.Our

implementations were tested on two sets of two-dimensional data:a distribution generated

by two Gaussians and an annulus-shaped distribution.

2.1 K-Means on Two-Dimensional,Two Gaussian Data

The K-Means algorithmworks very well on this data set,effectively converging in three or

four iterations (see ﬁgure 1).

-2

0

2

-2

-1

0

1

2

3

Iteration 1

-2

0

2

-2

-1

0

1

2

3

Iteration 2

-2

0

2

-2

-1

0

1

2

3

Iteration 4

Figure 1:The progress of the K-Means algorithmwith

and randominitialization on

the two-Gaussian data set (note:some data points omitted for clarity).

2.2 EMon Two-Dimensional,Two Gaussian Data

The EMalgorithmalso performs well,typically converging within 5 iterations (see ﬁgure

2).

5

0

5

4

2

0

2

4

Iteration 1

5

0

5

4

2

0

2

4

Iteration 4

5

0

5

4

2

0

2

4

Iteration 7

Figure 2:The progress of the EMalgorithm with

and random initialization on the

two-Gaussian data set (note:some data points omitted for clarity).The radius of the circle

around each Gaussian is set to its variance.

2.3 K-Means on Two-Dimensional,Annulus Data

On the annulus data,K-Means also works well,with the centers convergingto points evenly

distributed around the annulus in four or ﬁve iterations (see ﬁgure 3).

2.4 EMon Two Dimensional,Annulus Data

On the annulus data set,the EMalgorithmalso performs well,converging within 10 itera-

tions (see ﬁgure 4).

1

0

1

1

0

1

Iteration 1

1

0

1

1

0

1

Iteration 2

1

0

1

1

0

1

Iteration 3

1

0

1

1

0

1

Iteration 4

1

0

1

1

0

1

Iteration 5

1

0

1

1

0

1

Iteration 6

Figure 3:The progress of the K-Means algorithmwith

and randominitialization on

the annulus data set (note:some data points omitted for clarity).

We verify that our code for EMis progressively ﬁnding a better ﬁt for the data by checking

that the negative log likelihood after each iteration never increases.As can be seen in Figure

5,this value decreases after each iteration.

3 High Dimensional Data Clustering

Most real-world data sets are very high-dimensional.However,the performance of cluster-

ing algorithms tends to scale poorly as the dimension of the data grows.For this reason the

dimensionality of data sets is often reduced by various techniques before it is clustered.

Our data set is very high-dimensional,since each data point is a 240 x 292 image with

256 shades of gray.Treating each pixel as a dimension yields a 70080-dimensional data

set,which makes clustering difﬁcult given our computing resources.To reduce the dimen-

sionality of our data set,we experimented with random projections and principal compo-

nent analysis (PCA).Randomprojections have the desirable property that highly eccentric,

high-dimensional Gaussians become more spherical when projected down to a small ran-

dombasis.

Our data set is a collecton of images of the faces of 14 different people expressing different

emotions.Each person was instructed to make a happy,sad,surprised,afraid,disgusted,

and angry face.Our primary goal is to classify the facial expression of a given image by

clustering our data set into six clusters,one for each emotion,and then calculating which

cluster is most likely to contain that image.We are also interested in clustering our data set

2

0

2

2

1

0

1

2

Iteration 1

2

0

2

2

1

0

1

2

Iteration 2

2

0

2

2

1

0

1

2

Iteration 4

2

0

2

2

1

0

1

2

Iteration 6

2

0

2

2

1

0

1

2

Iteration 7

2

0

2

2

1

0

1

2

Iteration 9

Figure 4:The progress of the EMalgorithm with

and random initialization on the

annulus data set (note:some data points omitted for clarity).The radius of the circle around

each Gaussian is set to its variance.

0

2

4

6

8

10

12

14

16

18

20

3000

3200

3400

3600

3800

4000

4200

Iteration

Error Function

Figure 5:The negative log likelihood of the EM algorithm on the annulus data

set with

and random initialization.This error function is

.

into 14 clusters,one for each person,and conducting the analogous experiment of classi-

fying images by person.We were concerned that clustering images to distinguish between

emotions would ﬁnd clusters of different people,rather than different facial expressions.

To avoid this,we make use of an image of each person making a “neutral” face.We add

a”difference image,” deﬁned as the difference between each image of a person expressing

an emotion and that person’s neutral face to the data set.This set is used to cluster based

on facial expressions,whereas we use the raw images to classify particular people.Our

intuition was that clustering by person would be more successful than clustering by facial

expression.

We downsample all images by a factor of 64 (8 in the x and y dimensions) to reduce

the effects of noise.Intuitively,the downsampling does not remove information crucial

to clustering since a human can still identify people and their facial expressions at this

resolution.

3.1 Classifying People

Supervised Clustering

0

20

40

60

80

100

120

0

2

4

6

8

10

12

14

image number (lines seperate images of same person)

cluster number

Figure 6:Supervised clustering,1080-dimensional data (no dimension reduction).

Note about ﬁgures 6 and 7:The graphs show how each image was classiﬁed,by cluster

number on the vertical axis,and the images on the horizontal axis.These images are

sorted by known category,where each category is seperated by vertical dash-dot lines.

A square indicates a point classiﬁed by ﬁnding the cluster-center (mean) to which that

point is closest.A dot indicates a point classiﬁed by ﬁnding the Gaussian with the highest

probability at that point.

As an initial test,we use all the images of a particular person to calculate a maximum-

likelihood mean and variance of that person,and then use these 14 Gaussians to classify

each image in the data set.We use two classiﬁcation methods,one classiﬁes a data point

by ﬁnding the mean to which that point is closest,the other ﬁnds the Gaussian with the

highest probability at that point.This supervised clustering test represents an upper bound

on howwell we can expect unsupervised clustering algorithms to perform.

Figure 6 is the result of ﬁtting 14 Gaussians to the raw image data and then trying to

classify each image.Evaluating Gaussians on such high-dimensional data causes numerical

precision problems,so we are unable to use the probability method to classify images.

0

20

40

60

80

100

120

0

2

4

6

8

10

12

14

image number (lines seperate images of same person)

cluster number

Figure 7:Supervised clustering,15-dimensional data (randomprojection).

Notice that only three (of 110) images are misclassiﬁed,indicating that the raw data are

well seperated.

Figure 7 is the same as ﬁgure 6,except the data were pojected down to a random 15-

dimensional basis.These low-dimensional data are still well seperated,but there are a

few more misclassiﬁcations than in the high-dimensional data.Reducing the number of

dimensions with PCA yields comperable results.

Unsupervised Clustering

0

20

40

60

80

100

120

0

2

4

6

8

10

12

14

image number (lines seperate images of same person)

cluster number

Figure 8:Unsupervised clustering,1080-dimensional data (no dimension reduction).

0

20

40

60

80

100

120

0

2

4

6

8

10

12

14

image number (lines seperate images of same person)

cluster number

Figure 9:Unsupervised clustering,15-dimensional data (randomprojection).

0

20

40

60

80

100

120

0

2

4

6

8

10

12

14

image number (lines seperate images of same person)

cluster number

Figure 10:Unsupervised clustering,15-dimensional data (PCA projection).

Note about ﬁgures 8 through 10:The squares indicate the K-Means results and the dots

indicate the EMresults.Some EMresults are not present due to numerical precision prob-

lems.

Figure 8 is the result of running K-Means (EMfailed due to numerical precision problems)

on the entire high-dimensional data set,looking for 14 clusters,and classifying all the data

points according to the clusters those algorithms found.About ten clusters were found by

K-Means that correlate well to distinct people.

Figure 9 is the same as ﬁgure 8 except the data have been projected down to a 20 dimen-

sional random basis and EMresults are included.About ﬁve clusters were found by both

algorithms that correlated well with distinct people.

Figure 10 shows the clusters found by K-Means and EMon the data set projected down to

15 principal components.EMclusters marginally better than K-Means in this case.It also

appears that the PCA basis worked better than the randombasis.

3.2 Classifying Facial Expressions

Supervised Clustering

0

10

20

30

40

50

60

70

80

90

100

0

1

2

3

4

5

6

image number (lines seperate images of same expression)

cluster number

Figure 11:Supervised clustering,1080-dimensional data (no dimension reduction).

0

10

20

30

40

50

60

70

80

90

100

0

1

2

3

4

5

6

image number (lines seperate images of same expression)

cluster number

Figure 12:Supervised clustering,15-dimensional data (randomprojection).

To see if disjoint clusters of facial expressions exist in our data set,we use all the difference-

images of each facial expression to calculate a maximum-likelihood mean and variance of

that expression,and then use these six Gaussians to classify the images.We use the same

two classiﬁcation methods as in section 3.1,classifying a data point by ﬁnding the mean to

which that point is closest and the other ﬁnding the Gaussian with the highest probability

at that point.This supervised clustering test represents an upper bound on howwell we can

expect unsupervised clustering algorithms to perform.

Figure 11 shows that most data points can be identiﬁed with the correct cluster,indicating

the raw data can be partitioned into distinct clusters of facial expressions.

Figure 12 is the same as ﬁgure 11 except the data have been projected down to a 15-

dimensional random basis.About a quarter of the data are misclassiﬁed,indicating the

clusters in the projected data are less distinct.

Unsupervised Clustering

0

10

20

30

40

50

60

70

80

90

100

0

1

2

3

4

5

6

image number (lines seperate images of same expression)

cluster number

Figure 13:Unsupervised clustering,1080-dimensional data (no dimension reduction).

Figure 13 is the result of runningK-Means (EMfailed due to numerical precision problems)

on the entire high-dimensional data set,looking for 6 clusters,and classifying all the data

points according to the clusters found.The only cluster that correlates well with a particular

facial expression is cluster 1 corresponding to the happy expressions.

Figure 14 is the same as ﬁgure 13 except the data have been projected down to a 20 dimen-

sional random basis and EM results are included.Again,the only cluster that correlates

well with a particular facial expression is the cluster associated with happy expressions,

but this cluster is not as disjoint (from other clusters) as in the high-dimensional data set,

since there are more false positives and misses.

Figure 15 shows the clusters found by K-Means and EM on the data set projected down

to 20 principal components.As in the previous two ﬁgures,only happiness correlates well

with a particular cluster.There are slightly fewer misses than with the high dimensional

data;however,there are many more false positives.

0

10

20

30

40

50

60

70

80

90

100

0

1

2

3

4

5

6

image number (lines seperate images of same expression)

cluster number

Figure 14:Unsupervised clustering,20-dimensional data (randomprojection).

0

10

20

30

40

50

60

70

80

90

100

0

1

2

3

4

5

6

image number (lines seperate images of same expression)

cluster number

Figure 15:Unsupervised clustering,20-dimensional data (PCA projection).

4 Conclusion

Down-sampling the images greatly improved clustering in all cases.We suspect this is

because most noise was averaged out.

We tried two techniques to reduce the dimensionality of our data set,projecting it down to a

low-dimensional randombasis,and principal component analysis.Both of these techniques

degraded the quality of the clustering,but made the expectation maximization algorithm

feasible.Our observation is that PCA was only marginally better,if at all,than a random

projection despite its computational intensity.

In general,we noticed that K-Means performs comparably to EM;however,EM fails on

high-dimensional data sets due to numerical precision problems.Another problem with

EMis that Gaussians often collapsed to delta functions.Our technique to prevent this was

to reset the variance of collapsed Gaussians to a more reasonable value,and to set the mean

of those Gaussians to randomdata points.

We would have liked to run our clustering algorithms on our data sets and then validate

the results by classifying novel data,however when we reserved a portion of our data for

validation,the clusters the algorithms found did not correspond at all to the classes we were

trying to ﬁnd.We strongly suspect this is due to the lack of enough sample points to deﬁne

accurate Gaussians.

5 Future Work

The technique we use to convert images to feature vectors is simply to list all of the pixels in

the image.Clearly,this is the a naive approach because it ignores the correlations between

neighboring pixels.Our ﬁrst attempt was to create feature vectors that were the Gabor

wavelet transforms of each image using 4 scales and 6 rotations.This took a long time

and produced a feature vector that was far too large for our computational reasources.

Future work should explore the potential of this approach by using a more efﬁcient wavelet

transformprocedure.

In addition to K-Means and EM,K-Harmonic-Means is another clustering algorithm that

could be used to classify images.For each cluster center,K-Harmonic-Means computes the

harmoic mean of the distance to every data point,and update that cluster center accordingly.

This algorithmis less sensitive to initial cluster centers than K-Means,but does not have the

problemof collapsing Gaussians exhibited by EM.For these reasons,K-Harmonic-Means

might ﬁnd better clusters in high-dimensional data.

Individual Contributions

Neil Alldrin:

My primary contribution was writing the EMcode (most of the em

* ﬁles).This included

a lot of effort devoted to how to handle collapsing Gaussians and how to prevent divide by

zeros caused by lack of numerical precision (which was only a problem on high dimen-

sional data).I also generated the graphs for the 2-dimensional data and helped write the

latex document you are nowreading.I had lots of fun.

AndrewSmith:

I learned Matlab.I wrote the script to load the images (loadFaces.m).I wrote the function to

downsample images.I wrote K-means.I wrote the scripts to generate the high-dimensional

data graphs (section 3) in this paper.I wrote a function to generate a Gabor ﬁlter bank,

using code from[2] to evaluate Gabor functions.I experimented with using Gabor wavelet

transforms for feature vectors of images,but it was too slow.I had lots of fun.

Doug Turnbull:

My primary focus in this assignment was to design and implement various experiments for

high dimensional data using kmeans and EM.These tests included implementing random

projection and PCAprecompuation algorithms,creating scripts to run tests for various data

sets,and collecting results for analysis.Developing these tests were often a nontrival task

due to a large number of parameters (projection matrices,clustering algorithm,data sets,

etc.) that greatly affect the quality of the results.I had a little bit of fun.

References

[1] Bishop,C.M.(1995) Neural Networks for Pattern Recognition,Oxford University Press,New

York.

[2] Manjunath,B.S.&Ma,W.Y.(1996) Texture Features for Browsing and Retrieval of Image Data,

IEEE PAMI,vol 18,no.8,pp.837-842.

[3] Sanjoy,D.(1999) Learning Mixtures of Gaussians,IEEESymposiumon Foundations of Computer

Science (FOCS).

[4] Zhang,B.(2000) Generalized K-Harmonic Means,Hewlett-Packard Laboratoris Technical Re-

port.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο