CS181 Lecture 9 —Clustering Algorithms and an

Introduction to Probabilistic Methods

Avi Pfeffer;Revised by David Parkes and Ryan Adams

March 1,2013

We turn nowto the increasingly important topic of unsupervised learning.The ﬁrst problemwe

study is that of clustering.One challenge will be to decide on a criterion by which to judge the

quality of a hypothesis.We ﬁrst discuss a parametric method,the K-means clustering algorithm.

Then we look at hierarchical agglomerative clustering (HAC),which is a non-parametric method.

We will examine howmethods such as HAC can fail on high-dimensional data due to the “curse

of dimensionality”.Looking ahead,we will then formulate clustering as a probabilistic model

using a mixture of Gaussians model.An approach to training within this probabilistic framework

is delayed until next class.

1 The Clustering Problem

The clustering problemis deﬁned as follows:given a set of examples D = fx

1

; ;x

N

g,x

n

2 X,

learn a hypothesis that assigns each example to one of K clusters:

h:D!f1;:::;Kg:(1)

We will assume here that X R

M

,so that an example is a vector in the M-dimensional real

space.Features may be discrete or continuous.Note that we are not necessarily interested in good

performance on new examples.Rather,the task at hand for clustering is to identify structure in

existing data.(This said,the K-means and probabilistic approaches will also immediately assign

a cluster to newexamples.)

This is an example of unsupervised learning,where the training set D includes descriptions

of examples but no labels.For example,think about being given a collection of feature vectors

describing different animals and being asked to identify informative clusters of the different ani-

mals.Loosely speaking,the intuitive task is to ﬁnd some “hidden structure,” or patterns,in the

data.

One way to think about the clustering problemis that the problemis to partition examples into

clusters so that an example is associated with other examples that are closer (under some distance

metric) to itself than with examples in other clusters.

Another way to think about clustering,and unsupervised learning more generally,is to ﬁnd

a probabilistic model of the data,such that the observed data are generated with high probability

according to the model.The parameters of such a probabilistic model can be viewed as provid-

ing a description of the data,representing hidden structure.This will prove to be a general and

powerful approach and represents much of the current research frontier in machine learning.

1

A general challenge for the problem of unsupervised learning is to ﬁnd a good measure of

the performance of a learning algorithm.In supervised learning,we measure performance by

running an algorithmon the training set to produce a classiﬁer or regressor,and then measuring

the accuracy on a test set.This method does not work for unsupervised learning,because we do

not have target labels with which to judge generalization performance.

For clustering,we want to ask whether a learning algorithmdoes a good job of grouping sim-

ilar examples together.One possible way to judge a clustering algorithmis to manually examine

the clusters produced and see if they are intuitively coherent.But this is clearly not a systematic,

or formalized method.

We would prefer some objective way of telling whether an unsupervised learning algorithm

produces a good model of our data (e.g.,a clustering that captures meaningful structure,or a

parameterized probabilistic model that nicely explains the data.) Probabilistic approaches will be

more successful in providing such a quantiﬁed analysis then other approaches.Still,we consider

non probabilistic approaches ﬁrst and in particular two very well known algorithms:K-means

clustering and hierarchical agglomerative clustering.

2 K-means Clustering

Intuitively,we would like to ﬁnd clusters such that the inter-example distances between examples

in a cluster are small compared with the distances to examples in other clusters.

1

We formalize this

in K-means clustering by introducing a set of prototypes f

1

; ;

K

g,with

k

2 X,to represent

the center of each of K clusters.The idea is to ﬁnd prototypes that minimize the total squared

distance (for some distance metric) fromeach example to its closest prototype.The examples for

which the same prototype is the closest form a cluster,and in the minimum error solution each

prototype will be positioned at the center (appropriately deﬁned) of each cluster.

K-means is a parametric method because the learned hypothesis is represented by a small

number of parameters:in this case the set of prototype vectors.

Formally,we associate a binary indicator vector r

nk

2 f0;1g (responsibilities) with each exam-

ple x

n

and cluster k,s.t.

P

k

r

nk

= 1 for every n,and with r

nk

= 1 to indicate that example x

n

is

associated with prototype (and thus cluster) k.

Given this,then the clustering problemcan be viewed as ﬁnding prototypes and assignments

of examples to prototypes to minimize the loss function:

L(f

k

g

K

k=1

;fr

n

g

N

n=1

) =

N

X

n=1

K

X

k=1

r

nk

jjx

n

k

jj

2

(2)

where jjxjj

2

=

P

M

m=1

(x

m

m

)

2

is the squared Euclidean norm,deﬁned on vectors x; 2 R

M

.

The K-means algorithmminimizes the error by repeatedly applying two successive steps,cor-

responding to successive optimizations of the error with respect to responsibilities fr

nk

g and then

prototypes f

k

g.

It ﬁrst chooses a randominitial assignment of prototypes.Then,to minimize error with respect

to fr

nk

g it assigns each example to its closest prototype.Then,to minimize error with respect to

f

k

g it assigns each prototype to the mean of the examples with which it is associated.And so on,

until at some iteration no examples are reassigned and we have convergence.

1

This section is based in part on Bishop (2007).

2

A

Figure 1:Local minima with K-means

K-means(fx

1

;:::;x

N

g;K) =

For each k,set

k

to a random vector (e.g.one of x

n

)

Repeat until convergence:

For each n,r

nk

= 1 for k = arg min

k

0 jjx

n

k

0 jj

2

,and r

nk

= 0 otherwise.

For each k,

k

=

P

N

n=1

r

nk

x

n

P

N

n=1

r

nk

This is such a simple algorithm,it’s hard to believe that it works,but it does work.Each step

can be shown to (weakly) decrease the error,and therefore the algorithmis sure to converge.Still,

convergence may be to a local rather than global minimum,andK-means can be used with restart

for some other set of randomprototypes.

Note:The K-means algorithmcan be understood as a special case of the EMalgorithm,which

is introduced next lecture and is a general approach for training probabilistic models.

2

As such,

the K-means approach also admits a probabilistic interpretation.

The K-means algorithmhas a strong inductive bias.In particular,it makes assumptions about

the shapes of the clusters,namely that they are symmetric around the prototype vectors which

formthe center of each cluster.In particular,the decision boundary in determining whether to assign a

example to cluster k or k

0

is linear.(It is a good idea to think about and understand this claim.)

The K-means algorithm works well,even with fairly small amounts of data,because of this

strong inductive bias.The algorithm is fast,requiring O(KMN) steps per iteration,since the

work to compute the distance from each example to each prototype is O(KMN),and the work

to determine the new prototype positions is O(MN).There will be some number of iterations T

before convergence,and T tends (empirically) to be considerably less than N.

Sometimes K-means clustering performs very poorly.In particular,this occurs when the in-

ductive hypothesis fails to hold.Consider again the example of the disc inside the ring fromthe

lecture.The ring has the “wrong shape”,and K-means is incapable of identifying it as a cluster

and will produce completely the wrong result.(Would a transformation of the data help?)

As with most greedy hill-climbing algorithms,local minima can be a problem for K-means.

Figure 1 shows a situation that could be problematic.We can recognize that there are two clusters,

as well as an outlying example.Indeed,this solution does provide a minimum error clustering

However,there is another clustering that is a local minimum.The outlying example is assigned to

one cluster on its own,and all other examples are assigned to the other cluster,with a mean at A.

2

The ﬁrst step inside the loop of K-means,which assigns each example to a cluster prototype,is an “E” step.The

second step,which reassigns the cluster prototypes,is an “M” step.

3

3 Hierarchical Agglomerative Clustering

The second clustering algorithm we study is hierarchical agglomerative clustering (HAC).

3

HAC is

an example of an instance-based or non-parametric algorithm;the clusters are deﬁned directly

by the examples they contain.The hypothesis is not deﬁned by a small number of parameters that

instantiate a class of models,but rather by an arbitrary (hierarchical) clustering of the data.

That HAC is non-parametric brings beneﬁts and disadvantages:on the plus side,it does not

restrict possible clusters to those that ﬁt a parametric model and avoids the strong inductive hy-

pothesis of K-means.On the negative side,it depends on inter-point distances and so can fail

miserably with high dimensional data (the so called “curse of dimensionality”),and is also more

computationally expensive.

As its name suggests,HAC generates a hierarchy,which can be useful and natural in some

applications.This is achieved by “glomming” (or merging) clusters together to create larger and

larger clusters.The algorithmbegins with singleton clusters containing individual examples.At

each iteration,the algorithm chooses two clusters to merge,and replaces them with their union.

The algorithmis as follows:

HAC(fx

1

;:::;x

N

g;K) =

E = ffx

1

g;:::;fx

N

gg

Repeat until jEj = K

Let A;B be the two closest clusters in E

Remove A and B from E

Insert A[ B into E

In order to complete the speciﬁcation of the algorithm,we need to supply a distance function

(or metric) between clusters,so that we can determine which pair is the closest.There are several

possible ways to measure the distance between two clusters A and B.The choice of distance

function greatly affects the behavior of the HAC algorithm.Possibilities include the following:

min:d

min

(A;B) = min

a2A;b2B

jja bjj

max:d

max

(A;B) = max

a2A;b2B

jja bjj

mean:d

mean

(A;B) =

1

jAjjBj

P

a2A;a2B

jja bjj

centroid:d

cent

(A;B) =

1

jAj

P

a2A

a

1

jBj

P

b2B

b

All of these measures rely in turn on some underlying distance metric between individual

examples—by varying the underlying metric,one can get many more possibilities.For example,

a simple and common metric is based on the`

1

(or Manhattan distance) norm

jja bjj

1

=

M

X

m=1

ja

m

b

m

j;

and another common alternative is based on the`

2

(or Euclidean) norm,

jja bjj

2

=

v

u

u

t

M

X

m=1

(a

m

b

m

)

2

:

3

This section is based in part on Duda et al.,2001.

4

The HACalgorithmneeds to do some bookkeeping to maintain the distances between pairs of

clusters in E – the same distance should never have to be computed more than once.In addition,

it may be a good idea to store all the distances between pairs of examples.For example,one

algorithmic approach is to determine the pairwise distances between all examples once and for

all,requiring O(N

2

M) work.Given this,for the min distance metric between clusters,in each

round we can identify the pair of examples in two different clusters with the minimal distance,

which is an additional O(N

2

) work.For K clusters we need to run for N K rounds,and so

for O(N

3

+N

2

M) = O(N

3

) steps altogether,since K is typically a small constant.This is a naive

implementation and you should be able to think of things to do to speed up the computation.But

we can certainly see that the algorithm is scaling at least as O(N

2

M),compared to O(KMNT)

for K-means with T rounds,and can be considerably slower than K-means since T is typically

much less than N.

What are the main properties of the HAC algorithm?Since it is a non-parametric method,it

makes limited prior assumptions about the actual shape of the clusters and is able to learn clusters

with complex shape.The behavior of HAC depends on the metric.For example,a min metric

will tend to grow clusters that “snake” from example to example,as it seeks to connect clusters

that are close to each other,while the centroid and max metrics will tend to grow convex clusters.

HAC with the min metric tends to chain together clusters,and this “chaining effect” is sometimes

considered to be a defect.HACwith the max metric tends to prefer clusters that are more compact

and spherical like.The mean and centroid approaches provide compromises between the min and

max approaches.

Consider for example the disc and ring example fromthe lecture:a two dimensional data set

that consists of a disc with a surrounding ring.HAC with the min metric is capable of learning

that the circle and ring are two clusters,despite the fact that they have very different shapes.For

clusters that are compact and well separated the various distance metrics tend to generate very

similar results.However,if a good clustering includes clusters that are close to one another or

have shapes that are not basically hyper-spherical,then different results can be obtained.

The main limitation of HAC is that it suffers what is known as the curse of dimensionality.It

tends to fail and have poor performance in high-dimensional feature spaces.The problemis that

all examples,even those that should belong to the same cluster,tend to be far apart.Consider,

for example,a data set consisting of M binary attributes.Suppose that the ﬁrst M

0

attributes

are all correlated (either all true or all false),while the remainder of the attributes are uniformly

distributed.You would like the algorithmto discover two clusters based on the ﬁrst M

0

attributes.

Unfortunately,however,if M is much larger than M

0

,the average distance between two examples

of the same cluster will be almost as high as that between two examples of different clusters.

Given examples x,y and z,where x and y are in the same cluster and z is in the other cluster,the

probability that y is closer to x than z converges to 1=2 as M goes to inﬁnity!

4 AProbabilistic Approach

The probabilistic approach provides a coherent,mathematically soundframework for accomplish-

ing many machine learning classes.

4

We will visit this for the ﬁrst time here,in the context of

clustering.Probabilistic methods have become fundamental to modern machine learning and AI.

4

This section is based in part on Bishop (2008) and Jaakkola (2009).

5

The basic approach is to propose a parametric,probabilistic model of the domain and then

ﬁt parameters to best explain the data (perhaps also considering an explicit prior,e.g.,as a way

to avoid over-ﬁtting or to capture other knowledge about the domain.) Describing a parametric

probabilistic model is essentially the same as specifying the hypothesis space of a learning algo-

rithm.The parametric form (i.e.,the particular probability distribution model adopted) encodes

a number of assumptions about the domain that may or may not be true.These assumptions

provide the inductive bias for probabilistic methods.

For clustering,we will consider two particularly simple parametric forms.The ﬁrst is the

mixture of Gaussian (MoG) model,which is well suited for continuous domains.This will assume

that the examples are generatedfroma mixture of multivariate Gaussian distributions.The second

is a Naive Bayes model,and it is well suited to discrete attributes.This assumes that all attributes

are conditionally independent of each other given that the example falls into a particular cluster.

4.1 Primer:Probability Theory

In this primer we assume the random variables A and B are discrete.If the domain is continu-

ous then integrals take the role of summations,and probability mass functions (PMFs) become

probability density functions (PDFs).When it is clear from context what we mean,we will use the

notation p() to refer to both PDFs and PMFs.

5

If the set of possible values of A is A,then we have

P

a2A

p(A = a) = 1 (the sum rule),

and p(A = a) 0 for all possible values a.The joint probability p(A;B) is the probability

that A and B take on particular values simultaneously.We have p(A;B) = p(B;A),which means

that p(A = a;B = b) = p(B = b;A = a) for all a 2 A and b 2 B,where B is the domain of

values that can be adopted by B.p(Aj B) is the conditional probability of A given B,deﬁned as

p(Aj B) = p(A;B)=p(B).Fromthis,we have the product rule

p(A;B) = p(A) p(Bj A) = p(B) p(Aj B):(3)

We also obtain Bayes’ rule,

p(Bj A) =

p(Aj B) p(B)

p(A)

:(4)

The above rules continue to hold when conditioning,for example when conditioning on B then

we have

P

a

p(A = aj B) = 1.Similarly,when introducing randomvariable C we have

p(A;Bj C) = p(Aj C) p(Bj A;C) (5)

Two randomvariables are independent if p(A;B) = p(A) p(B),fromwhichit follows that p(Aj B) =

p(A).Another useful concept is marginalization,which holds that p(B) =

P

a

p(A = a;B) and

gives the unconditional (marginal) probability of p(B = b),for some value b,without any knowl-

edge about the value of A.For example,the denominator in Bayes rule,p(A),can be determined

as

P

b

p(A;B = b) =

P

b

p(Aj B = b) p(B = b).

5

This notation is somewhat unfortunate,although conventional in the machine learning literature.

6

4.2 Using Probabilistic Models

In the case of unsupervised learning,the basic probabilistic approach is to consider a parameter-

ized probabilistic model,

p(X = xj );(6)

parameterized by a vector of parameters ,where random variable X adopts values in feature

space X.

For a discrete feature space,then p(X = xj ) is the probability that randomvariable X adopts

value x.For a continuous feature space,then p(xj ) denotes the probability density function.

6

The result of learning is,in the simplest case,a vector of learned parameters .This is the case

for the maximum likelihood method introduced today and also the more advanced maximum a

posteriori method.A more sophisticated,full Bayesian approach,instead reasons directly about a

distribution on possible parameters.We leave this to one side for now.

Given such a parameterization then many interesting inference problems are possible.We

dig more into this next class.For now,let us consider the very special cases of classiﬁcation and

clustering.

4.2.1 Classiﬁcation

In the case of supervised learning,in the probabilistic method we consider a probabilistic model,

p(X = x;Y = y j ) (7)

where randomvariable Y adopts values drawn frompossible labels Y.

Given learned parameters ,then for classiﬁcation,we can label a newexample x by ﬁnding

y

?

= arg max

y2Y

p(Y = y j X = x;) (8)

= arg max

y2Y

p(Y = y;X = xj )

p(X = xj )

(9)

= arg max

y2Y

p(Y = y;X = xj ) (10)

where the idea is to ﬁnd the class label for which the conditional probability is maximized,given

example x.The ﬁrst equality follows fromthe deﬁnition of conditional probability,and the second

by noting that the denominator is constant for all choices of y.

4.2.2 Clustering

For clustering,the idea is to associate a so-called “latent” variable with each example,The latent

variable is hidden,in that it is not present in the data,and is interpreted as the cluster associated

with an example.Let Y denote this latent variable,where the randomvariable Y takes on values

in Y = f1;:::;Kg if we are looking for K clusters.

6

The distinction here is that the probability of a single point in a continuous space with a density is zero,so we

must talk about the probability of sets of such points that have volume,e.g.,p(xj ) dx = Pr(X 2 dxj ).A rigorous

treatment of this topic is outside the scope of this course.For further reading on this topic,see Rosenthal’s A First Look

at Rigorous Probability Theory.

7

Given this,and assuming that the data are used to learn a parameterized model (note that it is

not clear yet howthis can be done – one of the variables is unobservable after all!),then the cluster

assigned to an example is,

y

?

= arg max

y2Y

p(Y = y j X = x;) (11)

= arg max

y2Y

p(Y = y;X = xj )

p(X = xj )

(12)

= arg max

y2Y

p(Y = y;X = xj );(13)

that is,the value of the latent variable that is most likely given the example and the parameterized

model.

4.3 The MaximumLikelihood Method

Nowthat we have an idea of howto use a probabilistic model for some interesting machine learn-

ing tasks,howcan we learn?That is,howcan we select a good parameterization of a probabilistic

model given data?There are other questions too,but we will not address themfor now.For ex-

ample,we should also ask where the probabilistic model comes fromand howcan we determine

the best model to use?This is analogous to the question of how many hidden units to use in a

neural network model.

We ﬁrst introduce the maximum likelihood method,which is very simple and quite typical in

practice.Other methods of interest are the maximuma posteriori and full Bayesian approaches.

Under the assumption that the examples in data D are identically and independently sampled

according to p(X = xj ),then the joint probability of the data given parameters is

p(Dj ) =

N

Y

n=1

p(X

n

= x

n

j ) = L():(14)

This function is the likelihood function and we think of it as a function of the parameters .

Deﬁnition 1 (maximumlikelihood) Given data D and probabilistic model p(X = xj ) with parame-

ters ,the maximumlikelihood approach trains a model by selecting parameters to maximize L().

The parameters that maximize the likelihood represent a learned model and are the result

of training.These are the parameters that assign the highest probability to data,and preferring

such parameters is referred to as the maximum likelihood principle.The resulting value of is the

maximumlikelihood estimate of the model parameters.

In practice,likelihoods often involve a product over many terms (as above,where there is a

factor for each of the N data),and so it is convenient for analysis and for computational stability

(e.g.,avoiding underﬂowandoverﬂow) to performcomputations in log space.That is,to compute

with the natural logarithmof the likelihood,or log likelihood:

lnL() =

N

X

n=1

lnp(X

n

= x

n

j ):(15)

Finding the that maximize the likelihood is an optimizaton problem.As in other optimiza-

tion problems,it is typical to compute the gradient and either performgradient descent,or set the

gradient to zero and analytically solve for .

8

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-10

-5

0

5

10

f(x)

Figure 2:Univariate Gaussian distribution with = 0 and

2

= 1.

4.3.1 Example:Univariate Gaussian

First,the univariate (single variable) Gaussian (Normal) distribution has density function:

p(xj ) =

1

(2

2

)

1=2

exp

(x )

2

2

2

:(16)

The parameters are = (;

2

),where = E[X] and

2

is the variance E[(XE[X])

2

] = E[X

2

]

2

.

Here,expectation is denoted by E[].

The form of the univariate Gaussian is shown in Figure 2 and should be quite familiar.It is

symmetric about the mean.It has a single mode,at the mean.The density decays the further away

you get fromthe mean.The lower the variance,the faster the density decays,and the sharper the

mode.

Given i.i.d.data D = fx

1

;:::;x

N

g,with x

n

2 R,then we can nowsolve for the parameters

ML

that maximize the likelihood function,i.e.,maximize the probability of the data under a Gaussian

distribution.First,we take the log,and obtain for a single instance

lnp(x

n

j ) =

1

2

ln(2)

1

2

ln(

2

)

1

2

2

(x

n

)

2

:(17)

We sumthis across all of the N data:

lnp(Dj ) =

1

2

2

N

X

n=1

(x

n

)

2

N

2

ln(

2

)

N

2

ln(2):(18)

We nowtake the partial derivative with respect to ,and equate it to zero:

1

2

N

X

n=1

(x

n

) = 0 (19)

9

and so

ML

=

1

N

N

X

n=1

x

n

:(20)

Similarly,taking the partial derivative with respect to parameter

2

and setting to zero,we obtain

1

2(

2

)

2

N

X

n=1

(x

n

ML

)

2

N

2

2

= 0;(21)

which solves to

2

ML

=

1

N

N

X

n=1

(x

n

ML

)

2

(22)

Both expressions are just what we might expect.The maximum likelihood estimate of the

mean is the sample mean,and the maximumlikelihood estimate of the variance is the variance of

the sample considered as the population.

7

4.4 The Mixture of Gaussian Model

We cannowconsider the problemof clustering.Consider a model withKclusters.We will assume

that the data are generated by ﬁrst assigning a cluster to an example (this is the latent variable) and

then,conditioned on the cluster,assigning a feature vector according to a multivariate Gaussian

distribution.

The result is the well known mixture of Gaussian (MoG) model:

p(xj ) =

K

X

k=1

k

N(xj

k

;

k

);(23)

where parameters f

1

;:::;

K

g deﬁne the probability for generating an example fromthe kth clus-

ter,and N(xj

k

;

k

) is the density function associated with multivariate Gaussian with mean

k

and covariance

k

,so that the remaining parameters are f

k

;

k

g

K

k=1

.Each of the K multivariate

Gaussians is referred to as a component.Such weighted sums of component distributions (where

the weights are themselves a distribution) are referred to generally as mixture models and they are

a very powerful way to build complex distributions from simple ones.As mentioned in lecture,

they are generative models,which means that they naturally lead to algorithms that can drawdata

fromthe distribution.In the case of mixture models,one ﬁrst draws fromthe discrete distribution

given by to choose one of the K components in the mixture.The actual observed data are then

drawn from that component.This gives a hint as to how we will need to perform learning:we

will need to infer the latent variables of these unknown draws from in order to ﬁt the overall

parameters.It also reveals how we can think of this as probabilistic clustering:each datum is

literally generated by one of the K component distributions.

7

Digging a bit more deeply,this estimator of variance turns out to be biasedtowards zeros,andthe so called“Laplace

approximation” of making the denominator N 1 instead of N can be introduced to ﬁx this problemif necessary.The

difference becomes negligible for large enough N.

10

We can make this generative viewexplicit and introduce a newrandomvariable Y that takes

values in Y = f1; ;Kg.This results in a factorized distribution:

p(x;y j ) = p(Y = y j ) p(xj

y

;

y

):(24)

Given this,and given maximum likelihood parameters

ML

,then we would assign to exam-

ple x the cluster that maximizes,

y

?

= arg max

y2Y

p(Y = y j x;

ML

) (25)

= arg max

y2Y

p(Y = y;xj

ML

) (26)

= arg max

y2Y

p(Y = y j

ML

) p(xj Y = y;

ML

):(27)

This can be easily evaluated.The ﬁrst termis simply

y

(in

ML

) associated with cluster y.The

second termis given by the multivariate Gaussian density,parameterized according to

ML

.

Before continuing,we can reviewthe multivariate Gaussian distribution.This is deﬁned on a

randomvariable X taking values in R

M

,and thus M dimensions,has density function

p(xj ) =

1

(2)

M=2

jj

(1=2)

exp

1

2

(x )

T

1

(x )

;(28)

where 2 R

M

is the mean, is the M M covariance matrix and jj is the determinant of the

covariance matrix.

The covariance matrix is symmetric.Entry

m;m

0 (row m,column m

0

) entry in represents

the covariance between features mand m

0

,and thus a measure of whether or not they are corre-

lated.Entries on the diagonal,i.e.,entry

m;m

,represent the variance on the values adopted by

individual features.Recall that the covariance of two real-valued randomvariables Aand B is

Cov(A;B) = E[(AE[A])(B E[B])] (29)

so that Cov(A;A) = Var(A).

How many independent parameters does an M-dimensional multivariate Gaussian have?

There are M(M+1)=2 for (because it is symmetric),and Mfor ,for a total of M(M+3)=2 – i.e.,

on the order of M

2

.The number of free parameters is a rough indication of how much data will

be required to learn a model accurately.We see here that the amount of data required for the mul-

tivariate Gaussian grows quadratically with the dimension.Thus Gaussian models do not suffer

as much fromthe curse of dimensionality suffered by non-parametric methods such as HAC.

Interesting special cases of multivariate Gaussians are

Independence (axis-aligned):Fix all off-diagonal elements,indicating the covariance be-

tween variables,in to 0.Given this,there are only 2M parameters.

Spherical multivariate Gaussian:Further assume that =

2

I where I is the identity matrix

with 1 entries on the diagonal and 0 entries everywhere else,and

2

is a scalar indicating the

variance,the same in each dimension.This model includes only M +1 parameters.

It can be common to insist on independence across dimensions,with diagonal covariance ma-

trices,to reduce the number of parameters to be linear in the number of dimensions of the feature

space.

11

4.5 Application to Clustering

Back to clustering,the probabilistic model (assuming K clusters) is a MoGmodel,

p(xj ) =

K

X

k=1

k

N(xj

k

;

k

);(30)

and includes (K 1) + MK + M(M +1)K=2 parameters.

In principle,we can nowﬁnd values for these parameters that maximizes the probability of the

data,i.e.,p(Dj ).Given this we could then cluster the data using the approach explained above in

Eq.(27).Next time we will see a powerful algorithm(the EMalgorithm) to learn these parameters.

The key challenge is that the cluster assignments are a latent variable (and not seen in the data).

Without this complication it is a fairly simple matter to ﬁt maximumlikelihood parameters for a

mixture of Gaussians model.

12

## Comments 0

Log in to post a comment