CS181 Lecture 9 — Clustering Algorithms and an Introduction to ...

spiritualblurtedΤεχνίτη Νοημοσύνη και Ρομποτική

24 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

119 εμφανίσεις

CS181 Lecture 9 —Clustering Algorithms and an
Introduction to Probabilistic Methods
Avi Pfeffer;Revised by David Parkes and Ryan Adams
March 1,2013
We turn nowto the increasingly important topic of unsupervised learning.The first problemwe
study is that of clustering.One challenge will be to decide on a criterion by which to judge the
quality of a hypothesis.We first discuss a parametric method,the K-means clustering algorithm.
Then we look at hierarchical agglomerative clustering (HAC),which is a non-parametric method.
We will examine howmethods such as HAC can fail on high-dimensional data due to the “curse
of dimensionality”.Looking ahead,we will then formulate clustering as a probabilistic model
using a mixture of Gaussians model.An approach to training within this probabilistic framework
is delayed until next class.
1 The Clustering Problem
The clustering problemis defined as follows:given a set of examples D = fx
1
;  ;x
N
g,x
n
2 X,
learn a hypothesis that assigns each example to one of K clusters:
h:D!f1;:::;Kg:(1)
We will assume here that X  R
M
,so that an example is a vector in the M-dimensional real
space.Features may be discrete or continuous.Note that we are not necessarily interested in good
performance on new examples.Rather,the task at hand for clustering is to identify structure in
existing data.(This said,the K-means and probabilistic approaches will also immediately assign
a cluster to newexamples.)
This is an example of unsupervised learning,where the training set D includes descriptions
of examples but no labels.For example,think about being given a collection of feature vectors
describing different animals and being asked to identify informative clusters of the different ani-
mals.Loosely speaking,the intuitive task is to find some “hidden structure,” or patterns,in the
data.
One way to think about the clustering problemis that the problemis to partition examples into
clusters so that an example is associated with other examples that are closer (under some distance
metric) to itself than with examples in other clusters.
Another way to think about clustering,and unsupervised learning more generally,is to find
a probabilistic model of the data,such that the observed data are generated with high probability
according to the model.The parameters of such a probabilistic model can be viewed as provid-
ing a description of the data,representing hidden structure.This will prove to be a general and
powerful approach and represents much of the current research frontier in machine learning.
1
A general challenge for the problem of unsupervised learning is to find a good measure of
the performance of a learning algorithm.In supervised learning,we measure performance by
running an algorithmon the training set to produce a classifier or regressor,and then measuring
the accuracy on a test set.This method does not work for unsupervised learning,because we do
not have target labels with which to judge generalization performance.
For clustering,we want to ask whether a learning algorithmdoes a good job of grouping sim-
ilar examples together.One possible way to judge a clustering algorithmis to manually examine
the clusters produced and see if they are intuitively coherent.But this is clearly not a systematic,
or formalized method.
We would prefer some objective way of telling whether an unsupervised learning algorithm
produces a good model of our data (e.g.,a clustering that captures meaningful structure,or a
parameterized probabilistic model that nicely explains the data.) Probabilistic approaches will be
more successful in providing such a quantified analysis then other approaches.Still,we consider
non probabilistic approaches first and in particular two very well known algorithms:K-means
clustering and hierarchical agglomerative clustering.
2 K-means Clustering
Intuitively,we would like to find clusters such that the inter-example distances between examples
in a cluster are small compared with the distances to examples in other clusters.
1
We formalize this
in K-means clustering by introducing a set of prototypes f
1
;  ;
K
g,with 
k
2 X,to represent
the center of each of K clusters.The idea is to find prototypes that minimize the total squared
distance (for some distance metric) fromeach example to its closest prototype.The examples for
which the same prototype is the closest form a cluster,and in the minimum error solution each
prototype will be positioned at the center (appropriately defined) of each cluster.
K-means is a parametric method because the learned hypothesis is represented by a small
number of parameters:in this case the set of prototype vectors.
Formally,we associate a binary indicator vector r
nk
2 f0;1g (responsibilities) with each exam-
ple x
n
and cluster k,s.t.
P
k
r
nk
= 1 for every n,and with r
nk
= 1 to indicate that example x
n
is
associated with prototype (and thus cluster) k.
Given this,then the clustering problemcan be viewed as finding prototypes and assignments
of examples to prototypes to minimize the loss function:
L(f
k
g
K
k=1
;fr
n
g
N
n=1
) =
N
X
n=1
K
X
k=1
r
nk
jjx
n

k
jj
2
(2)
where jjxjj
2
=
P
M
m=1
(x
m

m
)
2
is the squared Euclidean norm,defined on vectors x; 2 R
M
.
The K-means algorithmminimizes the error by repeatedly applying two successive steps,cor-
responding to successive optimizations of the error with respect to responsibilities fr
nk
g and then
prototypes f
k
g.
It first chooses a randominitial assignment of prototypes.Then,to minimize error with respect
to fr
nk
g it assigns each example to its closest prototype.Then,to minimize error with respect to
f
k
g it assigns each prototype to the mean of the examples with which it is associated.And so on,
until at some iteration no examples are reassigned and we have convergence.
1
This section is based in part on Bishop (2007).
2
￿￿
￿￿
￿￿
￿￿
￿
￿
￿
￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿
￿￿￿￿
A
Figure 1:Local minima with K-means
K-means(fx
1
;:::;x
N
g;K) =
For each k,set 
k
to a random vector (e.g.one of x
n
)
Repeat until convergence:
For each n,r
nk
= 1 for k = arg min
k
0 jjx
n

k
0 jj
2
,and r
nk
= 0 otherwise.
For each k,
k
=
P
N
n=1
r
nk
x
n
P
N
n=1
r
nk
This is such a simple algorithm,it’s hard to believe that it works,but it does work.Each step
can be shown to (weakly) decrease the error,and therefore the algorithmis sure to converge.Still,
convergence may be to a local rather than global minimum,andK-means can be used with restart
for some other set of randomprototypes.
Note:The K-means algorithmcan be understood as a special case of the EMalgorithm,which
is introduced next lecture and is a general approach for training probabilistic models.
2
As such,
the K-means approach also admits a probabilistic interpretation.
The K-means algorithmhas a strong inductive bias.In particular,it makes assumptions about
the shapes of the clusters,namely that they are symmetric around the prototype vectors which
formthe center of each cluster.In particular,the decision boundary in determining whether to assign a
example to cluster k or k
0
is linear.(It is a good idea to think about and understand this claim.)
The K-means algorithm works well,even with fairly small amounts of data,because of this
strong inductive bias.The algorithm is fast,requiring O(KMN) steps per iteration,since the
work to compute the distance from each example to each prototype is O(KMN),and the work
to determine the new prototype positions is O(MN).There will be some number of iterations T
before convergence,and T tends (empirically) to be considerably less than N.
Sometimes K-means clustering performs very poorly.In particular,this occurs when the in-
ductive hypothesis fails to hold.Consider again the example of the disc inside the ring fromthe
lecture.The ring has the “wrong shape”,and K-means is incapable of identifying it as a cluster
and will produce completely the wrong result.(Would a transformation of the data help?)
As with most greedy hill-climbing algorithms,local minima can be a problem for K-means.
Figure 1 shows a situation that could be problematic.We can recognize that there are two clusters,
as well as an outlying example.Indeed,this solution does provide a minimum error clustering
However,there is another clustering that is a local minimum.The outlying example is assigned to
one cluster on its own,and all other examples are assigned to the other cluster,with a mean at A.
2
The first step inside the loop of K-means,which assigns each example to a cluster prototype,is an “E” step.The
second step,which reassigns the cluster prototypes,is an “M” step.
3
3 Hierarchical Agglomerative Clustering
The second clustering algorithm we study is hierarchical agglomerative clustering (HAC).
3
HAC is
an example of an instance-based or non-parametric algorithm;the clusters are defined directly
by the examples they contain.The hypothesis is not defined by a small number of parameters that
instantiate a class of models,but rather by an arbitrary (hierarchical) clustering of the data.
That HAC is non-parametric brings benefits and disadvantages:on the plus side,it does not
restrict possible clusters to those that fit a parametric model and avoids the strong inductive hy-
pothesis of K-means.On the negative side,it depends on inter-point distances and so can fail
miserably with high dimensional data (the so called “curse of dimensionality”),and is also more
computationally expensive.
As its name suggests,HAC generates a hierarchy,which can be useful and natural in some
applications.This is achieved by “glomming” (or merging) clusters together to create larger and
larger clusters.The algorithmbegins with singleton clusters containing individual examples.At
each iteration,the algorithm chooses two clusters to merge,and replaces them with their union.
The algorithmis as follows:
HAC(fx
1
;:::;x
N
g;K) =
E = ffx
1
g;:::;fx
N
gg
Repeat until jEj = K
Let A;B be the two closest clusters in E
Remove A and B from E
Insert A[ B into E
In order to complete the specification of the algorithm,we need to supply a distance function
(or metric) between clusters,so that we can determine which pair is the closest.There are several
possible ways to measure the distance between two clusters A and B.The choice of distance
function greatly affects the behavior of the HAC algorithm.Possibilities include the following:
min:d
min
(A;B) = min
a2A;b2B
jja bjj
max:d
max
(A;B) = max
a2A;b2B
jja bjj
mean:d
mean
(A;B) =
1
jAjjBj
P
a2A;a2B
jja bjj
centroid:d
cent
(A;B) =




1
jAj
P
a2A
a



1
jBj
P
b2B
b



All of these measures rely in turn on some underlying distance metric between individual
examples—by varying the underlying metric,one can get many more possibilities.For example,
a simple and common metric is based on the`
1
(or Manhattan distance) norm
jja bjj
1
=
M
X
m=1
ja
m
b
m
j;
and another common alternative is based on the`
2
(or Euclidean) norm,
jja bjj
2
=
v
u
u
t
M
X
m=1
(a
m
b
m
)
2
:
3
This section is based in part on Duda et al.,2001.
4
The HACalgorithmneeds to do some bookkeeping to maintain the distances between pairs of
clusters in E – the same distance should never have to be computed more than once.In addition,
it may be a good idea to store all the distances between pairs of examples.For example,one
algorithmic approach is to determine the pairwise distances between all examples once and for
all,requiring O(N
2
M) work.Given this,for the min distance metric between clusters,in each
round we can identify the pair of examples in two different clusters with the minimal distance,
which is an additional O(N
2
) work.For K clusters we need to run for N  K rounds,and so
for O(N
3
+N
2
M) = O(N
3
) steps altogether,since K is typically a small constant.This is a naive
implementation and you should be able to think of things to do to speed up the computation.But
we can certainly see that the algorithm is scaling at least as O(N
2
M),compared to O(KMNT)
for K-means with T rounds,and can be considerably slower than K-means since T is typically
much less than N.
What are the main properties of the HAC algorithm?Since it is a non-parametric method,it
makes limited prior assumptions about the actual shape of the clusters and is able to learn clusters
with complex shape.The behavior of HAC depends on the metric.For example,a min metric
will tend to grow clusters that “snake” from example to example,as it seeks to connect clusters
that are close to each other,while the centroid and max metrics will tend to grow convex clusters.
HAC with the min metric tends to chain together clusters,and this “chaining effect” is sometimes
considered to be a defect.HACwith the max metric tends to prefer clusters that are more compact
and spherical like.The mean and centroid approaches provide compromises between the min and
max approaches.
Consider for example the disc and ring example fromthe lecture:a two dimensional data set
that consists of a disc with a surrounding ring.HAC with the min metric is capable of learning
that the circle and ring are two clusters,despite the fact that they have very different shapes.For
clusters that are compact and well separated the various distance metrics tend to generate very
similar results.However,if a good clustering includes clusters that are close to one another or
have shapes that are not basically hyper-spherical,then different results can be obtained.
The main limitation of HAC is that it suffers what is known as the curse of dimensionality.It
tends to fail and have poor performance in high-dimensional feature spaces.The problemis that
all examples,even those that should belong to the same cluster,tend to be far apart.Consider,
for example,a data set consisting of M binary attributes.Suppose that the first M
0
attributes
are all correlated (either all true or all false),while the remainder of the attributes are uniformly
distributed.You would like the algorithmto discover two clusters based on the first M
0
attributes.
Unfortunately,however,if M is much larger than M
0
,the average distance between two examples
of the same cluster will be almost as high as that between two examples of different clusters.
Given examples x,y and z,where x and y are in the same cluster and z is in the other cluster,the
probability that y is closer to x than z converges to 1=2 as M goes to infinity!
4 AProbabilistic Approach
The probabilistic approach provides a coherent,mathematically soundframework for accomplish-
ing many machine learning classes.
4
We will visit this for the first time here,in the context of
clustering.Probabilistic methods have become fundamental to modern machine learning and AI.
4
This section is based in part on Bishop (2008) and Jaakkola (2009).
5
The basic approach is to propose a parametric,probabilistic model of the domain and then
fit parameters to best explain the data (perhaps also considering an explicit prior,e.g.,as a way
to avoid over-fitting or to capture other knowledge about the domain.) Describing a parametric
probabilistic model is essentially the same as specifying the hypothesis space of a learning algo-
rithm.The parametric form (i.e.,the particular probability distribution model adopted) encodes
a number of assumptions about the domain that may or may not be true.These assumptions
provide the inductive bias for probabilistic methods.
For clustering,we will consider two particularly simple parametric forms.The first is the
mixture of Gaussian (MoG) model,which is well suited for continuous domains.This will assume
that the examples are generatedfroma mixture of multivariate Gaussian distributions.The second
is a Naive Bayes model,and it is well suited to discrete attributes.This assumes that all attributes
are conditionally independent of each other given that the example falls into a particular cluster.
4.1 Primer:Probability Theory
In this primer we assume the random variables A and B are discrete.If the domain is continu-
ous then integrals take the role of summations,and probability mass functions (PMFs) become
probability density functions (PDFs).When it is clear from context what we mean,we will use the
notation p() to refer to both PDFs and PMFs.
5
If the set of possible values of A is A,then we have
P
a2A
p(A = a) = 1 (the sum rule),
and p(A = a)  0 for all possible values a.The joint probability p(A;B) is the probability
that A and B take on particular values simultaneously.We have p(A;B) = p(B;A),which means
that p(A = a;B = b) = p(B = b;A = a) for all a 2 A and b 2 B,where B is the domain of
values that can be adopted by B.p(Aj B) is the conditional probability of A given B,defined as
p(Aj B) = p(A;B)=p(B).Fromthis,we have the product rule
p(A;B) = p(A) p(Bj A) = p(B) p(Aj B):(3)
We also obtain Bayes’ rule,
p(Bj A) =
p(Aj B) p(B)
p(A)
:(4)
The above rules continue to hold when conditioning,for example when conditioning on B then
we have
P
a
p(A = aj B) = 1.Similarly,when introducing randomvariable C we have
p(A;Bj C) = p(Aj C) p(Bj A;C) (5)
Two randomvariables are independent if p(A;B) = p(A) p(B),fromwhichit follows that p(Aj B) =
p(A).Another useful concept is marginalization,which holds that p(B) =
P
a
p(A = a;B) and
gives the unconditional (marginal) probability of p(B = b),for some value b,without any knowl-
edge about the value of A.For example,the denominator in Bayes rule,p(A),can be determined
as
P
b
p(A;B = b) =
P
b
p(Aj B = b) p(B = b).
5
This notation is somewhat unfortunate,although conventional in the machine learning literature.
6
4.2 Using Probabilistic Models
In the case of unsupervised learning,the basic probabilistic approach is to consider a parameter-
ized probabilistic model,
p(X = xj );(6)
parameterized by a vector of parameters ,where random variable X adopts values in feature
space X.
For a discrete feature space,then p(X = xj ) is the probability that randomvariable X adopts
value x.For a continuous feature space,then p(xj ) denotes the probability density function.
6
The result of learning is,in the simplest case,a vector of learned parameters .This is the case
for the maximum likelihood method introduced today and also the more advanced maximum a
posteriori method.A more sophisticated,full Bayesian approach,instead reasons directly about a
distribution on possible parameters.We leave this to one side for now.
Given such a parameterization  then many interesting inference problems are possible.We
dig more into this next class.For now,let us consider the very special cases of classification and
clustering.
4.2.1 Classification
In the case of supervised learning,in the probabilistic method we consider a probabilistic model,
p(X = x;Y = y j ) (7)
where randomvariable Y adopts values drawn frompossible labels Y.
Given learned parameters ,then for classification,we can label a newexample x by finding
y
?
= arg max
y2Y
p(Y = y j X = x;) (8)
= arg max
y2Y
p(Y = y;X = xj )
p(X = xj )
(9)
= arg max
y2Y
p(Y = y;X = xj ) (10)
where the idea is to find the class label for which the conditional probability is maximized,given
example x.The first equality follows fromthe definition of conditional probability,and the second
by noting that the denominator is constant for all choices of y.
4.2.2 Clustering
For clustering,the idea is to associate a so-called “latent” variable with each example,The latent
variable is hidden,in that it is not present in the data,and is interpreted as the cluster associated
with an example.Let Y denote this latent variable,where the randomvariable Y takes on values
in Y = f1;:::;Kg if we are looking for K clusters.
6
The distinction here is that the probability of a single point in a continuous space with a density is zero,so we
must talk about the probability of sets of such points that have volume,e.g.,p(xj ) dx = Pr(X 2 dxj ).A rigorous
treatment of this topic is outside the scope of this course.For further reading on this topic,see Rosenthal’s A First Look
at Rigorous Probability Theory.
7
Given this,and assuming that the data are used to learn a parameterized model (note that it is
not clear yet howthis can be done – one of the variables is unobservable after all!),then the cluster
assigned to an example is,
y
?
= arg max
y2Y
p(Y = y j X = x;) (11)
= arg max
y2Y
p(Y = y;X = xj )
p(X = xj )
(12)
= arg max
y2Y
p(Y = y;X = xj );(13)
that is,the value of the latent variable that is most likely given the example and the parameterized
model.
4.3 The MaximumLikelihood Method
Nowthat we have an idea of howto use a probabilistic model for some interesting machine learn-
ing tasks,howcan we learn?That is,howcan we select a good parameterization of a probabilistic
model given data?There are other questions too,but we will not address themfor now.For ex-
ample,we should also ask where the probabilistic model comes fromand howcan we determine
the best model to use?This is analogous to the question of how many hidden units to use in a
neural network model.
We first introduce the maximum likelihood method,which is very simple and quite typical in
practice.Other methods of interest are the maximuma posteriori and full Bayesian approaches.
Under the assumption that the examples in data D are identically and independently sampled
according to p(X = xj ),then the joint probability of the data given parameters  is
p(Dj ) =
N
Y
n=1
p(X
n
= x
n
j ) = L():(14)
This function is the likelihood function and we think of it as a function of the parameters .
Definition 1 (maximumlikelihood) Given data D and probabilistic model p(X = xj ) with parame-
ters ,the maximumlikelihood approach trains a model by selecting parameters  to maximize L().
The parameters that maximize the likelihood represent a learned model and are the result
of training.These are the parameters that assign the highest probability to data,and preferring
such parameters is referred to as the maximum likelihood principle.The resulting value of  is the
maximumlikelihood estimate of the model parameters.
In practice,likelihoods often involve a product over many terms (as above,where there is a
factor for each of the N data),and so it is convenient for analysis and for computational stability
(e.g.,avoiding underflowandoverflow) to performcomputations in log space.That is,to compute
with the natural logarithmof the likelihood,or log likelihood:
lnL() =
N
X
n=1
lnp(X
n
= x
n
j ):(15)
Finding the  that maximize the likelihood is an optimizaton problem.As in other optimiza-
tion problems,it is typical to compute the gradient and either performgradient descent,or set the
gradient to zero and analytically solve for .
8
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-10
-5
0
5
10
f(x)
Figure 2:Univariate Gaussian distribution with  = 0 and 
2
= 1.
4.3.1 Example:Univariate Gaussian
First,the univariate (single variable) Gaussian (Normal) distribution has density function:
p(xj ) =
1
(2
2
)
1=2
exp


(x )
2
2
2

:(16)
The parameters are  = (;
2
),where  = E[X] and
2
is the variance E[(XE[X])
2
] = E[X
2
]
2
.
Here,expectation is denoted by E[].
The form of the univariate Gaussian is shown in Figure 2 and should be quite familiar.It is
symmetric about the mean.It has a single mode,at the mean.The density decays the further away
you get fromthe mean.The lower the variance,the faster the density decays,and the sharper the
mode.
Given i.i.d.data D = fx
1
;:::;x
N
g,with x
n
2 R,then we can nowsolve for the parameters 
ML
that maximize the likelihood function,i.e.,maximize the probability of the data under a Gaussian
distribution.First,we take the log,and obtain for a single instance
lnp(x
n
j ) = 
1
2
ln(2) 
1
2
ln(
2
) 
1
2
2
(x
n
)
2
:(17)
We sumthis across all of the N data:
lnp(Dj ) = 
1
2
2
N
X
n=1
(x
n
)
2

N
2
ln(
2
) 
N
2
ln(2):(18)
We nowtake the partial derivative with respect to ,and equate it to zero:
1

2
N
X
n=1
(x
n
) = 0 (19)
9
and so

ML
=
1
N
N
X
n=1
x
n
:(20)
Similarly,taking the partial derivative with respect to parameter 
2
and setting to zero,we obtain
1
2(
2
)
2
N
X
n=1
(x
n

ML
)
2

N
2
2
= 0;(21)
which solves to

2
ML
=
1
N
N
X
n=1
(x
n

ML
)
2
(22)
Both expressions are just what we might expect.The maximum likelihood estimate of the
mean is the sample mean,and the maximumlikelihood estimate of the variance is the variance of
the sample considered as the population.
7
4.4 The Mixture of Gaussian Model
We cannowconsider the problemof clustering.Consider a model withKclusters.We will assume
that the data are generated by first assigning a cluster to an example (this is the latent variable) and
then,conditioned on the cluster,assigning a feature vector according to a multivariate Gaussian
distribution.
The result is the well known mixture of Gaussian (MoG) model:
p(xj ) =
K
X
k=1

k
 N(xj 
k
;
k
);(23)
where parameters f
1
;:::;
K
g define the probability for generating an example fromthe kth clus-
ter,and N(xj 
k
;
k
) is the density function associated with multivariate Gaussian with mean 
k
and covariance 
k
,so that the remaining parameters are f
k
;
k
g
K
k=1
.Each of the K multivariate
Gaussians is referred to as a component.Such weighted sums of component distributions (where
the weights are themselves a distribution) are referred to generally as mixture models and they are
a very powerful way to build complex distributions from simple ones.As mentioned in lecture,
they are generative models,which means that they naturally lead to algorithms that can drawdata
fromthe distribution.In the case of mixture models,one first draws fromthe discrete distribution
given by  to choose one of the K components in the mixture.The actual observed data are then
drawn from that component.This gives a hint as to how we will need to perform learning:we
will need to infer the latent variables of these unknown draws from in order to fit the overall
parameters.It also reveals how we can think of this as probabilistic clustering:each datum is
literally generated by one of the K component distributions.
7
Digging a bit more deeply,this estimator of variance turns out to be biasedtowards zeros,andthe so called“Laplace
approximation” of making the denominator N 1 instead of N can be introduced to fix this problemif necessary.The
difference becomes negligible for large enough N.
10
We can make this generative viewexplicit and introduce a newrandomvariable Y that takes
values in Y = f1;  ;Kg.This results in a factorized distribution:
p(x;y j ) = p(Y = y j ) p(xj 
y
;
y
):(24)
Given this,and given maximum likelihood parameters 
ML
,then we would assign to exam-
ple x the cluster that maximizes,
y
?
= arg max
y2Y
p(Y = y j x;
ML
) (25)
= arg max
y2Y
p(Y = y;xj 
ML
) (26)
= arg max
y2Y
p(Y = y j 
ML
) p(xj Y = y;
ML
):(27)
This can be easily evaluated.The first termis simply 
y
(in 
ML
) associated with cluster y.The
second termis given by the multivariate Gaussian density,parameterized according to 
ML
.
Before continuing,we can reviewthe multivariate Gaussian distribution.This is defined on a
randomvariable X taking values in R
M
,and thus M dimensions,has density function
p(xj ) =
1
(2)
M=2
jj
(1=2)
exp


1
2
(x )
T

1
(x )

;(28)
where  2 R
M
is the mean, is the M M covariance matrix and jj is the determinant of the
covariance matrix.
The covariance matrix is symmetric.Entry 
m;m
0 (row m,column m
0
) entry in  represents
the covariance between features mand m
0
,and thus a measure of whether or not they are corre-
lated.Entries on the diagonal,i.e.,entry 
m;m
,represent the variance on the values adopted by
individual features.Recall that the covariance of two real-valued randomvariables Aand B is
Cov(A;B) = E[(AE[A])(B E[B])] (29)
so that Cov(A;A) = Var(A).
How many independent parameters does an M-dimensional multivariate Gaussian have?
There are M(M+1)=2 for (because it is symmetric),and Mfor ,for a total of M(M+3)=2 – i.e.,
on the order of M
2
.The number of free parameters is a rough indication of how much data will
be required to learn a model accurately.We see here that the amount of data required for the mul-
tivariate Gaussian grows quadratically with the dimension.Thus Gaussian models do not suffer
as much fromthe curse of dimensionality suffered by non-parametric methods such as HAC.
Interesting special cases of multivariate Gaussians are
 Independence (axis-aligned):Fix all off-diagonal elements,indicating the covariance be-
tween variables,in to 0.Given this,there are only 2M parameters.
 Spherical multivariate Gaussian:Further assume that  = 
2
I where I is the identity matrix
with 1 entries on the diagonal and 0 entries everywhere else,and 
2
is a scalar indicating the
variance,the same in each dimension.This model includes only M +1 parameters.
It can be common to insist on independence across dimensions,with diagonal covariance ma-
trices,to reduce the number of parameters to be linear in the number of dimensions of the feature
space.
11
4.5 Application to Clustering
Back to clustering,the probabilistic model (assuming K clusters) is a MoGmodel,
p(xj ) =
K
X
k=1

k
 N(xj 
k
;
k
);(30)
and includes (K 1) + MK + M(M +1)K=2 parameters.
In principle,we can nowfind values for these parameters that maximizes the probability of the
data,i.e.,p(Dj ).Given this we could then cluster the data using the approach explained above in
Eq.(27).Next time we will see a powerful algorithm(the EMalgorithm) to learn these parameters.
The key challenge is that the cluster assignments are a latent variable (and not seen in the data).
Without this complication it is a fairly simple matter to fit maximumlikelihood parameters for a
mixture of Gaussians model.
12