Algebraic-geometric methods for learning Gaussian ... - Caroline Uhler

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

90 εμφανίσεις

Algebraic
-
Geometric Methods for
Learning Gaussian Mixture Models

Mikhail
Belkin

Dept. of Computer Science and Engineering,

Dept. of Statistics

Ohio State
University / ISTA


Joint work with
Kaushik

Sinha

TexPoint fonts used in EMF.

Read the TexPoint manual before you delete this box.:
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A

First considered by Pearson, 1894.

Analyzed 1000 crabs from Naples.
Concluded
(erroneously?) that
there were two distinct populations.









From crabs to Gaussians

The Problem

Learning Gaussian Mixture Model


Classical problem in statistics


goes back to the classical work of Pearson(1894) .



Widely used model for scientific/engineering tasks



Application areas include


Speech Recognition


Computer Vision


Bioinformatics


Astronomy


Medicine


…..






Gaussian Mixtures


Problem:


identifying parameters of a Gaussian mixture distribution from a finite
sample.

The Problem

The Problem

The Problem


Mixture of Gaussians in










What does learning such a mixture mean?


estimating the parameters of a mixture within pre
-
specified accuracy from a
sample.


parameters are the means, covariance matrices and mixing weights of
component Gaussian distributions.


number of parameters:






Most popular method: Expectation Maximization


EM is by far the most popular method for mixture fitting.

Iterative procedure to find parameters (similar to
k
-
means

clustering).



Simple to implement.


Guaranteed to converge.


Converges to true values, if initialized close to true values.


However:



Sensitive to initialization. Numerous local maxima.


Does not detect the number of components.


Expectation Maximization

How EM fails: a simple example

From Tao, Belkin, Yu, Annals of Statistics, 2010

The Problem

Some Recent Progress



Understanding computational aspects of Gaussian Mixture
Learning


is it possible to learn Gaussian mixture in time and using a sample of size
polynomial in dimension?



Dasgupta

(1999) showed that learning a mixture of Gaussians
in using a sample size polynomial in n is possible.


result was surprising because complexity of many problems scales
exponentially with dimension (curse of dimensionality).











Something as simple
as the volume of a convex
body cannot be estimated
using number of samples
polynomial in dimension
(Barany,
Furedi
, 88).



Learning in high dimension

Dasgupta’s Result, 1999



It is possible to learn mixture of Gaussians in using a
sample size polynomial in , if the component separation is rof
the order


component separation is the minimum distance between the component
means.






Partial Summary of Results on Gaussian Mixture Learning










Min. separation is independent of and






Our Result (solves the general problem)









Summary of Existing Results

Author

Min. Separation

Description

[
Dasgupta
], 1999

Gaussian

mixtures, mild assumptions

[
Dasgupta
-
Schulman], 2000

Spherical Gaussian

mixtures

[
Arora
-
Kannan
], 2001

Gaussian

mixtures

[
Vempala
-
Wang], 2002

Spherical Gaussian

mixtures

[
Kannan
-
Salmasian
-
Vempala
], 2005

Gaussian

Mixtures,
Logconcave

Distr.

[
Achlioptas
-
McSherry
], 2005

Gaussian

Mixtures

Partial Summary of Results on Gaussian Mixture Learning










Min. separation is independent of and






Our Result (solves the general problem)









Summary of Existing Results

Author

Min. Separation

Description

[
Dasgupta
], 1999

Gaussian

mixtures, mild assumptions

[
Dasgupta
-
Schulman], 2000

Spherical Gaussian

mixtures

[
Arora
-
Kannan
], 2001

Gaussian

mixtures

[
Vempala
-
Wang], 2002

Spherical Gaussian

mixtures

[
Kannan
-
Salmasian
-
Vempala
], 2005

Gaussian

Mixtures,
Logconcave

Distr.

[
Achlioptas
-
McSherry
], 2005

Gaussian

Mixtures

Min. separation is an
increasing function of
and/
orr

Partial Summary of Results on Gaussian Mixture Learning










Min. separation is independent of and






Our Result (solves the general problem)









Summary of Existing Results

Author

Min. Separation

Description

[
Dasgupta
], 1999

Gaussian

mixtures, mild assumptions

[
Dasgupta
-
Schulman], 2000

Spherical Gaussian

mixtures

[
Arora
-
Kannan
], 2001

Gaussian

mixtures

[
Vempala
-
Wang], 2002

Spherical Gaussian

mixtures

[
Kannan
-
Salmasian
-
Vempala
], 2005

Gaussian

Mixtures,
Logconcave

Distr.

[
Achlioptas
-
McSherry
], 2005

Gaussian

Mixtures

[
Belkin
-
Sinha
], 2009

Identical

s
pherical Gaussian

mixtures

[
Kalai
-
Moitra
-
Valiant], 2010

Gaussian

mixtures with 2 components

Min. separation is an
increasing function of
and/
orr

[Feldman
-
O’Donnell
-
Servedio
], 2006

Axis

aligned
Gaussian
s,
no
param
. est.

Partial Summary of Results on Gaussian Mixture Learning










Min. separation is independent of and






Our Result (
solves the general problem
)









Summary of Existing Results

Author

Min. Separation

Description

[
Dasgupta
], 1999

Gaussian

mixtures, mild assumptions

[
Dasgupta
-
Schulman], 2000

Spherical Gaussian

mixtures

[
Arora
-
Kannan
], 2001

Gaussian

mixtures

[
Vempala
-
Wang], 2002

Spherical Gaussian

mixtures

[
Kannan
-
Salmasian
-
Vempala
], 2005

Gaussian

Mixtures,
Logconcave

Distr.

[
Achlioptas
-
McSherry
], 2005

Gaussian

Mixtures

[
Belkin
-
Sinha
], 2009

Identical

s
pherical Gaussian

mixtures

[
Kalai
-
Moitra
-
Valiant], 2010

Gaussian

mixtures with 2 components

[Belkin
-
Sinha], 2010

Gaussian mixtures

Min. separation is an
increasing function of
and/
orr

[Feldman
-
O’Donnell
-
Servedio
], 2006

Axis

aligned
Gaussian
s,
no
param
. est.

[Moitra
-
Valiant], 2010

Gaussian mixtures

Identifiabilty


Different values of parameters
could give rise to same
distribution










N
eed

for a quantification of how
hard it is to learn the parameters
from data




Obstacle in Learning

Example:
parameters of the following
distribution family cannot be learned
from sampled data

Identifiabilty


Different values of parameters
could give rise to same
distribution










N
eed

for a quantification of how
hard it is to statistically learn the
parameters from data




Obstacle in Learning

Example:
parameters of the following
distribution family cannot be learned
from sampled data

Identifiabilty


Different values of parameters
could give rise to same
distribution










N
eed

for a quantification of how
hard it is to statistically learn the
parameters from data




Obstacle in Learning

If and are close to two values of parameters
and with identical probability distributions , then
it is hard to distinguish them from sampled data,
even when is large.

Example:
parameters of the following
distribution family cannot be learned
from sampled data

Example:


Radius of Identifiability


Introduce Radius of Identifiability



it is the radius of largest open ball
around such that any two different
parameters from this ball give rise to
different probability density functions.



if no such ball exists, i.e., , then
parameters cannot be identified
uniquely, given any amount of data.



complexity scales with







Radius of Identifiability


Radius of Identifiability


Introduce Radius of Identifiability



it is the radius of largest open ball
around such that any two different
parameters from this ball give rise to
different probability density functions.



if no such ball exists, i.e., , then
parameters can not be identified
uniquely, given any amount of data.



complexity scales with







Radius of Identifiability

We show that explicit formula of for Gaussian mixture is

Our Result

Main Result


We show that the parameters of a Gaussian mixture with radius
of identifiability in can be learned (up to permutation) within
pre
-
specified precision with confidence using a sample
size , where is radius of the bounding
ball.




minimum separation can even be zero, i.e., two Gaussian components can
have same means but different covariance matrices.









polynomial dependence on is necessary.









Overview of Our Proof


1.
Reduction to fixed dimension


we show that learning Gaussian mixture in dimensions can be reduced to
, parameter estimation problems in dimensions. (more on this
later)


2.
Learning in fixed dimension


we introduce the general notion of “polynomial family” (more on this soon).



we show that the parameters of polynomial families can be learned within
accuracy with confidence at least , using a sample of size polynomial in
dd and .



in addition to Gaussian distribution, almost all standard parametric probability
distributions as well as their mixtures and products form polynomial families
(more on this soon).








Overview of Our Proof

Overview of Our Proof


1.
Reduction to fixed dimension


we show that learning Gaussian mixture in dimensions can be reduced to
,

parameter estimation problems in dimensions
(more on this
later).


2.
Learning in fixed dimension


we introduce the general notion of “polynomial family” (more on this soon).



we show that the parameters of polynomial families can be learned within
accuracy with confidence at least , using a sample of size polynomial in
dd and .



in addition to Gaussian distribution, almost all standard parametric probability
distributions as well as their mixtures and products form polynomial families
(more on this soon).








Overview of Our Proof

Overview of Our Proof


1.
Reduction to fixed dimension


we show that learning Gaussian mixture in dimensions can be reduced to
,

parameter estimation problems in dimensions
(more on this
later).


2.
Learning in fixed dimension


we introduce the general notion of
“polynomial family”
(more on this soon).



we show that the parameters of polynomial families can be learned within
accuracy with confidence at least , using a sample of size polynomial in
dd

and .



in addition to Gaussian distribution, almost all standard parametric probability
distributions as well as their
mixtures

and
products

form polynomial families
(more on this soon).








Overview of Our Proof

Learning in Fixed Dimension: Polynomial Family


Definition



a family of probability distributions parameterized by , forms a polynomial
family,
if each (raw) moment of exists and can be represented as a
polynomial of the parameters
.








Polynomial Family

Examples of Polynomial Families





Polynomial Family

Gaussian

Moments are given by

Hermite Polynomials

Examples of Polynomial Families





Polynomial Family

Gaussian

Gamma

Binomial

Exponentia
l

Examples include almost all standard parametric families as well as their
mixtures

and
products
. Hence Gaussian mixtures are also a polynomial family.

Moments are given by

Hermite Polynomials


Proof Sketch For Learning in Fixed Dimension


Main result for polynomial families


there is an algorithm which given for an identifiable family , where is
the set of parameters within a ball of radius , outputs within of
with probability at least , using a number of sample points from polynomial
in and .



Proof Sketch

1.
given a polynomial family, find a finite set of moments that completely
characterizes a distribution (
identifiability
).


2.
reformulate the problem of learning the parameters in terms of this set of
moments using algebraic inequalities.


3.
reduce the problem of learning the parameters to 1 dimension, using
techniques from algebraic geometry,
specifically,Tarski
-
Seidenberg theorem
(
elimination of quantifiers
).






Polynomial Family


Identifiability and Finite Set of Moments


Identifiability


family is identifiable if for any .



We will prove that when is identifiable, finite number of moments
are sufficient to uniquely identify the parameter
(next slide)



R
e
quires application of Hilbert Basis Theorem

Polynomial Family

Hilbert Basis Theorem :

Every ideal in a ring of polynomials is finitely generated
.


Finite Set of Moments Fully Characterizes Polynomial
Family


For a polynomial family each moment is a polynomial of .




Let be a polynomial of variables.




Let be the ideal in the ring of polynomials of variables generated by
polynomials .




Let , where is an increasing sequence.




Hilbert Basis theorem ensures that is finitely generated hence there exists some
large enough such that for any




Identifiability implies first moments defines the distribution uniquely.

Polynomial Family


Finite Set of Moments Fully Characterizes Polynomial
Family


For a polynomial family each moment is a polynomial of .




Let be a polynomial of variables.




Let be the ideal in the ring of polynomials of variables generated by
polynomials .




Let , where is an increasing sequence.




Hilbert Basis theorem ensures that is finitely generated hence there exists some
large enough such that for any




Identifiability implies first moments defines the distribution uniquely.

Polynomial Family


Finite Set of Moments Fully Characterizes Polynomial
Family


For a polynomial family each moment is a polynomial of .




Let be a polynomial of variables.




Let be the ideal in the ring of polynomials of variables generated by
polynomials .




Let , where is an increasing sequence.




Hilbert Basis theorem ensures that is finitely generated hence there exists some
large enough such that for any




Identifiability implies first moments defines the distribution uniquely.

Polynomial Family


Finite Set of Moments Fully Characterizes Polynomial
Family


For a polynomial family each moment is a polynomial of .




Let be a polynomial of variables.




Let be the ideal in the ring of polynomials of variables generated by
polynomials .




Let , where is an increasing sequence.




Hilbert Basis theorem ensures that is finitely generated hence there exists some
large enough such that for any




Identifiability implies first moments defines the distribution uniquely.

Polynomial Family


Finite Set of Moments Fully Characterizes Polynomial
Family


For a polynomial family each moment is a polynomial of .




Let be a polynomial of variables.




Let be the ideal in the ring of polynomials of variables generated by
polynomials .




Let , where is an increasing sequence.




Hilbert Basis theorem ensures that is finitely generated hence there exists some
large enough such that for any




Identifiability implies first moments defines the distribution uniquely.

Polynomial Family


Finite Set of Moments Fully Characterizes Polynomial
Family


For a polynomial family each moment is a polynomial of .




Let be a polynomial of variables.




Let be the ideal in the ring of polynomials of variables generated by
polynomials .




Let , where is an increasing sequence.




Hilbert Basis theorem ensures that is finitely generated hence there exists some
large enough such that for any




Identifiability implies first moments defines the distribution uniquely.

Polynomial Family


What Next?


If the first moments are known precisely then the problem of learning
the parameters is almost solved


only remaining task is to solve a finite set of polynomial equations.


can be done algorithmically.



However, moments need to be estimated from sample data


uncertainty in moment estimation introduces uncertainty in parameter estimation.


how do we deal with it?



A powerful result from mathematics, called the
Tarski
-
Seidenberg
Theorem

helps us to prove that


moment estimation error depends only
polynomially
on parameter estimation error.




Polynomial Family


Tarski
-
Seidenberg Theorem




Polynomial Family


Semi
-
algebraic Set


a semi algebraic set in is a finite union of sets defined by a finite number of
polynomial equations and inequalities.



Tarski
-
Seidenberg Theorem


Let be a projection map. If is a semi
-
algebraic set in for some
ok

, then is a semi
-
algebraic set in .


this is equivalent to elimination of quantifiers for semi
-
algebraic sets.






Equivalent to elimination of
existential quantifier


Characterization of Uncertainty




Polynomial Family


Suppose first moments completely characterizes the
distribution


for any two parameters , define .


,
iff

.



Fix and consider the set





since logical statements can be expressed as algebraic conditions by
Tarski
-
Seidenberg Theorem is a semi
-
algebraic subset of .



eliminating the quantifiers reduces the problem to 1
-
dimension, where it can be
shown that is
polynomially

dependent on
supremum

of .







Characterization of Uncertainty




Polynomial Family


Suppose first moments completely characterizes the
distribution


for any two parameters , define .


,
iff

.



Fix and consider the set





since logical statements can be expressed as algebraic conditions by
Tarski
-
Seidenberg Theorem is a semi
-
algebraic subset of .



eliminating the quantifiers reduces the problem to 1
-
dimension, where it can be
shown that is
polynomially

dependent on
supremum

of .







Characterization of Uncertainty




Polynomial Family


Suppose first moments completely characterizes the
distribution


for any two parameters , define .


,
iff

.



Fix and consider the set




here can be viewed as an around taking probability distribution
into account.


since logical statements can be expressed as algebraic conditions by
Tarski
-
Seidenberg Theorem is a semi
-
algebraic subset of .


eliminating the quantifiers reduces the problem to 1
-
dimension, where it can be
shown that is
polynomialy

dependent on .






here can be viewed as an
around taking probability distribution into
account.



A Simple Example


Consider a univariate Gaussian with zero mean


second moment uniquely defines this distribution.



For any two , assume and consider the following set




by Tarski
-
Seideberg Theorem is a semi algebraic subset of and in this case we
can see what it is exactly (
geometric interpretation on next slide).



For a fixed , supremum of represents allowable parameter estimation
error for a fixed moment estimation error


elimination of and leads to a relation between and .



Polynomial Family

Reduction to Fixed Dimension




Results of learning polynomial families for a fixed dimension can
not be applied directly to mixture of high
-
dimensional Gaussians


number of parameters to be estimated
increases with dimension.


how do we deal with it?




We show that it is possible to estimate the parameters in high
dimension by solving parameter estimation problems in
appropriate low dimensions


why does it work?









Reduction

Low
-
dimensional Projection of Gaussian Mixture


Some Good News


projecting a Gaussian mixture onto a lower dimensional coordinate plane yields a
low
-
dimensional Gaussian mixture where,


mixing coefficients remain the same


new component means are the projections of the original component means,


new component covariance matrices are the restrictions of the original component
covariance matrices.


results for polynomial families can be used to learn the parameters of this low
-
dimensional Gaussian mixture.


hopefully, parameters of high dimensional Gaussian mixture can be learned by
learning the parameters of several low
-
dimensional Gaussian mixtures.



Some Difficulties


radius of identifiability of the projected gaussian mixture may become zero (not
learnable!).


parameters of high dimensional Gaussian mixture may not be uniquely learned from
the parameters of several low
-
dimensional Gaussain mixtures.










Reduction

Low
-
dimensional Projection of Gaussian Mixture


Some Good News


projecting a Gaussian mixture onto a lower dimensional coordinate plane yields a
low
-
dimensional Gaussian mixture where,


mixing coefficients remain the same.


new component means are the projections of the original component means.


new component covariance matrices are the restrictions of the original component
covariance matrices.


results for polynomial families can be used to learn the parameters of this low
-
dimensional Gaussian mixture.


hopefully, parameters of high dimensional Gaussian mixture can be learned by
learning the parameters of
several

low
-
dimensional Gaussian mixtures.



Some Difficulties


radius of identifiability of the projected Gaussian mixture may become zero (not
learnable!).


parameters of high dimensional Gaussian mixture may not be uniquely learned from
the parameters of several low
-
dimensional Gaussain mixtures.










Reduction

Low
-
dimensional Projection of Gaussian Mixture


Good News



projecting a Gaussian mixture onto a lower dimensional coordinate plane yields a
low
-
dimensional Gaussian mixture where,


mixing coefficients remain the same.


new component means are the projections of the original component means.


new component covariance matrices are the restrictions of the original component
covariance matrices.


results for polynomial families can be used to learn the parameters of this low
-
dimensional Gaussian mixture.


hopefully, parameters of high dimensional Gaussian mixture can be learned by
learning the parameters of
several

low
-
dimensional Gaussian mixtures.











Reduction

Low
-
dimensional Projection of Gaussian Mixture



Some Difficulties


radius of identifiability of the projected Gaussian mixture may become zero (
not
learnable!).


parameters of high dimensional Gaussian mixture may not be
uniquely

learned
from the parameters of several low
-
dimensional Gaussian mixtures.










Reduction

Low
-
dimensional Projection of Gaussian Mixture



Some Difficulties


radius of identifiability of the projected Gaussian mixture may become zero (
not
learnable!).


parameters of high dimensional Gaussian mixture may not be
uniquely

learned
from the parameters of several low
-
dimensional Gaussian mixtures.










Reduction

Low
-
dimensional Projection of Gaussian Mixture



Some Difficulties


radius of identifiability of the projected Gaussian mixture may become zero (
not
learnable!).


parameters of high dimensional Gaussian mixture may not be
uniquely

learned
from the parameters of several low
-
dimensional Gaussian mixtures.










Reduction

Low
-
dimensional Projection of Gaussian Mixture



Some Difficulties


radius of identifiability of the projected Gaussian mixture may become zero (
not
learnable!).


parameters of high dimensional Gaussian mixture may not be
uniquely

learned
from the parameters of several low
-
dimensional Gaussian mixtures.










Reduction

Low
-
dimensional Projection of Gaussian Mixture



Some Difficulties


radius of identifiability of the projected Gaussian mixture may become zero (
not
learnable!).


parameters of high dimensional Gaussian mixture may not be
uniquely

learned
from the parameters of several low
-
dimensional Gaussian mixtures.










Reduction

?

Sketch of the algorithm


Step 1


identify a low
-
dimensional (fixed) coordinate plane where radius of identifiability
reduces only by a fixed amount.


this can be done deterministically by checking at most number of coordinate
planes.


project high dimensional Gaussian mixture onto this coordinate plane and learn the
parameters of this low
-
dimensional Gaussian mixture.


using results of learning polynomial families in a fixed dimension.




Step 2


parameters along each remaining coordinate can be estimated separately by
adding each coordinate at a time and aligning the estimates obtained from two
parameter estimation problems in overlapping fixed dimensions.


total number of such low
-
dimensional parameter estimation problems is at most .










Reduction










Reduction

Sketch of the idea: two components in three dimensions.










Reduction

Sketch of the idea: two components in three dimensions.










Reduction

Sketch of the idea: two components in three dimensions.










Reduction

Sketch of the idea: two components in three dimensions.










Reduction

Sketch of the idea: two components in three dimensions.










Reduction sketch

Sketch of the idea: two components in three dimensions.










Reduction

Sketch of the idea: two components in three dimensions.









Reduction

Sketch of the idea: two components in three dimensions.










Reduction

Sketch of the idea: two components in three dimensions.

Conclusion


Resolve the general problem of polynomial learning of Gaussian
mixture distribution.


Completes an active line of research in theoretical computer science.



The proof brings together the techniques of algebraic geometry and
the classical method of moments.



A step toward understanding algorithmic issues of Gaussian mixture
modelling.














Conclusion