Outline Lecture 5 2(35)

Machine Learning, Lecture 5

Support Vector Machines and

1. Summary of lecture 4

Approximate Inference

2. Support Vector Machines

3. Variational Bayesian Inference

Thomas Schön, Henrik Ohlsson and • General Derivation

• Example – Identiﬁcation of a Linear State-Space Model

Umut Orguner

• Example – Gaussian Mixtures

Division of Automatic Control

4. Expectation Propagation

Linköping University

• General Derivation

Linköping, Sweden.

• Example – State Estimation

Email: schon@isy.liu.se, (Chapter 7.1, 10)

Phone: 013 - 281373,

www.control.isy.liu.se/~schon/

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Summary of Lecture 4 (I/III) 3(35) Summary of Lecture 4 (II/III) 4(35)

A kernel functionk(x,z) is deﬁned as an inner product

A neural network is a nonlinear function (as a function expansion)

from a set of input variables{x} to a set of output variables{y}

i k

T

k(x,z)= φ(x) φ(z),

controlled by adjustable parametersw.

This function expansion is found by formulating the problem as usual, whereφ(x) is a ﬁxed mapping.

which results in a (non-convex) optimization problem. This problem is

Introduced the kernel trick (a.k.a. kernel substitution). In an algorithm

solved using numerical methods.

where the input datax enters only in the form of scalar products we

Backpropagation refers to a way of computing the gradients by can replace this scalar product with another choice of kernel.

making use of the chain rule, combined with clever reuse of

The use of kernels allows us to implicitly use basis functions of high,

information that is needed for more than one gradient.

even inﬁnite, dimensions (M→∞).

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. SchönSummary of Lecture 4 (III/III) 5(35) Support Vector Machines (SVM) 6(35)

Very popular classiﬁer.

A Gaussian process is a collection of random variables, any ﬁnite

Non-probabilistic

number of which have a joint Gaussian distribution.

Discriminative

By assuming that the considered system is a Gaussian process,

Can also be used for regression (then called

predictions can be made by computing the conditional distribution

∗ ∗

support vector regression, SVR).

p(y(x )|all the observations),y(x ) being the output for which we

seek a prediction. This regression approach is referred to as

Convex optimization

Gaussian process regression.

Sparse

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

SVM for Classiﬁcation 7(35) SVM for Classiﬁcation Cont’d 8(35)

N n

x

Assume:{(t ,x )} ,x ∈R and

n n n

n=1

t ∈{−1,1}, is a given training data set (linearly

n

The decision boundary that maximizes the margin is given as the

separable).

solution to the quadratic program (QP)

∗

Task: Givenx , what is the corresponding label?

1

2

min kwk

w,b 2

SVM is a discriminative classiﬁer, i.e. it provides a

T

s.t. t (w φ(x )+b)−1≥ 0, n= 1,...,N

n n

decision boundary. The decision boundary is given

T

by{x|w φ(x)+b= 0}.

To make it possible to let the dimension of the feature space (dim of

φ(x )) go to inﬁnity, we have to derive the dual.

n

Goal: Find the decision boundary that maximizes

the margin! The margin is the distance to the

closest point to the decision boundary.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schönˆ

ˆ

ˆ

SVM for Classiﬁcation Cont’d 9(35) SVM for Classiﬁcation Cont’d 10(35)

T

First, the Lagrangian is Letk(x,x)= φ(x) φ(x). The dual objective then becomes

i j i j

N N N N

1 1

2 T

L(w,b,a)= kwk − a t (w φ(x )+b)−1 g(a)= a − a a t t k(x ,x )

n n n n n m n m m n

∑ ∑ ∑ ∑

2 2

n=1 n=1 m=1n=1

which we can maximize w.r.t. a and subject to

and minimizing wrtw, b we obtain the dualg(a). Taking the

derivative wrtw, b and set them to zero, N

a ≥ 0, a t = 0.

n n n

∑

N N

dL(w,b,a) dL(w,b,a) n=1

= a t = 0, = w− a t φ(x )= 0

n n n n n

∑ ∑

db dw

The maximizing a, let say a, gives using

n=1 n=1

N

T ∗ T ∗

w φ(x )=( a t φ(x )) φ(x ) that

∑ n n n

n=1

This gives

N

∗ ∗

N N N

y(x )= a t k(x ,x )+b.

1

∑ n n n

T

g(a)= a − a a t t φ(x ) φ(x )

n n m n m m n

∑ ∑ ∑ n=1

2

n=1 m=1n=1

Manya’s will be zero⇒ computational remedy.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

SVM for Classiﬁcation – Non-Separable Classes 11(35) Example – CVX to Compute SVM 12(35)

1

Linearly separable data:

0.8

0.6

If points are on the right side of the decision boundary, then cvx_begin

0.4

T

v a r i ab l e s w( nx , 1 ) b

t (w φ(x )+b)≥ 1. To allow for some violations, we introduce 0.2

n n

minimize (0.5∗w’∗w)

0

slack variablesζ , n= 1,...,N. The modiﬁed optimization problem

−0.2

n subject to

−0.4

y .∗ (w’∗ x+b∗ones (1 ,N))−ones (1 ,N) >= 0

becomes

−0.6

cvx_end

−0.8

−1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

N

Non-separable data:

1

1

2

0.8

min kwk +C ζ

n

∑

cvx_begin

0.6

w,b,ζ2

n

0.4

v a r i ab l e s w( nx , 1 ) b zeta (1 ,N)

T 0.2

minimize (0.5∗w’∗w + C∗ones (1 ,N)∗ zeta ’ )

s.t. t (w φ(x )+b)+ζ −1≥ 0, n= 1,...,N,

n n n

0

subject to

−0.2

ζ ≥ 0, n= 1,...,N.

n y .∗ (w’∗ x+b∗ones (1 ,N))−ones (1 ,N)+ zeta >= 0 −0.4

−0.6

zeta >= 0

−0.8

cvx_end

−1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. SchönExample – CVX to Compute SVM 13(35) Further Reading and Code 14(35)

SVM – Solving the dual:

k=@( x1 , x2 ) exp(−sum ( ( x1∗ones (1 , size ( x2 ,2))− x2 ) . ^ 2 ) / 0 . 5 ) ’

Bernhard Schölkopf and Alex Smola. Learning with Kernels.

f o r t =1:N; f o r s= t :N

K( t , s )= k ( x ( : , t ) , x ( : , s ) ) ; K( s , t )=K( t , s ) ; MIT Press, Cambridge, MA, 2002.

end ; end

Yalmip can be downloaded from

cvx_begin

1

v a r i a b l e s a (N, 1 )

0.8 http://users.isy.liu.se/johanl/yalmip/

0.6

minimize ( 1/2∗(a.∗ y ’ ) ’∗ K∗(a.∗ y ’ ) − ones (1 ,N)∗a )

0.4

CVX can be downloaded fromhttp://cvxr.com/cvx/

subject to

0.2

ones (1 ,N)∗ ( a.∗ y ’ ) == 0

0

a >= 0 −0.2

−0.4

cvx_end

−0.6

ind= f i n d ( a >0.01);

−0.8

wphi = @( xstar ) ones (1 ,N)∗ ( a.∗ y ’ .∗ k ( xstar , x ) )

−1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

b=0;

f o r i =1: length ( ind )

b=b+1/ y ( ind ( i ))−wphi ( x ( : , ind ( i ) ) ) ;

end

b=b / length ( ind ) ;

ystar = @( xst ar ) wphi ( xs tar )+b

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Bayesian Framework Reminder 15(35) Variational Methods (I/II) 16(35)

Classic calculus involves functions and deﬁnesderivatives to

LetX= x ,...,x be the measurements.

1 N

optimize them.

LetZ= z ,...,z be the latent variables as in the EM

1 N

The so-called calculus of variations investigates functions of

framework.

functions which are called functionals.

Z

Then, the Bayesian framework is interested in the posterior

Example: Entropy H[p(·)]=− p(x)log(p(x)) dx

densityp(Z|X) given by Bayes rule as

The derivatives of functionals are called variations.

p(X|Z)p(Z)

p(Z|X)=

Calculus of variations has its origins in the 18th century and the

p(X)

most important result is probably the so-called Euler-lagrange

For quite many instances, the posterior can be found exactly equation

Z

using the concept of conjugate pairs.

d

0 0 0

C(q), L(t,q(t),q(t))dt : L (t,q ,q )+ L (t,q ,q )= 0

x ∗ v ∗

• Gaussian case ∗ ∗

| {z } dt

• More generally the exponential family.

,L(t,x,v)

What happens when there is no exact solution?

which is the core of Optimal Control Theory.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön6

ˆ

ˆ

6

¯

Variational Methods (II/II) 17(35) Variational Inference 18(35)

In general variational methods, one generally assumes a

Algorithm (Variational Iteration)

predetermined form of the argument function, possibly

parametric.

Solve the problem iteratively:

T T

Quadratic: q(x)= x Ax+b x+c

1. Forj= 1,...,M

N

φ

M M

or

• Fix{q(Z)} to their last estimated values{q(Z)} .

i i i i

i=1 i=1

Basis functions: q(x)= wφ(x)

∑ i i=j i=j

i=1 • Find the solution of

Variational Inference

q(Z)= argmaxL(q)

j j

q

j

In the case of probabilistic inference, the variational approximation

takes the form:

2. Repeat 1 until convergence.

M

q(Z)= q(Z)

∏ i i

i=1

whereZ={Z ,...,Z } is a partitioning of the unknown variables.

1 M

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

VB Example 1 – Linear System Identiﬁcation 19(35) VB Example 1 – Linear System Identiﬁcation 20(35)

Consider the following linear scalar state-space model With latent variables

Z

p(θ|y )= p(θ,x |y ) dx

0:N 0:N 0:N 0:N

x = θx +v ,

2

k+1 k k

v 0 σ 0

k v

∼N , .

2

1

e 0 0 σ

k There is still no exact form for the joint densityp(θ,x |y ).

e 0:N 0:N

y = x +e ,

k k k

2

Variational Approximation

The initial state: x ∼N(x ;x ,Σ ).

0 0 0 0

2 Approximate the posteriorp(θ,x |y ) as

0:N 0:N

θ with prior distributionθ∼N(θ;0,σ )

θ

The identiﬁcation problem is now to determine the posterior

p(θ,x |y )≈ q (θ)q (x )

0:N 0:N θ x 0:N

p(θ|y ) using the VB framework.

0:N

Findq (θ) andq (x ) using

x 0:N

We still have some latent variablesx ,{x ,...,x }. θ

0:N 0 N

Note the difference in notation compared to Bishop! The

logq (θ)=E [logp(y ,x ,θ)]+ const.

θ q 0:N 0:N

x

observations are denotedy and the latent variables are given by

logq (x )=E [logp(y ,x ,θ)]+ const.

x q

0:N 0:N 0:N

θ

x.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

¯

VB Example 1 – Linear System Identiﬁcation 21(35) VB Example 2 – Gaussian Mixture Identiﬁcation 22(35)

Variational Bayes formulas are

Back to the Bishop’s notation: x now denotes a measurement.

logq (θ)=E [logp(y ,x ,θ)]+ const.

q

θ 0:N 0:N

x

Suppose we havex i.i.d. and distributed as

1:N

logq (x )=E [logp(y ,x ,θ)]+ const.

x 0:N q 0:N 0:N

θ

K

−1

We have the joint densityp(y ,x ,θ) as

x ∼ p(x|π ,μ ,Λ )= πN x;μ ,Λ

0:N 0:N

i 1:K 1:K 1:K ∑ k k

k

k=1

p(y ,x ,θ)=p(y |x )p(x |x ,θ)p(x )p(θ)

0:N 0:N 0:N 0:N 1:N 0:N−1 0

N N

In the Bayesian framework, all the unknowns{π ,μ ,Λ } are

1:K 1:K 1:K

= p(y|x) p(x|x ,θ)p(x )p(θ)

∏ i i ∏ i i−1 0

random.

i=0 i=1

K

Taking the logarithm and separating the constant terms 4

α−1

0

π ∼Dir(π |α )∝ π

1:K 1:K 0

∏

k

N N

0.5 0.5 k=1

2 2

logp(y ,x ,θ)=− (y−0.5x) − (x−θx )

0:N 0:N

∑ i i ∑ i i−1

2 2 K

σ σ

e v

i=0 i=1 −1

μ ,Λ ∼p(μ ,Λ ), N(μ ;m ,(β Λ ) )W(Λ|W ,ν )

1:K 1:K 1:K 1:K ∏ k 0 0 k k 0 0

2 2 2 2

−0.5/σ (x −x ) −0.5/σ θ + const.

0 0 k=1

0

θ

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

VB Example 2 – Gaussian Mixture Identiﬁcation 23(35) VB Example 2 – Sparsity with Bayesian Methods 24(35)

T

Deﬁne the latent variablesz ,[z ,··· ,z ] as in EM. Then

i i1 iK

Symmetric Dirichlet distribution forK = 3.

N K

z

ik

z

ik −1

p(x ,z )= π N x;μ ,Λ

1:N 1:N ∏∏ k

k k

i=1 k=1

The Bayesian framework then asks for the posterior density

π ∼Dir(π |α )

1:3 1:3 0

p(z ,π ,μ ,Λ |x ).

1:N 1:K 1:K 1:K 1:N

3

4

α−1

0

∝ π

∏

k

Variational Approximation

k=1

Approximate the posterior as

α−1

0

=(π π (1−π −π ))

1 2 1 2

p(z ,π ,μ ,Λ |x )≈ q (z )q (π ,μ ,Λ )

z

1:N 1:K 1:K 1:K 1:N 1:N π,μ,Λ 1:K 1:K 1:K

Findq (z ) andq (π ,μ ,Λ ) iteratively.

z 1:N π,μ,Λ 1:K 1:K 1:K

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schönˆ

ˆ

ˆ ˆ

ˆ

ˆ

ˆ

ˆ

ˆ

ˆ

ˆ

ˆ

ˆ

ˆ ˆ

ˆ

ˆ

ˆ

Minimization of KL-divergence (I/III) 25(35) Minimization of KL-divergence (II/III) 26(35)

Suppose we have

Findmin KL(p||q )

Findmin KL(q ||p) μ,σ μ,σ

μ,σ μ,σ

2

0.5

p(x)= 0.2N(x;5,1)+0.8N(x,−5,2 ) 0.5

p(x)

p(x)

0.45

0.45

q(x)

q (x)

1

0.4

0.4

q (x)

2

2

0.35

Letq (x),N(x;μ,σ ) 0.35

μ,σ

0.3

0.3

0.25

0.25

0.2

Findmin KL(q ||p) Findmin KL(p||q ) 0.2

μ,σ μ,σ μ,σ μ,σ

0.15

0.15

0.5 0.5

p(x) p(x) 0.1

0.1

0.45 0.45

q (x) q(x)

1

0.05

0.05

0.4 0.4

q (x)

2

0

0

0.35 0.35 −15 −10 −5 0 5 10

−15 −10 −5 0 5 10

x

x

0.3 0.3

0.25 0.25 Z

Z

q

μ,σ p(x)

0.2 0.2

KL(q ||p), q (x)log dx KL(p||q ), p(x)log dx

μ,σ μ,σ

μ,σ

0.15 0.15

p(x)

q

μ,σ

0.1 0.1

0.05 0.05

zero–forcing

0 0

non-zero-forcing

−15 −10 −5 0 5 10 −15 −10 −5 0 5 10

x x

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Minimization of KL-divergence (III/III) 27(35) Expectation Propagation (I/II) 28(35)

0.5

p(x)

0.45

q(x) Suppose we have a posterior distribution in the form of

This second form of optimization

0.4

0.35

I

Z

0.3

p(x)

p(X|Y)∝ f(X)

0.25 i

∏

KL(p||q ), p(x)log dx

μ,σ

0.2

i=1

q

μ,σ

0.15

0.1

which is intractable or too computationally costly to compute.

0.05

has the following attractive property.

Then EP approximates the posterior as

0

−15 −10 −5 0 5 10

x

I I

μ= E (x)=E (x)

p(X|Y)≈ q(X), q(X)= N(X;μ,Σ)

q p

∏ i ∏ i i

h i h i

i=1 i=1

2 2 T 2

σ = E (x−E (x)) =E (x−E (xx ))

p p

q q

Ideally we want to minimize the KL divergence between the true

posterior and the approximation,

Similar properties hold for the entire exponential family.

!

I I

A variational method using this type of KL-divergence

1

q(X)= argminKL f(X)|| q(X)

∏ i ∏ i

minimization and hence the expectation equations above is

q

Z

i=1 i=1

Expectation Propagation.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

p(x)and q(x)

p(x)and q(x)

p(x)and q(x)

p(x)and q(x)

p(x)and q(x)

¯

¯

¯

¯

¯

¯

¯

¯

¯

¯

¯

¯

¯

¯

ˆ

6

ˆ

ˆ ˆ

6 6

ˆ

6

6

ˆ

6

ˆ

ˆ

6

ˆ

¯

¯

Expectation Propagation (II/II) 29(35) EP Example – Smoothing under GM noise 30(35)

Solving this is intractable, make the approximation that we minimize Consider the following linear scalar state-space model

the KL divergence between pairs of factorsf(X) andq(X).

i i

x =0 is known

x = x +v ,

The termsq(x) are estimated iteratively as in VB by keeping 0

k+1 k k

j j

I

2

the last estimates of{q} .

y = x +e ,

i v ∼N(v ;0,σ )

i=1 k k k

k k

v

i=j

2 2

e ∼ p (e ), 0.9N(e ;0,σ )+0.1N(e ;0,(10σ) )

e e

k k k e k

q(X)= argminKL f(X) q(X) q(X) q(X)

j j ∏ i j ∏ i

q

j

i=j i=j

The problem is to obtain the posterior densityp(x |y ).

1:N 1:N

This is in the Gaussian case obtained by solving the equations

The true posterior factorizes as

N

E (X)=E (X)

q q

∏ f ∏ q

j i=j i j i=j i

p(x |y )∝ p(y|x)p(x|x )

1:N 1:N

∏ i i i i−1

T T

i=1

E (XX )=E (XX )

q q f q

∏ ∏

j i=j i j i=j i

N

The true posterior in this case is a Gaussian mixture with2

for the meanμ and the covarianceΣ ofq(·).

i i i

components which is not feasible to compute.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

EP Example – Smoothing under GM noise 31(35) EP Example – Smoothing under GM noise 32(35)

2

p(x)=w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )

1 1

j j±1 j±1 j j±1 j±1 1 j±1 j±1

Make the variational approximation

2

N

+w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )

2 j±1 j±1 j 2 j±1 j±1 j±1 j±1

2

2

p(x |y )≈ q(x ), N(x;μ,σ )

1:N 1:N 1:N i i

∏ i

where the parametersw ,η andρ are

i=1

1,2 1,2 1,2

Consider the density forx given as

j

y y

η η

j j

2 2

Z Z

η =ρ + η =ρ +

2

1

1 2

2 2 2 2

ρ σ ρ (10σ)

e

p(x)∝ p(y|x)p(x |x)p(x|x ) e

j j j j+1 j j j−1

−1 −1

1 1 1 1

2 2

2 2

×N(x ;μ ,σ )N(x ;μ ,σ ) dx dx

ρ = +

j+1 j+1 j−1 j−1 j+1 j−1 ρ = +

j+1 j−1

1 2

2 2 2 2

ρ σ ρ (10σ)

e

e

which can be calculated as

2 2 2 2

w ∝0.9N y;η,ρ +σ w ∝0.1N y;η,ρ +(10σ)

1 j 2 j e

e

2

p(x)=w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ ) −1

1 1

j j±1 j±1 j j±1 j±1 1 j±1 j±1 μ μ

j−1 j+1

1 1

2

2

η =ρ +

ρ = +

2 2 2 2

2 2

2 2

2 σ +σ σ +σ

v v σ +σ σ +σ

j−1 j+1

+w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )

v v

2 j±1 j±1 j 2 j±1 j±1 j±1 j±1 j−1 j+1

2

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

¯

¯

EP Example – Smoothing under GM noise 33(35) References 34(35)

[Dimitris G. Tzikas, Aristidis C. Likas, and Nikolaos P. Galatsanos]

The Variational

Approximation for

Bayesian Inference

[Life after the EM algorithm]

homas Bayes (1701–1761), shown in

the upper left corner of Figure 1, first discovered Bayes’ theorem in a paper

Tthat was published in 1764 three years after his death, as the name Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P.; , “The variational approximation for

suggests. However, Bayes, in his theorem, used

uniform priors [1]. Pierre-Simon Laplace (1749–1827), shown in the lower right corner of

Figure 1, apparently unaware of Bayes’ work, dis- covered the same theorem in more general form

in a memoir he wrote at the age of 25 and

showed its wide applicability [2]. Regarding these issues S.M. Stiegler writes:

It was from here that “Bayesian” ideas first The influence of this memoir was immense.

spread through the mathematical world, as

Bayes’ played no important role in scientific debate s own article was ignored until 1780 and

until the 20th century Laplace’s that introduced the mathematical . It was also this article of Bayesian inference,” IEEE Signal Processing Magazine, vol.25, no.6,

techniques for the asymptotic analysis of poste-

rior distributions that are still employed today And it was here that the earliest example of.

optimum estimation can be found, the deriva- tion and characterization of an estimator that

minimized a particular measure of posterior

expected loss. After more than two centuries, we mathematicians, statisticians cannot only

2 recognize our roots in this masterpiece of our science, we can still learn from it. [3]

© STOCKBYTE

Digital Object Identifier 10.1109/MSP.2008.929620

1053-5888/08/$25.00©2008IEEE IEEE SIGNAL PROCESSING MAGAZINE [131] NOVEMBER 2008

pp.131-146, November 2008.

p(x)=w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ ) Authorized licensed use limited to: Linkoping Universitetsbibliotek. Downloaded on January 30, 2009 at 13:53 from IEEE Xplore. Restrictions apply.

j 1 j±1 j±1 j 1 j±1 j±1 j±1 j±1

1

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=

[ Matthias W. Seeger and David P. Wipf]

4644060&isnumber=4644043

2

Improving and broadening

[

the scope of compressive sensing]

+w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )

2 2

j±1 j±1 j j±1 j±1 2 j±1 j±1

ilestones in sparse At its heart lies ambiguity resolution Seeger, M.W.; Wipf, D.P.; , “Variational Bayesian Inference Techniques,” IEEE

signal reconstruction and compressive sens- between alternative data explanations, based on uncertain knowledge about

ing can be understood signal properties. A general approach is

M Bayesian context, fusing underdeter in a probabilistic- to model such knowledge probabilisti- cally and then to invert this causal

mined measurements with knowledge description for inference about the sig-

about low-level signal properties in the posterior distribution, which is maxi- nal, given the data. In this section, we phrase sparsity-

mized for point estimation. W recent progress to advance beyond this e review penalized least squares reconstruction in a probabilistic Bayesian context, as

setting. If the posterior is used as a dis- maximization of the posterior distribu-

tribution to be integrated over instead of merely an optimization criterion, © PHOTODISC tion over signals conditioned on observed data. We motivate recent

sparse estimators with better properties may be obtained, and progress to advance beyond this setting, by embracing a dif-

applications beyond point reconstruction from fixed data can be served. We describe novel variational relaxations of Bayesian ferent inference principle: Bayesian i terior, rather than its maximization n. W tege rat review variational ion over the pos- Signal Processing Magazine, vol.27, no.6, pp.81-91, Nov. 2010.

integration, characterized as well as posterior maximization, relaxations of Bayesian integration that not only result in

which can be solved robustly for very large models by algorithms unifying convex reconstruction and Bayesian estimators with provably better properties than posterior maximization, but also further applications beyond point

graphical model technology. They excel in difficult real-world reconstruction from fixed data. These relaxations are solved

imaging problems where posterior maximization performance is often unsatisfactory. by convex reconstruction a rithms coming together, drawing a novel bridge between nd Bayesian graphical model algo-

these concepts. In subsequent sections, we discuss large-scale

INTRODUCTION Signal reconstruction from noisy measurements is a core algorithms, theoretical and empirical advancements, and demonstrate real-world improvements for magnetoencepha-

problem in signal processing and computational mathematics. lography (MEG) and electroencephalography (EEG) source

2 Digital Object Identifier 10.1109/MSP.2010.938082 localization and new applications to magnetic resonance imaging (MRI).

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=

1053-5888/10/$26.00©2010IEEE 1053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE IEEE SIGNAL PROCESSING MAGAZINE [[81 81]] NOVEMBER 2010 NOVEMBER 2010

The EP solution forq(x)=N(x;μ,σ ) is obtained by matching

j j j j

j

5563102&isnumber=5563096

VARIATIONAL ALGORITHMS FOR

APPROXIMATE BAYESIAN INFERENCE

by

(propagating) expectations betweenq(·) andp(x). Matthew J. Beal

j j M.A., M.Sci., Physics, University of Cambridge, UK (1998)

Beal, M.J.; Variational Algorithms for Approximate Bayesian Inference, PhD

The Gatsby Computational Neuroscience Unit

University College London

17 Queen Square

London WC1N 3AR Thesis, University College London, UK, 2003.

A Thesis submitted for the degree of

Doctor of Philosophy of the University of London

May 2003

http://www.cse.buffalo.edu/faculty/mbeal/papers/beal03.pdf

μ =w η +w η

j 1 1 2 2

Minka, T.; , A Family of Algorithms for Approximate Bayesian Inference, PhD

2 2 2 2 2

Thesis, Massachusetts Institute of Technology, 2001.

σ =w ρ +(η −μ) +w ρ +(η −μ)

1 1 j 2 2 j

j 1 2

http://research.microsoft.com/en-us/um/people/minka/papers/ep/

minka-thesis.pdf

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

A Few Concepts to Summarize Lecture 7 35(35)

Support vector machines: A discriminative classiﬁer that gives the maximum margin

decision boundary.

Variational Inference: Approximate Bayesian inference where factorial approximations are

made on the form of the posteriors.

Kullback-Leibler (KL) Divergence: A cost function to ﬁnd optimal approximations for the

posteriors in two different forms.

Variational Bayes: A form of variational inference whereKL(q||p) is used for the optimization.

Expectation Propagation: A form of variational inference whereKL(p||q) is used for the

optimization.

AUTOMATIC CONTROL

Machine Learning

REGLERTEKNIK

LINKÖPINGS UNIVERSITET

T. Schön

## Comments 0

Log in to post a comment