Machine Learning, Lecture 5 Support Vector Machines and ...

grizzlybearcroatianAI and Robotics

Oct 16, 2013 (4 years and 27 days ago)

77 views

Outline Lecture 5 2(35)
Machine Learning, Lecture 5
Support Vector Machines and
1. Summary of lecture 4
Approximate Inference
2. Support Vector Machines
3. Variational Bayesian Inference
Thomas Schön, Henrik Ohlsson and • General Derivation
• Example – Identification of a Linear State-Space Model
Umut Orguner
• Example – Gaussian Mixtures
Division of Automatic Control
4. Expectation Propagation
Linköping University
• General Derivation
Linköping, Sweden.
• Example – State Estimation
Email: schon@isy.liu.se, (Chapter 7.1, 10)
Phone: 013 - 281373,
www.control.isy.liu.se/~schon/
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
Summary of Lecture 4 (I/III) 3(35) Summary of Lecture 4 (II/III) 4(35)
A kernel functionk(x,z) is defined as an inner product
A neural network is a nonlinear function (as a function expansion)
from a set of input variables{x} to a set of output variables{y}
i k
T
k(x,z)= φ(x) φ(z),
controlled by adjustable parametersw.
This function expansion is found by formulating the problem as usual, whereφ(x) is a fixed mapping.
which results in a (non-convex) optimization problem. This problem is
Introduced the kernel trick (a.k.a. kernel substitution). In an algorithm
solved using numerical methods.
where the input datax enters only in the form of scalar products we
Backpropagation refers to a way of computing the gradients by can replace this scalar product with another choice of kernel.
making use of the chain rule, combined with clever reuse of
The use of kernels allows us to implicitly use basis functions of high,
information that is needed for more than one gradient.
even infinite, dimensions (M→∞).
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. SchönSummary of Lecture 4 (III/III) 5(35) Support Vector Machines (SVM) 6(35)
Very popular classifier.
A Gaussian process is a collection of random variables, any finite
Non-probabilistic
number of which have a joint Gaussian distribution.
Discriminative
By assuming that the considered system is a Gaussian process,
Can also be used for regression (then called
predictions can be made by computing the conditional distribution
∗ ∗
support vector regression, SVR).
p(y(x )|all the observations),y(x ) being the output for which we
seek a prediction. This regression approach is referred to as
Convex optimization
Gaussian process regression.
Sparse
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
SVM for Classification 7(35) SVM for Classification Cont’d 8(35)
N n
x
Assume:{(t ,x )} ,x ∈R and
n n n
n=1
t ∈{−1,1}, is a given training data set (linearly
n
The decision boundary that maximizes the margin is given as the
separable).
solution to the quadratic program (QP)

Task: Givenx , what is the corresponding label?
1
2
min kwk
w,b 2
SVM is a discriminative classifier, i.e. it provides a
T
s.t. t (w φ(x )+b)−1≥ 0, n= 1,...,N
n n
decision boundary. The decision boundary is given
T
by{x|w φ(x)+b= 0}.
To make it possible to let the dimension of the feature space (dim of
φ(x )) go to infinity, we have to derive the dual.
n
Goal: Find the decision boundary that maximizes
the margin! The margin is the distance to the
closest point to the decision boundary.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schönˆ
ˆ
ˆ


SVM for Classification Cont’d 9(35) SVM for Classification Cont’d 10(35)
T
First, the Lagrangian is Letk(x,x)= φ(x) φ(x). The dual objective then becomes
i j i j
N N N N
1 1
2 T
L(w,b,a)= kwk − a t (w φ(x )+b)−1 g(a)= a − a a t t k(x ,x )
n n n n n m n m m n
∑ ∑ ∑ ∑
2 2
n=1 n=1 m=1n=1
which we can maximize w.r.t. a and subject to
and minimizing wrtw, b we obtain the dualg(a). Taking the
derivative wrtw, b and set them to zero, N
a ≥ 0, a t = 0.
n n n

N N
dL(w,b,a) dL(w,b,a) n=1
= a t = 0, = w− a t φ(x )= 0
n n n n n
∑ ∑
db dw
The maximizing a, let say a, gives using
n=1 n=1
N
T ∗ T ∗
w φ(x )=( a t φ(x )) φ(x ) that
∑ n n n
n=1
This gives
N
∗ ∗
N N N
y(x )= a t k(x ,x )+b.
1
∑ n n n
T
g(a)= a − a a t t φ(x ) φ(x )
n n m n m m n
∑ ∑ ∑ n=1
2
n=1 m=1n=1
Manya’s will be zero⇒ computational remedy.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
SVM for Classification – Non-Separable Classes 11(35) Example – CVX to Compute SVM 12(35)
1
Linearly separable data:
0.8
0.6
If points are on the right side of the decision boundary, then cvx_begin
0.4
T
v a r i ab l e s w( nx , 1 ) b
t (w φ(x )+b)≥ 1. To allow for some violations, we introduce 0.2
n n
minimize (0.5∗w’∗w)
0
slack variablesζ , n= 1,...,N. The modified optimization problem
−0.2
n subject to
−0.4
y .∗ (w’∗ x+b∗ones (1 ,N))−ones (1 ,N) >= 0
becomes
−0.6
cvx_end
−0.8
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
N
Non-separable data:
1
1
2
0.8
min kwk +C ζ
n

cvx_begin
0.6
w,b,ζ2
n
0.4
v a r i ab l e s w( nx , 1 ) b zeta (1 ,N)
T 0.2
minimize (0.5∗w’∗w + C∗ones (1 ,N)∗ zeta ’ )
s.t. t (w φ(x )+b)+ζ −1≥ 0, n= 1,...,N,
n n n
0
subject to
−0.2
ζ ≥ 0, n= 1,...,N.
n y .∗ (w’∗ x+b∗ones (1 ,N))−ones (1 ,N)+ zeta >= 0 −0.4
−0.6
zeta >= 0
−0.8
cvx_end
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. SchönExample – CVX to Compute SVM 13(35) Further Reading and Code 14(35)
SVM – Solving the dual:
k=@( x1 , x2 ) exp(−sum ( ( x1∗ones (1 , size ( x2 ,2))− x2 ) . ^ 2 ) / 0 . 5 ) ’
Bernhard Schölkopf and Alex Smola. Learning with Kernels.
f o r t =1:N; f o r s= t :N
K( t , s )= k ( x ( : , t ) , x ( : , s ) ) ; K( s , t )=K( t , s ) ; MIT Press, Cambridge, MA, 2002.
end ; end
Yalmip can be downloaded from
cvx_begin
1
v a r i a b l e s a (N, 1 )
0.8 http://users.isy.liu.se/johanl/yalmip/
0.6
minimize ( 1/2∗(a.∗ y ’ ) ’∗ K∗(a.∗ y ’ ) − ones (1 ,N)∗a )
0.4
CVX can be downloaded fromhttp://cvxr.com/cvx/
subject to
0.2
ones (1 ,N)∗ ( a.∗ y ’ ) == 0
0
a >= 0 −0.2
−0.4
cvx_end
−0.6
ind= f i n d ( a >0.01);
−0.8
wphi = @( xstar ) ones (1 ,N)∗ ( a.∗ y ’ .∗ k ( xstar , x ) )
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
b=0;
f o r i =1: length ( ind )
b=b+1/ y ( ind ( i ))−wphi ( x ( : , ind ( i ) ) ) ;
end
b=b / length ( ind ) ;
ystar = @( xst ar ) wphi ( xs tar )+b
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
Bayesian Framework Reminder 15(35) Variational Methods (I/II) 16(35)
Classic calculus involves functions and definesderivatives to
LetX= x ,...,x be the measurements.
1 N
optimize them.
LetZ= z ,...,z be the latent variables as in the EM
1 N
The so-called calculus of variations investigates functions of
framework.
functions which are called functionals.
Z
Then, the Bayesian framework is interested in the posterior
Example: Entropy H[p(·)]=− p(x)log(p(x)) dx
densityp(Z|X) given by Bayes rule as
The derivatives of functionals are called variations.
p(X|Z)p(Z)
p(Z|X)=
Calculus of variations has its origins in the 18th century and the
p(X)
most important result is probably the so-called Euler-lagrange
For quite many instances, the posterior can be found exactly equation
Z
using the concept of conjugate pairs.
d
0 0 0
C(q), L(t,q(t),q(t))dt : L (t,q ,q )+ L (t,q ,q )= 0
x ∗ v ∗
• Gaussian case ∗ ∗
| {z } dt
• More generally the exponential family.
,L(t,x,v)
What happens when there is no exact solution?
which is the core of Optimal Control Theory.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön6
ˆ
ˆ
6





¯

Variational Methods (II/II) 17(35) Variational Inference 18(35)
In general variational methods, one generally assumes a
Algorithm (Variational Iteration)
predetermined form of the argument function, possibly
parametric.
Solve the problem iteratively:
T T
Quadratic: q(x)= x Ax+b x+c
1. Forj= 1,...,M
N
φ
M M
or
• Fix{q(Z)} to their last estimated values{q(Z)} .
i i i i
i=1 i=1
Basis functions: q(x)= wφ(x)
∑ i i=j i=j
i=1 • Find the solution of
Variational Inference
q(Z)= argmaxL(q)
j j
q
j
In the case of probabilistic inference, the variational approximation
takes the form:
2. Repeat 1 until convergence.
M
q(Z)= q(Z)
∏ i i
i=1
whereZ={Z ,...,Z } is a partitioning of the unknown variables.
1 M
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
VB Example 1 – Linear System Identification 19(35) VB Example 1 – Linear System Identification 20(35)
Consider the following linear scalar state-space model With latent variables
Z
p(θ|y )= p(θ,x |y ) dx
0:N 0:N 0:N 0:N
x = θx +v ,
2
k+1 k k
v 0 σ 0
k v
∼N , .
2
1
e 0 0 σ
k There is still no exact form for the joint densityp(θ,x |y ).
e 0:N 0:N
y = x +e ,
k k k
2
Variational Approximation
The initial state: x ∼N(x ;x ,Σ ).
0 0 0 0
2 Approximate the posteriorp(θ,x |y ) as
0:N 0:N
θ with prior distributionθ∼N(θ;0,σ )
θ
The identification problem is now to determine the posterior
p(θ,x |y )≈ q (θ)q (x )
0:N 0:N θ x 0:N
p(θ|y ) using the VB framework.
0:N
Findq (θ) andq (x ) using
x 0:N
We still have some latent variablesx ,{x ,...,x }. θ
0:N 0 N
Note the difference in notation compared to Bishop! The
logq (θ)=E [logp(y ,x ,θ)]+ const.
θ q 0:N 0:N
x
observations are denotedy and the latent variables are given by
logq (x )=E [logp(y ,x ,θ)]+ const.
x q
0:N 0:N 0:N
θ
x.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön



¯
VB Example 1 – Linear System Identification 21(35) VB Example 2 – Gaussian Mixture Identification 22(35)
Variational Bayes formulas are
Back to the Bishop’s notation: x now denotes a measurement.
logq (θ)=E [logp(y ,x ,θ)]+ const.
q
θ 0:N 0:N
x
Suppose we havex i.i.d. and distributed as
1:N
logq (x )=E [logp(y ,x ,θ)]+ const.
x 0:N q 0:N 0:N
θ
K
−1
We have the joint densityp(y ,x ,θ) as
x ∼ p(x|π ,μ ,Λ )= πN x;μ ,Λ
0:N 0:N
i 1:K 1:K 1:K ∑ k k
k
k=1
p(y ,x ,θ)=p(y |x )p(x |x ,θ)p(x )p(θ)
0:N 0:N 0:N 0:N 1:N 0:N−1 0
N N
In the Bayesian framework, all the unknowns{π ,μ ,Λ } are
1:K 1:K 1:K
= p(y|x) p(x|x ,θ)p(x )p(θ)
∏ i i ∏ i i−1 0
random.
i=0 i=1
K
Taking the logarithm and separating the constant terms 4
α−1
0
π ∼Dir(π |α )∝ π
1:K 1:K 0

k
N N
0.5 0.5 k=1
2 2
logp(y ,x ,θ)=− (y−0.5x) − (x−θx )
0:N 0:N
∑ i i ∑ i i−1
2 2 K
σ σ
e v
i=0 i=1 −1
μ ,Λ ∼p(μ ,Λ ), N(μ ;m ,(β Λ ) )W(Λ|W ,ν )
1:K 1:K 1:K 1:K ∏ k 0 0 k k 0 0
2 2 2 2
−0.5/σ (x −x ) −0.5/σ θ + const.
0 0 k=1
0
θ
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
VB Example 2 – Gaussian Mixture Identification 23(35) VB Example 2 – Sparsity with Bayesian Methods 24(35)
T
Define the latent variablesz ,[z ,··· ,z ] as in EM. Then
i i1 iK
Symmetric Dirichlet distribution forK = 3.
N K
z
ik
z
ik −1
p(x ,z )= π N x;μ ,Λ
1:N 1:N ∏∏ k
k k
i=1 k=1
The Bayesian framework then asks for the posterior density
π ∼Dir(π |α )
1:3 1:3 0
p(z ,π ,μ ,Λ |x ).
1:N 1:K 1:K 1:K 1:N
3
4
α−1
0
∝ π

k
Variational Approximation
k=1
Approximate the posterior as
α−1
0
=(π π (1−π −π ))
1 2 1 2
p(z ,π ,μ ,Λ |x )≈ q (z )q (π ,μ ,Λ )
z
1:N 1:K 1:K 1:K 1:N 1:N π,μ,Λ 1:K 1:K 1:K
Findq (z ) andq (π ,μ ,Λ ) iteratively.
z 1:N π,μ,Λ 1:K 1:K 1:K
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schönˆ
ˆ
ˆ ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ ˆ
ˆ
ˆ
ˆ
Minimization of KL-divergence (I/III) 25(35) Minimization of KL-divergence (II/III) 26(35)
Suppose we have
Findmin KL(p||q )
Findmin KL(q ||p) μ,σ μ,σ
μ,σ μ,σ
2
0.5
p(x)= 0.2N(x;5,1)+0.8N(x,−5,2 ) 0.5
p(x)
p(x)
0.45
0.45
q(x)
q (x)
1
0.4
0.4
q (x)
2
2
0.35
Letq (x),N(x;μ,σ ) 0.35
μ,σ
0.3
0.3
0.25
0.25
0.2
Findmin KL(q ||p) Findmin KL(p||q ) 0.2
μ,σ μ,σ μ,σ μ,σ
0.15
0.15
0.5 0.5
p(x) p(x) 0.1
0.1
0.45 0.45
q (x) q(x)
1
0.05
0.05
0.4 0.4
q (x)
2
0
0
0.35 0.35 −15 −10 −5 0 5 10
−15 −10 −5 0 5 10
x
x
0.3 0.3
0.25 0.25 Z
Z
q
μ,σ p(x)
0.2 0.2
KL(q ||p), q (x)log dx KL(p||q ), p(x)log dx
μ,σ μ,σ
μ,σ
0.15 0.15
p(x)
q
μ,σ
0.1 0.1
0.05 0.05
zero–forcing
0 0
non-zero-forcing
−15 −10 −5 0 5 10 −15 −10 −5 0 5 10
x x
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
Minimization of KL-divergence (III/III) 27(35) Expectation Propagation (I/II) 28(35)
0.5
p(x)
0.45
q(x) Suppose we have a posterior distribution in the form of
This second form of optimization
0.4
0.35
I
Z
0.3
p(x)
p(X|Y)∝ f(X)
0.25 i

KL(p||q ), p(x)log dx
μ,σ
0.2
i=1
q
μ,σ
0.15
0.1
which is intractable or too computationally costly to compute.
0.05
has the following attractive property.
Then EP approximates the posterior as
0
−15 −10 −5 0 5 10
x
I I
μ= E (x)=E (x)
p(X|Y)≈ q(X), q(X)= N(X;μ,Σ)
q p
∏ i ∏ i i
h i h i
i=1 i=1
2 2 T 2
σ = E (x−E (x)) =E (x−E (xx ))
p p
q q
Ideally we want to minimize the KL divergence between the true
posterior and the approximation,
Similar properties hold for the entire exponential family.
!
I I
A variational method using this type of KL-divergence
1
q(X)= argminKL f(X)|| q(X)
∏ i ∏ i
minimization and hence the expectation equations above is
q
Z
i=1 i=1
Expectation Propagation.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
p(x)and q(x)
p(x)and q(x)
p(x)and q(x)
p(x)and q(x)
p(x)and q(x)




¯
¯

¯
¯
¯


¯



¯

¯

¯
¯
¯

¯
¯

¯


ˆ
6


ˆ
ˆ ˆ
6 6
ˆ
6
6
ˆ


6
ˆ
ˆ
6
ˆ
¯
¯
Expectation Propagation (II/II) 29(35) EP Example – Smoothing under GM noise 30(35)
Solving this is intractable, make the approximation that we minimize Consider the following linear scalar state-space model
the KL divergence between pairs of factorsf(X) andq(X).
i i
x =0 is known
x = x +v ,
The termsq(x) are estimated iteratively as in VB by keeping 0
k+1 k k
j j
I
2
the last estimates of{q} .
y = x +e ,
i v ∼N(v ;0,σ )
i=1 k k k
k k
v
i=j
2 2
e ∼ p (e ), 0.9N(e ;0,σ )+0.1N(e ;0,(10σ) )
e e
k k k e k
q(X)= argminKL f(X) q(X) q(X) q(X)
j j ∏ i j ∏ i
q
j
i=j i=j
The problem is to obtain the posterior densityp(x |y ).
1:N 1:N
This is in the Gaussian case obtained by solving the equations
The true posterior factorizes as
N
E (X)=E (X)
q q
∏ f ∏ q
j i=j i j i=j i
p(x |y )∝ p(y|x)p(x|x )
1:N 1:N
∏ i i i i−1
T T
i=1
E (XX )=E (XX )
q q f q
∏ ∏
j i=j i j i=j i
N
The true posterior in this case is a Gaussian mixture with2
for the meanμ and the covarianceΣ ofq(·).
i i i
components which is not feasible to compute.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
EP Example – Smoothing under GM noise 31(35) EP Example – Smoothing under GM noise 32(35)
2
p(x)=w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )
1 1
j j±1 j±1 j j±1 j±1 1 j±1 j±1
Make the variational approximation
2
N
+w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )
2 j±1 j±1 j 2 j±1 j±1 j±1 j±1
2
2
p(x |y )≈ q(x ), N(x;μ,σ )
1:N 1:N 1:N i i
∏ i
where the parametersw ,η andρ are
i=1
1,2 1,2 1,2
Consider the density forx given as
j
y y
η η
j j
2 2
Z Z
η =ρ + η =ρ +
2
1
1 2
2 2 2 2
ρ σ ρ (10σ)
e
p(x)∝ p(y|x)p(x |x)p(x|x ) e
j j j j+1 j j j−1
−1 −1
1 1 1 1
2 2
2 2
×N(x ;μ ,σ )N(x ;μ ,σ ) dx dx
ρ = +
j+1 j+1 j−1 j−1 j+1 j−1 ρ = +
j+1 j−1
1 2
2 2 2 2
ρ σ ρ (10σ)
e
e
which can be calculated as
2 2 2 2
w ∝0.9N y;η,ρ +σ w ∝0.1N y;η,ρ +(10σ)
1 j 2 j e
e
2
p(x)=w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ ) −1
1 1
j j±1 j±1 j j±1 j±1 1 j±1 j±1 μ μ
j−1 j+1
1 1
2
2
η =ρ +
ρ = +
2 2 2 2
2 2
2 2
2 σ +σ σ +σ
v v σ +σ σ +σ
j−1 j+1
+w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )
v v
2 j±1 j±1 j 2 j±1 j±1 j±1 j±1 j−1 j+1
2
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön





















¯




¯
EP Example – Smoothing under GM noise 33(35) References 34(35)
[Dimitris G. Tzikas, Aristidis C. Likas, and Nikolaos P. Galatsanos]
The Variational
Approximation for
Bayesian Inference
[Life after the EM algorithm]
homas Bayes (1701–1761), shown in
the upper left corner of Figure 1, first discovered Bayes’ theorem in a paper
Tthat was published in 1764 three years after his death, as the name Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P.; , “The variational approximation for
suggests. However, Bayes, in his theorem, used
uniform priors [1]. Pierre-Simon Laplace (1749–1827), shown in the lower right corner of
Figure 1, apparently unaware of Bayes’ work, dis- covered the same theorem in more general form
in a memoir he wrote at the age of 25 and
showed its wide applicability [2]. Regarding these issues S.M. Stiegler writes:
It was from here that “Bayesian” ideas first The influence of this memoir was immense.
spread through the mathematical world, as
Bayes’ played no important role in scientific debate s own article was ignored until 1780 and
until the 20th century Laplace’s that introduced the mathematical . It was also this article of Bayesian inference,” IEEE Signal Processing Magazine, vol.25, no.6,
techniques for the asymptotic analysis of poste-
rior distributions that are still employed today And it was here that the earliest example of.
optimum estimation can be found, the deriva- tion and characterization of an estimator that
minimized a particular measure of posterior
expected loss. After more than two centuries, we mathematicians, statisticians cannot only
2 recognize our roots in this masterpiece of our science, we can still learn from it. [3]
© STOCKBYTE
Digital Object Identifier 10.1109/MSP.2008.929620
1053-5888/08/$25.00©2008IEEE IEEE SIGNAL PROCESSING MAGAZINE [131] NOVEMBER 2008
pp.131-146, November 2008.
p(x)=w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ ) Authorized licensed use limited to: Linkoping Universitetsbibliotek. Downloaded on January 30, 2009 at 13:53 from IEEE Xplore. Restrictions apply.
j 1 j±1 j±1 j 1 j±1 j±1 j±1 j±1
1
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=
[ Matthias W. Seeger and David P. Wipf]
4644060&isnumber=4644043
2
Improving and broadening
[
the scope of compressive sensing]
+w (μ ,σ )N x;η (μ ,σ ),ρ (μ ,σ )
2 2
j±1 j±1 j j±1 j±1 2 j±1 j±1
ilestones in sparse At its heart lies ambiguity resolution Seeger, M.W.; Wipf, D.P.; , “Variational Bayesian Inference Techniques,” IEEE
signal reconstruction and compressive sens- between alternative data explanations, based on uncertain knowledge about
ing can be understood signal properties. A general approach is
M Bayesian context, fusing underdeter in a probabilistic- to model such knowledge probabilisti- cally and then to invert this causal
mined measurements with knowledge description for inference about the sig-
about low-level signal properties in the posterior distribution, which is maxi- nal, given the data. In this section, we phrase sparsity-
mized for point estimation. W recent progress to advance beyond this e review penalized least squares reconstruction in a probabilistic Bayesian context, as
setting. If the posterior is used as a dis- maximization of the posterior distribu-
tribution to be integrated over instead of merely an optimization criterion, © PHOTODISC tion over signals conditioned on observed data. We motivate recent
sparse estimators with better properties may be obtained, and progress to advance beyond this setting, by embracing a dif-
applications beyond point reconstruction from fixed data can be served. We describe novel variational relaxations of Bayesian ferent inference principle: Bayesian i terior, rather than its maximization n. W tege rat review variational ion over the pos- Signal Processing Magazine, vol.27, no.6, pp.81-91, Nov. 2010.
integration, characterized as well as posterior maximization, relaxations of Bayesian integration that not only result in
which can be solved robustly for very large models by algorithms unifying convex reconstruction and Bayesian estimators with provably better properties than posterior maximization, but also further applications beyond point
graphical model technology. They excel in difficult real-world reconstruction from fixed data. These relaxations are solved
imaging problems where posterior maximization performance is often unsatisfactory. by convex reconstruction a rithms coming together, drawing a novel bridge between nd Bayesian graphical model algo-
these concepts. In subsequent sections, we discuss large-scale
INTRODUCTION Signal reconstruction from noisy measurements is a core algorithms, theoretical and empirical advancements, and demonstrate real-world improvements for magnetoencepha-
problem in signal processing and computational mathematics. lography (MEG) and electroencephalography (EEG) source
2 Digital Object Identifier 10.1109/MSP.2010.938082 localization and new applications to magnetic resonance imaging (MRI).
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=
1053-5888/10/$26.00©2010IEEE 1053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE IEEE SIGNAL PROCESSING MAGAZINE [[81 81]] NOVEMBER 2010 NOVEMBER 2010
The EP solution forq(x)=N(x;μ,σ ) is obtained by matching
j j j j
j
5563102&isnumber=5563096
VARIATIONAL ALGORITHMS FOR
APPROXIMATE BAYESIAN INFERENCE
by
(propagating) expectations betweenq(·) andp(x). Matthew J. Beal
j j M.A., M.Sci., Physics, University of Cambridge, UK (1998)
Beal, M.J.; Variational Algorithms for Approximate Bayesian Inference, PhD
The Gatsby Computational Neuroscience Unit
University College London
17 Queen Square
London WC1N 3AR Thesis, University College London, UK, 2003.
A Thesis submitted for the degree of
Doctor of Philosophy of the University of London
May 2003
http://www.cse.buffalo.edu/faculty/mbeal/papers/beal03.pdf
μ =w η +w η
j 1 1 2 2
Minka, T.; , A Family of Algorithms for Approximate Bayesian Inference, PhD
2 2 2 2 2
Thesis, Massachusetts Institute of Technology, 2001.
σ =w ρ +(η −μ) +w ρ +(η −μ)
1 1 j 2 2 j
j 1 2
http://research.microsoft.com/en-us/um/people/minka/papers/ep/
minka-thesis.pdf
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET
T. Schön T. Schön
A Few Concepts to Summarize Lecture 7 35(35)
Support vector machines: A discriminative classifier that gives the maximum margin
decision boundary.
Variational Inference: Approximate Bayesian inference where factorial approximations are
made on the form of the posteriors.
Kullback-Leibler (KL) Divergence: A cost function to find optimal approximations for the
posteriors in two different forms.
Variational Bayes: A form of variational inference whereKL(q||p) is used for the optimization.
Expectation Propagation: A form of variational inference whereKL(p||q) is used for the
optimization.
AUTOMATIC CONTROL
Machine Learning
REGLERTEKNIK
LINKÖPINGS UNIVERSITET
T. Schön