# Machine Learning, Lecture 2 Linear Regression ... - Automatic Control

Τεχνίτη Νοημοσύνη και Ρομποτική

14 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

94 εμφανίσεις

Outline lecture 2 2(30)
Machine Learning, Lecture 2
Linear Regression
1. Summary of lecture 1
“it is our ﬁrm belief that an understanding of linear models
2. Linear basis function models
is essential for understanding nonlinear ones”
3. Maximum likelihood and least squares
Thomas Schön
5. Shrinkage methods
Division of Automatic Control
• Ridge regression
• LASSO
6. Bayesian linear regression
7. Motivation of kernel methods
Email: schon@isy.liu.se,
Phone: 013 - 281373,
Ofﬁce: House B, Entrance 27.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Summary of lecture 1 (I/III) 3(30) Summary of lecture 1 (II/III) 4(30)
The exponential family of distributions overx, parameterized byη,
The three basic steps of Bayesian modeling (where all variables are
modeled as stochastic)
T
p(x| η) = h(x)g(η)exp η u(x)
1. Assign prior distributionsp(θ) to all unknown parametersθ.
2. Write down the likelihoodp(x ,...,x | θ) of the data
1 N
One important member is the Gaussian density, which is commonly
x ,...,x given the parametersθ.
1 N
used as a building block in more sophisticated models. Important
3. Determine the posterior distribution of the parameters given
basic properties were provided.
the data
The idea underlying maximum likelihood is that the parametersθ
p(x ,...,x | θ)p(θ)
1 N
N
should be chosen in such a way that the measurements{x} are p(θ| x ,...,x ) = ∝ p(x ,...,x | θ)p(θ)
i 1 N 1 N
i=1
p(x ,...,x )
1 N
as likely as possible, i.e.,
If the posteriorp(θ| x ,...,x ) and the priorp(θ) distributions are
1 N
b
θ = argmax p(x ,··· ,x | θ).
1 N
of the same functional form they are conjugate distributions and
θ
the prior is said to be a conjugate prior for the likelihood.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön

Summary of lecture 1 (III/III) 5(30) Commonly used basis functions 6(30)
Modeling “heavy tails” using the Student’s t-distribution
Z
In using nonlinear basis functions,y(x,w) can be a nonlinear
−1
St(x| μ,λ,ν) = N x| μ,(ηλ) Gam(η| ν/2,ν/2)dη
function in the input variablex (still linear inw).
ν 1
1
− −
2 2 2
2
Γ(ν/2+1/2) λ λ(x−μ)
= 1+
• Global (in the sense that a small change inx affects all basis
Γ(ν/2) πν ν
functions) basis function
which according to the ﬁrst expressions can be interpreted as an
1. Polynomial (see illustrative example in Section 1.1) (ex. identity
inﬁnite mix of Gaussians with the same mean, but different variance.
φ(x) = x)
10 • Local (in the sense that a small change inx only affects the
−log Student
9 −log Gaussian Poor robustness is due to an
nearby basis functions) basis function
8
7
unrealistic model, the ML
6 1. Gaussian
5
estimator is inherently robust,
2. Sigmoidal
4
3
provided we have the correct
2
1
model.
0
−15 −10 −5 0 5 10 15
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Linear regression model on matrix form 7(30) Maximum likelihood and least squares (I/IV) 8(30)
It is commonly convenient to write the linear regression model
T
t = w φ(x )+e , n = 1,...,N,
In our linear regression model,
n n n
T
T
t = w φ(x )+e ,
wherew = w w ... w and
0 1 M−1 n n n
T
φ = 1 φ (x ) ... φ (x ) on matrix form
1 n M−1 n
−1
assume thate ∼N(0,β ) (i.i.d.). This results in the following
n
likelihood function
T =Φw+E,
T −1
where
p(t | w,β) =N(w φ(x ),β )
n n
     
t φ (x ) φ (x ) ... φ (x ) e
1 0 1 1 1 M−1 1 1
     
t φ (x ) φ (x ) ... φ (x ) e
2 0 2 1 2 M−1 2 2
      Note that this is a slight abuse of notation,p (t ) orp(t ;w,β) would have been better,
w,β n n
T = Φ = E =
     
. . . . .
.
. . . . . .
      sincew and β are both considered deterministic parameters in ML.
.
. . . . .
t φ (x ) φ (x ) ... φ (x ) e
N 0 N 1 N M−1 N N
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön”

Maximum likelihood and least squares (II/IV) 9(30) Maximum likelihood and least squares (III/IV) 10(30)
The available training data consisting ofN input variables The maximum likelihood problem now amounts to solving
N N
X ={x} and the corresponding target variablesT ={t} .
i i
i=1 i=1
argmax L(w,β)
According to our assumption on the noise, the likelihood function is w,β
given by
∂L N T T
Setting the derivative = 2β (t −w φ(x ))φ(x ) equal to

n n n
n=1
∂w
N N
0 gives the following ML estimate forw
T −1
p(T| w,β) = p(t | w,β) = N(t | w φ(x ),β )
∏ n ∏ n n
n=1 n=1 ML T −1 T
b
w = (Φ Φ) Φ T,
| {z }
which results in the following log-likelihood function
Φ
N
 
T −1
T
Note that ifΦ Φ is singular
L(w,β), lnp(t ,...,t | w,β) = lnN(t | w φ(x ),β )
1 n n n

φ (x ) φ (x ) ... φ (x )
0 1 1 1 M−1 1
n=1
  (or close to) we can ﬁx this
φ (x ) φ (x ) ... φ (x )
0 2 1 2 M−1 2
 
N
Φ = 
N N . . . by addingλI, i.e.,
.
T 2
. . . .
 
.
= lnβ− ln(2π)−β (t −w φ(x )) . . .
n n

2 2
n=1
RR T −1 T
φ (x ) φ (x ) ... φ (x )
0 N 1 N M−1 N
wb = (Φ Φ+λI) Φ T,
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Maximum likelihood and least squares (IV/IV) 11(30) Interpretation of the Gauss-Markov theorem 12(30)
Maximizing the log-likelihood functionL(w,β) w.r.t. β results in the The least squares estimator has the smallest mean square error
(MSE) of all linear estimators with no bias, BUT there may exist a
following estimate for β
biased estimator with lower MSE.
N
1 1 2
ML
“the restriction to unbiased estimates is not necessarily a wise one.
= t −wb φ(x )
∑ n n
ML
b
N
β [HTF, page51]
n=1
Two classes of potentially biased estimators, 1. Subset selection
Finally, note that if we are only interested inw, the log-likelihood
methods and 2. Shrinkage methods.
function is proportional to
N
This is intimately connected to the bias-variance trade-off
T 2
(t −w φ(x )) ,
∑ n n
• We will give a system identiﬁcation example related to ridge
n=1
regression to illustrate the bias-variance trade-off.
which clearly shows that assuming a Gaussian noise model and • See Section 3.2 for a slightly more abstract (but very
making use of Maximum Likelihood (ML) corresponds to a Least informative) account of the bias-variance trade-off. (this is a perfect
Squares (LS) problem. topic for discussions during the exercise sessions!)
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön

Interpretation of RR using the SVD ofΦ 13(30) Bias-variance tradeoff – example (I/IV) 14(30)
(Ex. 2.3 in Henrik Ohlsson’s PhD thesis) Consider a SISO system
n
0
y = g u +e , (1)
t t
∑ t−k
k
k=1
By studying the SVD ofΦ it can be shown that ridge regression
projects the measurements onto the principal components ofΦ and
whereu denotes the input,y denotes the output,e denotes white
t t t
then shrinks the coefﬁcients of low-variance components more than 2 0 n
noise (E(e) = 0 and E(e e ) = σ δ(t−s)) and{g} denote the
t s
k k=1
the coefﬁcients of high-variance components.
impulse response of the system.
Recall that the impulse response is the outputy whenu = δ(t) is
(See Section 3.4.1. in HTF for details.) t t
used in (1), which results in
(
0
g +e t = 1,...,n,
t
t
y =
t
e t> n.
t
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Bias-variance tradeoff – example (II/IV) 15(30) Bias-variance tradeoff – example (III/IV) 16(30)
th
1
The task is now to estimate the impulse response using ann order
Squared bias (gray line)
0.9
FIR model,
0.8
2
T T T
y = w φ +e , E wb φ −w φ
0.7
t t t wb ∗ ∗
0
0.6
where
Variance (dashed line)
0.5
T 0.4
n
2
φ = u ... u , w∈R
t t−1 t−n
T T
0.3 E E wb φ −wb φ
b b ∗ ∗
w w
0.2
Let us use Ridge Regression (RR),
0.1
MSE (black line)
RR 2 T
0
wb = argminkY−Φwk +λw w.
2 0 1 2 3 4 5 6 7 8 9 10
λ
w
2
to ﬁnd the parametersw.
MSE = (bias) + variance
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön

Bias-variance tradeoff – example (IV/IV) 17(30) Lasso 18(30)
The Lasso was introduced during lecture1 as the MAP estimate
when a Laplacian prior is assigned to the parameters. Alternatively
“Flexible” models will have a low bias and high variance and more
we can motivate the Lasso as the solution to
“restricted” models will have high bias and low variance.
2
N T
min t −w φ(x )

n n
n=1
w
The model with the best predictive capabilities is the one which
M−1
s.t. |w| ≤ η

j
strikes the best tradeoff between bias and variance.
j=0
which using a Lagrange multiplierλ can be stated
Recent contributions on impulse response identiﬁcation using
N M−1
regularization, see 2
T
min t −w φ(x ) +λ |w|
n n j
∑ ∑
• Gianluigi Pillonetto and Giuseppe De Nicolao. A new kernel-based approach for linear system identiﬁcation.
w
Automatica, 46(1):81–93, January 2010.
n=1 j=0
• Tianshi Chen, Henrik Ohlsson and Lennart Ljung. On the estimation of transfer functions, regularizations and
Gaussian processes – Revisited. Automatica, 48(8): 1525–1535, August 2012.
The difference to ridge regression is simply that Lasso make use of
M−1 M−1 2
the‘ -norm |w|, rather than the‘ -norm w used in
∑ ∑
1 j 2
j=0 j=0 j
ridge regression in shrinking the parameters.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Graphical illustration of Lasso and RR 19(30) Implementing Lasso 20(30)
The‘ -regularized least squares problem (lasso)
1
Ridge Regression (RR)
Lasso
2
minkT−Φwk +λkwk (2)
1
2
w
w=sdpvar (M, 1 ) ;
ops=sdpsettings ( ’ verbose ’ , 0 ) ;
solvesdp ( [ ] , ( T−Phi∗w) ’∗ ( T−Phi∗w) + lambda∗norm (w, 1 ) , ops )
cvx_begin
v a r i a b l e w(M)
The circles are contours of the least squares cost function (LS
minimize ( ( T−Phi∗w) ’∗ ( y−Phi∗w) + lambda∗norm (w, 1 ) )
cvx_end
estimate in the middle). The constraint regions are shown in gray
2 2
|w|+|w|≤ η (Lasso) andw +w ≤ η (RR). The shape of the
0 1
0 1 A MATLAB package dedicated to‘ -regularized least squares
1
constraints motivates why Lasso often leads to sparseness.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön

Bayesian linear regression – example (I/VI) 21(30) Bayesian linear regression – example (II/VI) 22(30)
T
?
Consider the problem of ﬁtting a straight line to noisy measurements. Let the true values forw bew = −0.3 0.5 (plotted using a
Let the model be (t∈R,x ∈R) white circle below).
n
Generate synthetic measurements by
t = w +w x +e , n = 1,...,N. (3)
n 0 1 n n
| {z }
? ? 2
y(x,w)
t = w +w x +e , e ∼N(0,0.2 ),
n n n n
0 1
where
wherex ∼U(−1,1).
n
1
2
e ∼N(0,0.2 ), β = = 25.
n
2 Furthermore, let the prior be
0.2
T
−1
p(w) =N w| 0 0 ,α I ,
According to (3), the following identity basis function is used
φ (x ) = 1, φ (x ) = x .
0 n 1 n n where
The example lives in two dimensions, allowing us to plot the
α = 2.
distributions in illustrating the inference.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Bayesian linear regression – example (III/VI) 23(30) Bayesian linear regression – example (IV/VI) 24(30)
Plot of the situation before any data arrives.
Plot of the situation after one measurement has arrived.
1
1
0.8
0.8
0.6
0.6
0.4
0.2
0.4
0
0.2
−0.2
−0.4
0
−0.6
−0.2
−0.8
−1
−0.4 −1 −0.5 0 0.5 1
x
−0.6
−0.8
Likelihood (plotted as a Posterior/prior, Example of a few realizations
−1
−1 −0.5 0 0.5 1
function ofw)
from the posterior and the ﬁrst
x
measurement (black circle).
p(w| t ) =N (w| m ,S ),
1 1 1
−1
p(t | w) =N(t | w +w x ,β )
1 1 0 1 1
Prior, T
Example of a few realizations from m = βS Φ t ,
1 1 1
T −1
the posterior. S = (αI+βΦ Φ) .
1 1
T
p(w) =N w| 0 0 , I
2
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
y
yBayesian Linear Regression - Example (V/VI) 25(30) Bayesian linear regression – example (VI/VI) 26(30)
Plot of the situation after two measurements have arrived. Plot of the situation after 30 measurements have arrived.
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
−0.2
−0.2
−0.4 −0.4
−0.6 −0.6
−0.8 −0.8
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
Likelihood (plotted as a Likelihood (plotted as a
Posterior/prior, Example of a few realizations Posterior/prior, Example of a few realizations
function ofw) from the posterior and the function ofw) from the posterior and the
measurements (black circles). measurements (black circles).
p(w| T) =N (w| m ,S ), p(w| T) =N (w| m ,S ),
2 2 30 30
−1 −1
p(t | w) =N(t | w +w x ,β ) p(t | w) =N(t | w +w x ,β )
2 2 0 1 2 T 30 30 0 1 30 T
m = βS Φ T, m = βS Φ T,
2 2 30 30
T −1 T −1
S = (αI+βΦ Φ) . S = (αI+βΦ Φ) .
2 30
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
Empirical Bayes 27(30) Predictive distribution – example 28(30)
Important question: How do we decide on the suitable values for
Investigating the predictive distribution for the example above
hyperparametersη?
0.6 1 1
Idea: Estimate the hyperparameters from the data by selecting them
0.4
0.5 0.5
0.2
such that they maximize the marginal likelihood function,
0
0 0
−0.2
Z
−0.4
−0.5 −0.5
−0.6
p(T| η) = p(T| w,η)p(w| η)dw,
−0.8
−1 −1
−1
−1.2 −1.5 −1.5
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
whereη denotes the hyperparameters to be estimated.
N = 2 observations N = 5 observations N = 200 observations
Travels under many names, besides empirical Bayes, this is also
referred to as type 2 maximum likelihood, generalized maximum
• True system (y(x) =−0.3+0.5x) generating the data (red line)
likelihood, and evidence approximation.
• Mean of the predictive distribution (blue line)
• One standard deviation of the predictive distribution (gray shaded area) Note that this is
Empirical Bayes combines the two statistical philosophies;
the point-wise predictive standard deviation as a function ofx.
frequentistic ideas are used to estimate the hyperparameters that are
• Observations (black circles)
then used within the Bayesian inference.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK
T. Schön T. Schön
y
yPosterior distribution 29(30) A few concepts to summarize lecture 2 30(30)
Recall that the posterior distribution is given by
Linear regression: Models the relationship between a continuous target variablet and a
possibly nonlinear functionφ(x) of the input variables.
p(w| T) =N(w| m ,S ),
N N
Hyperparameter: A parameter of the prior distribution that controls the distribution of the
parameters of the model.
where
Maximum a Posteriori (MAP): A point estimate obtained by maximizing the posterior
distribution. Corresponds to a mode of the posterior distribution.
T
m = βS Φ T,
N N
Gauss Markov theorem: States that in a linear regression model, the best (in the sense of
minimum MSE) linear unbiased estimate (BLUE) is given by the least squares estimate.
T −1
S = (αI+βΦ Φ) .
N
Ridge regression: An‘ -regularized least squares problem used to solve the linear
2
regression problem resulting in potentially biased estimates. A.k.a. Tikhonov regularization.
Let us now investigate the posterior mean solutionm , which has an
N
Lasso: An‘ -regularized least squares problem used to solve the linear regression problem
1
interpretation that directly leads to the kernel methods (lecture 5),
resulting in potentially biased estimates. The Lasso typically produce sparse estimates.
including Gaussian processes.
AUTOMATIC CONTROL AUTOMATIC CONTROL
Machine Learning Machine Learning
REGLERTEKNIK REGLERTEKNIK