Outline lecture 2 2(30)

Machine Learning, Lecture 2

Linear Regression

1. Summary of lecture 1

“it is our ﬁrm belief that an understanding of linear models

2. Linear basis function models

is essential for understanding nonlinear ones”

3. Maximum likelihood and least squares

4. Bias variance trade-off

Thomas Schön

5. Shrinkage methods

Division of Automatic Control

• Ridge regression

Linköping University

• LASSO

Linköping, Sweden.

6. Bayesian linear regression

7. Motivation of kernel methods

Email: schon@isy.liu.se,

Phone: 013 - 281373,

Ofﬁce: House B, Entrance 27.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Summary of lecture 1 (I/III) 3(30) Summary of lecture 1 (II/III) 4(30)

The exponential family of distributions overx, parameterized byη,

The three basic steps of Bayesian modeling (where all variables are

modeled as stochastic)

T

p(x| η) = h(x)g(η)exp η u(x)

1. Assign prior distributionsp(θ) to all unknown parametersθ.

2. Write down the likelihoodp(x ,...,x | θ) of the data

1 N

One important member is the Gaussian density, which is commonly

x ,...,x given the parametersθ.

1 N

used as a building block in more sophisticated models. Important

3. Determine the posterior distribution of the parameters given

basic properties were provided.

the data

The idea underlying maximum likelihood is that the parametersθ

p(x ,...,x | θ)p(θ)

1 N

N

should be chosen in such a way that the measurements{x} are p(θ| x ,...,x ) = ∝ p(x ,...,x | θ)p(θ)

i 1 N 1 N

i=1

p(x ,...,x )

1 N

as likely as possible, i.e.,

If the posteriorp(θ| x ,...,x ) and the priorp(θ) distributions are

1 N

b

θ = argmax p(x ,··· ,x | θ).

1 N

of the same functional form they are conjugate distributions and

θ

the prior is said to be a conjugate prior for the likelihood.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Summary of lecture 1 (III/III) 5(30) Commonly used basis functions 6(30)

Modeling “heavy tails” using the Student’s t-distribution

Z

In using nonlinear basis functions,y(x,w) can be a nonlinear

−1

St(x| μ,λ,ν) = N x| μ,(ηλ) Gam(η| ν/2,ν/2)dη

function in the input variablex (still linear inw).

ν 1

1

− −

2 2 2

2

Γ(ν/2+1/2) λ λ(x−μ)

= 1+

• Global (in the sense that a small change inx affects all basis

Γ(ν/2) πν ν

functions) basis function

which according to the ﬁrst expressions can be interpreted as an

1. Polynomial (see illustrative example in Section 1.1) (ex. identity

inﬁnite mix of Gaussians with the same mean, but different variance.

φ(x) = x)

10 • Local (in the sense that a small change inx only affects the

−log Student

9 −log Gaussian Poor robustness is due to an

nearby basis functions) basis function

8

7

unrealistic model, the ML

6 1. Gaussian

5

estimator is inherently robust,

2. Sigmoidal

4

3

provided we have the correct

2

1

model.

0

−15 −10 −5 0 5 10 15

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Linear regression model on matrix form 7(30) Maximum likelihood and least squares (I/IV) 8(30)

It is commonly convenient to write the linear regression model

T

t = w φ(x )+e , n = 1,...,N,

In our linear regression model,

n n n

T

T

t = w φ(x )+e ,

wherew = w w ... w and

0 1 M−1 n n n

T

φ = 1 φ (x ) ... φ (x ) on matrix form

1 n M−1 n

−1

assume thate ∼N(0,β ) (i.i.d.). This results in the following

n

likelihood function

T =Φw+E,

T −1

where

p(t | w,β) =N(w φ(x ),β )

n n

t φ (x ) φ (x ) ... φ (x ) e

1 0 1 1 1 M−1 1 1

t φ (x ) φ (x ) ... φ (x ) e

2 0 2 1 2 M−1 2 2

Note that this is a slight abuse of notation,p (t ) orp(t ;w,β) would have been better,

w,β n n

T = Φ = E =

. . . . .

.

. . . . . .

sincew and β are both considered deterministic parameters in ML.

.

. . . . .

t φ (x ) φ (x ) ... φ (x ) e

N 0 N 1 N M−1 N N

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön”

†

Maximum likelihood and least squares (II/IV) 9(30) Maximum likelihood and least squares (III/IV) 10(30)

The available training data consisting ofN input variables The maximum likelihood problem now amounts to solving

N N

X ={x} and the corresponding target variablesT ={t} .

i i

i=1 i=1

argmax L(w,β)

According to our assumption on the noise, the likelihood function is w,β

given by

∂L N T T

Setting the derivative = 2β (t −w φ(x ))φ(x ) equal to

∑

n n n

n=1

∂w

N N

0 gives the following ML estimate forw

T −1

p(T| w,β) = p(t | w,β) = N(t | w φ(x ),β )

∏ n ∏ n n

n=1 n=1 ML T −1 T

b

w = (Φ Φ) Φ T,

| {z }

which results in the following log-likelihood function

Φ

N

T −1

T

Note that ifΦ Φ is singular

L(w,β), lnp(t ,...,t | w,β) = lnN(t | w φ(x ),β )

1 n n n

∑

φ (x ) φ (x ) ... φ (x )

0 1 1 1 M−1 1

n=1

(or close to) we can ﬁx this

φ (x ) φ (x ) ... φ (x )

0 2 1 2 M−1 2

N

Φ =

N N . . . by addingλI, i.e.,

.

T 2

. . . .

.

= lnβ− ln(2π)−β (t −w φ(x )) . . .

n n

∑

2 2

n=1

RR T −1 T

φ (x ) φ (x ) ... φ (x )

0 N 1 N M−1 N

wb = (Φ Φ+λI) Φ T,

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Maximum likelihood and least squares (IV/IV) 11(30) Interpretation of the Gauss-Markov theorem 12(30)

Maximizing the log-likelihood functionL(w,β) w.r.t. β results in the The least squares estimator has the smallest mean square error

(MSE) of all linear estimators with no bias, BUT there may exist a

following estimate for β

biased estimator with lower MSE.

N

1 1 2

ML

“the restriction to unbiased estimates is not necessarily a wise one.

= t −wb φ(x )

∑ n n

ML

b

N

β [HTF, page51]

n=1

Two classes of potentially biased estimators, 1. Subset selection

Finally, note that if we are only interested inw, the log-likelihood

methods and 2. Shrinkage methods.

function is proportional to

N

This is intimately connected to the bias-variance trade-off

T 2

(t −w φ(x )) ,

∑ n n

• We will give a system identiﬁcation example related to ridge

n=1

regression to illustrate the bias-variance trade-off.

which clearly shows that assuming a Gaussian noise model and • See Section 3.2 for a slightly more abstract (but very

making use of Maximum Likelihood (ML) corresponds to a Least informative) account of the bias-variance trade-off. (this is a perfect

Squares (LS) problem. topic for discussions during the exercise sessions!)

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Interpretation of RR using the SVD ofΦ 13(30) Bias-variance tradeoff – example (I/IV) 14(30)

(Ex. 2.3 in Henrik Ohlsson’s PhD thesis) Consider a SISO system

n

0

y = g u +e , (1)

t t

∑ t−k

k

k=1

By studying the SVD ofΦ it can be shown that ridge regression

projects the measurements onto the principal components ofΦ and

whereu denotes the input,y denotes the output,e denotes white

t t t

then shrinks the coefﬁcients of low-variance components more than 2 0 n

noise (E(e) = 0 and E(e e ) = σ δ(t−s)) and{g} denote the

t s

k k=1

the coefﬁcients of high-variance components.

impulse response of the system.

Recall that the impulse response is the outputy whenu = δ(t) is

(See Section 3.4.1. in HTF for details.) t t

used in (1), which results in

(

0

g +e t = 1,...,n,

t

t

y =

t

e t> n.

t

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Bias-variance tradeoff – example (II/IV) 15(30) Bias-variance tradeoff – example (III/IV) 16(30)

th

1

The task is now to estimate the impulse response using ann order

Squared bias (gray line)

0.9

FIR model,

0.8

2

T T T

y = w φ +e , E wb φ −w φ

0.7

t t t wb ∗ ∗

0

0.6

where

Variance (dashed line)

0.5

T 0.4

n

2

φ = u ... u , w∈R

t t−1 t−n

T T

0.3 E E wb φ −wb φ

b b ∗ ∗

w w

0.2

Let us use Ridge Regression (RR),

0.1

MSE (black line)

RR 2 T

0

wb = argminkY−Φwk +λw w.

2 0 1 2 3 4 5 6 7 8 9 10

λ

w

2

to ﬁnd the parametersw.

MSE = (bias) + variance

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Bias-variance tradeoff – example (IV/IV) 17(30) Lasso 18(30)

The Lasso was introduced during lecture1 as the MAP estimate

when a Laplacian prior is assigned to the parameters. Alternatively

“Flexible” models will have a low bias and high variance and more

we can motivate the Lasso as the solution to

“restricted” models will have high bias and low variance.

2

N T

min t −w φ(x )

∑

n n

n=1

w

The model with the best predictive capabilities is the one which

M−1

s.t. |w| ≤ η

∑

j

strikes the best tradeoff between bias and variance.

j=0

which using a Lagrange multiplierλ can be stated

Recent contributions on impulse response identiﬁcation using

N M−1

regularization, see 2

T

min t −w φ(x ) +λ |w|

n n j

∑ ∑

• Gianluigi Pillonetto and Giuseppe De Nicolao. A new kernel-based approach for linear system identiﬁcation.

w

Automatica, 46(1):81–93, January 2010.

n=1 j=0

• Tianshi Chen, Henrik Ohlsson and Lennart Ljung. On the estimation of transfer functions, regularizations and

Gaussian processes – Revisited. Automatica, 48(8): 1525–1535, August 2012.

The difference to ridge regression is simply that Lasso make use of

M−1 M−1 2

the‘ -norm |w|, rather than the‘ -norm w used in

∑ ∑

1 j 2

j=0 j=0 j

ridge regression in shrinking the parameters.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Graphical illustration of Lasso and RR 19(30) Implementing Lasso 20(30)

The‘ -regularized least squares problem (lasso)

1

Ridge Regression (RR)

Lasso

2

minkT−Φwk +λkwk (2)

1

2

w

YALMIP code solving (2). Download: http://users.isy.liu.se/johanl/yalmip/

w=sdpvar (M, 1 ) ;

ops=sdpsettings ( ’ verbose ’ , 0 ) ;

solvesdp ( [ ] , ( T−Phi∗w) ’∗ ( T−Phi∗w) + lambda∗norm (w, 1 ) , ops )

CVX code solving (2). Download: http://cvxr.com/cvx/

cvx_begin

v a r i a b l e w(M)

The circles are contours of the least squares cost function (LS

minimize ( ( T−Phi∗w) ’∗ ( y−Phi∗w) + lambda∗norm (w, 1 ) )

cvx_end

estimate in the middle). The constraint regions are shown in gray

2 2

|w|+|w|≤ η (Lasso) andw +w ≤ η (RR). The shape of the

0 1

0 1 A MATLAB package dedicated to‘ -regularized least squares

1

constraints motivates why Lasso often leads to sparseness.

problems isl1_ls. Download: http://www.stanford.edu/~boyd/l1_ls/

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Bayesian linear regression – example (I/VI) 21(30) Bayesian linear regression – example (II/VI) 22(30)

T

?

Consider the problem of ﬁtting a straight line to noisy measurements. Let the true values forw bew = −0.3 0.5 (plotted using a

Let the model be (t∈R,x ∈R) white circle below).

n

Generate synthetic measurements by

t = w +w x +e , n = 1,...,N. (3)

n 0 1 n n

| {z }

? ? 2

y(x,w)

t = w +w x +e , e ∼N(0,0.2 ),

n n n n

0 1

where

wherex ∼U(−1,1).

n

1

2

e ∼N(0,0.2 ), β = = 25.

n

2 Furthermore, let the prior be

0.2

T

−1

p(w) =N w| 0 0 ,α I ,

According to (3), the following identity basis function is used

φ (x ) = 1, φ (x ) = x .

0 n 1 n n where

The example lives in two dimensions, allowing us to plot the

α = 2.

distributions in illustrating the inference.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Bayesian linear regression – example (III/VI) 23(30) Bayesian linear regression – example (IV/VI) 24(30)

Plot of the situation before any data arrives.

Plot of the situation after one measurement has arrived.

1

1

0.8

0.8

0.6

0.6

0.4

0.2

0.4

0

0.2

−0.2

−0.4

0

−0.6

−0.2

−0.8

−1

−0.4 −1 −0.5 0 0.5 1

x

−0.6

−0.8

Likelihood (plotted as a Posterior/prior, Example of a few realizations

−1

−1 −0.5 0 0.5 1

function ofw)

from the posterior and the ﬁrst

x

measurement (black circle).

p(w| t ) =N (w| m ,S ),

1 1 1

−1

p(t | w) =N(t | w +w x ,β )

1 1 0 1 1

Prior, T

Example of a few realizations from m = βS Φ t ,

1 1 1

T −1

the posterior. S = (αI+βΦ Φ) .

1 1

T

p(w) =N w| 0 0 , I

2

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

y

yBayesian Linear Regression - Example (V/VI) 25(30) Bayesian linear regression – example (VI/VI) 26(30)

Plot of the situation after two measurements have arrived. Plot of the situation after 30 measurements have arrived.

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2

−0.2

−0.4 −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

x x

Likelihood (plotted as a Likelihood (plotted as a

Posterior/prior, Example of a few realizations Posterior/prior, Example of a few realizations

function ofw) from the posterior and the function ofw) from the posterior and the

measurements (black circles). measurements (black circles).

p(w| T) =N (w| m ,S ), p(w| T) =N (w| m ,S ),

2 2 30 30

−1 −1

p(t | w) =N(t | w +w x ,β ) p(t | w) =N(t | w +w x ,β )

2 2 0 1 2 T 30 30 0 1 30 T

m = βS Φ T, m = βS Φ T,

2 2 30 30

T −1 T −1

S = (αI+βΦ Φ) . S = (αI+βΦ Φ) .

2 30

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

Empirical Bayes 27(30) Predictive distribution – example 28(30)

Important question: How do we decide on the suitable values for

Investigating the predictive distribution for the example above

hyperparametersη?

0.6 1 1

Idea: Estimate the hyperparameters from the data by selecting them

0.4

0.5 0.5

0.2

such that they maximize the marginal likelihood function,

0

0 0

−0.2

Z

−0.4

−0.5 −0.5

−0.6

p(T| η) = p(T| w,η)p(w| η)dw,

−0.8

−1 −1

−1

−1.2 −1.5 −1.5

−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

whereη denotes the hyperparameters to be estimated.

N = 2 observations N = 5 observations N = 200 observations

Travels under many names, besides empirical Bayes, this is also

referred to as type 2 maximum likelihood, generalized maximum

• True system (y(x) =−0.3+0.5x) generating the data (red line)

likelihood, and evidence approximation.

• Mean of the predictive distribution (blue line)

• One standard deviation of the predictive distribution (gray shaded area) Note that this is

Empirical Bayes combines the two statistical philosophies;

the point-wise predictive standard deviation as a function ofx.

frequentistic ideas are used to estimate the hyperparameters that are

• Observations (black circles)

then used within the Bayesian inference.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

y

yPosterior distribution 29(30) A few concepts to summarize lecture 2 30(30)

Recall that the posterior distribution is given by

Linear regression: Models the relationship between a continuous target variablet and a

possibly nonlinear functionφ(x) of the input variables.

p(w| T) =N(w| m ,S ),

N N

Hyperparameter: A parameter of the prior distribution that controls the distribution of the

parameters of the model.

where

Maximum a Posteriori (MAP): A point estimate obtained by maximizing the posterior

distribution. Corresponds to a mode of the posterior distribution.

T

m = βS Φ T,

N N

Gauss Markov theorem: States that in a linear regression model, the best (in the sense of

minimum MSE) linear unbiased estimate (BLUE) is given by the least squares estimate.

T −1

S = (αI+βΦ Φ) .

N

Ridge regression: An‘ -regularized least squares problem used to solve the linear

2

regression problem resulting in potentially biased estimates. A.k.a. Tikhonov regularization.

Let us now investigate the posterior mean solutionm , which has an

N

Lasso: An‘ -regularized least squares problem used to solve the linear regression problem

1

interpretation that directly leads to the kernel methods (lecture 5),

resulting in potentially biased estimates. The Lasso typically produce sparse estimates.

including Gaussian processes.

AUTOMATIC CONTROL AUTOMATIC CONTROL

Machine Learning Machine Learning

REGLERTEKNIK REGLERTEKNIK

LINKÖPINGS UNIVERSITET LINKÖPINGS UNIVERSITET

T. Schön T. Schön

## Comments 0

Log in to post a comment