Machine Learning

Machine Learning

7. Support Vector Machines (SVMs)

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)

University of Hildesheim, Germany

http://www.ismll.uni-hildesheim.de

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 1/56

Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 1/56Machine Learning / 1. Separating Hyperplanes

Separating Hyperplanes

Logistic Regression:

Linear Discriminant Analysis (LDA):

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 1/56

Machine Learning / 1. Separating Hyperplanes

Hyperplanes

Hyperplanes can be modeled explicitly as

β

1

β

2

p

H :={x|hβ,xi =−β }, β = ∈R ,β ∈R

β,β 0 0

.

0

.

.

β

p

We will write H shortly for H (although β is very relevant!).

β β,β 0

0

0

For any two points x,x ∈ H we have

β

0 0

hβ,x−xi =hβ,xi−hβ,xi =−β +β = 0

0 0

thus β is orthogonal to all translation vectors in H ,

β

and thus β/||β|| is thenormalvector of H .

β

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 2/56Machine Learning / 1. Separating Hyperplanes

Hyperplanes

p

The projection of a point x∈R onto H , i.e., the closest point on

β

H to x is given by

β

hβ,xi +β

0

π (x) := x− β

H

β

hβ,βi

Proof:

(i) πx := π (x)∈ H :

H β

β

hβ,xi +β

0

hβ,π (x)i =hβ,x− βi

H

β

hβ,βi

hβ,xi +β

0

=hβ,xi− hβ,βi =−β

0

hβ,βi

(ii) π (x) is the closest such point to x:

H

β

0

For any other point x ∈ H :

β

0 2 0 0 0 0

||x−x|| =hx−x,x−xi =hx−πx +πx−x,x−πx +πx−xi

0 0 0

=hx−πx,x−πxi + 2hx−πx,πx−xi +hπx−x,πx−xi

2 0 2

=||x−πx|| + 0 +||πx−x||

0

as x−πx is proportional to β and πx and x are on H .

β

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 3/56

Machine Learning / 1. Separating Hyperplanes

Hyperplanes

p

Thesigneddistance of a point x∈R to H is given by

β

hβ,xi +β

0

||β||

Proof:

hβ,xi−β

0

x−πx = β

hβ,βi

Therefore

hβ,xi +β hβ,xi +β

0 0

2

||x−πx|| =h β, βi

hβ,βi hβ,βi

hβ,xi +β

0

2

=( ) hβ,βi

hβ,βi

hβ,xi +β

0

||x−πx|| =

||β||

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 4/56Machine Learning / 1. Separating Hyperplanes

Separating Hyperplanes

For given data

(x ,y ),(x ,y ),...,(x ,y )

1 1 2 2 n n

with a binary class label Y ∈{−1,+1}

a hyperplane H is calledseparating if

β

yh(x ) > 0, i = 1,...,n, with h(x) :=hβ,xi +β

i i 0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 5/56

Machine Learning / 1. Separating Hyperplanes

Linear Separable Data

The data is calledlinearseparable if there exists such a

separating hyperplane.

In general, if there is one, there are many.

Ifthereisachoice,weneedacriteriontonarrowdownwhichone

we want / is the best.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 6/56Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 7/56

Machine Learning / 2. Perceptron

Perceptron as Linear Model

Perceptron is another name for a linear binary classiﬁcation

model (Rosenblatt 1958):

+1, x > 0

Y (X) =signh(X), with signx = 0, x = 0

−1, x < 0

h(X) =β +hβ,Xi +

0

that is very similar to the logisitic regression model

Y (X) =argmax p(Y = y|X)

y

P

n

β X

i i

i=1

e

P

p(Y = +1|X) =logistic(hX,βi) + = +

n

β X

i=1 i i

1 +e

p(Y =−1|X) =1−p(Y = +1|X)

as well as to linear discriminant analysis (LDA).

The perceptron does just provide class labels yˆ(x) and unscaled

ˆ

certainty factors h(x), but no class probabilities pˆ(Y |X).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 7/56Machine Learning / 2. Perceptron

Perceptron as Linear Model

The perceptron does just provide class labels yˆ(x) and unscaled

ˆ

certainty factors h(x), but no class probabilities pˆ(Y |X).

Therefore, probabilistic ﬁt/error criteria such as maximum

likelihood cannot be applied.

For perceptrons, the sum of the certainty factors of misclassiﬁed

points is used as error criterion:

n n

X X

q(β,β ) := |h (x )| =− yh (x )

0 β i i β i

i=1:yˆ6=y i=1:yˆ6=y

i i i i

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 8/56

Machine Learning / 2. Perceptron

Perceptron as Linear Model

For learning, gradient descent is used:

n

X

∂q(β,β )

0

=− yx

i i

∂β

i=1:yˆ6=y

i i

n

X

∂q(β,β )

0

=− y

i

∂β

0

i=1:yˆ6=y

i i

Instead of looking at all points at the same time,

stochastic gradient descent is applied where all points are looked

at sequentially (in a random sequence).

The update for a single point (x,y ) then is

i i

(k+1) (k)

ˆ ˆ

β :=β +αyx

i i

(k+1) (k)

ˆ ˆ

β :=β +αy

i

0 0

with a step length α (often calledlearningrate).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 9/56Machine Learning / 2. Perceptron

Perceptron Learning Algorithm

1 learn-perceptron(training dataX, step lengthα) :

ˆ

2 β := a random vector

ˆ

3 β := a random value

0

4 do

5 errors := 0

6 for(x,y)∈X (in random order) do

ˆ ˆ

7 ify(β +hβ,xi) ≤ 0

0

8 errors := errors +1

ˆ ˆ

9 β :=β +αyx

ˆ ˆ

11 β :=β +αy

0 0

12 ﬁ

13 od

14 while errors > 0

ˆ ˆ

15 return (β,β )

0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 10/56

Machine Learning / 2. Perceptron

Perceptron Learning Algorithm: Properties

For linear separable data the perceptron learning algorithm can

be shown to converge: it ﬁnds a separating hyperplane in a ﬁnite

number of steps.

But there are many problems with this simple algorithm:

• If there are several separating hyperplanes,

there is no control about which one is found

(it depends on the starting values).

• If the gap between the classes is narrow,

it may take many steps until convergence.

• If the data are not separable,

the learning algorithm does not converge at all.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 11/56Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 12/56

Machine Learning / 3. Maximum Margin Separating Hyperplanes

Maximum Margin Separating Hyperplanes

Many of the problems of perceptrons can be overcome by

designing a better ﬁt/error criterion.

MaximumMarginSeparatingHyperplanes use the width of the

margin, i.e., the distance of the closest points to the hyperplane

as criterion:

maximize C

β +hβ,xi

0 i

w.r.t. y ≥C, i = 1,...,n

i

||β||

p

β ∈R

β ∈R

0

———————————————————————-

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 12/56Machine Learning / 3. Maximum Margin Separating Hyperplanes

Maximum Margin Separating Hyperplanes

As for any solutions β,β also all positive scalar multiples fullﬁl

0

the equations, we can arbitrarily set

1

||β|| =

C

Then the problem can be reformulated as

1

2

minimize ||β||

2

w.r.t. y (β +hβ,xi)≥1, i = 1,...,n

i 0 i

p

β ∈R

β ∈R

0

This problem is a convex optimization problem

(quadaratic target function with linear inequality constraints).

———————————————————————-

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 13/56

Machine Learning / 3. Maximum Margin Separating Hyperplanes

Quadratic Optimization

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 14/56Machine Learning / 3. Maximum Margin Separating Hyperplanes

To get rid of the linear inequality constraints, one usually applies

Lagrange multipliers.

The Lagrange (primal) function of this problem is

n

X

1

2

L := ||β|| − α (y (β +hβ,xi)− 1)

i i 0 i

2

i=1

w.r.t. α ≥0

i

For an extremum it is required that

n

X

∂L

!

=β− αyx = 0

i i i

∂β

i=1

n

X

⇒ β = αyx

i i i

i=1

and

n

X

∂L

!

=− αy = 0

i i

∂β

0

i=1

———————————————————————-

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 15/56

Machine Learning / 3. Maximum Margin Separating Hyperplanes

Quadratic Optimization

Input

n n

X X

β = αyx, αy = 0

i i i i i

i=1 i=1

into

n

X

1

2

L := ||β|| − α (y (β +hβ,xi)− 1)

i i 0 i

2

i=1

yields thedualproblem

n n n n

X X X X

1

L = h αyx, α y x i− α (y (β +h α y x ,xi)− 1)

i i i j j j i i 0 j j j i

2

i=1 j=1 i=1 j=1

n n n n n n

XX X X XX

1

= αα yy hx,x i + α − αyβ − αα yy hx,x i

i j i j i j i i i 0 i j i j i j

2

i=1 j=1 i=1 i=1 i=1 j=1

n n n

XX X

1

=− αα yy hx,x i + α

i j i j i j i

2

i=1 j=1 i=1

———————————————————————-

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 16/56Machine Learning / 3. Maximum Margin Separating Hyperplanes

Quadratic Optimization

The dual problem is

n n n

XX X

1

maximize L =− αα yy hx,x i + α

i j i j i j i

2

i=1 j=1 i=1

n

X

w.r.t. αy =0

i i

i=1

xi)− 1) =0

i

α ≥0

i

with much simpler constraints.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 17/56

Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 18/56Machine Learning / 4. Digression: Quadratic Optimization

Unconstrained Problem

Theunconstrainedquadraticoptimizationproblem is

1

minimize f(x) := hx,Cxi−hc,xi

2

n

w.r.t. x∈R

n×n n

(with C ∈R symmetric and positive deﬁnite, c∈R ).

The solution of the unconstrained quadratic optimization problem

coincides with the solution of the linear systems of equations

Cx = c

that can be solved by Gaussian Elimination, Cholesky

decomposition, QR decomposition etc.

Proof:

∂f(x)

!

T T

= x C−c = 0⇔ Cx = c

∂x

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 18/56

Machine Learning / 4. Digression: Quadratic Optimization

Equality Constraints

Thequadraticoptimizationproblemwithequalityconstraints

is

1

minimize f(x) := hx,Cxi−hc,xi

2

w.r.t. g(x) :=Ax−b = 0

n

x∈R

n×n n m×n

(with C ∈R symmetric and positive deﬁnite, c∈R , A∈R ,

m

b∈R ).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 19/56Machine Learning / 4. Digression: Quadratic Optimization

Lagrange Function

Deﬁnition1.Consider the optimization problem

minimize f(x)

subject to g(x)≤ 0

h(x) = 0

n

x∈R

n n m n p

with f :R →R, g :R →R and h :R →R .

TheLagrangefunction of this problem is deﬁned as

L(x,λ,ν) := f(x) +hλ,g(x)i +hν,h(x)i

λ and ν are calledLagrangemultipliers.

Thedualproblem is deﬁned as

¯

maximize f(λ,ν) := infL(x,λ,ν)

x

subject to λ≥ 0

m p

λ∈R , ν ∈R

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 20/56

Machine Learning / 4. Digression: Quadratic Optimization

Lower Bounds Lemma

Lemma1. Die dual function yields lower bounds for the optimal

value of the problem, i.e.,

∗

¯

f(λ,ν)≤ f(x ), ∀λ≥ 0,ν

Proof:

For feasible x, i.e., g(x)≤ 0 and h(x) = 0:

L(x,λ,ν) = f(x) +hλ,g(x)i +hν,h(x)i≤ f(x)

Hence

¯

f(λ,ν) = infL(x,λ,ν)≤ f(x)

x

∗

and especially for x = x .

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 21/56Machine Learning / 4. Digression: Quadratic Optimization

Karush-Kuhn-Tucker Conditions

Theorem1 (Karush-Kuhn-Tucker Conditions) . If

(i) x is optimal for the problem,

(ii) λ,ν are optimal for the dual problem and

¯

(iii) f(x) = f(λ,ν),

then the following conditions hold:

g(x)≤ 0

h(x) = 0

λ≥ 0

λg (x) = 0

i i

∂f(x) ∂g(x) ∂h(x)

+hλ, i +hν, i = 0

∂x ∂x ∂x

If f is convex and h is afﬁne, then the KKT conditions are also

sufﬁcient.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 22/56

Machine Learning / 4. Digression: Quadratic Optimization

Karush-Kuhn-Tucker Conditions

Proof: “⇒”

0 0 0

¯

f(x) = f(λ,ν) = inff(x) +hλ,g(x)i +hν,h(x)i

0

x

≤ f(x) +hλ,g(x)i +hν,h(x)i≤ f(x)

and therefore equality holds, thus

m

X

hλ,g(x)i = λg (x) = 0

i i

i=1

and as all terms are non-positive: λg (x) = 0.

i i

0 0

Since x minimizes L(x,λ,ν) over x, the derivative must vanish:

∂L(x,λ,ν) ∂f(x) ∂g(x) ∂h(x)

= +hλ, i +hν, i = 0

∂x ∂x ∂x ∂x

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 23/56Machine Learning / 4. Digression: Quadratic Optimization

Karush-Kuhn-Tucker Conditions

Proof (ctd.): “⇐”

0 0

Now let f be convex. Since λ≥ 0, L(x,λ,ν) is convex in x.

0 0

As its ﬁrst derivative vanishes at x, x minimizes L(x,λ,ν) over x,

and thus:

¯

f(λ,ν) = L(x,λ,ν) = f(x) +hλ,g(x)i +hν,h(x)i = f(x)

Therefore is x optimal for the problem and λ,ν optimal for the

dual problem.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 24/56

Machine Learning / 4. Digression: Quadratic Optimization

Equality Constraints

Thequadraticoptimizationproblemwithequalityconstraints

is

1

minimize f(x) := hx,Cxi−hc,xi

2

w.r.t. h(x) :=Ax−b = 0

n

x∈R

n×n n m×n

(with C ∈R symmetric and positive deﬁnite, c∈R , A∈R ,

m

b∈R ).

∗ ∗

The KKT conditons for the optimal solution x ,ν are:

∗ ∗

h(x ) = Ax −b = 0

∗ ∗

∂f(x ) ∂h(x )

∗ ∗ T ∗

+hν , i = Cx −c +A ν = 0

∂x ∂x

which can be written as a single system of linear equations

T ∗

C A x c

=

∗

A 0 ν b

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 25/56Machine Learning / 4. Digression: Quadratic Optimization

Inequality Constraints

Thequadraticoptimizationproblemwithinequality

constraints is

1

minimize f(x) := hx,Cxi−hc,xi

2

w.r.t. g(x) :=Ax−b≤ 0

n

x∈R

n×n n m×n

(with C ∈R symmetric and positive deﬁnite, c∈R , A∈R ,

m

b∈R ).

Inequality constraints are more complex to solve.

But they can be reduced to a sequence of equality constraints.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 26/56

Machine Learning / 4. Digression: Quadratic Optimization

Inequality Constraints

n

At each point x∈R one distinguishes between

activeconstraints g with g (x) = 0 and

i i

inactiveconstraints g with g (x) < 0.

i i

Activeset:

I (x) :={i∈{1,...,m}|g (x) = 0}

0 i

Inactive constraints stay inactive in a neighborhood of x and can

be neglected there.

Active constraints are equality constraints that identify points at

the border of the feasible area.

We can restrict our attention to just the points at the actual

border, i.e., use the equality constraints

h (x) := g (x), i∈ I

i i 0

.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 27/56Machine Learning / 4. Digression: Quadratic Optimization

Inequality Constraints

∗

If there is an optimal point x found with optimal lagrange

∗

multiplier ν ≥ 0:

∗ ∗

X

∂f(x ) ∂h (x )

i

∗

+ ν = 0

i

∂x ∂x

i∈I

0

∗

then x with

∗

ν , i∈ I

∗ 0

i

λ :=

i

0, else

fullﬁls the KKT conditions of the original problem:

∗ ∗

ν h (x ) = 0, i∈ I

i 0

∗ ∗

i

λ g (x ) =

i

∗

i

0g (x ) = 0, else

i

and

∗ ∗ ∗ ∗

X

∂f(x ) ∂h(x ) ∂f(x ) ∂h (x )

i

∗ ∗

+hλ , i = + ν = 0

i

∂x ∂x ∂x ∂x

i∈I

0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 28/56

Machine Learning / 4. Digression: Quadratic Optimization

Inequality Constraints

∗

If the optimal point x on the border has an optimal lagrange

∗ ∗

multiplier ν with ν < 0 for some i∈ I ,

0

i

∗ ∗

X

∂f(x ) ∂h (x )

i

∗

+ ν = 0

i

∂x ∂x

i∈I

0

then f decreases along h := g, thus we can decrease f by

i i

moving away from the border by dropping the constraint i.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 29/56Machine Learning / 4. Digression: Quadratic Optimization

Inequality Constraints

1 minimize-submanifold(target functionf, inequality constraint functiong) :

2 x := a random vector withg(x)≤ 0

3 I := I (x) :={ i| g (x) = 0}

0 0 i

4 do

∗

5 x := argmin f(x) subject tog (x) = 0,i∈ I

i 0

x

∗

6 while f(x ) < f(x) do

∗

7 α := max{ α∈ [0,1]| g(x+α(x −x))≤ 0}

∗

8 x := x+α(x −x)

9 I := I (x)

0 0

∗

10 x := argmin f(x) subject tog (x) = 0,i∈ I

i 0

x

11 od

∗ ∗

12 Let ν be the optimal Lagrange multiplier for x

∗

13 if ν ≥ 0 break ﬁ

∗

14 choose i∈ I : ν < 0

0

i

15 I := I \{ i}

0 0

16 while true

17 return x

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 30/56

Machine Learning / 4. Digression: Quadratic Optimization

Thedualproblemforthemaximummarginseparatinghyperplane

is such a constrained quadratic optimization problem:

n n n

XX X

1

maximize L =− αα yy hx,x i + α

i j i j i j i

2

i=1 j=1 i=1

n

X

w.r.t. αy =0

i i

i=1

α ≥0

i

Set f :=−L

C :=yy hx,x i

i,j i j i j

c :=1

i

x :=α

i i

A :=(0,0,...,0,−1,0,...,0) (with the -1 at column i),i = 1,...,n

i

b :=0

i

n

X

h(x) := αy

i i

i=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 31/56Machine Learning / 4. Digression: Quadratic Optimization

Example

Find a maximum margin separating hyperplane for the

following data:

x x y

1 2

1 1 −1

3 3 +1

4 3 +1

+ +

−

0 1 2 3 4

x

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 32/56

Machine Learning / 4. Digression: Quadratic Optimization

Example

2 −6 −7 1

C = (yy hx,x i) = −6 18 21 , c = 1 ,

i j i y i,j

−7 21 25 1

−1 0 0 0

A = 0 −1 0 , b = 0

0 0 −1 0

h(α) =hα,yi =−α +α +α

1 2 3

As the equality constraint h always needs to be met, it

can be added to C:

2 −6 −7 −1

C y −6 18 21 1

0

C = , = ,

T

y 0 −7 21 25 1

−1 1 1 0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 33/56

y

0 1 2 3 4Machine Learning / 4. Digression: Quadratic Optimization

Example

Let us start with a random

2

x = 1

1

that meets both constraints:

−2

g(x) = Ax−b = −1 ≤ 0

−1

h(x) =hy,xi =−2 + 1 + 1 = 0

As none of the inequality constaints is active: I (x) =∅.

0

Step1: We have to solve

x c

0

C =

μ 0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 34/56

Machine Learning / 4. Digression: Quadratic Optimization

Example

This yields

0.5

∗

x = 1.5

−1.0

which does not fulﬁll the (inactive) inequality constraint

x ≥ 0.

3

So we look for

2 −1.5

∗

x +α(x −x) = 1 +α 0.5 ≥ 0

1 −2

that fulﬁlls all inequality constraints and has large step

size α. Obviously, α = 0.5 is best and yields

1.25

∗

x := x +α(x −x) = 1.25

0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 35/56Machine Learning / 4. Digression: Quadratic Optimization

Example

Step2: Now the third inequality constraint is active:

I (x) ={3}.

0

2 −6 −7 −1 0

0

C y −e −6 18 21 1 0

3

00

T

C = y 0 0 , = −7 21 25 1 −1 ,

T

−e 0 0 −1 1 1 0 0

3

0 0 −1 0 0

and we have to solve

x c

00

C μ = 0

∗

ν 0

which yields

0.25

∗ ∗

x = 0.25 , ν = 0.5

0

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 36/56

Machine Learning / 4. Digression: Quadratic Optimization

Example

∗

As x fulﬁlls all constraints, it becomes the next x (step

size α = 1):

∗

x := x

∗

As the lagrange multiplier ν ≥ 0, the algorithm stops:

x is optimal.

So we found the optimal

0.25

α = 0.25 (called x in the algorithm!)

0

and can compute

n

X

1 3 0.5

β = αyx = 0.25·(−1)· +0.25·(+1)· =

i i i

1 3 0.5

i=1

β can be computed from the original constraints of the

0

points with α > 0 which have to be sharp, i.e.,

i

0.5 1

y (β +hβ,x i) = 1 ⇒ β = y −hβ,x i =−1−h , i =−2

1 0 1 0 1 1

0.5 1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 37/56Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 38/56

Machine Learning / 5. Non-separable Problems

Optimal Hyperplane

Inseparableproblemscanbemodeledbyallowingsomepointsto

be on the wrong side of the hyperplane.

Hyperplanes are better if

(i) the fewer points are on the wrong side and

(ii) the closer these points are to the hyperplane

(modeled byslackvariables ξ).

i

n

X

1

2

minimize ||β|| +γ ξ

i

2

i=1

w.r.t. y (β +hβ,xi)≥1−ξ, i = 1,...,n

i 0 i i

ξ ≥0

p

β ∈R

β ∈R

0

This problem also is a convex optimization problem

(quadaratic target function with linear inequality constraints).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 38/56Machine Learning / 5. Non-separable Problems

Dual Problem

Compute again the dual problem:

n n n

X X X

1

2

L := ||β|| +γ ξ − α (y (β +hβ,xi)− (1−ξ ))− μξ

i i i 0 i i i i

2

i=1 i=1 i=1

w.r.t. α ≥0

i

μ ≥0

i

For an extremum it is required that

n n

X X

∂L

!

=β− αyx = 0 ⇒ β = αyx

i i i i i i

∂β

i=1 i=1

and

n

X

∂L

!

=− αy = 0

i i

∂β

0

i=1

and

∂L

!

=γ−α −μ = 0 ⇒ α = γ−μ

i i i i

∂ξ

i

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 39/56

Machine Learning / 5. Non-separable Problems

Dual Problem

Input

n n

X X

β = αyx, αy = 0, α = γ−μ

i i i i i i i

i=1 i=1

into

n n n

X X X

1

2

L := ||β|| +γ ξ − α (y (β +hβ,xi)− (1−ξ))− μξ

i i i 0 i i i

2

i=1 i=1 i=1

yields thedualproblem

n n n n

X X X X

1

L = h αyx, α y x i− α (y (β +h α y x ,xi)− (1−ξ ))

i i i j j j i i 0 j j j i i

2

i=1 j=1 i=1 j=1

n n

X X

+γ ξ − μξ

i i i

i=1 i=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 40/56Machine Learning / 5. Non-separable Problems

Dual Problem

n n n n

X X X X

1

L = h αyx, α y x i− α (y (β +h α y x ,xi)− (1−ξ ))

i i i j j j i i 0 j j j i i

2

i=1 j=1 i=1 j=1

n n

X X

+γ ξ − μξ

i i i

i=1 i=1

n n n n n n

XX X X XX

1

= αα yy hx,x i + α − αyβ − αα yy hx,x i

i j i j i j i i i 0 i j i j i j

2

i=1 j=1 i=1 i=1 i=1 j=1

n n

X X

− αξ + αξ

i i i i

i=1 i=1

n n n

XX X

1

=− αα yy hx,x i + α

i j i j i j i

2

i=1 j=1 i=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 41/56

Machine Learning / 5. Non-separable Problems

Dual Problem

The dual problem is

n n n

XX X

1

maximize L =− αα yy hx,x i + α

i j i j i j i

2

i=1 j=1 i=1

n

X

w.r.t. αy =0

i i

i=1

α ≤γ

i

α ≥0

i

with much simpler constraints.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 42/56Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 43/56

Machine Learning / 6. Support Vectors and Kernels

Support Vectors / Separable Case

For points on the right side of the hyperplane (i.e., if a constraint

holds),

y (β +hβ,xi) > 1

i 0 i

then L is maximized by α = 0: x is irrelevant.

i i

For points on the wrong side of the hyperplane (i.e., if a

constraint is violated),

y (β +hβ,xi) < 1

i 0 i

then L is maximized for α →∞.

i

For separable data, β and β needs to be changed to make the

0

constraint hold.

For points on the margin, i.e.,

y (β +hβ,xi) = 1

i 0 i

α is some ﬁnite value.

i

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 43/56Machine Learning / 6. Support Vectors and Kernels

Support Vectors / Inseparable Case

For points on the right side of the hyperplane,

y (β +hβ,xi) > 1, ξ = 0

i 0 i i

then L is maximized by α = 0: x is irrelevant.

i i

For points in the margin as well as on the wrong side of the

hyperplane,

y (β +hβ,xi) = 1−ξ, ξ > 0

i 0 i i i

α is some ﬁnite value.

i

For points on the margin, i.e.,

y (β +hβ,xi) = 1, ξ = 0

i 0 i i

α is some ﬁnite value.

i

The data points x with α > 0 are calledsupportvectors.

i i

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 44/56

Machine Learning / 6. Support Vectors and Kernels

Decision Function

Due to

n

X

ˆ

β = α ˆ yx,

i i i

i=1

the decision function

ˆ ˆ

yˆ(x) = signβ +hβ,xi

0

can be expressed using the training data:

n

X

ˆ

yˆ(x) = signβ + α ˆ yhx,xi

0 i i i

i=1

Only support vectors are required, as only for them α ˆ 6= 0.

i

Both, the learning problem and the decision function can be

expressed using an inner product / a similarity measure / a kernel

0

hx,xi.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 45/56Machine Learning / 6. Support Vectors and Kernels

High-Dimensional Embeddings / The “kernel trick”

Example:

2 6

we map points from R into the higher dimensional spaceR via

1

√

2x

1

√

x

2x

1

2

h : 7→

2

x x

2

1

2

x

2

√

2x x

1 2

Then the inner product

0

x x

2 2

1

0 0 2 0 2 0 0 0

1

hh( ),h( )i = 1 + 2x x + 2x x +x x +x x + 2x x x x

1 2 1 2

0 1 2 1 1 2 2 1 2

x x

2

2

0 0 2

= (1 +x x +x x )

1 2

1 2

can be computed without having to compute h explicitely !

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 46/56

Machine Learning / 6. Support Vectors and Kernels

Popular Kernels

Some popular kernels are:

linearkernel:

n

X

0 0 0

K(x,x) :=hx,xi := xx

i

i

i=1

polynomialkernel of degree d:

0 0 d

K(x,x) := (1 +hx,xi)

radialbasiskernel/gaussiankernel :

0 2

||x−x||

0 −

c

K(x,x) := e

neuralnetworkkernel/sigmoidkernel :

0 0

K(x,x) := tanh(ahx,xi +b)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 47/56Machine Learning

1.SeparatingHyperplanes

2.Perceptron

3.MaximumMarginSeparatingHyperplanes

4.Digression: QuadraticOptimization

5.Non-separableProblems

6.SupportVectorsandKernels

7.SupportVectorRegression

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 48/56

Machine Learning / 7. Support Vector Regression

Optimal Hyperplanes as Regularization

Optimal separating hyperplanes

n

X

1

2

minimize ||β|| +γ ξ

i

2

i=1

w.r.t. y (β +hβ,xi)≥1−ξ, i = 1,...,n

i 0 i i

ξ ≥0

p

β ∈R

β ∈R

0

can also be understood as regularization “error + complexity”:

n

X

1

2

minimize γ [1−y (β +hβ,xi)] + ||β||

i 0 i +

2

i=1

p

w.r.t. β ∈R

β ∈R

0

where the postitive part is deﬁned as

x, if x≥ 0

[x] :=

+

0, else

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 48/56Machine Learning / 7. Support Vector Regression

Optimal Hyperplanes as Regularization / Error functions

Speciﬁc for the SVM model then is the error function (also often

calledlossfunction).

model error function minimizing function

p(y = +1|x)

−yf(x)

logistic regression negative log(1 +e ) f(x) = log

p(y =−1|x)

binomial

loglikelihood

2

LDA squared (y−f(x)) f(x) = p(y = +1|x)−p(y =−1|x)

error

1

1, if p(y = +1|x)≥

2

SVM [1−yf(x)] f(x) =

+

0, else

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 49/56

Machine Learning / 7. Support Vector Regression

Optimal Hyperplanes as Regularization / Error functions

neg. bin. loglikelihood

squared error

svm

−3 −2 −1 0 1 2 3

y f(x)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 50/56

error

0.0 0.5 1.0 1.5 2.0 2.5 3.0Machine Learning / 7. Support Vector Regression

SVM Regression / Error functions

In regression, squared error sometimes is dominated by outliers,

i.e., points with large residuum, due to the quadratic dependency:

2

err(y,yˆ) := (y−yˆ)

Therefore, robust error functions such as theHubererror have

been developed that keep the quadratic form near zero, but are

linear for larger values:

(

2

(y−yˆ)

, if|y−yˆ| < c

2

err (y,yˆ) :=

c 2

c

c|y−yˆ|− , else

2

SVM regression uses the -insensitiveerror :

0, if|y−yˆ| <

err (y,yˆ) :=

c

|y−yˆ|− , else

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 51/56

Machine Learning / 7. Support Vector Regression

SVM Regression / Error functions

0.5 squared error

svm (eps=2)

huber (c=2)

−4 −2 0 2 4

y − f(x)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 52/56

error

0 2 4 6 8Machine Learning / 7. Support Vector Regression

SVM Regression

Any of these error functions can be used to ﬁnd optimal

parameters for the linear regression model

f(X) := β +hβ,Xi +

0

by solving the optimization problem

n

X

λ

2

ˆ ˆ ˆ

min. err(y,β +hβ,xi) + ||β||

i 0 i

2

i=1

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 53/56

Machine Learning / 7. Support Vector Regression

SVM Regression

For the -insensitive error, the solution can be shown to have the

form

n

X

∗

ˆ

β = (α ˆ −α ˆ )x

i i

i

i=1

n

X

∗

ˆ

f(x) =β + (α ˆ −α ˆ )hx,xi

0 i i

i

i=1

∗

where α ˆ and α ˆ are the solutions of the quadratic problem

i

i

n n n n

X X XX

1

∗ ∗ ∗ ∗

min (α −α )− y (α −α ) + (α −α )(α −α )hx,x i

i i i i j i j

i i i j

2

i=1 i=1 i=1 j=1

s.t. α ≥ 0

i

1

∗

α ≤

i

λ

n

X

∗

(α −α ) = 0

i

i

i=1

∗

α α = 0

i

i

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 54/56Machine Learning / 7. Support Vector Regression

Summary (1/2)

• Binary classiﬁcation problems with linear decision boundaries can be

rephrased as ﬁnding a separating hyperplane.

• In the linear separable case, there are simple algorithms like

perceptron learning to ﬁnd such a separating hyperplane.

• If one requires the additional property that the hyperplane should have

maximal margin, i.e., maximal distance to the closest points of both

classes, then a quadratic optimization problem with inequality

constraints arises.

• Quadratic optimization problems without constraints as well as with

equality constraints can be solved by linear systems of equations.

Quadratic optimization problems with inequality constraints require

some more complex methods such as submanifold optimization (a

sequence of linear systems of equations).

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 55/56

Machine Learning / 7. Support Vector Regression

Summary (2/2)

• Optimal hyperplanes can also be formulated for the inseparable case

by allowing some points to be on the wrong side of the margin, but

penalize for their distance from the margin. This also can be

formulated as a quadratic optimization problem with inequality

constraints.

• The ﬁnal decision function can be computed in terms of inner products

of the query points with some of the data points (called support

vectors), which allows to bypass the explicit computation of high

dimensional embeddings.

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,

Course on Machine Learning, winter term 2006 56/56

## Comments 0

Log in to post a comment