Machine Learning 7. Support Vector Machines (SVMs) - ISMLL

achoohomelessΤεχνίτη Νοημοσύνη και Ρομποτική

14 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

163 εμφανίσεις

Machine Learning
Machine Learning
7. Support Vector Machines (SVMs)
Lars Schmidt-Thieme
Information Systems and Machine Learning Lab (ISMLL)
University of Hildesheim, Germany
http://www.ismll.uni-hildesheim.de
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 1/56
Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 1/56Machine Learning / 1. Separating Hyperplanes
Separating Hyperplanes
Logistic Regression:
Linear Discriminant Analysis (LDA):
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 1/56
Machine Learning / 1. Separating Hyperplanes
Hyperplanes
Hyperplanes can be modeled explicitly as
 
β
1
 
β
2
p
 
H :={x|hβ,xi =−β }, β = ∈R ,β ∈R
β,β 0 0
.
0
 
.
.
β
p
We will write H shortly for H (although β is very relevant!).
β β,β 0
0
0
For any two points x,x ∈ H we have
β
0 0
hβ,x−xi =hβ,xi−hβ,xi =−β +β = 0
0 0
thus β is orthogonal to all translation vectors in H ,
β
and thus β/||β|| is thenormalvector of H .
β
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 2/56Machine Learning / 1. Separating Hyperplanes
Hyperplanes
p
The projection of a point x∈R onto H , i.e., the closest point on
β
H to x is given by
β
hβ,xi +β
0
π (x) := x− β
H
β
hβ,βi
Proof:
(i) πx := π (x)∈ H :
H β
β
hβ,xi +β
0
hβ,π (x)i =hβ,x− βi
H
β
hβ,βi
hβ,xi +β
0
=hβ,xi− hβ,βi =−β
0
hβ,βi
(ii) π (x) is the closest such point to x:
H
β
0
For any other point x ∈ H :
β
0 2 0 0 0 0
||x−x|| =hx−x,x−xi =hx−πx +πx−x,x−πx +πx−xi
0 0 0
=hx−πx,x−πxi + 2hx−πx,πx−xi +hπx−x,πx−xi
2 0 2
=||x−πx|| + 0 +||πx−x||
0
as x−πx is proportional to β and πx and x are on H .
β
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 3/56
Machine Learning / 1. Separating Hyperplanes
Hyperplanes
p
Thesigneddistance of a point x∈R to H is given by
β
hβ,xi +β
0
||β||
Proof:
hβ,xi−β
0
x−πx = β
hβ,βi
Therefore
hβ,xi +β hβ,xi +β
0 0
2
||x−πx|| =h β, βi
hβ,βi hβ,βi
hβ,xi +β
0
2
=( ) hβ,βi
hβ,βi
hβ,xi +β
0
||x−πx|| =
||β||
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 4/56Machine Learning / 1. Separating Hyperplanes
Separating Hyperplanes
For given data
(x ,y ),(x ,y ),...,(x ,y )
1 1 2 2 n n
with a binary class label Y ∈{−1,+1}
a hyperplane H is calledseparating if
β
yh(x ) > 0, i = 1,...,n, with h(x) :=hβ,xi +β
i i 0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 5/56
Machine Learning / 1. Separating Hyperplanes
Linear Separable Data
The data is calledlinearseparable if there exists such a
separating hyperplane.
In general, if there is one, there are many.
Ifthereisachoice,weneedacriteriontonarrowdownwhichone
we want / is the best.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 6/56Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 7/56
Machine Learning / 2. Perceptron
Perceptron as Linear Model
Perceptron is another name for a linear binary classification
model (Rosenblatt 1958):

+1, x > 0

Y (X) =signh(X), with signx = 0, x = 0

−1, x < 0
h(X) =β +hβ,Xi +
0
that is very similar to the logisitic regression model
Y (X) =argmax p(Y = y|X)
y
P
n
β X
i i
i=1
e
P
p(Y = +1|X) =logistic(hX,βi) + = +
n
β X
i=1 i i
1 +e
p(Y =−1|X) =1−p(Y = +1|X)
as well as to linear discriminant analysis (LDA).
The perceptron does just provide class labels yˆ(x) and unscaled
ˆ
certainty factors h(x), but no class probabilities pˆ(Y |X).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 7/56Machine Learning / 2. Perceptron
Perceptron as Linear Model
The perceptron does just provide class labels yˆ(x) and unscaled
ˆ
certainty factors h(x), but no class probabilities pˆ(Y |X).
Therefore, probabilistic fit/error criteria such as maximum
likelihood cannot be applied.
For perceptrons, the sum of the certainty factors of misclassified
points is used as error criterion:
n n
X X
q(β,β ) := |h (x )| =− yh (x )
0 β i i β i
i=1:yˆ6=y i=1:yˆ6=y
i i i i
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 8/56
Machine Learning / 2. Perceptron
Perceptron as Linear Model
For learning, gradient descent is used:
n
X
∂q(β,β )
0
=− yx
i i
∂β
i=1:yˆ6=y
i i
n
X
∂q(β,β )
0
=− y
i
∂β
0
i=1:yˆ6=y
i i
Instead of looking at all points at the same time,
stochastic gradient descent is applied where all points are looked
at sequentially (in a random sequence).
The update for a single point (x,y ) then is
i i
(k+1) (k)
ˆ ˆ
β :=β +αyx
i i
(k+1) (k)
ˆ ˆ
β :=β +αy
i
0 0
with a step length α (often calledlearningrate).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 9/56Machine Learning / 2. Perceptron
Perceptron Learning Algorithm
1 learn-perceptron(training dataX, step lengthα) :
ˆ
2 β := a random vector
ˆ
3 β := a random value
0
4 do
5 errors := 0
6 for(x,y)∈X (in random order) do
ˆ ˆ
7 ify(β +hβ,xi) ≤ 0
0
8 errors := errors +1
ˆ ˆ
9 β :=β +αyx
ˆ ˆ
11 β :=β +αy
0 0
12 fi
13 od
14 while errors > 0
ˆ ˆ
15 return (β,β )
0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 10/56
Machine Learning / 2. Perceptron
Perceptron Learning Algorithm: Properties
For linear separable data the perceptron learning algorithm can
be shown to converge: it finds a separating hyperplane in a finite
number of steps.
But there are many problems with this simple algorithm:
• If there are several separating hyperplanes,
there is no control about which one is found
(it depends on the starting values).
• If the gap between the classes is narrow,
it may take many steps until convergence.
• If the data are not separable,
the learning algorithm does not converge at all.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 11/56Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 12/56
Machine Learning / 3. Maximum Margin Separating Hyperplanes
Maximum Margin Separating Hyperplanes
Many of the problems of perceptrons can be overcome by
designing a better fit/error criterion.
MaximumMarginSeparatingHyperplanes use the width of the
margin, i.e., the distance of the closest points to the hyperplane
as criterion:
maximize C
β +hβ,xi
0 i
w.r.t. y ≥C, i = 1,...,n
i
||β||
p
β ∈R
β ∈R
0
———————————————————————-
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 12/56Machine Learning / 3. Maximum Margin Separating Hyperplanes
Maximum Margin Separating Hyperplanes
As for any solutions β,β also all positive scalar multiples fullfil
0
the equations, we can arbitrarily set
1
||β|| =
C
Then the problem can be reformulated as
1
2
minimize ||β||
2
w.r.t. y (β +hβ,xi)≥1, i = 1,...,n
i 0 i
p
β ∈R
β ∈R
0
This problem is a convex optimization problem
(quadaratic target function with linear inequality constraints).
———————————————————————-
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 13/56
Machine Learning / 3. Maximum Margin Separating Hyperplanes
Quadratic Optimization
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 14/56Machine Learning / 3. Maximum Margin Separating Hyperplanes
To get rid of the linear inequality constraints, one usually applies
Lagrange multipliers.
The Lagrange (primal) function of this problem is
n
X
1
2
L := ||β|| − α (y (β +hβ,xi)− 1)
i i 0 i
2
i=1
w.r.t. α ≥0
i
For an extremum it is required that
n
X
∂L
!
=β− αyx = 0
i i i
∂β
i=1
n
X
⇒ β = αyx
i i i
i=1
and
n
X
∂L
!
=− αy = 0
i i
∂β
0
i=1
———————————————————————-
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 15/56
Machine Learning / 3. Maximum Margin Separating Hyperplanes
Quadratic Optimization
Input
n n
X X
β = αyx, αy = 0
i i i i i
i=1 i=1
into
n
X
1
2
L := ||β|| − α (y (β +hβ,xi)− 1)
i i 0 i
2
i=1
yields thedualproblem
n n n n
X X X X
1
L = h αyx, α y x i− α (y (β +h α y x ,xi)− 1)
i i i j j j i i 0 j j j i
2
i=1 j=1 i=1 j=1
n n n n n n
XX X X XX
1
= αα yy hx,x i + α − αyβ − αα yy hx,x i
i j i j i j i i i 0 i j i j i j
2
i=1 j=1 i=1 i=1 i=1 j=1
n n n
XX X
1
=− αα yy hx,x i + α
i j i j i j i
2
i=1 j=1 i=1
———————————————————————-
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 16/56Machine Learning / 3. Maximum Margin Separating Hyperplanes
Quadratic Optimization
The dual problem is
n n n
XX X
1
maximize L =− αα yy hx,x i + α
i j i j i j i
2
i=1 j=1 i=1
n
X
w.r.t. αy =0
i i
i=1
xi)− 1) =0
i
α ≥0
i
with much simpler constraints.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 17/56
Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 18/56Machine Learning / 4. Digression: Quadratic Optimization
Unconstrained Problem
Theunconstrainedquadraticoptimizationproblem is
1
minimize f(x) := hx,Cxi−hc,xi
2
n
w.r.t. x∈R
n×n n
(with C ∈R symmetric and positive definite, c∈R ).
The solution of the unconstrained quadratic optimization problem
coincides with the solution of the linear systems of equations
Cx = c
that can be solved by Gaussian Elimination, Cholesky
decomposition, QR decomposition etc.
Proof:
∂f(x)
!
T T
= x C−c = 0⇔ Cx = c
∂x
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 18/56
Machine Learning / 4. Digression: Quadratic Optimization
Equality Constraints
Thequadraticoptimizationproblemwithequalityconstraints
is
1
minimize f(x) := hx,Cxi−hc,xi
2
w.r.t. g(x) :=Ax−b = 0
n
x∈R
n×n n m×n
(with C ∈R symmetric and positive definite, c∈R , A∈R ,
m
b∈R ).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 19/56Machine Learning / 4. Digression: Quadratic Optimization
Lagrange Function
Definition1.Consider the optimization problem
minimize f(x)
subject to g(x)≤ 0
h(x) = 0
n
x∈R
n n m n p
with f :R →R, g :R →R and h :R →R .
TheLagrangefunction of this problem is defined as
L(x,λ,ν) := f(x) +hλ,g(x)i +hν,h(x)i
λ and ν are calledLagrangemultipliers.
Thedualproblem is defined as
¯
maximize f(λ,ν) := infL(x,λ,ν)
x
subject to λ≥ 0
m p
λ∈R , ν ∈R
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 20/56
Machine Learning / 4. Digression: Quadratic Optimization
Lower Bounds Lemma
Lemma1. Die dual function yields lower bounds for the optimal
value of the problem, i.e.,

¯
f(λ,ν)≤ f(x ), ∀λ≥ 0,ν
Proof:
For feasible x, i.e., g(x)≤ 0 and h(x) = 0:
L(x,λ,ν) = f(x) +hλ,g(x)i +hν,h(x)i≤ f(x)
Hence
¯
f(λ,ν) = infL(x,λ,ν)≤ f(x)
x

and especially for x = x .
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 21/56Machine Learning / 4. Digression: Quadratic Optimization
Karush-Kuhn-Tucker Conditions
Theorem1 (Karush-Kuhn-Tucker Conditions) . If
(i) x is optimal for the problem,
(ii) λ,ν are optimal for the dual problem and
¯
(iii) f(x) = f(λ,ν),
then the following conditions hold:
g(x)≤ 0
h(x) = 0
λ≥ 0
λg (x) = 0
i i
∂f(x) ∂g(x) ∂h(x)
+hλ, i +hν, i = 0
∂x ∂x ∂x
If f is convex and h is affine, then the KKT conditions are also
sufficient.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 22/56
Machine Learning / 4. Digression: Quadratic Optimization
Karush-Kuhn-Tucker Conditions
Proof: “⇒”
0 0 0
¯
f(x) = f(λ,ν) = inff(x) +hλ,g(x)i +hν,h(x)i
0
x
≤ f(x) +hλ,g(x)i +hν,h(x)i≤ f(x)
and therefore equality holds, thus
m
X
hλ,g(x)i = λg (x) = 0
i i
i=1
and as all terms are non-positive: λg (x) = 0.
i i
0 0
Since x minimizes L(x,λ,ν) over x, the derivative must vanish:
∂L(x,λ,ν) ∂f(x) ∂g(x) ∂h(x)
= +hλ, i +hν, i = 0
∂x ∂x ∂x ∂x
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 23/56Machine Learning / 4. Digression: Quadratic Optimization
Karush-Kuhn-Tucker Conditions
Proof (ctd.): “⇐”
0 0
Now let f be convex. Since λ≥ 0, L(x,λ,ν) is convex in x.
0 0
As its first derivative vanishes at x, x minimizes L(x,λ,ν) over x,
and thus:
¯
f(λ,ν) = L(x,λ,ν) = f(x) +hλ,g(x)i +hν,h(x)i = f(x)
Therefore is x optimal for the problem and λ,ν optimal for the
dual problem.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 24/56
Machine Learning / 4. Digression: Quadratic Optimization
Equality Constraints
Thequadraticoptimizationproblemwithequalityconstraints
is
1
minimize f(x) := hx,Cxi−hc,xi
2
w.r.t. h(x) :=Ax−b = 0
n
x∈R
n×n n m×n
(with C ∈R symmetric and positive definite, c∈R , A∈R ,
m
b∈R ).
∗ ∗
The KKT conditons for the optimal solution x ,ν are:
∗ ∗
h(x ) = Ax −b = 0
∗ ∗
∂f(x ) ∂h(x )
∗ ∗ T ∗
+hν , i = Cx −c +A ν = 0
∂x ∂x
which can be written as a single system of linear equations

T ∗
C A x c
=

A 0 ν b
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 25/56Machine Learning / 4. Digression: Quadratic Optimization
Inequality Constraints
Thequadraticoptimizationproblemwithinequality
constraints is
1
minimize f(x) := hx,Cxi−hc,xi
2
w.r.t. g(x) :=Ax−b≤ 0
n
x∈R
n×n n m×n
(with C ∈R symmetric and positive definite, c∈R , A∈R ,
m
b∈R ).
Inequality constraints are more complex to solve.
But they can be reduced to a sequence of equality constraints.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 26/56
Machine Learning / 4. Digression: Quadratic Optimization
Inequality Constraints
n
At each point x∈R one distinguishes between
activeconstraints g with g (x) = 0 and
i i
inactiveconstraints g with g (x) < 0.
i i
Activeset:
I (x) :={i∈{1,...,m}|g (x) = 0}
0 i
Inactive constraints stay inactive in a neighborhood of x and can
be neglected there.
Active constraints are equality constraints that identify points at
the border of the feasible area.
We can restrict our attention to just the points at the actual
border, i.e., use the equality constraints
h (x) := g (x), i∈ I
i i 0
.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 27/56Machine Learning / 4. Digression: Quadratic Optimization
Inequality Constraints

If there is an optimal point x found with optimal lagrange

multiplier ν ≥ 0:
∗ ∗
X
∂f(x ) ∂h (x )
i

+ ν = 0
i
∂x ∂x
i∈I
0

then x with


ν , i∈ I
∗ 0
i
λ :=
i
0, else
fullfils the KKT conditions of the original problem:

∗ ∗
ν h (x ) = 0, i∈ I
i 0
∗ ∗
i
λ g (x ) =
i

i
0g (x ) = 0, else
i
and
∗ ∗ ∗ ∗
X
∂f(x ) ∂h(x ) ∂f(x ) ∂h (x )
i
∗ ∗
+hλ , i = + ν = 0
i
∂x ∂x ∂x ∂x
i∈I
0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 28/56
Machine Learning / 4. Digression: Quadratic Optimization
Inequality Constraints

If the optimal point x on the border has an optimal lagrange
∗ ∗
multiplier ν with ν < 0 for some i∈ I ,
0
i
∗ ∗
X
∂f(x ) ∂h (x )
i

+ ν = 0
i
∂x ∂x
i∈I
0
then f decreases along h := g, thus we can decrease f by
i i
moving away from the border by dropping the constraint i.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 29/56Machine Learning / 4. Digression: Quadratic Optimization
Inequality Constraints
1 minimize-submanifold(target functionf, inequality constraint functiong) :
2 x := a random vector withg(x)≤ 0
3 I := I (x) :={ i| g (x) = 0}
0 0 i
4 do

5 x := argmin f(x) subject tog (x) = 0,i∈ I
i 0
x

6 while f(x ) < f(x) do

7 α := max{ α∈ [0,1]| g(x+α(x −x))≤ 0}

8 x := x+α(x −x)
9 I := I (x)
0 0

10 x := argmin f(x) subject tog (x) = 0,i∈ I
i 0
x
11 od
∗ ∗
12 Let ν be the optimal Lagrange multiplier for x

13 if ν ≥ 0 break fi

14 choose i∈ I : ν < 0
0
i
15 I := I \{ i}
0 0
16 while true
17 return x
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 30/56
Machine Learning / 4. Digression: Quadratic Optimization
Thedualproblemforthemaximummarginseparatinghyperplane
is such a constrained quadratic optimization problem:
n n n
XX X
1
maximize L =− αα yy hx,x i + α
i j i j i j i
2
i=1 j=1 i=1
n
X
w.r.t. αy =0
i i
i=1
α ≥0
i
Set f :=−L
C :=yy hx,x i
i,j i j i j
c :=1
i
x :=α
i i
A :=(0,0,...,0,−1,0,...,0) (with the -1 at column i),i = 1,...,n
i
b :=0
i
n
X
h(x) := αy
i i
i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 31/56Machine Learning / 4. Digression: Quadratic Optimization
Example
Find a maximum margin separating hyperplane for the
following data:
x x y
1 2
1 1 −1
3 3 +1
4 3 +1
+ +

0 1 2 3 4
x
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 32/56
Machine Learning / 4. Digression: Quadratic Optimization
Example
   
2 −6 −7 1
   
C = (yy hx,x i) = −6 18 21 , c = 1 ,
i j i y i,j
−7 21 25 1
   
−1 0 0 0
   
A = 0 −1 0 , b = 0
0 0 −1 0
h(α) =hα,yi =−α +α +α
1 2 3
As the equality constraint h always needs to be met, it
can be added to C:
 
2 −6 −7 −1

 
C y −6 18 21 1
0
 
C = , = ,
T
 
y 0 −7 21 25 1
−1 1 1 0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 33/56
y
0 1 2 3 4Machine Learning / 4. Digression: Quadratic Optimization
Example
Let us start with a random
 
2
 
x = 1
1
that meets both constraints:
 
−2
 
g(x) = Ax−b = −1 ≤ 0
−1
h(x) =hy,xi =−2 + 1 + 1 = 0
As none of the inequality constaints is active: I (x) =∅.
0
Step1: We have to solve

x c
0
C =
μ 0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 34/56
Machine Learning / 4. Digression: Quadratic Optimization
Example
This yields
 
0.5

 
x = 1.5
−1.0
which does not fulfill the (inactive) inequality constraint
x ≥ 0.
3
So we look for
   
2 −1.5

   
x +α(x −x) = 1 +α 0.5 ≥ 0
1 −2
that fulfills all inequality constraints and has large step
size α. Obviously, α = 0.5 is best and yields
 
1.25

 
x := x +α(x −x) = 1.25
0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 35/56Machine Learning / 4. Digression: Quadratic Optimization
Example
Step2: Now the third inequality constraint is active:
I (x) ={3}.
0
 
2 −6 −7 −1 0
 
0
 
C y −e −6 18 21 1 0
3
 
00
T
 
 
C = y 0 0 , = −7 21 25 1 −1 ,
 
T
 
−e 0 0 −1 1 1 0 0
3
0 0 −1 0 0
and we have to solve
   
x c
00
   
C μ = 0

ν 0
which yields
 
0.25
∗ ∗
 
x = 0.25 , ν = 0.5
0
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 36/56
Machine Learning / 4. Digression: Quadratic Optimization
Example

As x fulfills all constraints, it becomes the next x (step
size α = 1):

x := x

As the lagrange multiplier ν ≥ 0, the algorithm stops:
x is optimal.
So we found the optimal
 
0.25
 
α = 0.25 (called x in the algorithm!)
0
and can compute

n
X
1 3 0.5
β = αyx = 0.25·(−1)· +0.25·(+1)· =
i i i
1 3 0.5
i=1
β can be computed from the original constraints of the
0
points with α > 0 which have to be sharp, i.e.,
i

0.5 1
y (β +hβ,x i) = 1 ⇒ β = y −hβ,x i =−1−h , i =−2
1 0 1 0 1 1
0.5 1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 37/56Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 38/56
Machine Learning / 5. Non-separable Problems
Optimal Hyperplane
Inseparableproblemscanbemodeledbyallowingsomepointsto
be on the wrong side of the hyperplane.
Hyperplanes are better if
(i) the fewer points are on the wrong side and
(ii) the closer these points are to the hyperplane
(modeled byslackvariables ξ).
i
n
X
1
2
minimize ||β|| +γ ξ
i
2
i=1
w.r.t. y (β +hβ,xi)≥1−ξ, i = 1,...,n
i 0 i i
ξ ≥0
p
β ∈R
β ∈R
0
This problem also is a convex optimization problem
(quadaratic target function with linear inequality constraints).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 38/56Machine Learning / 5. Non-separable Problems
Dual Problem
Compute again the dual problem:
n n n
X X X
1
2
L := ||β|| +γ ξ − α (y (β +hβ,xi)− (1−ξ ))− μξ
i i i 0 i i i i
2
i=1 i=1 i=1
w.r.t. α ≥0
i
μ ≥0
i
For an extremum it is required that
n n
X X
∂L
!
=β− αyx = 0 ⇒ β = αyx
i i i i i i
∂β
i=1 i=1
and
n
X
∂L
!
=− αy = 0
i i
∂β
0
i=1
and
∂L
!
=γ−α −μ = 0 ⇒ α = γ−μ
i i i i
∂ξ
i
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 39/56
Machine Learning / 5. Non-separable Problems
Dual Problem
Input
n n
X X
β = αyx, αy = 0, α = γ−μ
i i i i i i i
i=1 i=1
into
n n n
X X X
1
2
L := ||β|| +γ ξ − α (y (β +hβ,xi)− (1−ξ))− μξ
i i i 0 i i i
2
i=1 i=1 i=1
yields thedualproblem
n n n n
X X X X
1
L = h αyx, α y x i− α (y (β +h α y x ,xi)− (1−ξ ))
i i i j j j i i 0 j j j i i
2
i=1 j=1 i=1 j=1
n n
X X
+γ ξ − μξ
i i i
i=1 i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 40/56Machine Learning / 5. Non-separable Problems
Dual Problem
n n n n
X X X X
1
L = h αyx, α y x i− α (y (β +h α y x ,xi)− (1−ξ ))
i i i j j j i i 0 j j j i i
2
i=1 j=1 i=1 j=1
n n
X X
+γ ξ − μξ
i i i
i=1 i=1
n n n n n n
XX X X XX
1
= αα yy hx,x i + α − αyβ − αα yy hx,x i
i j i j i j i i i 0 i j i j i j
2
i=1 j=1 i=1 i=1 i=1 j=1
n n
X X
− αξ + αξ
i i i i
i=1 i=1
n n n
XX X
1
=− αα yy hx,x i + α
i j i j i j i
2
i=1 j=1 i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 41/56
Machine Learning / 5. Non-separable Problems
Dual Problem
The dual problem is
n n n
XX X
1
maximize L =− αα yy hx,x i + α
i j i j i j i
2
i=1 j=1 i=1
n
X
w.r.t. αy =0
i i
i=1
α ≤γ
i
α ≥0
i
with much simpler constraints.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 42/56Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 43/56
Machine Learning / 6. Support Vectors and Kernels
Support Vectors / Separable Case
For points on the right side of the hyperplane (i.e., if a constraint
holds),
y (β +hβ,xi) > 1
i 0 i
then L is maximized by α = 0: x is irrelevant.
i i
For points on the wrong side of the hyperplane (i.e., if a
constraint is violated),
y (β +hβ,xi) < 1
i 0 i
then L is maximized for α →∞.
i
For separable data, β and β needs to be changed to make the
0
constraint hold.
For points on the margin, i.e.,
y (β +hβ,xi) = 1
i 0 i
α is some finite value.
i
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 43/56Machine Learning / 6. Support Vectors and Kernels
Support Vectors / Inseparable Case
For points on the right side of the hyperplane,
y (β +hβ,xi) > 1, ξ = 0
i 0 i i
then L is maximized by α = 0: x is irrelevant.
i i
For points in the margin as well as on the wrong side of the
hyperplane,
y (β +hβ,xi) = 1−ξ, ξ > 0
i 0 i i i
α is some finite value.
i
For points on the margin, i.e.,
y (β +hβ,xi) = 1, ξ = 0
i 0 i i
α is some finite value.
i
The data points x with α > 0 are calledsupportvectors.
i i
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 44/56
Machine Learning / 6. Support Vectors and Kernels
Decision Function
Due to
n
X
ˆ
β = α ˆ yx,
i i i
i=1
the decision function
ˆ ˆ
yˆ(x) = signβ +hβ,xi
0
can be expressed using the training data:
n
X
ˆ
yˆ(x) = signβ + α ˆ yhx,xi
0 i i i
i=1
Only support vectors are required, as only for them α ˆ 6= 0.
i
Both, the learning problem and the decision function can be
expressed using an inner product / a similarity measure / a kernel
0
hx,xi.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 45/56Machine Learning / 6. Support Vectors and Kernels
High-Dimensional Embeddings / The “kernel trick”
Example:
2 6
we map points from R into the higher dimensional spaceR via
 
1

 
2x
1
 

 
x
2x
1
2
 
h : 7→
2
 
x x
2
1
 
2
 
x
2

2x x
1 2
Then the inner product

0
x x
2 2
1
0 0 2 0 2 0 0 0
1
hh( ),h( )i = 1 + 2x x + 2x x +x x +x x + 2x x x x
1 2 1 2
0 1 2 1 1 2 2 1 2
x x
2
2
0 0 2
= (1 +x x +x x )
1 2
1 2
can be computed without having to compute h explicitely !
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 46/56
Machine Learning / 6. Support Vectors and Kernels
Popular Kernels
Some popular kernels are:
linearkernel:
n
X
0 0 0
K(x,x) :=hx,xi := xx
i
i
i=1
polynomialkernel of degree d:
0 0 d
K(x,x) := (1 +hx,xi)
radialbasiskernel/gaussiankernel :
0 2
||x−x||
0 −
c
K(x,x) := e
neuralnetworkkernel/sigmoidkernel :
0 0
K(x,x) := tanh(ahx,xi +b)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 47/56Machine Learning
1.SeparatingHyperplanes
2.Perceptron
3.MaximumMarginSeparatingHyperplanes
4.Digression: QuadraticOptimization
5.Non-separableProblems
6.SupportVectorsandKernels
7.SupportVectorRegression
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 48/56
Machine Learning / 7. Support Vector Regression
Optimal Hyperplanes as Regularization
Optimal separating hyperplanes
n
X
1
2
minimize ||β|| +γ ξ
i
2
i=1
w.r.t. y (β +hβ,xi)≥1−ξ, i = 1,...,n
i 0 i i
ξ ≥0
p
β ∈R
β ∈R
0
can also be understood as regularization “error + complexity”:
n
X
1
2
minimize γ [1−y (β +hβ,xi)] + ||β||
i 0 i +
2
i=1
p
w.r.t. β ∈R
β ∈R
0
where the postitive part is defined as

x, if x≥ 0
[x] :=
+
0, else
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 48/56Machine Learning / 7. Support Vector Regression
Optimal Hyperplanes as Regularization / Error functions
Specific for the SVM model then is the error function (also often
calledlossfunction).
model error function minimizing function
p(y = +1|x)
−yf(x)
logistic regression negative log(1 +e ) f(x) = log
p(y =−1|x)
binomial
loglikelihood
2
LDA squared (y−f(x)) f(x) = p(y = +1|x)−p(y =−1|x)
error

1
1, if p(y = +1|x)≥
2
SVM [1−yf(x)] f(x) =
+
0, else
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 49/56
Machine Learning / 7. Support Vector Regression
Optimal Hyperplanes as Regularization / Error functions
neg. bin. loglikelihood
squared error
svm
−3 −2 −1 0 1 2 3
y f(x)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 50/56
error
0.0 0.5 1.0 1.5 2.0 2.5 3.0Machine Learning / 7. Support Vector Regression
SVM Regression / Error functions
In regression, squared error sometimes is dominated by outliers,
i.e., points with large residuum, due to the quadratic dependency:
2
err(y,yˆ) := (y−yˆ)
Therefore, robust error functions such as theHubererror have
been developed that keep the quadratic form near zero, but are
linear for larger values:
(
2
(y−yˆ)
, if|y−yˆ| < c
2
err (y,yˆ) :=
c 2
c
c|y−yˆ|− , else
2
SVM regression uses the -insensitiveerror :

0, if|y−yˆ| <
err (y,yˆ) :=
c
|y−yˆ|− , else
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 51/56
Machine Learning / 7. Support Vector Regression
SVM Regression / Error functions
0.5 squared error
svm (eps=2)
huber (c=2)
−4 −2 0 2 4
y − f(x)
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 52/56
error
0 2 4 6 8Machine Learning / 7. Support Vector Regression
SVM Regression
Any of these error functions can be used to find optimal
parameters for the linear regression model
f(X) := β +hβ,Xi +
0
by solving the optimization problem
n
X
λ
2
ˆ ˆ ˆ
min. err(y,β +hβ,xi) + ||β||
i 0 i
2
i=1
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 53/56
Machine Learning / 7. Support Vector Regression
SVM Regression
For the -insensitive error, the solution can be shown to have the
form
n
X

ˆ
β = (α ˆ −α ˆ )x
i i
i
i=1
n
X

ˆ
f(x) =β + (α ˆ −α ˆ )hx,xi
0 i i
i
i=1

where α ˆ and α ˆ are the solutions of the quadratic problem
i
i
n n n n
X X XX
1
∗ ∗ ∗ ∗
min (α −α )− y (α −α ) + (α −α )(α −α )hx,x i
i i i i j i j
i i i j
2
i=1 i=1 i=1 j=1
s.t. α ≥ 0
i
1

α ≤
i
λ
n
X

(α −α ) = 0
i
i
i=1

α α = 0
i
i
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 54/56Machine Learning / 7. Support Vector Regression
Summary (1/2)
• Binary classification problems with linear decision boundaries can be
rephrased as finding a separating hyperplane.
• In the linear separable case, there are simple algorithms like
perceptron learning to find such a separating hyperplane.
• If one requires the additional property that the hyperplane should have
maximal margin, i.e., maximal distance to the closest points of both
classes, then a quadratic optimization problem with inequality
constraints arises.
• Quadratic optimization problems without constraints as well as with
equality constraints can be solved by linear systems of equations.
Quadratic optimization problems with inequality constraints require
some more complex methods such as submanifold optimization (a
sequence of linear systems of equations).
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 55/56
Machine Learning / 7. Support Vector Regression
Summary (2/2)
• Optimal hyperplanes can also be formulated for the inseparable case
by allowing some points to be on the wrong side of the margin, but
penalize for their distance from the margin. This also can be
formulated as a quadratic optimization problem with inequality
constraints.
• The final decision function can be computed in terms of inner products
of the query points with some of the data points (called support
vectors), which allows to bypass the explicit computation of high
dimensional embeddings.
Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany,
Course on Machine Learning, winter term 2006 56/56