# Lecture 11

AI and Robotics

Oct 16, 2013 (4 years and 8 months ago)

155 views

CS434a/541a: Pattern Recognition
Prof. Olga Veksler
Lecture 11
Today

Support Vector Machines (SVM)

Introduction

Linear Discriminant

Linearly Separable Case

Linearly Non Separable Case

Kernel Trick

Non Linear Discriminant
SVM

Said to start in 1979 with Vladimir
Vapnik’spaper

Major developments throughout
1990’s

Elegant theory

Has good generalization properties

Have been applied to diverse
problems very successfully in the last
10-15 years

One of the most important
developments in pattern recognition
in the last 10 years
Linear DiscriminantFunctions

A discriminantfunction is linear if it can be written as
g(x)= wtx+ w0
x(1)
x(2)
(
((
(
)
))
)
(
((
()
))
)
20
10
classxxg
classxxg

∈∈
∈

<
<<
<

∈∈
∈

>
>>
>

which separating hyperplane should we choose?
Linear DiscriminantFunctions
x(1)
x(2)

Training data is just a subset of of all possible data

Suppose hyperplaneis close to sample xi

If we see new sample close to sample i, it is likely
to be on the wrong side of the hyperplane
xi

Poor generalization (performance on unseen data)
Linear DiscriminantFunctions

Hyperplane as far as possible from any sample
x(1)
x(2)
xi

New samples close to the old samples will be
classified correctly

Good generalization
SVM

Idea: maximize distance to the closest example
x(1)
x(2)
xi
x(1)
x(2)
xi
smaller distance
larger distance

For the optimal hyperplane

distance to the closest negative example = distance to
the closest positive example
SVM: Linearly Separable Case

SVM: maximize the
margin
x(1)
x(2)

margin
is twice the absolute value of distance bof
the closest example to the separating hyperplane

Better generalization (performance on test data)

in practice

and in theory
b
margin
b
b
SVM: Linearly Separable Case
x(1)
x(2)

Support vectors
are the samples closest to the
separating hyperplane

they are the most difficalt patterns to classify

Optimal hyperplaneis completely defined by support vectors

of course, we do not know which samples are support vectors without
finding the optimal hyperplane
b
b
b
SVM: Formula for the Margin

g(x)=wtx+ w0
x(1)
x(2)
x
g(x) /||w||

absolute distance between x
and the boundary g(x) =0
w
wxw
t
0
+
++
+

distance is unchanged for hyperplane
g1(x)=αg(x)
w
wxw
t
α
αα
α
α
αα
αα
αα
α
0
+
++
+

Let xi
be an example closest to the boundary. Set
1
0
=
==
=+
++
+wxw
i
t
w
wxw
t
0
+
++
+
=
==
=

Now the largest margin hyperplaneis unique
Today

Continue Support Vector Machines (SVM)

Linear Discriminant

Linearly Separable Case

Linearly Non Separable Case

Kernel Trick

Non Linear Discriminant
SVM: Linearly Separable Case

SVM: maximize the
margin
x(1)
x(2)

margin
is twice the absolute value of distance bof
the closest example to the separating hyperplane

Better generalization (performance on test data)

in practice

and in theory
b
margin
b
b
SVM: Formula for the Margin

now distance from closest sample xi
to g(x) = 0is
w
wxw
i
t
0
+
++
+
w
1
=
==
=

Thus the margin is
w
m
2
=
==
=
x(1)
x(2)
1
/||w||
1
/||w||
1
/||w||
2
/ ||w||

For uniqueness, set for any example
xi
closest to the boundary
1
0
=
==
=+
++
+wxw
i
t
SVM: Optimal Hyperplane







−−
−≤
≤≤
≤+
++
+

≥≥
≥+
++
+
examplenegativeisxifwxw
examplepositiveisxifwxw
ii
t
ii
t
1
1
0
0

Maximize margin
w
m
2
=
==
=

subject to constraints

Let







−−
−=
==
=
=
==
=
examplenegativeisxif1z
examplepositiveisxif1z
ii
ii

Can convert our problem to
minimize
constrained to
(
((
()
))
)
2
2
1
wwJ=
==
=
(
((
(
)
))
)
i1wxwz
0i
t
i

∀∀
∀≥
≥≥
≥+
++
+

J(w) is a quadratic function, thus there is a single
global minimum
SVM: Optimal Hyperplane

Use Kuhn-Tucker theorem to convert our problem to:
maximize
constrained to
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
1i
n 1j
j
t
ijiji
n
1i
iD
xxzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α



=
==
=
=
==
=∀
∀∀
∀≥
≥≥

n
1i
iii
0zandi0
α
αα
αα
αα
α

α
α α
α
={
α
αα
α
1,…,
α
αα
α
n} are new variables, one for each sample
(
((
()
))
)





















−−
−=
==
=



=
==
=
n
1
t
n
1
n
1i
iD
H
2
1
L
α
αα
α
α
αα
α
α
αα
α
α
αα
α
α
αα
αα
αα
α


Can rewrite LD(
α
αα
α
) using nby nmatrix H:
j
t
ijiij
xxzzH=
==
=

where the value in the ithrow and jthcolumn of His
SVM: Optimal Hyperplane

Use Kuhn-Tucker theorem to convert our problem to:
maximize
constrained to
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
1i
n 1j
j
t
ijiji
n
1i
iD
xxzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α



=
==
=
=
==
=∀
∀∀
∀≥
≥≥

n
1i
iii
0zandi0
α
αα
αα
αα
α

α
α α
α
={
α
αα
α
1,…,
α
αα
α
n} are new variables, one for each sample

LD(
α
αα
α
) can be optimized by quadratic programming

LD(
α
αα
α
) formulated in terms of
α
αα
α

it depends on wand w0 indirectly
SVM: Optimal Hyperplane

Final discriminant function:
(
((
()
))
)
0
t
Sx
iii
wxxzxg
i
+
++
+















=
==
=



∈∈

α
αα
α

After finding the optimal
α
α α
α
= {
α
αα
α
1,…,
α
αα
α
n}

can solve for w0
using any
α
αα
α
i
> 0 and
(
((
(
)
))
)
[
[[
[
]
]]
]
01wxwz
0i
t
ii
=
==
=−
−−
−+
++
+
α
αα
α

where Sis the set of support vectors
{
{{
{
}
}}
}
0|

≠≠

=
==
=
ii
xS
α
αα
α

can find wusing



=
==
=
=
==
=
n
1i
iii
xzw
α
αα
α

For every sample i, one of the following must hold

α
αα
α
i
= 0 (sample iis not a support vector)

α
αα
α
i
0 andzi(wtxi+w0 -1) = 0 (sample iis support vector)

≠≠

i
t
i
xw
z
w−
−−
−=
==
=
1
0
SVM: Optimal Hyperplane
maximize
constrained to
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
1i
n 1j
j
t
ijiji
n
1i
iD
xxzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α



=
==
=
=
==
=∀
∀∀
∀≥
≥≥

n
1i
iii
0zandi0
α
αα
αα
αα
α

LD(
α
αα
α
) depends on the number of samples, not on
dimension of samples

samples appear only through the dot products
j
t
i
xx

This will become important when looking for a
nonlineardiscriminantfunction, as we will see soon
SVM: Example using Matlab

Class 2: [5,2], [7,6], [10,4]

Class 1: [1,6], [1,10], [4,11]

Let’s pile all data into array X

















=
==
=
410
67
25
114
101
61
X

Pile zi’sinto vector















−−

−−

−−

=
==
=
1
1
1
1
1
1
Z
j
t
ijiij
xxzzH=
==
=

Matrix Hwith , in matlabuse
(
((
(
)
))
)
(
((
(
)
))
)
'**.'*zzxxH
=
==
=

















−−
−−
−−
−−
−−

−−
−−
−−
−−
−−

−−
−−
−−
−−
−−

−−
−−
−−
−−
−−

−−
−−
−−
−−
−−

−−
−−
−−
−−
−−

=
==
=
1169458845034
948547946743
584729422517
84944213711470
50672511410161
344317706137
H
SVM: Example using Matlab

Matlab expects quadratic programming to be stated
in the canonical(standard) form which is

where A,B,Hare nby nmatrices and f, a,bare vectors
minimize
constrained to
(
((
(
)
))
)
α
αα
αα
αα
αα
αα
αα
αα
α
tt
D
fH5.0L+
++
+=
==
=
b
B
and
a
A
=
==
=

≤≤

α
αα
α
α
αα
α

Need to convert our optimization problem to
canonical form
maximize
constrained to



=
==
=
=
==
=∀
∀∀
∀≥
≥≥

n
1i
iii
0zandi0
α
αα
αα
αα
α
(
((
()
))
)





















−−
−=
==
=



=
==
=
n
t
n
n
i
iD
HL
α
αα
α
α
αα
α
α
αα
α
α
αα
α
α
αα
αα
αα
α

11
1
2
1
SVM: Example using Matlab

Multiply by –1 to convert to minimization:
minimize
(
((
()
))
)
α
αα
αα
αα
αα
αα
αα
αα
α
HL
t
n
i
iD
2
1
1
+
++
+−
−−
−=
==
=



=
==
=

Let , then can write
)1,6(
1
1
onesf−
−−
−=
==
=











−−

−−

=
==
=

minimize
(
((
()
))
)
α
αα
αα
αα
αα
αα
αα
αα
α
HfL
tt
D
2
1
+
++
+=
==
=
i
i

∀∀

≥≥

0
α
αα
α

First constraint is

Let = -eye(6),











−−

−−

=
==
=
10
01



A
(
((
()
))
)
1,6
0
0
zerosa=
==
=











=
==
=

Rewrite the first constraint in canonical form:
a
A

≤≤

α
αα
α
SVM: Example using Matlab

Our second constraint is



=
==
=
=
==
=
n
i
ii
z
1
0
α
αα
α

Let
[
[[
[]
]]
][
[[
[]
]]
][
[[
[]
]]
]
)6,5(;
00
00
61
zerosz
zz
B=
==
=













=
==
=




(
((
()
))
)
1,6
0
0
zerosb=
==
=











=
==
=
and

Second constraint in canonical form is:
b
B
=
==
=
α
αα
α
minimize
constrained to
(
((
(
)
))
)
α
αα
αα
αα
αα
αα
αα
αα
α
tt
D
fH5.0L+
++
+=
==
=
b
B
and
a
A
=
==
=

≤≤

α
αα
α
α
αα
α

Thus our problem is in canonical form and can be
solved by matlab:
SVM: Example using
Matlab

α
α α
α
= quadprog(H+eye(6)*0.001, f, A, a, B, b)
for stability

















=
==
=
0
076.0
0
039.0
0
036.0
α
αα
α

Solution
support
vectors

find wusing
(
((
()
))
)









−−

=
==
=∗
∗∗
∗=
==
==
==
=



=
==
=
20.0
33.0
.
1
xzxzw
t
n
i
iii
α
αα
αα
αα
α

since
α
αα
α
1
>0, can find w0
using
13.0
1
1
1
0
=
==
=−
−−
−=
==
=xw
z
w
t
SVM: Non Separable Case

Data is most likely to be not linearly separable, but
linear classifier may still be appropriate
x(1)
x(2)
outliers

Can apply SVM in non linearly separable case

data should be “almost” linearly separable for good
performance
SVM: Non Separable Case

Use slack variables
ξ
ξξ
ξ
1,…,
ξ
ξξ
ξ
n
(one for each sample)
x(1)
x(2)
(
((
(
)
))
)
i1wxwz
i0i
t
i

∀∀
∀−
−−
−≥
≥≥
≥+
++
+
ξ
ξξ
ξ

ξ
ξξ
ξ
i
is a measure of
deviation from the ideal
for sample i

ξ
ξξ
ξ
i
>1 sample i is on the wrong
side of the separating
hyperplane

0<
ξ
ξξ
ξ
i
<1 sample i is on the
right side of separating
hyperplane but within the
region of maximum margin

ξ
ξξ
ξ
i
< 0 is the ideal case for
sample i
ξ
ξξ
ξ
i
> 1
0<
ξ
ξξ
ξ
i
<1
(
((
(
)
))
)
i1wxwz
0i
t
i

∀∀
∀≥
≥≥
≥+
++
+

Change constraints from to
SVM: Non Separable Case

Would like to minimize

where
(
((
()
))
)







≤≤

>
>>
>
=
==
=>
>>
>
00
01
0
i
i
i
if
if
I
ξ
ξξ
ξ
ξ
ξξ
ξ
ξ
ξξ
ξ
(
((
()
))
)
(
((
()
))
)



=
==
=
>
>>
>+
++
+=
==
=
n
i
in
IwwJ
1
2
1
0
2
1
,...,,
ξ
ξξ
ξβ
ββ
βξ
ξξ
ξξ
ξξ
ξ
# of samples
not in ideal location

β
ββ
β
is a constant which measures relative weight of the
first and second terms

if
β
ββ
β
is small, we allow a lot of samples not in ideal position

if
β
ββ
β
is large, we want to have very few samples not in ideal
positon
(
((
(
)
))
)
ii
t
i
wxwz
ξ
ξξ
ξ

−−
−≥
≥≥
≥+
++
+1
0

constrained to and
i
i

∀∀

≥≥

0
ξ
ξξ
ξ
SVM: Non Separable Case
(
((
()
))
)
(
((
()
))
)



=
==
=
>
>>
>+
++
+=
==
=
n
i
in
IwwJ
1
2
1
0
2
1
,...,,
ξ
ξξ
ξβ
ββ
βξ
ξξ
ξξ
ξξ
ξ
# of examples
not in ideal location
x(1)
x(2)
large
β,
β, β,
β,
few samples not in
ideal position
x(1)
x(2)
small
β,
β, β,
β,
a lot of samples
not in ideal position
SVM: Non Separable Case

where
(
((
()
))
)







≤≤

>
>>
>
=
==
=>
>>
>
00
01
0
i
i
i
if
if
I
ξ
ξξ
ξ
ξ
ξξ
ξ
ξ
ξξ
ξ
(
((
()
))
)
(
((
()
))
)



=
==
=
>
>>
>+
++
+=
==
=
n
i
in
IwwJ
1
2
1
0
2
1
,...,,
ξ
ξξ
ξβ
ββ
βξ
ξξ
ξξ
ξξ
ξ
# of examples
not in ideal location

Unfortunately this minimization problem is NP-hard
due to discontinuity of functions I(
ξ
ξξ
ξ
i)
(
((
(
)
))
)
ii
t
i
wxwz
ξ
ξξ
ξ

−−
−≥
≥≥
≥+
++
+1
0

constrained to and
i
i

∀∀

≥≥

0
ξ
ξξ
ξ
SVM: Non Separable Case

(
((
()
))
)



=
==
=
+
++
+=
==
=
n
i
in
wwJ
1
2
1
2
1
,...,,
ξ
ξξ
ξβ
ββ
βξ
ξξ
ξξ
ξξ
ξ
(
((
(
)
))
)







∀∀
∀≥
≥≥

∀∀
∀−
−−
−≥
≥≥
≥+
++
+
i0
i1wxwz
i
i0i
t
i
ξ
ξξ
ξ
ξ
ξξ
ξ

constrained to
a measure of
# of misclassified
examples

Can use Kuhn-Tucker theorem to converted to
maximize
constrained to
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
1i
n 1j
j
t
ijiii
n
1i
iD
xxzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α



=
==
=
=
==
=∀
∀∀
∀≤
≤≤
≤≤
≤≤

n
1i
iii
0zandi0
α
αα
αβ
ββ
βα
αα
α

find wusing



=
==
=
=
==
=
n
1i
iii
xzw
α
αα
α

solve for w0
using any 0 <
α
αα
α
i
<
β
β β
β
and
(
((
(
)
))
)
[
[[
[
]
]]
]
01wxwz
0i
t
ii
=
==
=−
−−
−+
++
+
α
αα
α
Non Linear Mapping

Cover’s theorem:

“pattern-classification problem cast in a high dimensional
space non-linearly is more likely to be linearly separable
than in a low-dimensional space”

One dimensional space, not linearly separable

Lift to two dimensional space with
ϕ
ϕϕ
ϕ
(x)=(x,x2 )
01235-2-3
Non Linear Mapping

To solve a non linear classification problem with a
linear classifier
1.
Project data xto high dimension using function
ϕ
ϕϕ
ϕ
(x)
2.
Find a linear discriminantfunction for transformed data
ϕ
ϕϕ
ϕ
(x)
3.
Final nonlinear discriminant function is g(x) = wt
ϕ
ϕϕ
ϕ
(x)+w0
01235-2-3
ϕ
ϕϕ
ϕ
(x)=(x,x2 )

In 2D, discriminant function is linear
(
((
()
))
)
(
((
()
))
)
[
[[
[]
]]
]
(
((
()
))
)
(
((
()
))
)
0
2
1
21
2
1
w
x
x
ww
x
x
g+
++
+









=
==
=



















In 1D, discriminant function is not linear
(
((
(
)
))
)
0
2
21
wxwxwxg+
++
++
++
+=
==
=
R1
R2
R2
Non Linear Mapping: Another Example
Non Linear SVM

Can use any linear classifier after lifting data into a
higher dimensional space. However we will have to
deal with the “curse of dimensionality”
1.
poor generalization to test data
2.
computationally expensive

SVM avoids the “curse of dimensionality” problems by
1.
enforcing largest margin permits good generalization

It can be shown that generalization in SVM is a function of the
margin, independent of the dimensionality
2.
computation in the higher dimensional case is performed
only implicitly through the use of kernelfunctions
Non Linear SVM: Kernels

Recall SVM optimization
maximize
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
1i
n 1j
j
t
ijiii
n
1i
iD
xxzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α

Note this optimization depends on samples xi
only
through the dot product xitxj

If we lift xi
to high dimension using
ϕ
ϕϕ
ϕ
(x), need to
compute high dimensional product
ϕ
ϕϕ
ϕ
(xi)t
ϕ
ϕϕ
ϕ
(xj)
maximize
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
i
n
j
j
t
ijiii
n
i
iD
xxzzL
111
2
1
ϕ
ϕϕ
ϕϕ
ϕϕ
ϕα
αα
αα
αα
αα
αα
αα
αα
α

Idea: find kernelfunction K(xi,xj) s.t.
K(xi,xj) =
ϕ
ϕϕ
ϕ
(xi)t
ϕ
ϕϕ
ϕ
(xj)
K(xi,xj)
Non Linear SVM: Kernels

Then we only need to compute K(xi,xj) instead of
ϕ
ϕϕ
ϕ
(xi)t
ϕ
ϕϕ
ϕ
(xj)

“kernel trick”: do not need to perform operations in high
dimensional space explicitly
maximize
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
i
n
j
j
t
ijiii
n
i
iD
xxzzL
111
2
1
ϕ
ϕϕ
ϕϕ
ϕϕ
ϕα
αα
αα
αα
αα
αα
αα
αα
α
K(xi,xj)
Non Linear SVM: Kernels

Suppose we have 2 features and K(x,y) = (xty)2

Which mapping
ϕ
ϕϕ
ϕ
(x) does it correspond to?
(
((
()
))
)
(
((
(
)
))
)
2
,yxyxK
t
=
==
=
(
((
()
))
)(
((
()
))
)
[
[[
[]
]]
]
(
((
()
))
)
(
((
()
))
)
2
2
1
21























=
==
=
y
y
xx
(
((
()
))
)(
((
()
))
)(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
2
2211
yxyx+
++
+=
==
=
(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
2
222211
2
11
2yxyxyxyx+
++
++
++
+=
==
=

Thus
(
((
()
))
)
(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
[
[[
[
]
]]
]
2
221
2
1
2xxxxx=
==
=
ϕ
ϕϕ
ϕ
(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
[
[[
[
]
]]
]
(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
[
[[
[
]
]]
]
2
221
2
1
2
221
2
1
22yyyyxxxx=
==
=
t
Non Linear SVM: Kernels

How to choose kernel function K(xi,x
j)?

K(xi,xj) should correspond to product
ϕ
ϕϕ
ϕ
(xi)t
ϕ
ϕϕ
ϕ
(xj) in a
higher dimensional space

Mercer’s condition tells us which kernel function can be
expressed as dot product of two vectors

Some common choices:

Polynomial kernel
(
((
(
)
))
)
(
((
(
)
))
)
p
j
t
iji
xxxxK
1,
+
++
+=
==
=

Gaussianradial Basis kernel (data is lifted in infinite
dimension)
(
((
()
))
)











−−
−−
−−
−=
==
=
2
2
2
1
exp,
jiji
xxxxK
σ
σσ
σ
Non Linear SVM

Choose
ϕ
ϕϕ
ϕ
(x) so that the first (“0”th) dimension is the
augmented dimension with feature value fixed to 1
(
((
()
))
)
(
((
()
))
)(
((
()
))
)(
((
()
))
)(
((
()
))
)
[
[[
[
]
]]
]
t
2121
xxxx1x=
==
=
ϕ
ϕϕ
ϕ

Threshold parameter w0
gets folded into the weight
vector w
[
[[
[]
]]
]
0
*
1
ww
0
=
==
=









search for separating hyperplane in high dimension
(
((
(
)
))
)
0wxw
0
=
==
=
+
++
+
ϕ
ϕϕ
ϕ
ϕ
ϕϕ
ϕ
(x)
Non Linear SVM

Will not use notation a = [w0
w], we’ll use old
notation w and seek hyperplane through the origin
(
((
(
)
))
)
0xw
=
==
=
ϕ
ϕϕ
ϕ

If the first component of
ϕ
ϕϕ
ϕ
(x) is not 1, the above is
equivalent to saying that the hyperplane has to go
through the origin in high dimension

removes only one degree of freedom

But we have introduced many new degrees when we lifted
the data in high dimension
Non Linear SVM Recepie

Choose kernel K(xi,xj) or function
ϕ
ϕϕ
ϕ
(xi) which takes
sample xi
to a higher dimensional space

which lives in feature space
of dimension d

Find the largest margin linear discriminant function in
the higher dimensional space by using quadratic
programming package to solve:
maximize
constrained to
(
((
()
))
)
(
((
()
))
)







=
==
==
==
==
==
=

−−
−=
==
=
n
1i
n 1j
jijiii
n
1i
iD
x,xKzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α



=
==
=
=
==
=∀
∀∀
∀≤
≤≤
≤≤
≤≤

n
1i
iii
0zandi0
α
αα
αβ
ββ
βα
αα
α
Non Linear SVM Recipe

Linear discriminant function of largest margin in the
high dimensional space:
(
((
()
))
)
(
((
()
))
)
xxz
t
Sx
iii
i
ϕ
ϕϕ
ϕϕ
ϕϕ
ϕα
αα
α















=
==
=



∈∈

where Sis the set of support vectors
{
{{
{
}
}}
}
0|

≠≠

=
==
=
ii
xS
α
αα
α
(
((
()
))
)
(
((
()
))
)



∈∈

=
==
=
Sx
i
t
ii
i
xxz
ϕ
ϕϕ
ϕϕ
ϕϕ
ϕα
αα
α
(
((
()
))
)



∈∈

=
==
=
Sx
iii
i
x,xKz
α
αα
α

Non linear discriminant
function in the original space:
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)
xxzxg
t
Sx
iii
i
ϕ
ϕϕ
ϕϕ
ϕϕ
ϕα
αα
α















=
==
=



∈∈

decide class 1 if g(x ) > 0, otherwise decide class 2
(
((
()
))
)



∈∈

=
==
=
Sx
iii
i
xzw
ϕ
ϕϕ
ϕα
αα
α

Weight vectorwin the high dimensional space:
(
((
(
)
))
)
(
((
(
)
))
)
(
((
(
)
))
)
xwxg
t
ϕ
ϕϕ
ϕϕ
ϕϕ
ϕ
=
==
=
Non Linear SVM
(
((
()
))
)
(
((
()
))
)



∈∈

=
==
=
Sx
iii
i
xxKzxg,
α
αα
α

Nonlinear discriminantfunction
(
((
()
))
)



=
==
=xg
most important
training samples,
i.e. support vectors
weight of support
vector xi



1
“inverse distance”
fromxto
support vector xi
(
((
()
))
)











−−
−−
−−
−=
==
=
2
2
2
1
exp,xxxxK
ii
σ
σσ
σ
SVM Example: XOR Problem

Class 2: x3 = [1,1], x4 = [-1,-1]

Class 1: x1 = [1,-1], x2 = [-1,1]

Use polynomial kernel of degree 2:

K(xi,xj) = (xi
t
xj
+ 1)2

This kernel corresponds to mapping
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)
(
((
(
)
))
)
[
[[
[
]
]]
]
t
xxxxxxx
2
2
2
12121
2221
=
==
=
ϕ
ϕϕ
ϕ

Need to maximize
(
((
()
))
)
(
((
()
))
)







=
==
==
==
==
==
=
+
++
+−
−−
−=
==
=
4 1i
4
1j
2
j
t
ijiii
4
1i
iD
1xxzz
2
1
L
α
αα
αα
αα
αα
αα
αα
αα
α
constrained to
00
4321
=
==
=

−−

−−

+
++
+

∀∀

≤≤

α
αα
α
α
αα
α
α
αα
α
α
αα
α
α
αα
α
andi
i
SVM Example: XOR Problem

Can rewrite
(
((
()
))
)
α
αα
αα
αα
αα
αα
αα
αα
α
HL
t
i
iD
2
1
4 1

−−
−=
==
=



=
==
=

where and
[
[[
[
]
]]
]
t
4321
α
αα
αα
αα
αα
αα
αα
αα
αα
αα
α
=
==
=













−−
−−
−−

−−
−−
−−

−−
−−
−−

−−
−−
−−

=
==
=
9111
1911
1191
1119
H

Take derivative with respect to
α
αα
α
and set it to 0
(
((
()
))
)
0
9111
1911
1191
1119
1
1
1
1
=
==
=













−−
−−
−−

−−
−−
−−

−−
−−
−−

−−
−−
−−

−−













=
==
=
α
αα
αα
αα
α
D
L
da
d

Solution to the above is
α
αα
α
1=
α
αα
α
2
=
α
αα
α
3
=
α
αα
α
4
= 0.25

all samples are support vectors

satisfies the constraints
00,
4321
=
==
=

−−

−−

+
++
+

≤≤

∀∀

α
αα
α
α
αα
α
α
αα
α
α
αα
α
α
αα
α
andi
i
SVM Example: XOR Problem

Weight vector wis:
(
((
(
)
))
)
(
((
(
)
))
)
xwxg
ϕ
ϕϕ
ϕ
=
==
=
(
((
()
))
)



=
==
=
=
==
=
4
1i
iii
xzw
ϕ
ϕϕ
ϕα
αα
α
(
((
(
)
))
)
(
((
(
)
))
)
(
((
(
)
))
)
(
((
(
)
))
)
(
((
(
)
))
)
4321
xxxx25.0
ϕ
ϕϕ
ϕ
ϕ
ϕϕ
ϕ
ϕ
ϕϕ
ϕ
ϕ
ϕϕ
ϕ

−−

−−

+
++
+
=
==
=
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)
(
((
()
))
)(
((
()
))
)(
((
()
))
)
(
((
(
)
))
)
(
((
()
))
)
(
((
(
)
))
)
[
[[
[
]
]]
]
t
xxxxxxx
2
2
2
12121
2221
=
==
=
ϕ
ϕϕ
ϕ
[
[[
[
]
]]
]
002000−
−−
−=
==
=

Thus the nonlinear discriminant function is:
(
((
(
)
))
)
(
((
(
)
))
)
(
((
(
)
))
)
21
xx22−
−−
−=
==
=
(
((
()
))
)
xw
i
6 1i
i
ϕ
ϕϕ
ϕ



=
==
=
=
==
=
(
((
(
)
))
)
(
((
(
)
))
)
21
x
x
2

−−

=
==
=
SVM Example: XOR Problem
(
((
(
)
))
)
(
((
(
)
))
)
(
((
(
)
))
)
21
xx2xg−
−−
−=
==
=
(
((
(
)
))
)
1
x
(
((
(
)
))
)
2
x
-1
1
1 -1
decision boundaries nonlinear
(
((
(
)
))
)
1
2
x
(
((
(
)
))
)
(
((
(
)
))
)
21
2
xx
-1
1
1
-1
-2
2
2
-2
decision boundary is linear
SVM Summary

Based on nice theory

excellent generalization properties

objective function has no local minima

can be used to find non linear discriminant functions

Complexity of the classifier is characterized by the number
of support vectors rather than the dimensionality of the
transformed space