Support Vector Machine (SVM) Classification

milkygoodyearAI and Robotics

Oct 14, 2013 (3 years and 11 months ago)

72 views

Greg GrudicMachine Learning1
Support Vector Machine (SVM)
Classification
Greg Grudic
Last Class
•Linear separating hyperplanesfor binary
classification
•Rosenblatt’s PerceptronAlgorithm
–Based on Gradient Descent
–Convergence theoretically guaranteed if data is linearly
separable
•Infinite number of solutions
•For nonlinear data:
–Mapping data into a nonlinear space where it is linearly
separable (or almost)
–However, convergence still not guaranteed…
Greg GrudicMachine Learning2
Questions?
Greg GrudicMachine Learning3
Greg GrudicMachine Learning4
Today’s Lecture Goals
•Support Vector Machine (SVM) Classification
–Another algorithm for linear separating hyperplanes
A Good text on SVMs:Bernhard Schölkopf
and Alex Smola. Learning with Kernels.
MIT Press, Cambridge, MA, 2002
Greg GrudicMachine Learning5
Support Vector Machine (SVM)
Classification
•Classification as a problem of finding optimal (canonical) linear hyperplanes.
•Optimal Linear Separating Hyperplanes:
–In Input Space
–In Kernel Space
•Can be non-linear
Greg GrudicMachine Learning6
Linear Separating Hyper-Planes
How many lines can separate these points?
NO!
Which line should we use?
Greg GrudicMachine Learning7
Initial Assumption:
Linearly Separable Data
Greg GrudicMachine Learning8
Linear Separating Hyper-Planes
1
x
2
x
0bwx⋅+<
0bwx⋅+
>
0bwx⋅+
=
1y=−
1y=+
Greg GrudicMachine Learning9
Linear Separating Hyper-Planes
•Given data:
•Finding a separating hyperplane can be posed as a
constraint satisfaction problem (CSP):
•Or, equivalently:
•If data is linearly separable, there are an infinite
number of hyperplanes that satisfy this CSP
(
)
1,...,, find and such that
1 if 1
1 if 1
ii
ii
iNb
by
by
w
wx
wx
∀∈
⋅+≥+=+
⋅+≤−=−
(
)
(
)
11
,,...,,
NN
yyxx
()
10,
ii
ybiwx⋅+−≥∀
Greg GrudicMachine Learning10
The Margin of a Classifier
•Take any hyper-plane (P0) that separates
the data
•Put a parallel hyper-plane (P1) on a point in
class 1 closest to P0
•Put a secondparallel hyper-plane (P2) on a
point in class -1 closest to P0
•The margin (M) is the perpendicular
distance between P1 and P2
Greg GrudicMachine Learning11
Calculating the Margin of a Classifier
P0
P2
P1
•P0: Any separating hyperplane
•P1: Parallel to P0, passing through
closest point in one class
•P2: Parallel to P0, passing through
point closest to the opposite class
Margin (M)
: distance measured along
a line perpendicular to P1 and P2
1
x
2
x
Model parameters must be chosen such that,
for on P1 and for on P2:
SVM Constraints on the Model Parameters
Greg GrudicMachine Learning12
(
)
,bw
1
P1: 1bwx⋅+=−
2
P2: 1bwx⋅+=+
For any P0, these
constraints are always
attainable.
Given the above, then the linear separating boundary lies
half way between P1 and P2 and is given by:
0bwx⋅+=
(
)
ˆ
sgnybwx=⋅+
Resulting Classifier:
1
x
2
x
Remember: signed distance from
a point to a hyperplane:
Greg GrudicMachine Learning13
()
2
1
,hyperplane
d
i
i
cc
d
w
wxwx
x
w
=
+⋅+⋅
==
⎛⎞







⎝⎠

()
,cw
Hyperplane define by:
Calculating the Margin
Greg GrudicMachine Learning14
Take absolute value to get the unsigned margin:
2
M
w
=
Signed
Distance
()()
()
21
21
21
21
1
1
11
1,2,
Therefore:
11
2
Therefore:
21
2(1)2(0)2
bb
MdPdP
bb
b
b
M
wxwx
xx
ww
wxwx
wxwx
ww
wx
wx
wwww
⋅++⋅+−
====
⋅++⋅+−
=⇒⋅=⋅−
⋅−++
−+⋅++−+−
====
Greg GrudicMachine Learning15
Different P0’s have Different Margins
P0
P2
P1
•P0: Any separating hyperplane
•P1: Parallel to P0, passing through
closest point in one class
•P2: Parallel to P0, passing through
point closest to the opposite class
Margin (M)
: distance measured along
a line perpendicular to P1 and P2
Greg GrudicMachine Learning16
Different P0’s have Different Margins
P0
P2
P1
•P0: Any separating hyperplane
•P1: Parallel to P0, passing through
closest point in one class
•P2: Parallel to P0, passing through
point closest to the opposite class
Margin (M)
: distance measured along
a line perpendicular to P1 and P2
Greg GrudicMachine Learning17
Different P0’s have Different Margins
P0
P2
P1
•P0: Any separating hyperplane
•P1: Parallel to P0, passing through
closest point in one class
•P2: Parallel to P0, passing through
point closest to the opposite class
Margin (M)
: distance measured along
a line perpendicular to P1 and P2
Greg GrudicMachine Learning18
How Do SVMs Choose the Optimal Separating
Hyperplane (boundary)?
P2
P1
•Find the that
maximizes the margin!
Margin (M)
: distance measured along
a line perpendicular to P1 and P2
2
margin (M)
w
=
w
Greg GrudicMachine Learning19
SVM: Constraint Optimization
Problem
•Given data:
•Minimizesubject to:
(
)
(
)
11
,,...,,
NN
yyxx
(
)
(
)
10, 1,...,
ii
ybiNwx⋅+−≥∀=
2
w
The Lagrange Function Formulation is used to solve this
Minimization Problem
Greg Grudic
Machine Learning
20
The Lagrange Function Formulation
()
[]
()
2
1
1
,,1
2
N
iii
i
Lbybwαwwxα
=
=−⋅+−

For every constraint we introduce a Lagrange Multiplier:
0
i
α

The Lagrangian is then defined by:
Where -the primal variables are
-the dual variables are
(
)
,bw
()
1
,...,
N
αα
Goal: MinimizeLagrangian w.r.t. primal variables,
and Maximizew.r.t. dual variables
Greg GrudicMachine Learning21
Derivation of the Dual Problem
•At the saddle point (extremum w.r.t. primal)
•This give the conditions
•Substitute into to get the dual problem
()
()
,,0, ,,0LbLb
b
∂∂
==
∂∂
wαwα
w
11
0,
NN
iiiii
ii
yy
αα
==
==
∑∑
wx
(
)
,,
L
bwα
Greg GrudicMachine Learning22
Using the Lagrange Function Formulation,
we get the Dual Problem
•Maximize
•Subject to
()
()
111
1
2
NNN
ii
j
i
j
i
j
iij
Wyyxxαααα
===
=−⋅
∑∑∑
1
0, 1,...,
0
i
N
ii
i
iN
y
α
α
=
≥=
=

Properties of the Dual Problem
•Solving the Dual gives a solution to the
original constraint optimization problem
•For SVMs, the Dual problem is a Quadratic
Optimization Problem which has a globally
optimal solution
•Gives insights into the NON-Linear
formulation for SVMs
Greg GrudicMachine Learning23
Greg GrudicMachine Learning24
Support Vector Expansion (1)
1
N
iii
i
y
α
=
=

wx
[
]
1 0 irrelevant
iiii
ybwxxα⋅+>⇒=→
OR
[
]
(On Margin)
1 Support Vector
iii
ybwxx⋅+=
is also computed from the optimal dual variables
b
i
α
Greg GrudicMachine Learning25
Support Vector Expansion (2)
1
N
iii
i
y
α
=
=

wx
()()
sgn
f
bxwx=⋅+
OR
()()
1
sgn
N
iii
i
f
ybxxxα
=
⎛⎞


=⋅+





⎝⎠

Substitute
Greg GrudicMachine Learning26
What are the Support Vectors?
Maximized
Margin
Greg GrudicMachine Learning27
Why do we want a model with only
a few SVs?
•Leaving out an example that does not become an
SV gives the same solution!
•Theorem (Vapnik and Chervonenkis, 1974)
:
Let be the number of SVs obtained by training
on N examples randomly drawn from P(X,Y), and
E be an expectation. Then
[]
[]
Prob(test error)
NSV
E
N
E≤
SV
N
Greg GrudicMachine Learning28
What Happens When Data is Not
Separable: Soft Margin SVM
Add a Slack Variable
0if correctly classified
distance to marginotherwise
i
i
ξ



=




x
i
ξ
Greg GrudicMachine Learning29
Soft MarginSVM: Constraint
Optimization Problem
•Given data:
•Minimize subject to:
(
)
(
)
11
,,...,,
NN
yyxx
()()
1, 1,...,
iii
ybiNwx
ξ
⋅+≥−∀=
2
1
1
2
N
i
i

=
+

w
Greg GrudicMachine Learning30
Dual Problem (Non-separable data)
•Maximize
•Subject to
()
()
111
1
2
NNN
ii
j
i
j
i
j
iij
Wyyxxαααα
===
=−⋅
∑∑∑
1
0, 1,...,
0
i
N
ii
i
CiN
y
α
α
=
≤≤=
=

Greg GrudicMachine Learning31
Same Decision Boundary
()()
1
sgn
N
iii
i
f
ybxxxα
=
⎛⎞


=⋅+





⎝⎠

Greg GrudicMachine Learning32
Mapping into Nonlinear Space
Greg GrudicMachine Learning33
Nonlinear
Data?
Greg GrudicMachine Learning34
Nonlinear SVMs
•KEY IDEA:
Note that both the decision boundary and dual
optimization formulation use dot products in input space
only!
()()
1
sgn
N
iii
i
f
ybxxxα
=
⎛⎞


=⋅+





⎝⎠

()
()
111
1
2
NNN
iijiji
iij
Wyyxxαααα
===
=−⋅
∑∑∑
()
i
xx⋅
Greg GrudicMachine Learning35
Kernel Trick
Replace
()
()
()
,,
ijij
Kxxxx=ΦΦ
with
Can use the same algorithms in nonlinear kernel space!
()
ij
xx⋅
Inner Product
Greg GrudicMachine Learning36
Nonlinear SVMs
()()
1
sgn,
N
iii
i
f
yKbα
=
⎛⎞


=+






⎝⎠

xxx
()
()
111
1
,
2
NNN
ii
j
i
j
i
j
iij
WyyKαααα
===
=−
∑∑∑
xx
Maximize:
Boundary:
Greg GrudicMachine Learning37
Need Mercer Kernels
(
)
(
)
(
)
()
()
(
)
,,
,
,
ijij
ji
ji
K
K
=ΦΦ
=ΦΦ
=
xxxx
xx
xx
Greg GrudicMachine Learning38
Gram (Kernel) Matrix
(
)
(
)
()()
111
1
,,
,,
N
NNN
KK
K
KK
⎛⎞







=










⎝⎠
xxxx
xxxx

#%#
"
(
)
(
)
11
,,...,,
NN
yyxx
Training Data:
Properties:
•Positive Definite Matrix
•Symmetric
•Positive on diagonal
•N by N
Greg GrudicMachine Learning39
Commonly Used Mercer Kernels
•Polynomial
•Sigmoid
•Gaussian
(
)
(
)
(
)
()()
()
()
2
2
,
,tanh
1
,exp
2
d
ijij
ijij
ijij
Kc
K
K
xxxx
xxxx
xxxx
κθ
σ
=⋅+
=⋅+
⎛⎞


=−−




⎝⎠
Greg GrudicMachine Learning40
Greg GrudicMachine Learning41
Greg GrudicMachine Learning42
Greg GrudicMachine Learning43
MNIST: A SVM Success Story
•Handwritten
character benchmark
–60,000 training and
10,0000 testing
–Dimension d = 28 x
28
Greg GrudicMachine Learning44
Results on Test Data
SVM used a polynomial kernel of degree 9.
Greg GrudicMachine Learning45
SVM (Kernel) Model Structure