Greg GrudicMachine Learning1

Support Vector Machine (SVM)

Classification

Greg Grudic

Last Class

•Linear separating hyperplanesfor binary

classification

•Rosenblatt’s PerceptronAlgorithm

–Based on Gradient Descent

–Convergence theoretically guaranteed if data is linearly

separable

•Infinite number of solutions

•For nonlinear data:

–Mapping data into a nonlinear space where it is linearly

separable (or almost)

–However, convergence still not guaranteed…

Greg GrudicMachine Learning2

Questions?

Greg GrudicMachine Learning3

Greg GrudicMachine Learning4

Today’s Lecture Goals

•Support Vector Machine (SVM) Classification

–Another algorithm for linear separating hyperplanes

A Good text on SVMs:Bernhard Schölkopf

and Alex Smola. Learning with Kernels.

MIT Press, Cambridge, MA, 2002

Greg GrudicMachine Learning5

Support Vector Machine (SVM)

Classification

•Classification as a problem of finding optimal (canonical) linear hyperplanes.

•Optimal Linear Separating Hyperplanes:

–In Input Space

–In Kernel Space

•Can be non-linear

Greg GrudicMachine Learning6

Linear Separating Hyper-Planes

How many lines can separate these points?

NO!

Which line should we use?

Greg GrudicMachine Learning7

Initial Assumption:

Linearly Separable Data

Greg GrudicMachine Learning8

Linear Separating Hyper-Planes

1

x

2

x

0bwx⋅+<

0bwx⋅+

>

0bwx⋅+

=

1y=−

1y=+

Greg GrudicMachine Learning9

Linear Separating Hyper-Planes

•Given data:

•Finding a separating hyperplane can be posed as a

constraint satisfaction problem (CSP):

•Or, equivalently:

•If data is linearly separable, there are an infinite

number of hyperplanes that satisfy this CSP

(

)

1,...,, find and such that

1 if 1

1 if 1

ii

ii

iNb

by

by

w

wx

wx

∀∈

⋅+≥+=+

⋅+≤−=−

(

)

(

)

11

,,...,,

NN

yyxx

()

10,

ii

ybiwx⋅+−≥∀

Greg GrudicMachine Learning10

The Margin of a Classifier

•Take any hyper-plane (P0) that separates

the data

•Put a parallel hyper-plane (P1) on a point in

class 1 closest to P0

•Put a secondparallel hyper-plane (P2) on a

point in class -1 closest to P0

•The margin (M) is the perpendicular

distance between P1 and P2

Greg GrudicMachine Learning11

Calculating the Margin of a Classifier

P0

P2

P1

•P0: Any separating hyperplane

•P1: Parallel to P0, passing through

closest point in one class

•P2: Parallel to P0, passing through

point closest to the opposite class

Margin (M)

: distance measured along

a line perpendicular to P1 and P2

1

x

2

x

Model parameters must be chosen such that,

for on P1 and for on P2:

SVM Constraints on the Model Parameters

Greg GrudicMachine Learning12

(

)

,bw

1

P1: 1bwx⋅+=−

2

P2: 1bwx⋅+=+

For any P0, these

constraints are always

attainable.

Given the above, then the linear separating boundary lies

half way between P1 and P2 and is given by:

0bwx⋅+=

(

)

ˆ

sgnybwx=⋅+

Resulting Classifier:

1

x

2

x

Remember: signed distance from

a point to a hyperplane:

Greg GrudicMachine Learning13

()

2

1

,hyperplane

d

i

i

cc

d

w

wxwx

x

w

=

+⋅+⋅

==

⎛⎞

⎟

⎜

⎟

⎜

⎟

⎜

⎟

⎝⎠

∑

()

,cw

Hyperplane define by:

Calculating the Margin

Greg GrudicMachine Learning14

Take absolute value to get the unsigned margin:

2

M

w

=

Signed

Distance

()()

()

21

21

21

21

1

1

11

1,2,

Therefore:

11

2

Therefore:

21

2(1)2(0)2

bb

MdPdP

bb

b

b

M

wxwx

xx

ww

wxwx

wxwx

ww

wx

wx

wwww

⋅++⋅+−

====

⋅++⋅+−

=⇒⋅=⋅−

⋅−++

−+⋅++−+−

====

Greg GrudicMachine Learning15

Different P0’s have Different Margins

P0

P2

P1

•P0: Any separating hyperplane

•P1: Parallel to P0, passing through

closest point in one class

•P2: Parallel to P0, passing through

point closest to the opposite class

Margin (M)

: distance measured along

a line perpendicular to P1 and P2

Greg GrudicMachine Learning16

Different P0’s have Different Margins

P0

P2

P1

•P0: Any separating hyperplane

•P1: Parallel to P0, passing through

closest point in one class

•P2: Parallel to P0, passing through

point closest to the opposite class

Margin (M)

: distance measured along

a line perpendicular to P1 and P2

Greg GrudicMachine Learning17

Different P0’s have Different Margins

P0

P2

P1

•P0: Any separating hyperplane

•P1: Parallel to P0, passing through

closest point in one class

•P2: Parallel to P0, passing through

point closest to the opposite class

Margin (M)

: distance measured along

a line perpendicular to P1 and P2

Greg GrudicMachine Learning18

How Do SVMs Choose the Optimal Separating

Hyperplane (boundary)?

P2

P1

•Find the that

maximizes the margin!

Margin (M)

: distance measured along

a line perpendicular to P1 and P2

2

margin (M)

w

=

w

Greg GrudicMachine Learning19

SVM: Constraint Optimization

Problem

•Given data:

•Minimizesubject to:

(

)

(

)

11

,,...,,

NN

yyxx

(

)

(

)

10, 1,...,

ii

ybiNwx⋅+−≥∀=

2

w

The Lagrange Function Formulation is used to solve this

Minimization Problem

Greg Grudic

Machine Learning

20

The Lagrange Function Formulation

()

[]

()

2

1

1

,,1

2

N

iii

i

Lbybwαwwxα

=

=−⋅+−

∑

For every constraint we introduce a Lagrange Multiplier:

0

i

α

≥

The Lagrangian is then defined by:

Where -the primal variables are

-the dual variables are

(

)

,bw

()

1

,...,

N

αα

Goal: MinimizeLagrangian w.r.t. primal variables,

and Maximizew.r.t. dual variables

Greg GrudicMachine Learning21

Derivation of the Dual Problem

•At the saddle point (extremum w.r.t. primal)

•This give the conditions

•Substitute into to get the dual problem

()

()

,,0, ,,0LbLb

b

∂∂

==

∂∂

wαwα

w

11

0,

NN

iiiii

ii

yy

αα

==

==

∑∑

wx

(

)

,,

L

bwα

Greg GrudicMachine Learning22

Using the Lagrange Function Formulation,

we get the Dual Problem

•Maximize

•Subject to

()

()

111

1

2

NNN

ii

j

i

j

i

j

iij

Wyyxxαααα

===

=−⋅

∑∑∑

1

0, 1,...,

0

i

N

ii

i

iN

y

α

α

=

≥=

=

∑

Properties of the Dual Problem

•Solving the Dual gives a solution to the

original constraint optimization problem

•For SVMs, the Dual problem is a Quadratic

Optimization Problem which has a globally

optimal solution

•Gives insights into the NON-Linear

formulation for SVMs

Greg GrudicMachine Learning23

Greg GrudicMachine Learning24

Support Vector Expansion (1)

1

N

iii

i

y

α

=

=

∑

wx

[

]

1 0 irrelevant

iiii

ybwxxα⋅+>⇒=→

OR

[

]

(On Margin)

1 Support Vector

iii

ybwxx⋅+=

is also computed from the optimal dual variables

b

i

α

Greg GrudicMachine Learning25

Support Vector Expansion (2)

1

N

iii

i

y

α

=

=

∑

wx

()()

sgn

f

bxwx=⋅+

OR

()()

1

sgn

N

iii

i

f

ybxxxα

=

⎛⎞

⎟

⎜

=⋅+

⎟

⎜

⎟

⎜

⎟

⎝⎠

∑

Substitute

Greg GrudicMachine Learning26

What are the Support Vectors?

Maximized

Margin

Greg GrudicMachine Learning27

Why do we want a model with only

a few SVs?

•Leaving out an example that does not become an

SV gives the same solution!

•Theorem (Vapnik and Chervonenkis, 1974)

:

Let be the number of SVs obtained by training

on N examples randomly drawn from P(X,Y), and

E be an expectation. Then

[]

[]

Prob(test error)

NSV

E

N

E≤

SV

N

Greg GrudicMachine Learning28

What Happens When Data is Not

Separable: Soft Margin SVM

Add a Slack Variable

0if correctly classified

distance to marginotherwise

i

i

ξ

⎧

⎪

⎪

=

⎨

⎪

⎪

⎩

x

i

ξ

Greg GrudicMachine Learning29

Soft MarginSVM: Constraint

Optimization Problem

•Given data:

•Minimize subject to:

(

)

(

)

11

,,...,,

NN

yyxx

()()

1, 1,...,

iii

ybiNwx

ξ

⋅+≥−∀=

2

1

1

2

N

i

i

Cξ

=

+

∑

w

Greg GrudicMachine Learning30

Dual Problem (Non-separable data)

•Maximize

•Subject to

()

()

111

1

2

NNN

ii

j

i

j

i

j

iij

Wyyxxαααα

===

=−⋅

∑∑∑

1

0, 1,...,

0

i

N

ii

i

CiN

y

α

α

=

≤≤=

=

∑

Greg GrudicMachine Learning31

Same Decision Boundary

()()

1

sgn

N

iii

i

f

ybxxxα

=

⎛⎞

⎟

⎜

=⋅+

⎟

⎜

⎟

⎜

⎟

⎝⎠

∑

Greg GrudicMachine Learning32

Mapping into Nonlinear Space

Greg GrudicMachine Learning33

Nonlinear

Data?

Greg GrudicMachine Learning34

Nonlinear SVMs

•KEY IDEA:

Note that both the decision boundary and dual

optimization formulation use dot products in input space

only!

()()

1

sgn

N

iii

i

f

ybxxxα

=

⎛⎞

⎟

⎜

=⋅+

⎟

⎜

⎟

⎜

⎟

⎝⎠

∑

()

()

111

1

2

NNN

iijiji

iij

Wyyxxαααα

===

=−⋅

∑∑∑

()

i

xx⋅

Greg GrudicMachine Learning35

Kernel Trick

Replace

()

()

()

,,

ijij

Kxxxx=ΦΦ

with

Can use the same algorithms in nonlinear kernel space!

()

ij

xx⋅

Inner Product

Greg GrudicMachine Learning36

Nonlinear SVMs

()()

1

sgn,

N

iii

i

f

yKbα

=

⎛⎞

⎟

⎜

=+

⎟

⎜

⎟

⎜

⎟

⎜

⎝⎠

∑

xxx

()

()

111

1

,

2

NNN

ii

j

i

j

i

j

iij

WyyKαααα

===

=−

∑∑∑

xx

Maximize:

Boundary:

Greg GrudicMachine Learning37

Need Mercer Kernels

(

)

(

)

(

)

()

()

(

)

,,

,

,

ijij

ji

ji

K

K

=ΦΦ

=ΦΦ

=

xxxx

xx

xx

Greg GrudicMachine Learning38

Gram (Kernel) Matrix

(

)

(

)

()()

111

1

,,

,,

N

NNN

KK

K

KK

⎛⎞

⎟

⎜

⎟

⎜

⎟

⎜

⎟

=

⎜

⎟

⎜

⎟

⎜

⎟

⎜

⎟

⎟

⎜

⎝⎠

xxxx

xxxx

…

#%#

"

(

)

(

)

11

,,...,,

NN

yyxx

Training Data:

Properties:

•Positive Definite Matrix

•Symmetric

•Positive on diagonal

•N by N

Greg GrudicMachine Learning39

Commonly Used Mercer Kernels

•Polynomial

•Sigmoid

•Gaussian

(

)

(

)

(

)

()()

()

()

2

2

,

,tanh

1

,exp

2

d

ijij

ijij

ijij

Kc

K

K

xxxx

xxxx

xxxx

κθ

σ

=⋅+

=⋅+

⎛⎞

⎟

⎜

=−−

⎟

⎜

⎟

⎜

⎝⎠

Greg GrudicMachine Learning40

Greg GrudicMachine Learning41

Greg GrudicMachine Learning42

Greg GrudicMachine Learning43

MNIST: A SVM Success Story

•Handwritten

character benchmark

–60,000 training and

10,0000 testing

–Dimension d = 28 x

28

Greg GrudicMachine Learning44

Results on Test Data

SVM used a polynomial kernel of degree 9.

Greg GrudicMachine Learning45

SVM (Kernel) Model Structure

## Comments 0

Log in to post a comment