# Support Vector Machines (SVMs)

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

102 views

CS 8751 ML & KDD

Support Vector Machines

1

Support Vector Machines (SVMs)

Learning mechanism based on linear
programming

Chooses a separating plane based on maximizing
the notion of a margin

Based on PAC learning

Has mechanisms for

Noise

Non
-
linear separating surfaces (kernel functions)

Notes based on those of Prof. Jude Shavlik

CS 8751 ML & KDD

Support Vector Machines

2

Support Vector Machines

A+

A
-

Find the best separating plane in feature space

-

many possibilities to choose from

which is the

best choice?

CS 8751 ML & KDD

Support Vector Machines

3

SVMs

The General Idea

How to pick the best separating plane?

Idea:

Define a set of inequalities we want to satisfy

Use advanced optimization methods (e.g., linear
programming) to find satisfying solutions

Key issues:

Dealing with noise

What if no good linear separating surface?

CS 8751 ML & KDD

Support Vector Machines

4

Linear Programming

Subset of Math Programming

Problem has the following form:

function
f(x
1
,x
2
,x
3
,…,x
n
)

to be maximized

subject to a set of constraints of the form:

g(x
1
,x
2
,x
3
,…,x
n
) > b

Math programming
-

find a set of values for the variables
x
1
,x
2
,x
3
,…,x
n
that meets all of the constraints and
maximizes the function f

Linear programming
-

solving math programs where the
constraint functions and function to be maximized use
linear combinations

of the variables

Generally easier than general Math Programming problem

Well studied problem

CS 8751 ML & KDD

Support Vector Machines

5

Maximizing the Margin

A+

A
-

the decision

boundary

The margin between categories

-

want this distance to be maximal

-

(we’ll assume linearly separable for now)

CS 8751 ML & KDD

Support Vector Machines

6

PAC Learning

PAC

Probably Approximately Correct learning

Theorems that can be used to define bounds for
the risk (error) of a family of learning functions

Basic formula, with probability (
1
-

):

R

risk function,

is the parameters chosen by
the learner, N is the number of data points, and h
is the VC dimension (something like an estimate
of the complexity of the class of functions)

N
h
N
h
R
R
emp
)
4
/
log(
)
/
2
(log(
)
(
)
(

CS 8751 ML & KDD

Support Vector Machines

7

Margins and PAC Learning

Theorems connect PAC theory to the size of the
margin

Basically, the
larger

the margin, the better the
expected accuracy

See, for example, Chapter 4 of
Support Vector
Machines

by Christianini and Shawe
-
Taylor,
Cambridge University Press, 2002

CS 8751 ML & KDD

Support Vector Machines

8

Some Equations

w
x
w
x
w
x
w
x
w
neg
pos
2
margin

margin)

(the

planes

red

and

blue
between

Distance
1

examples

negative

all
For
1

examples

positive

all
For
threshold
-

features,
input

-

weights,
-

Plane

Separating

1s result from dividing

through by a constant

for convenience

Euclidean length (“2 norm”) of

the weight vector

CS 8751 ML & KDD

Support Vector Machines

9

What the Equations Mean

A+

A
-

Support

Vectors

Margin

x
´
w =
γ

+ 1

x
´
w =
γ

-

1

2 / ||w||
2

CS 8751 ML & KDD

Support Vector Machines

10

Choosing a Separating Plane

A+

A
-

?

CS 8751 ML & KDD

Support Vector Machines

11

Our “Mathematical Program” (so far)

soln)

optimal

global

(a

above

the
o
solution t

a

find

to
software

on
optimizati

g
programmin
math

existing

use

now
can

We
es)
inequaliti
our

of

side
left

the
to

move

and

trick"
"

ANN

the
use

course,

of

could,

(we

parameters

our

are

,

:
Note
examples)

(for

1

examples)

(for

1

such that

min
2
,

w
x
w
x
w
w
neg
pos
w

for technical reasons easier to

CS 8751 ML & KDD

Support Vector Machines

12

Dealing with Non
-
Separable Data

We can add what is called a “slack” variable to each
example

This variable can be viewed as:

0

if the example is correctly separated

y

“distance” we need to move example to make it

correct (i.e., the distance from its surface)

CS 8751 ML & KDD

Support Vector Machines

13

“Slack” Variables

A+

A
-

y

Support

Vectors

CS 8751 ML & KDD

Support Vector Machines

14

The Math Program with Slack Variables

0

1

1

such that
positive)

(all

components

of

sum

-

norm"

one
"

constant

scaling

example
each
for

one

feature
input
each
for

one

min
k
1
1
2
,
,

k
j
neg
i
pos
s
w
s
s
x
w
s
x
w
s
s
w
s
w
j
i

Support Vector Machine

CS 8751 ML & KDD

Support Vector Machines

15

Why the word “Support”?

All those examples
on

or on the
wrong side

of the
two separating planes are the support vectors

We’d get the same answer if we deleted all the
non
-
support vectors!

i.e., the “support vectors [examples]” support the
solution

CS 8751 ML & KDD

Support Vector Machines

16

PAC and the Number of Support Vectors

The fewer the support vectors, the better the
generalization will be

Recall, non
-
support vectors are

Correctly classified

Don’t change the learned model if left out of the
training set

So

examples

training
#
ctors
support ve

#

rate
error
out
one
leave

CS 8751 ML & KDD

Support Vector Machines

17

Finding Non
-
Linear Separating Surfaces

Map inputs into new space

Example: features x
1

x
2

5 4

Example: features x
1

x
2
x
1
2

x
2
2

x
1
*x
2

5 4 25 16 20

Solve SVM program in this new space

Computationally complex if many features

But a clever trick exists

CS 8751 ML & KDD

Support Vector Machines

18

The Kernel Trick

Optimization problems often/always have a
“primal” and a “dual” representation

So far we’ve looked at the
primal
formulation

The
dual

formulation is better for the case of a non
-
linear separating surface

CS 8751 ML & KDD

Support Vector Machines

19

Perceptrons Re
-
Visited

zero)

(all

0

assumes

This
weights
change

and

wrong

get

we
times
of
number

some

is

where

So
ied
misclassif
currently

is

example

the
if

,

1
-

F

and

1

T

if

s,
perceptron
In
#
1
1
1

initial
i
i
examples
i
i
i
i
final
i
i
i
k
k
w
x
x
y
w
x
x
y
w
w

CS 8751 ML & KDD

Support Vector Machines

20

Dual Form of the Perceptron Learning Rule

errors)

(counts

1

then
)
teacher

predicted

(i.e.,

0

if

i

example

each
For
:
algorithm

perceptron

dual)

(i.e.,
New
sgn
sgn

So
otherwise

1
-

0

if

1

sgn

sgn
i
i
1
1

i
i
i
j
#examples
j
j
j
i
i
i
i
#examples
i
i
i
i
α
α

x
x
y
α
y
)
x
x
y
α
(

)
x
x
y
α
(

)
x
h(
z
(z)
)
x
w
(

)
x
h(

perceptron
output of

CS 8751 ML & KDD

Support Vector Machines

21

Primal versus Dual Space

Primal

“weight space”

Weight
features
to make output decision

Dual

“training
-
examples space”

Weight distance (which is based on the features) to
training
examples

)
sgn(
)
(
new
new
x
w
x
h

)
x
x
y
α
(
)
x
h(
i
j
#examples
j
j
j
new

1
sgn
CS 8751 ML & KDD

Support Vector Machines

22

The Dual SVM

2
min
max

:
form

primal

back to
convert

Can
0

that
such
min

Let
1
1
1
1
1
1
1
2
1
i
y
i
y
i
n
i
i
i
i
n
i
i
i
n
i
i
n
i
n
j
j
i
j
i
j
i
x
w
x
w
x
y
w
y
x
x
y
y
examples
#training
n
i
i

CS 8751 ML & KDD

Support Vector Machines

23

Non
-
Zero
α
i
’s

weights
the
to
contribute

0)

(i.e.,

ctors
support ve

only the

-

Recall
ctors
support ve

the

are

0

with
examples

Those
1

i
i
n
i
i
i
i
x
y
w

CS 8751 ML & KDD

Support Vector Machines

24

Generalizing the Dot Product

space
new

to
converting
directly

than
efficient

more
usually

-

space
new

the
in

features

know the

explicitly

to
need
t
don'

we
-

product
dot
a

computing

re
we'
space
new

this
in

-

implicitly

space
new
a

into

features

original

the
maps

linear)
-
non
(usually

kernel

acceptable

LAn

)
,
K(
e.g.,

functions"

kernel
"
other

to

)
,
uct(
Dot_Prod
generalize

can

We

j
i
j
i
j
i
j
i
x
x
x
x
x
x
x
x

CS 8751 ML & KDD

Support Vector Machines

25

The New Space for a Sample Kernel

2
2
1
2
2
1
1
1
2
2
1
2
2
1
1
1
2
2
2
2
1
2
1
2
2
1
2
1
1
1
1
1
2
2
2
1
1
2
2
,
,
,
,
,
,

)
(
2
let

and
Let
z
z
z
z
z
z
z
z
x
x
x
x
x
x
x
x
z
z
x
x
z
z
x
x
z
z
x
x
z
z
x
x
z
x
z
x
)
z
x
(
#features
)
z
x
(
)
z
,
x
K(

Our new feature space (with 4 dimensions)

-

we’re doing a dot product in it

CS 8751 ML & KDD

Support Vector Machines

26

g(+)

Visualizing the Kernel

-

+

+

+

+

-

-

-

-

+

Input Space

Original Space

Separating plane

(non
-
linear here but

linear in derived space)

g(+)

g(
-
)

Derived Feature Space

New Space

g(+)

g(+)

g(+)

g(
-
)

g(
-
)

g(
-
)

g(
-
)

g() is feature transformation

function

process is similar to what

hidden units do in ANNs but

kernel is user chosen

CS 8751 ML & KDD

Support Vector Machines

27

More Sample Kernels

etc.)

DNA,
(text,

specific

for

designed
many

including

more,
many

plus

-
s
ANN'

of

sigmoid

to
Related
-

tanh

3)
network

RBF
to

kernel,

Gaussian

-

2)

1)
2
2
d
d
z
x
c
)
z
,
x
K(
e
)
z
,
x
K(
const
z
x
)
z
,
x
K(
z
x

CS 8751 ML & KDD

Support Vector Machines

28

What Makes a Kernel

...

real
a

returns

where

4)

3)

constant

is

where

2)

1)

are

so

then
kernels,

are

and

If
kernel
a

is

function

a

when
zes
characteri

theoremn
s
Mercer'
2
1
1
2
1
2
1
f()
)
z
f(
)
x
f(
()
() * K
K
c
()
c*K
()
K
()
K
()
K
()
K
)
z
,
x
f(

CS 8751 ML & KDD

Support Vector Machines

29

Key SVM Ideas

Maximize the
margin

between positive and
negative examples (connects to PAC theory)

Penalize errors in non
-
separable case

Only the
support vectors

contribute to the solution

Kernels map examples into a new, usually non
-
linear space

We implicitly do dot products in this new space (in the
“dual” form of the SVM program)