CS 8751 ML & KDD
Support Vector Machines
1
Support Vector Machines (SVMs)
•
Learning mechanism based on linear
programming
•
Chooses a separating plane based on maximizing
the notion of a margin
–
Based on PAC learning
•
Has mechanisms for
–
Noise
–
Non

linear separating surfaces (kernel functions)
•
Notes based on those of Prof. Jude Shavlik
CS 8751 ML & KDD
Support Vector Machines
2
Support Vector Machines
A+
A

Find the best separating plane in feature space

many possibilities to choose from
which is the
best choice?
CS 8751 ML & KDD
Support Vector Machines
3
SVMs
–
The General Idea
•
How to pick the best separating plane?
•
Idea:
–
Define a set of inequalities we want to satisfy
–
Use advanced optimization methods (e.g., linear
programming) to find satisfying solutions
•
Key issues:
–
Dealing with noise
–
What if no good linear separating surface?
CS 8751 ML & KDD
Support Vector Machines
4
Linear Programming
•
Subset of Math Programming
•
Problem has the following form:
function
f(x
1
,x
2
,x
3
,…,x
n
)
to be maximized
subject to a set of constraints of the form:
g(x
1
,x
2
,x
3
,…,x
n
) > b
•
Math programming

find a set of values for the variables
x
1
,x
2
,x
3
,…,x
n
that meets all of the constraints and
maximizes the function f
•
Linear programming

solving math programs where the
constraint functions and function to be maximized use
linear combinations
of the variables
–
Generally easier than general Math Programming problem
–
Well studied problem
CS 8751 ML & KDD
Support Vector Machines
5
Maximizing the Margin
A+
A

the decision
boundary
The margin between categories

want this distance to be maximal

(we’ll assume linearly separable for now)
CS 8751 ML & KDD
Support Vector Machines
6
PAC Learning
•
PAC
–
Probably Approximately Correct learning
•
Theorems that can be used to define bounds for
the risk (error) of a family of learning functions
•
Basic formula, with probability (
1

):
•
R
–
risk function,
is the parameters chosen by
the learner, N is the number of data points, and h
is the VC dimension (something like an estimate
of the complexity of the class of functions)
N
h
N
h
R
R
emp
)
4
/
log(
)
/
2
(log(
)
(
)
(
CS 8751 ML & KDD
Support Vector Machines
7
Margins and PAC Learning
•
Theorems connect PAC theory to the size of the
margin
•
Basically, the
larger
the margin, the better the
expected accuracy
•
See, for example, Chapter 4 of
Support Vector
Machines
by Christianini and Shawe

Taylor,
Cambridge University Press, 2002
CS 8751 ML & KDD
Support Vector Machines
8
Some Equations
w
x
w
x
w
x
w
x
w
neg
pos
2
margin
margin)
(the
planes
red
and
blue
between
Distance
1
examples
negative
all
For
1
examples
positive
all
For
threshold

features,
input

weights,

Plane
Separating
1s result from dividing
through by a constant
for convenience
Euclidean length (“2 norm”) of
the weight vector
CS 8751 ML & KDD
Support Vector Machines
9
What the Equations Mean
A+
A

Support
Vectors
Margin
x
´
w =
γ
+ 1
x
´
w =
γ

1
2 / w
2
CS 8751 ML & KDD
Support Vector Machines
10
Choosing a Separating Plane
A+
A

?
CS 8751 ML & KDD
Support Vector Machines
11
Our “Mathematical Program” (so far)
soln)
optimal
global
(a
above
the
o
solution t
a
find
to
software
on
optimizati
g
programmin
math
existing
use
now
can
We
es)
inequaliti
our
of
side
left
the
to
move
and
trick"
"
ANN
the
use
course,
of
could,
(we
parameters
adjustable
our
are
,
:
Note
examples)
(for
1
examples)
(for
1
such that
min
2
,
w
x
w
x
w
w
neg
pos
w
for technical reasons easier to
optimize this “quadratic program”
CS 8751 ML & KDD
Support Vector Machines
12
Dealing with Non

Separable Data
We can add what is called a “slack” variable to each
example
This variable can be viewed as:
0
if the example is correctly separated
y
“distance” we need to move example to make it
correct (i.e., the distance from its surface)
CS 8751 ML & KDD
Support Vector Machines
13
“Slack” Variables
A+
A

y
Support
Vectors
CS 8751 ML & KDD
Support Vector Machines
14
The Math Program with Slack Variables
0
1
1
such that
positive)
(all
components
of
sum

norm"
one
"
constant
scaling
example
each
for
one
feature
input
each
for
one
min
k
1
1
2
,
,
k
j
neg
i
pos
s
w
s
s
x
w
s
x
w
s
s
w
s
w
j
i
This is the “traditional”
Support Vector Machine
CS 8751 ML & KDD
Support Vector Machines
15
Why the word “Support”?
•
All those examples
on
or on the
wrong side
of the
two separating planes are the support vectors
–
We’d get the same answer if we deleted all the
non

support vectors!
–
i.e., the “support vectors [examples]” support the
solution
CS 8751 ML & KDD
Support Vector Machines
16
PAC and the Number of Support Vectors
•
The fewer the support vectors, the better the
generalization will be
•
Recall, non

support vectors are
–
Correctly classified
–
Don’t change the learned model if left out of the
training set
•
So
examples
training
#
ctors
support ve
#
rate
error
out
one
leave
CS 8751 ML & KDD
Support Vector Machines
17
Finding Non

Linear Separating Surfaces
•
Map inputs into new space
Example: features x
1
x
2
5 4
Example: features x
1
x
2
x
1
2
x
2
2
x
1
*x
2
5 4 25 16 20
•
Solve SVM program in this new space
–
Computationally complex if many features
–
But a clever trick exists
CS 8751 ML & KDD
Support Vector Machines
18
The Kernel Trick
•
Optimization problems often/always have a
“primal” and a “dual” representation
–
So far we’ve looked at the
primal
formulation
–
The
dual
formulation is better for the case of a non

linear separating surface
CS 8751 ML & KDD
Support Vector Machines
19
Perceptrons Re

Visited
zero)
(all
0
assumes
This
weights
change
and
wrong
get
we
times
of
number
some
is
where
So
ied
misclassif
currently
is
example
the
if
,
1

F
and
1
T
if
s,
perceptron
In
#
1
1
1
initial
i
i
examples
i
i
i
i
final
i
i
i
k
k
w
x
x
y
w
x
x
y
w
w
CS 8751 ML & KDD
Support Vector Machines
20
Dual Form of the Perceptron Learning Rule
errors)
(counts
1
then
)
teacher
predicted
(i.e.,
0
if
i
example
each
For
:
algorithm
perceptron
dual)
(i.e.,
New
sgn
sgn
So
otherwise
1

0
if
1
sgn
sgn
i
i
1
1
i
i
i
j
#examples
j
j
j
i
i
i
i
#examples
i
i
i
i
α
α
x
x
y
α
y
)
x
x
y
α
(
)
x
x
y
α
(
)
x
h(
z
(z)
)
x
w
(
)
x
h(
perceptron
output of
CS 8751 ML & KDD
Support Vector Machines
21
Primal versus Dual Space
•
Primal
–
“weight space”
–
Weight
features
to make output decision
•
Dual
–
“training

examples space”
–
Weight distance (which is based on the features) to
training
examples
)
sgn(
)
(
new
new
x
w
x
h
)
x
x
y
α
(
)
x
h(
i
j
#examples
j
j
j
new
1
sgn
CS 8751 ML & KDD
Support Vector Machines
22
The Dual SVM
2
min
max
:
form
primal
back to
convert
Can
0
that
such
min
Let
1
1
1
1
1
1
1
2
1
i
y
i
y
i
n
i
i
i
i
n
i
i
i
n
i
i
n
i
n
j
j
i
j
i
j
i
x
w
x
w
x
y
w
y
x
x
y
y
examples
#training
n
i
i
CS 8751 ML & KDD
Support Vector Machines
23
Non

Zero
α
i
’s
weights
the
to
contribute
0)
(i.e.,
ctors
support ve
only the

Recall
ctors
support ve
the
are
0
with
examples
Those
1
i
i
n
i
i
i
i
x
y
w
CS 8751 ML & KDD
Support Vector Machines
24
Generalizing the Dot Product
space
new
to
converting
directly
than
efficient
more
usually

space
new
the
in
features
know the
explicitly
to
need
t
don'
we

product
dot
a
computing
re
we'
space
new
this
in

implicitly
space
new
a
into
features
original
the
maps
linear)

non
(usually
kernel
acceptable
LAn
)
,
K(
e.g.,
functions"
kernel
"
other
to
)
,
uct(
Dot_Prod
generalize
can
We
j
i
j
i
j
i
j
i
x
x
x
x
x
x
x
x
CS 8751 ML & KDD
Support Vector Machines
25
The New Space for a Sample Kernel
2
2
1
2
2
1
1
1
2
2
1
2
2
1
1
1
2
2
2
2
1
2
1
2
2
1
2
1
1
1
1
1
2
2
2
1
1
2
2
,
,
,
,
,
,
)
(
2
let
and
Let
z
z
z
z
z
z
z
z
x
x
x
x
x
x
x
x
z
z
x
x
z
z
x
x
z
z
x
x
z
z
x
x
z
x
z
x
)
z
x
(
#features
)
z
x
(
)
z
,
x
K(
Our new feature space (with 4 dimensions)

we’re doing a dot product in it
CS 8751 ML & KDD
Support Vector Machines
26
g(+)
Visualizing the Kernel

+
+
+
+




+
Input Space
Original Space
Separating plane
(non

linear here but
linear in derived space)
g(+)
g(

)
Derived Feature Space
New Space
g(+)
g(+)
g(+)
g(

)
g(

)
g(

)
g(

)
g() is feature transformation
function
process is similar to what
hidden units do in ANNs but
kernel is user chosen
CS 8751 ML & KDD
Support Vector Machines
27
More Sample Kernels
etc.)
DNA,
(text,
tasks
specific
for
designed
many
including
more,
many
plus

s
ANN'
of
sigmoid
to
Related

tanh
3)
network
RBF
to
leads
kernel,
Gaussian

2)
1)
2
2
d
d
z
x
c
)
z
,
x
K(
e
)
z
,
x
K(
const
z
x
)
z
,
x
K(
z
x
CS 8751 ML & KDD
Support Vector Machines
28
What Makes a Kernel
...
real
a
returns
where
4)
3)
constant
is
where
2)
1)
are
so
then
kernels,
are
and
If
kernel
a
is
function
a
when
zes
characteri
theoremn
s
Mercer'
2
1
1
2
1
2
1
f()
)
z
f(
)
x
f(
()
() * K
K
c
()
c*K
()
K
()
K
()
K
()
K
)
z
,
x
f(
CS 8751 ML & KDD
Support Vector Machines
29
Key SVM Ideas
•
Maximize the
margin
between positive and
negative examples (connects to PAC theory)
•
Penalize errors in non

separable case
•
Only the
support vectors
contribute to the solution
•
Kernels map examples into a new, usually non

linear space
–
We implicitly do dot products in this new space (in the
“dual” form of the SVM program)
Comments 0
Log in to post a comment