Support Vector Machines (SVMs)

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

92 εμφανίσεις

CS 8751 ML & KDD

Support Vector Machines

1

Support Vector Machines (SVMs)


Learning mechanism based on linear
programming


Chooses a separating plane based on maximizing
the notion of a margin


Based on PAC learning


Has mechanisms for


Noise


Non
-
linear separating surfaces (kernel functions)



Notes based on those of Prof. Jude Shavlik

CS 8751 ML & KDD

Support Vector Machines

2

Support Vector Machines

A+

A
-

Find the best separating plane in feature space


-

many possibilities to choose from

which is the

best choice?

CS 8751 ML & KDD

Support Vector Machines

3

SVMs


The General Idea


How to pick the best separating plane?


Idea:


Define a set of inequalities we want to satisfy


Use advanced optimization methods (e.g., linear
programming) to find satisfying solutions


Key issues:


Dealing with noise


What if no good linear separating surface?

CS 8751 ML & KDD

Support Vector Machines

4

Linear Programming


Subset of Math Programming


Problem has the following form:

function
f(x
1
,x
2
,x
3
,…,x
n
)

to be maximized

subject to a set of constraints of the form:


g(x
1
,x
2
,x
3
,…,x
n
) > b


Math programming
-

find a set of values for the variables
x
1
,x
2
,x
3
,…,x
n
that meets all of the constraints and
maximizes the function f


Linear programming
-

solving math programs where the
constraint functions and function to be maximized use
linear combinations

of the variables


Generally easier than general Math Programming problem


Well studied problem

CS 8751 ML & KDD

Support Vector Machines

5

Maximizing the Margin

A+

A
-

the decision

boundary

The margin between categories


-

want this distance to be maximal


-

(we’ll assume linearly separable for now)

CS 8751 ML & KDD

Support Vector Machines

6

PAC Learning


PAC


Probably Approximately Correct learning


Theorems that can be used to define bounds for
the risk (error) of a family of learning functions


Basic formula, with probability (
1
-


):




R


risk function,


is the parameters chosen by
the learner, N is the number of data points, and h
is the VC dimension (something like an estimate
of the complexity of the class of functions)


N
h
N
h
R
R
emp
)
4
/
log(
)
/
2
(log(
)
(
)
(






CS 8751 ML & KDD

Support Vector Machines

7

Margins and PAC Learning


Theorems connect PAC theory to the size of the
margin


Basically, the
larger

the margin, the better the
expected accuracy


See, for example, Chapter 4 of
Support Vector
Machines

by Christianini and Shawe
-
Taylor,
Cambridge University Press, 2002

CS 8751 ML & KDD

Support Vector Machines

8

Some Equations

w
x
w
x
w
x
w
x
w
neg
pos
2
margin

margin)

(the

planes

red

and

blue
between

Distance
1

examples

negative

all
For
1

examples

positive

all
For
threshold
-

features,
input

-


weights,
-



Plane

Separating





















1s result from dividing

through by a constant

for convenience

Euclidean length (“2 norm”) of

the weight vector

CS 8751 ML & KDD

Support Vector Machines

9

What the Equations Mean

A+

A
-

Support

Vectors

Margin

x
´
w =
γ

+ 1

x
´
w =
γ

-

1

2 / ||w||
2

CS 8751 ML & KDD

Support Vector Machines

10

Choosing a Separating Plane

A+

A
-

?

CS 8751 ML & KDD

Support Vector Machines

11

Our “Mathematical Program” (so far)

soln)

optimal

global

(a

above

the
o
solution t

a

find

to
software

on
optimizati

g
programmin
math

existing

use

now
can

We
es)
inequaliti
our

of

side
left

the
to

move

and

trick"
"

ANN

the
use

course,

of

could,

(we

parameters

adjustable
our

are

,

:
Note
examples)

(for

1

examples)

(for

1

such that

min
2
,





w
x
w
x
w
w
neg
pos
w















for technical reasons easier to

optimize this “quadratic program”

CS 8751 ML & KDD

Support Vector Machines

12

Dealing with Non
-
Separable Data

We can add what is called a “slack” variable to each
example


This variable can be viewed as:

0


if the example is correctly separated

y


“distance” we need to move example to make it



correct (i.e., the distance from its surface)

CS 8751 ML & KDD

Support Vector Machines

13

“Slack” Variables

A+

A
-

y

Support

Vectors

CS 8751 ML & KDD

Support Vector Machines

14

The Math Program with Slack Variables

0

1


1

such that
positive)

(all

components

of

sum

-

norm"

one
"


constant

scaling


example
each
for

one


feature
input
each
for

one



min
k
1
1
2
,
,















k
j
neg
i
pos
s
w
s
s
x
w
s
x
w
s
s
w
s
w
j
i
















This is the “traditional”

Support Vector Machine

CS 8751 ML & KDD

Support Vector Machines

15

Why the word “Support”?


All those examples
on

or on the
wrong side

of the
two separating planes are the support vectors


We’d get the same answer if we deleted all the
non
-
support vectors!


i.e., the “support vectors [examples]” support the
solution

CS 8751 ML & KDD

Support Vector Machines

16

PAC and the Number of Support Vectors


The fewer the support vectors, the better the
generalization will be


Recall, non
-
support vectors are


Correctly classified


Don’t change the learned model if left out of the
training set


So

examples

training
#
ctors
support ve

#


rate
error
out
one
leave



CS 8751 ML & KDD

Support Vector Machines

17

Finding Non
-
Linear Separating Surfaces


Map inputs into new space

Example: features x
1

x
2


5 4


Example: features x
1

x
2
x
1
2

x
2
2

x
1
*x
2


5 4 25 16 20



Solve SVM program in this new space


Computationally complex if many features


But a clever trick exists



CS 8751 ML & KDD

Support Vector Machines

18

The Kernel Trick


Optimization problems often/always have a
“primal” and a “dual” representation


So far we’ve looked at the
primal
formulation


The
dual

formulation is better for the case of a non
-
linear separating surface

CS 8751 ML & KDD

Support Vector Machines

19

Perceptrons Re
-
Visited

zero)

(all

0

assumes

This
weights
change

and

wrong

get

we
times
of
number

some

is


where

So
ied
misclassif
currently

is


example

the
if


,

1
-


F

and

1


T

if

s,
perceptron
In
#
1
1
1



















initial
i
i
examples
i
i
i
i
final
i
i
i
k
k
w
x
x
y
w
x
x
y
w
w



CS 8751 ML & KDD

Support Vector Machines

20

Dual Form of the Perceptron Learning Rule





errors)

(counts

1

then
)
teacher

predicted

(i.e.,

0

if

i

example

each
For
:
algorithm

perceptron

dual)

(i.e.,
New
sgn
sgn

So
otherwise

1
-

0


if

1


sgn

sgn
i
i
1
1



































i
i
i
j
#examples
j
j
j
i
i
i
i
#examples
i
i
i
i
α
α

x
x
y
α
y
)
x
x
y
α
(



)
x
x
y
α
(

)
x
h(
z
(z)
)
x
w
(

)
x
h(

perceptron
output of










CS 8751 ML & KDD

Support Vector Machines

21

Primal versus Dual Space


Primal


“weight space”


Weight
features
to make output decision





Dual


“training
-
examples space”


Weight distance (which is based on the features) to
training
examples

)
sgn(
)
(
new
new
x
w
x
h







)
x
x
y
α
(
)
x
h(
i
j
#examples
j
j
j
new







1
sgn
CS 8751 ML & KDD

Support Vector Machines

22

The Dual SVM











2
min
max


:
form

primal

back to
convert

Can
0


that
such
min

Let
1
1
1
1
1
1
1
2
1
i
y
i
y
i
n
i
i
i
i
n
i
i
i
n
i
i
n
i
n
j
j
i
j
i
j
i
x
w
x
w
x
y
w
y
x
x
y
y
examples
#training
n
i
i









































CS 8751 ML & KDD

Support Vector Machines

23

Non
-
Zero
α
i
’s

weights
the
to
contribute


0)

(i.e.,

ctors
support ve

only the

-


Recall
ctors
support ve

the

are

0

with
examples

Those
1





i
i
n
i
i
i
i
x
y
w





CS 8751 ML & KDD

Support Vector Machines

24

Generalizing the Dot Product



space
new

to
converting
directly

than
efficient

more
usually

-

space
new

the
in

features

know the

explicitly

to
need
t
don'

we
-

product
dot
a

computing

re
we'
space
new

this
in

-

implicitly

space
new
a

into

features

original

the
maps

linear)
-
non
(usually

kernel

acceptable

LAn


)
,
K(
e.g.,

functions"

kernel
"
other

to


)
,
uct(
Dot_Prod
generalize

can

We

j
i
j
i
j
i
j
i
x
x
x
x
x
x
x
x












CS 8751 ML & KDD

Support Vector Machines

25

The New Space for a Sample Kernel

2
2
1
2
2
1
1
1
2
2
1
2
2
1
1
1
2
2
2
2
1
2
1
2
2
1
2
1
1
1
1
1
2
2
2
1
1
2
2
,
,
,
,
,
,





)
(
2
let

and
Let
z
z
z
z
z
z
z
z
x
x
x
x
x
x
x
x
z
z
x
x
z
z
x
x
z
z
x
x
z
z
x
x
z
x
z
x
)
z
x
(
#features
)
z
x
(
)
z
,
x
K(


















Our new feature space (with 4 dimensions)

-

we’re doing a dot product in it

CS 8751 ML & KDD

Support Vector Machines

26

g(+)

Visualizing the Kernel

-

+

+

+

+

-

-

-

-

+

Input Space

Original Space

Separating plane

(non
-
linear here but

linear in derived space)

g(+)

g(
-
)

Derived Feature Space

New Space

g(+)

g(+)

g(+)

g(
-
)

g(
-
)

g(
-
)

g(
-
)

g() is feature transformation


function


process is similar to what


hidden units do in ANNs but


kernel is user chosen

CS 8751 ML & KDD

Support Vector Machines

27

More Sample Kernels









etc.)

DNA,
(text,

tasks
specific

for

designed
many

including

more,
many

plus

-
s
ANN'

of

sigmoid

to
Related
-

tanh

3)
network

RBF
to
leads

kernel,

Gaussian

-


2)

1)
2
2
d
d
z
x
c
)
z
,
x
K(
e
)
z
,
x
K(
const
z
x
)
z
,
x
K(
z
x























CS 8751 ML & KDD

Support Vector Machines

28

What Makes a Kernel

...

real
a

returns


where

4)


3)

constant

is


where

2)




1)

are

so

then
kernels,

are


and


If
kernel
a

is


function

a

when
zes
characteri

theoremn
s
Mercer'
2
1
1
2
1
2
1
f()
)
z
f(
)
x
f(
()
() * K
K
c
()
c*K
()
K
()
K
()
K
()
K
)
z
,
x
f(






CS 8751 ML & KDD

Support Vector Machines

29

Key SVM Ideas


Maximize the
margin

between positive and
negative examples (connects to PAC theory)


Penalize errors in non
-
separable case


Only the
support vectors

contribute to the solution


Kernels map examples into a new, usually non
-
linear space


We implicitly do dot products in this new space (in the
“dual” form of the SVM program)