Support Vector Machines (SVM)

spraytownspeakerAI and Robotics

Oct 16, 2013 (4 years and 2 months ago)

105 views

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


1




Support Vector Machines
(SVM)


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


2

Outline


Linear pattern classfiers and optimal hyperplane

Optimization problem formulation

Statistical properties of optimal hyperplane

The case of non
-
separable patterns

Applications to general pattern classification

Merc
er's Theorem


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


3

Linear Hyperplane Classifier

Given: {(x
i
, d
i
); i = 1 to N, d
i



{+1,

1}}.

A linear hyperplane classifier is
a hyperplane consisting of
points x such that

H = {x| g(x) = w
T
x + b = 0}

g(x): discriminant function!.


For x

in the side of

o :


w
T
x + b


0;

d = +1;

For x in the side of

:

w
T
x + b


0;

d =

1.

Distance from x to H:

r = w
T
x/|w|


(

b/|w|) = g(x) /|w|


x
1

x
2

w


b/|w|

x

r

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


4

Distance from a Point to a Hyper
-
plane

The hyperplane H is characterized by

(*) w
T
x + b = 0

w: normal vector pe
rpendicular to H

(*) says any vector x on H that

project to w will have a length of

OA =

b/|w|.

Consider a special point C corresponding to vector x
*
. Its
projection onto vector w is

w
T
x
*
/|w| = OA + BC. Or equivalently, w
T
x
*
/|w| = r

b/|w|
.

Hence r = (w
T
x
*
+b)/|w| = g(x*)/|w|

r

w

X*

H

O

A

B

C

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


5

Optimal Hyperplane: Linearly Separable Case

For d
i

= +1, g(x
i
) = w
T
x
i

+ b



|w|


w
o
T
x
i

+ b
o



1

For d
i

=

1, g(x
i
) = w
T
x
i

+ b




|w|


w
o
T
x
i

+ b
o




1


x
1

x
2







佰瑩浡⁨祰敲e污湥n獨o畬搠
扥b楮⁴桥⁣e湴n爠潦⁴桥⁧瀮



卵灰潲瑩湧n噥瑯牳


卡浰汥猠潮o瑨⁢畮u慲楥献
卵灰潲瑩湧n癥捴v牳⁡潮攠
捡渠n整e牭楮攠潰o業慬
桹灥牰污湥n



兵敳瑩渺†䡯眠瑯H晩搠
潰o業慬⁨祰敲污湥n

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


6

Separation Gap

For x
i

being a supporting vector,

For d
i

= +1, g(x
i
) = w
T
x
i

+ b =

|w|


w
o
T
x
i

+ b
o

= 1

For d
i

=

1, g(x
i
) = w
T
x
i

+ b =


|w|


w
o
T
x
i

+ b
o

=

1

Hence w
o

= w/(

|w|), b
o

= b/(

|w|). But distance from x
i

to
hyperplane is


= g(x
i
)/|w|. Thus w
o

= w/g(x
i
), and


= 1/|w
o
|.

The max
imum distance between the two classes is



2


= 2/|w
o
|.

Hence the objective is to find w
o
, b
o

to minimize |w
o
| (so that


is maximized) subject to the constraints that

w
o
T
x
i

+ b
o



1 for d
i

= +1; and w
o
T
x
i

+ b
o



1 for d
i

=

1.

Combine these constraints,

one has: d
i

(w
o
T
x
i

+ b
o
)


1

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


7

Quadratic Optimization Problem Formulation

Given {(x
i
, d
i
); i = 1 to N}, find w and b such that


(w) = w
T
w/2

is minimized subject to N constraints

d
i



(w
T
x
i

+ b)


0; 1


i


N.


Method of Lagrange Multiplier


J(w, b,


) =

(w)




Set


J(w,b,

)/

w = 0



w =




J(w,b,

)/



= 0




Copyright © 1999
-
2000 by Yu Hen Hu


SVM


8

Optimization (continued)

The solution of Lagrange multiplier problem is at a saddle
point where the minimum is sough
t w.r.t. w and b, while the
maximum is sought w.r.t.

i.

Kuhn
-
Tucker Condition: at the saddle point,



i
[d
i

(w
T
x
i

+ b)


1] = 0 for 1


i


N.



If x
i

is NOT a suppor vector, the corresponding

i

= 0!



Hence, only support vector will affect the re
sult of
optimization!




Copyright © 1999
-
2000 by Yu Hen Hu


SVM


9

A Numerical Example



3 inequalities: 1

w + b



1; 2

w + b


+1; 3

w + b


+ 1

J = w
2
/2



1
(

w

b

1)



2
(2w+b

1)



3
(3w+b

1)


J/

w = 0


w =



1

+ 2

2

+ 3

3


J/

b = 0


0 =

1




2




3

Solve: (a)

w

b

1 = 0 (b)

2w+b

1 = 0 (c); 3w + b

1 = 0

(b) and (c) conflict each other. Solve (a), (b) yield w = 2,

b =

3. From the Kuhn
-
Tucker condition,

3

= 0. Thus,

1

=

2

= 2. Hence the solution of decision boundary is: 2x


3 = 0. or
x = 1.5! This is shown as the
dash line in above figure.

(1,

1)

(2,1)

(3,1)

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


10

Primal/Dual Problem Formulation

Given a constrained optimization problem with a convex cost
function and linear constraints; a
dual problem

with the
Lagrange multipliers providing the solution can be formulated.

Duality Theorem (
Bertsekas 1995)

(a)

If the primal problem has an optimal solution, then the dual
problem has an optimal solution with the same optimal
values.

(b)

In order for w
o

to be an optimal solution and

o

to be an
optimal dual solution, it is necessary and sufficient that
w
o

is feasible for the primal problem and


(w
o
) = J(w
o
,b
o
,


o
) = Min
w

J(w,b
o
,


o
)

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


11

Formulating the Dual Problem


With w =

and
. These lead to a

Dual Problem


Maximize


Subject to:

and

i



0 for i = 1, 2, …, N.

Note


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


12

Numerical Example (cont’d)


or Q(

) =

1

+

2

+

3



[0.5

1
2

+ 2

2
2

+ 4.5

3
2



2

1

2








3

1

3

+ 6

2

3
]

subject to cons
traints:


1

+

2

+

3

= 0, and


1



0,

2



0, and

3



0.

Use Matlab


Optimization tool box command:

x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)

The solution is [

1


2


3
] = [2 2 0] as expected.

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


13

Implication of Minimizing ||w||

Let D denote the diamet
er of the smallest hyper
-
ball that
encloses all the input training vectors {x
1
, x
2
, …, x
N
}. The set
of optimal hyper
-
planes described by the equation

W
o
T
x + b
o

= 0

has a
VC
-
dimension

h bounded from above as

h


min {

D
2
/

2

, m
0
} + 1

where m
0

is the dimen
sion of the input vectors, and


= 2/||w
o
||
is the margin of the separation of the hyper
-
planes.

VC
-
dimension determines the
complexity of the classifier
structure
, and usually the smaller the better.


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


14

Non
-
separable Cases

Recall that in linearly separable
case, each training sample
pair (x
i
, d
i
) represents a linear inequality constraint

d
i
(w
T
x
i

+ b)


1, i = 1, 2, …, N

If the training samples are
not linearly separable
, the
constraint can be modified to yield a
soft constraint
:

d
i
(w
T
x
i

+ b)


1


i

, i = 1
, 2, …, N

{

i
; 1


i


N} are known as
slack variables.

If

i

> 1, then the
corresponding (x
i
, d
i
) will be mis
-
classified.

The minimum error classifier would minimize
, but it
is non
-
convex w.r.t. w. Hence an approximation is to mi
nimize


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


15

Primal and Dual Problem Formulation

Primal Optimization Problem

Given {(x
i
,

i
);1


i


N}. Find w, b
such that




is minimized subject to the constraints (i)

i



0, and (ii)
di(w
T
x
i

+ b)


1


i

for
i = 1, 2, …, N.

Dual Optimization Problem

Given {(x
i
,

i
);1


i


N}. Find
Lagrange multipliers {

i
; 1


i


N} such that


is
maximized subject to the constraints (i) 0



i



C (a user
-
specified positive number) and (ii)

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


16

Solution to the Dual Problem

Optimal Solution to the Dual problem is:



N
s
: # of support vectors.

Kuhn
-
Tucker condition implies for i = 1, 2, …, N,

(i)


i

[d
i

(w
T
x
i

+ b)


1 +

i
] = 0

(*)

(ii)


i


i

= 0

{

i
; 1


i


N} ar
e Lagrange multipliers to enforce the condition

i



0. At optimal point of the primal problem,



/


i

= 0. One
may deduce that

i

= 0 if

i



C. Solving (*), we have




Copyright © 1999
-
2000 by Yu Hen Hu


SVM


17

Matlab Implementation

% svm1.m: basic support vector machin
e

% X: N by m matrix. i
-
th row is x_i

% d: N by 1 vector. i
-
th element is d_i

% X, d should be loaded from file or read from input.

% call MATLAB optimization tool box function fmincon

a0=eps*ones(N,1);

C = 1;

a=fmincon(
'qfun'
,a0,[],[],d',0,zeros(N,1),C
*ones(N,1),…


[],[],X,d)

wo=X'*(a.*d)

bo=sum(diag(a)*(X*wo
-
d))/sum([a > 10*eps])


function
y=qfun(a,X,d);

% the Q(a) function. Note that it is actually
-
Q(a)

% because we call fmincon to minimize
-
Q(a) is

% the same as to maximize Q(a)

[N,m]=size
(X);

y=
-
ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a;


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


18

Inner Product Kernels

In general, if the input is first transformed via a set of nonlinear
functions {

i
(x)} and then subject to the hyperplane classifier




Define the inner product kernel as


one may obtain a dual optimization problem formulation as:


Often, dim of


⠽瀫ㄩ‾㸠摩洠潦⁸m

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


19

General Pattern Recognition with SVM

By careful selection of the nonlinea
r transformation {

j
(x); 1


j


p}, any pattern recognition problem can be solved.



1
(x)


2
(x)


p
(x)



















x
1

x
2

x
m

b

w
1

w
2

w
p

+

d
i

x
i

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


20

Polynomial Kernel

Consider a polynomial kernel


Let K(x,y) =

T
(
x
)

(
x
), then


(
x
) = [1 x
1
2
,

, x
m
2
,

2 x
1
,

,

2x
m
,

2 x
1

x
2
,

,

2 x
1
x
m
,




2 x
2

x
3
,

,

2 x
2
x
m
,

,

2 x
m

1
x
m
]



= [1

1
(x),

,

p
(x)]

where p = 1 +m + m + (m

1) + (m

2) +


+ 1 = (m+2)(m+1)/2

Hence, using a kernel, a low dimensional pattern classification
problem (with dimension m) is solved in a higher dimensional
space (dimension p+
1). But only

j
(x) corresponding to
support vectors are used for pattern classification!

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


21

Numerical Example: XOR Problem

Training samples:


(

1

1;

1), (

1 1 +1),


(1

1 +1), (1 1

1)







x

= [x
1
, x
2
]
T
. Use K(
x
,
x
i
) =
(1 +
x
T
x
i
)
2

one has


(
x
) = [1 x
1
2

x
2
2


2 x
1
,

2 x
2
,

2 x
1
x
2
]
T

;
;

Note dim[

(
x
)] = 6 > dim[
x
] = 2!

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


22

XOR Problem (Continued)

Note that K(x
i
, x
j
) can be calculated directly without using

!

E.g.


The corresponding Lagrange multiplier


= (1/8)[1 1 1 1]
T
.

w =


= (1/8)(

1)

(
x
1
) + (1/8)(1)

(
x
2
) + (1/8)(1)

(
x
3
) + (1/8)(

1)

(
x
4
)
= [0 0 0 0 0

1/

2]
T

Hence the hyperplane is: y = w
T

(
x
) =


x
1
x
2


(x
1
, x
2
)

(

ㄬ1



(

ㄬ‫ㄩ

⠫ㄬ



⠫ㄬ⬱+

礠㴠

ㄠ
1
x
2


1






1

Copyright © 1999
-
2000 by Yu Hen Hu


SVM


23

Other Types of Kernels


type of SVM

K(x,y)

Comments

Polynomial learning
machine

(
x
T
y

+ 1)
p

p: selected a priori

Radial basis
function



2
: selected a priori

Two
-
layer
perce
ptron

tanh(

o
x
T
y

+

1
)

only some

o

and

1

values are feasible.


What kernel is feasible? It must satisfy the "Mercer's theorem"!


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


24

Mercer's Theorem

Let K(
x,y
) be a continuous, symmetric kernel, defined on a


x,y



b. K(
x,y
) admits an eigen
-
function expan
sion


with

i

> 0 for each i. This expansion converges absolutely and
uniformly if and only if


for all

(
x
) such that
.


Copyright © 1999
-
2000 by Yu Hen Hu


SVM


25

Testing with Kernels

For many types of kernels,

(x) can not be e
xplicitly
represented or even found. However,

w =


y(x) = w
T

(
x
) = f
T




(
x
) = f
T

K(x
i
,x) = K(x, x
j
) f

Hence there is no need to know

(x) explicitly! For example,
in the XOR problem, f = (1/8)[

1 +1 +1

1]
T
. Suppose that x
=
(

1, +1), then