# Support Vector Machines (SVM)

AI and Robotics

Oct 16, 2013 (4 years and 9 months ago)

128 views

-
2000 by Yu Hen Hu

SVM

1

Support Vector Machines
(SVM)

-
2000 by Yu Hen Hu

SVM

2

Outline

Linear pattern classfiers and optimal hyperplane

Optimization problem formulation

Statistical properties of optimal hyperplane

The case of non
-
separable patterns

Applications to general pattern classification

Merc
er's Theorem

-
2000 by Yu Hen Hu

SVM

3

Linear Hyperplane Classifier

Given: {(x
i
, d
i
); i = 1 to N, d
i

{+1,

1}}.

A linear hyperplane classifier is
a hyperplane consisting of
points x such that

H = {x| g(x) = w
T
x + b = 0}

g(x): discriminant function!.

For x

in the side of

o :

w
T
x + b

0;

d = +1;

For x in the side of

:

w
T
x + b

0;

d =

1.

Distance from x to H:

r = w
T
x/|w|

(

b/|w|) = g(x) /|w|

x
1

x
2

w

b/|w|

x

r

-
2000 by Yu Hen Hu

SVM

4

Distance from a Point to a Hyper
-
plane

The hyperplane H is characterized by

(*) w
T
x + b = 0

w: normal vector pe
rpendicular to H

(*) says any vector x on H that

project to w will have a length of

OA =

b/|w|.

Consider a special point C corresponding to vector x
*
. Its
projection onto vector w is

w
T
x
*
/|w| = OA + BC. Or equivalently, w
T
x
*
/|w| = r

b/|w|
.

Hence r = (w
T
x
*
+b)/|w| = g(x*)/|w|

r

w

X*

H

O

A

B

C

-
2000 by Yu Hen Hu

SVM

5

Optimal Hyperplane: Linearly Separable Case

For d
i

= +1, g(x
i
) = w
T
x
i

+ b

|w|

w
o
T
x
i

+ b
o

1

For d
i

=

1, g(x
i
) = w
T
x
i

+ b

|w|

w
o
T
x
i

+ b
o

1

x
1

x
2

-
2000 by Yu Hen Hu

SVM

6

Separation Gap

For x
i

being a supporting vector,

For d
i

= +1, g(x
i
) = w
T
x
i

+ b =

|w|

w
o
T
x
i

+ b
o

= 1

For d
i

=

1, g(x
i
) = w
T
x
i

+ b =

|w|

w
o
T
x
i

+ b
o

=

1

Hence w
o

= w/(

|w|), b
o

= b/(

|w|). But distance from x
i

to
hyperplane is

= g(x
i
)/|w|. Thus w
o

= w/g(x
i
), and

= 1/|w
o
|.

The max
imum distance between the two classes is

2

= 2/|w
o
|.

Hence the objective is to find w
o
, b
o

to minimize |w
o
| (so that

is maximized) subject to the constraints that

w
o
T
x
i

+ b
o

1 for d
i

= +1; and w
o
T
x
i

+ b
o

1 for d
i

=

1.

Combine these constraints,

one has: d
i

(w
o
T
x
i

+ b
o
)

1

-
2000 by Yu Hen Hu

SVM

7

Quadratic Optimization Problem Formulation

Given {(x
i
, d
i
); i = 1 to N}, find w and b such that

(w) = w
T
w/2

is minimized subject to N constraints

d
i

(w
T
x
i

+ b)

0; 1

i

N.

Method of Lagrange Multiplier

J(w, b,

) =

(w)

Set

J(w,b,

)/

w = 0

w =

J(w,b,

)/

= 0

-
2000 by Yu Hen Hu

SVM

8

Optimization (continued)

The solution of Lagrange multiplier problem is at a saddle
point where the minimum is sough
t w.r.t. w and b, while the
maximum is sought w.r.t.

i.

Kuhn
-
Tucker Condition: at the saddle point,

i
[d
i

(w
T
x
i

+ b)

1] = 0 for 1

i

N.

If x
i

is NOT a suppor vector, the corresponding

i

= 0!

Hence, only support vector will affect the re
sult of
optimization!

-
2000 by Yu Hen Hu

SVM

9

A Numerical Example

3 inequalities: 1

w + b

1; 2

w + b

+1; 3

w + b

+ 1

J = w
2
/2

1
(

w

b

1)

2
(2w+b

1)

3
(3w+b

1)

J/

w = 0

w =

1

+ 2

2

+ 3

3

J/

b = 0

0 =

1

2

3

Solve: (a)

w

b

1 = 0 (b)

2w+b

1 = 0 (c); 3w + b

1 = 0

(b) and (c) conflict each other. Solve (a), (b) yield w = 2,

b =

3. From the Kuhn
-
Tucker condition,

3

= 0. Thus,

1

=

2

= 2. Hence the solution of decision boundary is: 2x

3 = 0. or
x = 1.5! This is shown as the
dash line in above figure.

(1,

1)

(2,1)

(3,1)

-
2000 by Yu Hen Hu

SVM

10

Primal/Dual Problem Formulation

Given a constrained optimization problem with a convex cost
function and linear constraints; a
dual problem

with the
Lagrange multipliers providing the solution can be formulated.

Duality Theorem (
Bertsekas 1995)

(a)

If the primal problem has an optimal solution, then the dual
problem has an optimal solution with the same optimal
values.

(b)

In order for w
o

to be an optimal solution and

o

to be an
optimal dual solution, it is necessary and sufficient that
w
o

is feasible for the primal problem and

(w
o
) = J(w
o
,b
o
,

o
) = Min
w

J(w,b
o
,

o
)

-
2000 by Yu Hen Hu

SVM

11

Formulating the Dual Problem

With w =

and
. These lead to a

Dual Problem

Maximize

Subject to:

and

i

0 for i = 1, 2, …, N.

Note

-
2000 by Yu Hen Hu

SVM

12

Numerical Example (cont’d)

or Q(

) =

1

+

2

+

3

[0.5

1
2

+ 2

2
2

+ 4.5

3
2

2

1

2

3

1

3

+ 6

2

3
]

subject to cons
traints:

1

+

2

+

3

= 0, and

1

0,

2

0, and

3

0.

Use Matlab

Optimization tool box command:

x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)

The solution is [

1

2

3
] = [2 2 0] as expected.

-
2000 by Yu Hen Hu

SVM

13

Implication of Minimizing ||w||

Let D denote the diamet
er of the smallest hyper
-
ball that
encloses all the input training vectors {x
1
, x
2
, …, x
N
}. The set
of optimal hyper
-
planes described by the equation

W
o
T
x + b
o

= 0

has a
VC
-
dimension

h bounded from above as

h

min {

D
2
/

2

, m
0
} + 1

where m
0

is the dimen
sion of the input vectors, and

= 2/||w
o
||
is the margin of the separation of the hyper
-
planes.

VC
-
dimension determines the
complexity of the classifier
structure
, and usually the smaller the better.

-
2000 by Yu Hen Hu

SVM

14

Non
-
separable Cases

Recall that in linearly separable
case, each training sample
pair (x
i
, d
i
) represents a linear inequality constraint

d
i
(w
T
x
i

+ b)

1, i = 1, 2, …, N

If the training samples are
not linearly separable
, the
constraint can be modified to yield a
soft constraint
:

d
i
(w
T
x
i

+ b)

1

i

, i = 1
, 2, …, N

{

i
; 1

i

N} are known as
slack variables.

If

i

> 1, then the
corresponding (x
i
, d
i
) will be mis
-
classified.

The minimum error classifier would minimize
, but it
is non
-
convex w.r.t. w. Hence an approximation is to mi
nimize

-
2000 by Yu Hen Hu

SVM

15

Primal and Dual Problem Formulation

Primal Optimization Problem

Given {(x
i
,

i
);1

i

N}. Find w, b
such that

is minimized subject to the constraints (i)

i

0, and (ii)
di(w
T
x
i

+ b)

1

i

for
i = 1, 2, …, N.

Dual Optimization Problem

Given {(x
i
,

i
);1

i

N}. Find
Lagrange multipliers {

i
; 1

i

N} such that

is
maximized subject to the constraints (i) 0

i

C (a user
-
specified positive number) and (ii)

-
2000 by Yu Hen Hu

SVM

16

Solution to the Dual Problem

Optimal Solution to the Dual problem is:

N
s
: # of support vectors.

Kuhn
-
Tucker condition implies for i = 1, 2, …, N,

(i)

i

[d
i

(w
T
x
i

+ b)

1 +

i
] = 0

(*)

(ii)

i

i

= 0

{

i
; 1

i

N} ar
e Lagrange multipliers to enforce the condition

i

0. At optimal point of the primal problem,

/

i

= 0. One
may deduce that

i

= 0 if

i

C. Solving (*), we have

-
2000 by Yu Hen Hu

SVM

17

Matlab Implementation

% svm1.m: basic support vector machin
e

% X: N by m matrix. i
-
th row is x_i

% d: N by 1 vector. i
-
th element is d_i

% X, d should be loaded from file or read from input.

% call MATLAB optimization tool box function fmincon

a0=eps*ones(N,1);

C = 1;

a=fmincon(
'qfun'
,a0,[],[],d',0,zeros(N,1),C
*ones(N,1),…

[],[],X,d)

wo=X'*(a.*d)

bo=sum(diag(a)*(X*wo
-
d))/sum([a > 10*eps])

function
y=qfun(a,X,d);

% the Q(a) function. Note that it is actually
-
Q(a)

% because we call fmincon to minimize
-
Q(a) is

% the same as to maximize Q(a)

[N,m]=size
(X);

y=
-
ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a;

-
2000 by Yu Hen Hu

SVM

18

Inner Product Kernels

In general, if the input is first transformed via a set of nonlinear
functions {

i
(x)} and then subject to the hyperplane classifier

Define the inner product kernel as

one may obtain a dual optimization problem formulation as:

Often, dim of

⠽瀫ㄩ‾㸠摩洠潦⁸m

-
2000 by Yu Hen Hu

SVM

19

General Pattern Recognition with SVM

By careful selection of the nonlinea
r transformation {

j
(x); 1

j

p}, any pattern recognition problem can be solved.

1
(x)

2
(x)

p
(x)

x
1

x
2

x
m

b

w
1

w
2

w
p

+

d
i

x
i

-
2000 by Yu Hen Hu

SVM

20

Polynomial Kernel

Consider a polynomial kernel

Let K(x,y) =

T
(
x
)

(
x
), then

(
x
) = [1 x
1
2
,

, x
m
2
,

2 x
1
,

,

2x
m
,

2 x
1

x
2
,

,

2 x
1
x
m
,

2 x
2

x
3
,

,

2 x
2
x
m
,

,

2 x
m

1
x
m
]

= [1

1
(x),

,

p
(x)]

where p = 1 +m + m + (m

1) + (m

2) +

+ 1 = (m+2)(m+1)/2

Hence, using a kernel, a low dimensional pattern classification
problem (with dimension m) is solved in a higher dimensional
space (dimension p+
1). But only

j
(x) corresponding to
support vectors are used for pattern classification!

-
2000 by Yu Hen Hu

SVM

21

Numerical Example: XOR Problem

Training samples:

(

1

1;

1), (

1 1 +1),

(1

1 +1), (1 1

1)

x

= [x
1
, x
2
]
T
. Use K(
x
,
x
i
) =
(1 +
x
T
x
i
)
2

one has

(
x
) = [1 x
1
2

x
2
2

2 x
1
,

2 x
2
,

2 x
1
x
2
]
T

;
;

Note dim[

(
x
)] = 6 > dim[
x
] = 2!

-
2000 by Yu Hen Hu

SVM

22

XOR Problem (Continued)

Note that K(x
i
, x
j
) can be calculated directly without using

!

E.g.

The corresponding Lagrange multiplier

= (1/8)[1 1 1 1]
T
.

w =

= (1/8)(

1)

(
x
1
) + (1/8)(1)

(
x
2
) + (1/8)(1)

(
x
3
) + (1/8)(

1)

(
x
4
)
= [0 0 0 0 0

1/

2]
T

Hence the hyperplane is: y = w
T

(
x
) =

x
1
x
2

(x
1
, x
2
)

(

ㄬ1

(

ㄬ‫ㄩ

⠫ㄬ

⠫ㄬ⬱+

ㄠ
1
x
2

1

1

-
2000 by Yu Hen Hu

SVM

23

Other Types of Kernels

type of SVM

K(x,y)

Polynomial learning
machine

(
x
T
y

+ 1)
p

p: selected a priori

function

2
: selected a priori

Two
-
layer
perce
ptron

tanh(

o
x
T
y

+

1
)

only some

o

and

1

values are feasible.

What kernel is feasible? It must satisfy the "Mercer's theorem"!

-
2000 by Yu Hen Hu

SVM

24

Mercer's Theorem

Let K(
x,y
) be a continuous, symmetric kernel, defined on a

x,y

b. K(
x,y
) admits an eigen
-
function expan
sion

with

i

> 0 for each i. This expansion converges absolutely and
uniformly if and only if

for all

(
x
) such that
.

-
2000 by Yu Hen Hu

SVM

25

Testing with Kernels

For many types of kernels,

(x) can not be e
xplicitly
represented or even found. However,

w =

y(x) = w
T

(
x
) = f
T

(
x
) = f
T

K(x
i
,x) = K(x, x
j
) f

Hence there is no need to know

(x) explicitly! For example,
in the XOR problem, f = (1/8)[

1 +1 +1

1]
T
. Suppose that x
=
(

1, +1), then