Copyright © 1999

2000 by Yu Hen Hu
SVM
1
Support Vector Machines
(SVM)
Copyright © 1999

2000 by Yu Hen Hu
SVM
2
Outline
Linear pattern classfiers and optimal hyperplane
Optimization problem formulation
Statistical properties of optimal hyperplane
The case of non

separable patterns
Applications to general pattern classification
Merc
er's Theorem
Copyright © 1999

2000 by Yu Hen Hu
SVM
3
Linear Hyperplane Classifier
Given: {(x
i
, d
i
); i = 1 to N, d
i
{+1,
1}}.
A linear hyperplane classifier is
a hyperplane consisting of
points x such that
H = {x g(x) = w
T
x + b = 0}
g(x): discriminant function!.
For x
in the side of
o :
w
T
x + b
0;
d = +1;
For x in the side of
:
w
T
x + b
0;
d =
1.
Distance from x to H:
r = w
T
x/w
(
b/w) = g(x) /w
x
1
x
2
w
b/w
x
r
Copyright © 1999

2000 by Yu Hen Hu
SVM
4
Distance from a Point to a Hyper

plane
The hyperplane H is characterized by
(*) w
T
x + b = 0
w: normal vector pe
rpendicular to H
(*) says any vector x on H that
project to w will have a length of
OA =
b/w.
Consider a special point C corresponding to vector x
*
. Its
projection onto vector w is
w
T
x
*
/w = OA + BC. Or equivalently, w
T
x
*
/w = r
b/w
.
Hence r = (w
T
x
*
+b)/w = g(x*)/w
r
w
X*
H
O
A
B
C
Copyright © 1999

2000 by Yu Hen Hu
SVM
5
Optimal Hyperplane: Linearly Separable Case
For d
i
= +1, g(x
i
) = w
T
x
i
+ b
w
w
o
T
x
i
+ b
o
1
For d
i
=
1, g(x
i
) = w
T
x
i
+ b
w
w
o
T
x
i
+ b
o
1
x
1
x
2
佰瑩浡祰敲e污湥n獨o畬搠
扥b楮⁴桥e湴n爠潦⁴桥瀮
卵灰潲瑩湧n噥瑯牳
卡浰汥猠潮o瑨畮u慲楥献
卵灰潲瑩湧n癥捴v牳潮攠
捡渠n整e牭楮攠潰o業慬
桹灥牰污湥n
兵敳瑩渺†䡯眠瑯H晩搠
潰o業慬祰敲污湥n
Copyright © 1999

2000 by Yu Hen Hu
SVM
6
Separation Gap
For x
i
being a supporting vector,
For d
i
= +1, g(x
i
) = w
T
x
i
+ b =
w
w
o
T
x
i
+ b
o
= 1
For d
i
=
1, g(x
i
) = w
T
x
i
+ b =
w
w
o
T
x
i
+ b
o
=
1
Hence w
o
= w/(
w), b
o
= b/(
w). But distance from x
i
to
hyperplane is
= g(x
i
)/w. Thus w
o
= w/g(x
i
), and
= 1/w
o
.
The max
imum distance between the two classes is
2
= 2/w
o
.
Hence the objective is to find w
o
, b
o
to minimize w
o
 (so that
is maximized) subject to the constraints that
w
o
T
x
i
+ b
o
1 for d
i
= +1; and w
o
T
x
i
+ b
o
1 for d
i
=
1.
Combine these constraints,
one has: d
i
(w
o
T
x
i
+ b
o
)
1
Copyright © 1999

2000 by Yu Hen Hu
SVM
7
Quadratic Optimization Problem Formulation
Given {(x
i
, d
i
); i = 1 to N}, find w and b such that
(w) = w
T
w/2
is minimized subject to N constraints
d
i
(w
T
x
i
+ b)
0; 1
i
N.
Method of Lagrange Multiplier
J(w, b,
) =
(w)
Set
J(w,b,
)/
w = 0
w =
J(w,b,
)/
= 0
Copyright © 1999

2000 by Yu Hen Hu
SVM
8
Optimization (continued)
The solution of Lagrange multiplier problem is at a saddle
point where the minimum is sough
t w.r.t. w and b, while the
maximum is sought w.r.t.
i.
Kuhn

Tucker Condition: at the saddle point,
i
[d
i
(w
T
x
i
+ b)
1] = 0 for 1
i
N.
If x
i
is NOT a suppor vector, the corresponding
i
= 0!
Hence, only support vector will affect the re
sult of
optimization!
Copyright © 1999

2000 by Yu Hen Hu
SVM
9
A Numerical Example
3 inequalities: 1
w + b
1; 2
w + b
+1; 3
w + b
+ 1
J = w
2
/2
1
(
w
b
1)
2
(2w+b
1)
3
(3w+b
1)
J/
w = 0
w =
1
+ 2
2
+ 3
3
J/
b = 0
0 =
1
2
3
Solve: (a)
w
b
1 = 0 (b)
2w+b
1 = 0 (c); 3w + b
1 = 0
(b) and (c) conflict each other. Solve (a), (b) yield w = 2,
b =
3. From the Kuhn

Tucker condition,
3
= 0. Thus,
1
=
2
= 2. Hence the solution of decision boundary is: 2x
3 = 0. or
x = 1.5! This is shown as the
dash line in above figure.
(1,
1)
(2,1)
(3,1)
Copyright © 1999

2000 by Yu Hen Hu
SVM
10
Primal/Dual Problem Formulation
Given a constrained optimization problem with a convex cost
function and linear constraints; a
dual problem
with the
Lagrange multipliers providing the solution can be formulated.
Duality Theorem (
Bertsekas 1995)
(a)
If the primal problem has an optimal solution, then the dual
problem has an optimal solution with the same optimal
values.
(b)
In order for w
o
to be an optimal solution and
o
to be an
optimal dual solution, it is necessary and sufficient that
w
o
is feasible for the primal problem and
(w
o
) = J(w
o
,b
o
,
o
) = Min
w
J(w,b
o
,
o
)
Copyright © 1999

2000 by Yu Hen Hu
SVM
11
Formulating the Dual Problem
With w =
and
. These lead to a
Dual Problem
Maximize
Subject to:
and
i
0 for i = 1, 2, …, N.
Note
Copyright © 1999

2000 by Yu Hen Hu
SVM
12
Numerical Example (cont’d)
or Q(
) =
1
+
2
+
3
[0.5
1
2
+ 2
2
2
+ 4.5
3
2
2
1
2
3
1
3
+ 6
2
3
]
subject to cons
traints:
1
+
2
+
3
= 0, and
1
0,
2
0, and
3
0.
Use Matlab
Optimization tool box command:
x=fmincon(‘qalpha’,X0, A, B, Aeq, Beq)
The solution is [
1
2
3
] = [2 2 0] as expected.
Copyright © 1999

2000 by Yu Hen Hu
SVM
13
Implication of Minimizing w
Let D denote the diamet
er of the smallest hyper

ball that
encloses all the input training vectors {x
1
, x
2
, …, x
N
}. The set
of optimal hyper

planes described by the equation
W
o
T
x + b
o
= 0
has a
VC

dimension
h bounded from above as
h
min {
D
2
/
2
, m
0
} + 1
where m
0
is the dimen
sion of the input vectors, and
= 2/w
o

is the margin of the separation of the hyper

planes.
VC

dimension determines the
complexity of the classifier
structure
, and usually the smaller the better.
Copyright © 1999

2000 by Yu Hen Hu
SVM
14
Non

separable Cases
Recall that in linearly separable
case, each training sample
pair (x
i
, d
i
) represents a linear inequality constraint
d
i
(w
T
x
i
+ b)
1, i = 1, 2, …, N
If the training samples are
not linearly separable
, the
constraint can be modified to yield a
soft constraint
:
d
i
(w
T
x
i
+ b)
1
i
, i = 1
, 2, …, N
{
i
; 1
i
N} are known as
slack variables.
If
i
> 1, then the
corresponding (x
i
, d
i
) will be mis

classified.
The minimum error classifier would minimize
, but it
is non

convex w.r.t. w. Hence an approximation is to mi
nimize
Copyright © 1999

2000 by Yu Hen Hu
SVM
15
Primal and Dual Problem Formulation
Primal Optimization Problem
Given {(x
i
,
i
);1
i
N}. Find w, b
such that
is minimized subject to the constraints (i)
i
0, and (ii)
di(w
T
x
i
+ b)
1
i
for
i = 1, 2, …, N.
Dual Optimization Problem
Given {(x
i
,
i
);1
i
N}. Find
Lagrange multipliers {
i
; 1
i
N} such that
is
maximized subject to the constraints (i) 0
i
C (a user

specified positive number) and (ii)
Copyright © 1999

2000 by Yu Hen Hu
SVM
16
Solution to the Dual Problem
Optimal Solution to the Dual problem is:
N
s
: # of support vectors.
Kuhn

Tucker condition implies for i = 1, 2, …, N,
(i)
i
[d
i
(w
T
x
i
+ b)
1 +
i
] = 0
(*)
(ii)
i
i
= 0
{
i
; 1
i
N} ar
e Lagrange multipliers to enforce the condition
i
0. At optimal point of the primal problem,
/
i
= 0. One
may deduce that
i
= 0 if
i
C. Solving (*), we have
Copyright © 1999

2000 by Yu Hen Hu
SVM
17
Matlab Implementation
% svm1.m: basic support vector machin
e
% X: N by m matrix. i

th row is x_i
% d: N by 1 vector. i

th element is d_i
% X, d should be loaded from file or read from input.
% call MATLAB optimization tool box function fmincon
a0=eps*ones(N,1);
C = 1;
a=fmincon(
'qfun'
,a0,[],[],d',0,zeros(N,1),C
*ones(N,1),…
[],[],X,d)
wo=X'*(a.*d)
bo=sum(diag(a)*(X*wo

d))/sum([a > 10*eps])
function
y=qfun(a,X,d);
% the Q(a) function. Note that it is actually

Q(a)
% because we call fmincon to minimize

Q(a) is
% the same as to maximize Q(a)
[N,m]=size
(X);
y=

ones(1,N)*a+0.5*a'*diag(d)*X*X'*diag(d)*a;
Copyright © 1999

2000 by Yu Hen Hu
SVM
18
Inner Product Kernels
In general, if the input is first transformed via a set of nonlinear
functions {
i
(x)} and then subject to the hyperplane classifier
Define the inner product kernel as
one may obtain a dual optimization problem formulation as:
Often, dim of
⠽瀫ㄩ‾㸠摩洠潦⁸m
Copyright © 1999

2000 by Yu Hen Hu
SVM
19
General Pattern Recognition with SVM
By careful selection of the nonlinea
r transformation {
j
(x); 1
j
p}, any pattern recognition problem can be solved.
1
(x)
2
(x)
p
(x)
x
1
x
2
x
m
b
w
1
w
2
w
p
+
d
i
x
i
Copyright © 1999

2000 by Yu Hen Hu
SVM
20
Polynomial Kernel
Consider a polynomial kernel
Let K(x,y) =
T
(
x
)
(
x
), then
(
x
) = [1 x
1
2
,
, x
m
2
,
2 x
1
,
,
2x
m
,
2 x
1
x
2
,
,
2 x
1
x
m
,
2 x
2
x
3
,
,
2 x
2
x
m
,
,
2 x
m
1
x
m
]
= [1
1
(x),
,
p
(x)]
where p = 1 +m + m + (m
1) + (m
2) +
+ 1 = (m+2)(m+1)/2
Hence, using a kernel, a low dimensional pattern classification
problem (with dimension m) is solved in a higher dimensional
space (dimension p+
1). But only
j
(x) corresponding to
support vectors are used for pattern classification!
Copyright © 1999

2000 by Yu Hen Hu
SVM
21
Numerical Example: XOR Problem
Training samples:
(
1
1;
1), (
1 1 +1),
(1
1 +1), (1 1
1)
x
= [x
1
, x
2
]
T
. Use K(
x
,
x
i
) =
(1 +
x
T
x
i
)
2
one has
(
x
) = [1 x
1
2
x
2
2
2 x
1
,
2 x
2
,
2 x
1
x
2
]
T
;
;
Note dim[
(
x
)] = 6 > dim[
x
] = 2!
Copyright © 1999

2000 by Yu Hen Hu
SVM
22
XOR Problem (Continued)
Note that K(x
i
, x
j
) can be calculated directly without using
!
E.g.
The corresponding Lagrange multiplier
= (1/8)[1 1 1 1]
T
.
w =
= (1/8)(
1)
(
x
1
) + (1/8)(1)
(
x
2
) + (1/8)(1)
(
x
3
) + (1/8)(
1)
(
x
4
)
= [0 0 0 0 0
1/
2]
T
Hence the hyperplane is: y = w
T
(
x
) =
x
1
x
2
(x
1
, x
2
)
(
ㄬ1
ㄩ
(
ㄬㄩ
⠫ㄬ
ㄩ
⠫ㄬ⬱+
礠㴠
ㄠ
1
x
2
1
⬱
⬱
1
Copyright © 1999

2000 by Yu Hen Hu
SVM
23
Other Types of Kernels
type of SVM
K(x,y)
Comments
Polynomial learning
machine
(
x
T
y
+ 1)
p
p: selected a priori
Radial basis
function
2
: selected a priori
Two

layer
perce
ptron
tanh(
o
x
T
y
+
1
)
only some
o
and
1
values are feasible.
What kernel is feasible? It must satisfy the "Mercer's theorem"!
Copyright © 1999

2000 by Yu Hen Hu
SVM
24
Mercer's Theorem
Let K(
x,y
) be a continuous, symmetric kernel, defined on a
x,y
b. K(
x,y
) admits an eigen

function expan
sion
with
i
> 0 for each i. This expansion converges absolutely and
uniformly if and only if
for all
(
x
) such that
.
Copyright © 1999

2000 by Yu Hen Hu
SVM
25
Testing with Kernels
For many types of kernels,
(x) can not be e
xplicitly
represented or even found. However,
w =
y(x) = w
T
(
x
) = f
T
(
x
) = f
T
K(x
i
,x) = K(x, x
j
) f
Hence there is no need to know
(x) explicitly! For example,
in the XOR problem, f = (1/8)[
1 +1 +1
1]
T
. Suppose that x
=
(
1, +1), then
Comments 0
Log in to post a comment