CISC667, F05, Lec22, Liao
1
CISC 667 Intro to Bioinformatics
(Fall 2005)
Support Vector Machines I
CISC667, F05, Lec22, Liao
2
Terminologies
•
An object
x
is represented by a set of m attributes x
i
, 1
i
m.
•
A set of n training examples S = { (
x
1
, y
1
), …, (
x
n
, y
n
)},
where y
i
is the classification (or label) of instance
x
i
.
–
For binary classification, y
i
={
1, +1}, and for k

class
classification, y
i
={1, 2, …,k}.
–
Without loss of generality, we focus on binary classification.
•
The task is to learn the mapping:
x
i
y
i
•
A machine is a learned function/mapping/hypothesis h:
x
i
h(
x
i
,
)
where
stands for parameters to be fixed during training.
•
Performance is measured as
E = (1/2n)
i=1 to n
y
i

h(
x
i
,
)
CISC667, F05, Lec22, Liao
3
Linear SVMs:
find a hyperplane (specified by normal vector
w
and
perpendicular distance
b
to the origin) that separates the positive and negative
examples with the largest margin.
Separating hyperplane
(
w,
b)
Margin
Origin
w
b
w
∙
x
i
+ b
>
0 if y
i
= +1
w
∙
x
i
+ b
<
0 if y
i
=
1
+
An unknown x is classified as
sign(w
∙
x
+ b)
CISC667, F05, Lec22, Liao
4
Rosenblatt’s Algorithm (1956)
;
// is the learning rate
w
0
=
0
; b
0
= 0; k = 0
R = max
1
i
n
 x
i

error = 1; // flag for misclassification/mistake
while (error) { // as long as modification is made in the for

loop
error = 0;
for (i = 1 to
n) {
if (y
i
( <
w
k
∙
x
i
> + b
k
)
0 ){ // misclassification
w
k+1
=
w
k
+
y
i
x
i
// update the weight
b
k+1
= b
k
+
y
i
R
2
// update the bias
k = k +1
error = 1;
}
}
}
return (
w
k
, b
k
) // hyperplane that separates the data, where k is the number of
// modifications.
CISC667, F05, Lec22, Liao
5
Questions w.r.t. Rosenblatt’s algorithm
–
Is the algorithm guaranteed to converge?
–
How quickly does it converge?
Novikoff Theorem:
Let S be a training set of size n and R = max
1
i
n
 x
i
. If
there exists a vector w* such that w* = 1 and
y
i
(w*
∙
x
i
)
,
for 1
i
n,
then the number of modifications before
convergence is at most
(R/
)
2
.
CISC667, F05, Lec22, Liao
6
Proof:
1.
w
t
∙
w
*
=
w
t

1
∙
w
*
+
y
i
x
i
∙
w
*
w
t

1
∙
w
*
+
w
t
∙
w
*
t
2.

w
t

2
= 
w
t

1

2
+ 2
y
i
x
i
∙
w
t

1
+
2

x
i

2

w
t

1

2
+
2

x
i

2

w
t

1

2
+
2
R
2

w
t

2
t
2
R
2
3.
t
R 
w
*
w
t
∙
w
*
t
t
(R/
)
2
.
Note:
–
Without loss of generality, the separating plane is assumed to pass the
origin, i.e., no bias b is necessary.
–
The learning rate
seems to have no bearing on this upper bound.
–
What if the training data is not linearly separable, i.e., w* does not
exist?
CISC667, F05, Lec22, Liao
7
Dual form
•
The final hypothesis w is a linear combination of the
training points:
w
=
i=1 to n
i
y
i
x
i
where
i
are positive values proportional to the number of
times misclassification of
x
i
has caused the weight to be
updated.
•
Vector
捡渠扥b潮獩摥敤e慳a瑥湡瑩癥 牥r牥r敮瑡瑩潮
of the hypothesis;
i
can be regarded as an indication of
the information content of the example
x
i
.
•
The decision function can be rewritten as
h(
x
) = sign (
w
∙
x
+ b)
= sign( (
j=1 to n
j
y
j
x
j
)
∙
x
+ b)
= sign(
j=1 to n
j
y
j
(
x
j
∙
x
) + b)
CISC667, F05, Lec22, Liao
8
Rosenblatt’s Algorithm in dual form
=
0
; b = 0
R = max
1
i
n
 x
i

error = 1; // flag for misclassification
while (error) { // as long as modification is made in the for

loop
error = 0;
for (i = 1 to
n) {
if (y
i
(
j=1 to n
j
y
j
(
x
j
∙
x
i
)
+ b)
0 ){ // x
i
is misclassified
i
=
i
+
1
// update the weight
b = b +
y
i
R
2
// update the bias
error = 1;
}
}
}
return (
, 戩††b⼯ 桹灥p灬p湥n瑨琠数e牡r敳e瑨 摡愬 睨敲 欠楳i瑨 湵n扥b 潦
⼯ m潤楦捡瑩潮献
乯瑥猺N
–
The training examples enter the algorithm as dot products
(
x
j
∙
x
i
).
–
i
i
s a measure of information content;
x
i
with non

zero information content (
i
>0)
are called
support vectors, as they are located on the boundaries.
CISC667, F05, Lec22, Liao
9
Separating hyperplane
(
w,
b)
Margin
Origin
w
b
+
> 0
= 0
CISC667, F05, Lec22, Liao
10
Larger margin is preferred:
•
converge more quickly
•
generalize better
CISC667, F05, Lec22, Liao
11
w
∙
x
+
+ b
=
+ 1
w
∙
x

+ b
=
1
2 = [ (
x
+
∙
w
)

(
x

∙
w
)
] = (
x
+

x

) ∙
w =  x
+

x

 
w

Therefore, maximizing the geometric margin
 x
+

x

 is
equivalent to minimizing ½ w
2
, under linear constraints:
y
i
(
w
∙
x
i
) +b
1 for i = 1, …, n.
Min
w
,b
<
w
∙
w
>
subject to
y
i
<
w
∙
x
i
> +b
1 for i = 1, …, n
CISC667, F05, Lec22, Liao
12
Optimization with constraints
Min
w
,b
<
w
∙
w
>
subject to
y
i
<
w
∙
x
i
> +b
1 for i = 1, …, n
Lagrangian Theory: Introducing Lagrangian multiplier
i
for each
constraint
L(
w
, b,
)= ½ w
2
i
(
y
i
(
w
∙
x
i
+ b)
1),
and then calculating
L
L
L

= 0,

= 0,

= 0,
w
b
This optimization problem can be solved as Quadratic Programming.
… guaranteed to converge to the global minimum because of its being
a convex
Note: advantages over the artificial neural nets
CISC667, F05, Lec22, Liao
13
The optimal w* and b* can be found by solving the dual problem for
to
maximize:
L(
⤠㴠
i
½
i
j
y
i
y
j
x
i
∙
x
j
under the constraints:
i
0, and
i
y
i
= 0.
Once
is solved
,
w
* =
i
y
i
x
i
b* = ½ (max
y =

1
w
*∙
x
i
+ min
y=+1
w
*∙
x
i
)
And a
n unknown
x
is classified as
sign(
w*
∙
x
+ b*) = sign(
i
y
i
x
i
∙
x
+ b*)
Notes:
1.
Only the dot product for vectors is needed.
2.
Many
i
are equal to zero, and those that are not zero correspond to
x
i
on the boundaries
–
support vectors!
3.
In practice, instead of sign function, the actual value of
w*
∙
x
+ b* is
used when its absolute value is less than or equal to one. Such a value is
called a discriminant.
CISC667, F05, Lec22, Liao
14
Non

linear mapping to a feature space
Φ
(
)
x
i
Φ
(
x
i
)
Φ
(
x
j
)
x
j
L(
⤠㴠
i
½
i
j
y
i
y
j
Φ
(
x
i
)∙
Φ
(
x
j
)
CISC667, F05, Lec22, Liao
15
X
=
Nonlinear SVMs
Input Space
Feature Space
x
1
x
2
CISC667, F05, Lec22, Liao
16
Kernel function for mapping
•
For input
X
= (x
1
, x
2
),
Define map
(
X
) = (x
1
x
1
,
2x
1
x
2
, x
2
x
2
).
•
Define Kernel function as
K(
X
,
Y
) = (
X
∙
Y
)
2
.
•
It has K(
X
,
Y
) =
(
X
)
∙
(
Y
)
•
We can compute the dot
product in feature space
without computing
.
K(
X
,
Y
)
=
(
X
)
∙
(
Y
)
=
(x
1
x
1
,
2 x
1
x
2
, x
2
x
2
)
∙
(y
1
y
1
,
2 y
1
y
2
, y
2
y
2
)
=
(x
1
x
1
y
1
y
1
+ 2x
1
x
2
y
1
y
2
+ x
2
x
2
y
2
y
2
)
=
(x
1
y
1
+ x
2
y
2
)(x
1
y
1
+ x
2
y
2
)
=
((x
1
, x
2
) ∙ (y
1
, y
2
))
2
=
(
X
∙
Y
)
2
CISC667, F05, Lec22, Liao
17
Kernels
Given a mapping
Φ
(
) from the space of input vectors to some higher
dimensional feature space, the kernel K of two vectors
x
i
,
x
j
is the
inner product of their images in the feature space, namely,
K(
x
i
,
x
j
) =
Φ
(
x
i
)∙
Φ
(
x
j
).
Since we just need the inner product of vectors in the feature space to find
the maximal margin separating hyperplane, we use the kernel in place
of the mapping
Φ
(
).
Because inner product of two vectors is a measure of the distance between
the vectors, a kernel function actually defines the geometry of the
feature space (lengths and angles), and implicitly provides a similarity
measure for objects to be classified.
CISC667, F05, Lec22, Liao
18
Mercer’s condition
Since kernel functions play an important role, it is important
to know if a kernel gives dot products (in some higher
dimension space).
For a kernel K(x,y), if for any g(x) such that
g(x)
2
dx is
finite, we have
K(x,y)g(x)g(y) dx dy
0,
then there exist a mapping
such that
K(x,y) =
(x)
∙
(y)
Notes:
1.
Mercer’s condition does not tell how to actually find
.
2.
Mercer’s condition may be hard to check since it must hold for
every g(x).
CISC667, F05, Lec22, Liao
19
More kernel functions
some commonly used generic kernel functions
–
Polynomial kernel: K(
x
,
y
) = (1+
x
∙
y
)
p
–
Radial (or Gaussian) kernel: K(
x
,
y
) = exp(


x

y

2
/2
2
)
Questions: By introducing extra dimensions (sometimes
infinite), we can find a linearly separating hyperplane. But
how can we be sure such a mapping to a higher dimension
space will generalize well to unseen data? Because the
mapping introduces flexibility for fitting the training
examples, how to avoid overfitting?
Answer: Use the maximum margin hyperplane. (Vapnik
theory)
CISC667, F05, Lec22, Liao
20
The optimal w* and b* can be found by solving the
dual problem for
to maximize:
L(
⤠㴠
i
½
i
j
y
i
y
j
K(
x
i
,
x
j
)
under the constraints:
i
0, and
i
y
i
= 0.
Once
is solved
,
w
* =
i
y
i
x
i
b* = ½ max
y =

1
[K(
w
*,
x
i
) + min
y=+1
(
w
*∙
x
i
)]
And a
n unknown
x
is classified as
sign(
w*
∙
x
+ b*) = sign(
i
y
i
K(
x
i
,
x
) + b*)
CISC667, F05, Lec22, Liao
21
References and resources
•
Cristianini & Shawe

Tayor, “
An introduction to
Support Vector Machines
”, Cambridge University
Press, 2000.
•
www.kernel

machines.org
–
SVMLight
–
Chris Burges, A tutorial
•
J.

P Vert, A 3

day tutorial
•
W. Noble, “Support vector machine applications
in computational biology”,
Kernel Methods in
Computational Biology
. B. Schoelkopf, K. Tsuda
and J.

P. Vert, ed. MIT Press, 2004.
Comments 0
Log in to post a comment