# Bioinformatics (1)

Biotechnology

Oct 2, 2013 (4 years and 9 months ago)

161 views

CISC667, F05, Lec22, Liao

1

CISC 667 Intro to Bioinformatics

(Fall 2005)

Support Vector Machines I

CISC667, F05, Lec22, Liao

2

Terminologies

An object
x

is represented by a set of m attributes x
i
, 1

i

m.

A set of n training examples S = { (
x
1
, y
1
), …, (
x
n
, y
n
)},
where y
i

is the classification (or label) of instance
x
i
.

For binary classification, y
i

={

1, +1}, and for k
-
class
classification, y
i

={1, 2, …,k}.

Without loss of generality, we focus on binary classification.

The task is to learn the mapping:
x
i

y
i

A machine is a learned function/mapping/hypothesis h:

x
i

h(
x
i
,

)

where

stands for parameters to be fixed during training.

Performance is measured as

E = (1/2n)

i=1 to n

|y
i
-

h(
x
i
,

)|

CISC667, F05, Lec22, Liao

3

Linear SVMs:
find a hyperplane (specified by normal vector
w

and
perpendicular distance
b

to the origin) that separates the positive and negative
examples with the largest margin.

Separating hyperplane
(
w,
b)

Margin

Origin

w

b

w

x
i

+ b
>
0 if y
i

= +1

w

x
i

+ b
<
0 if y
i

=

1

+

An unknown x is classified as
sign(w

x

+ b)

CISC667, F05, Lec22, Liao

4

Rosenblatt’s Algorithm (1956)

;

// is the learning rate

w
0

=
0
; b
0

= 0; k = 0

R = max
1

i

n

|| x
i

||

error = 1; // flag for misclassification/mistake

while (error) { // as long as modification is made in the for
-
loop

error = 0;

for (i = 1 to
n) {

if (y
i

( <
w
k

x
i
> + b
k

)

0 ){ // misclassification

w
k+1

=
w
k

+

y
i

x
i

// update the weight

b
k+1

= b
k

+

y
i

R
2
// update the bias

k = k +1

error = 1;

}

}

}

return (
w
k
, b
k
) // hyperplane that separates the data, where k is the number of

// modifications.

CISC667, F05, Lec22, Liao

5

Questions w.r.t. Rosenblatt’s algorithm

Is the algorithm guaranteed to converge?

How quickly does it converge?

Novikoff Theorem:

Let S be a training set of size n and R = max
1

i

n

|| x
i

||. If
there exists a vector w* such that ||w*|| = 1 and

y
i

(w*

x
i
)

,

for 1

i

n,
then the number of modifications before
convergence is at most

(R/

)
2
.

CISC667, F05, Lec22, Liao

6

Proof:

1.
w
t

w
*

=
w
t
-
1

w
*

+

y
i

x
i

w
*

w
t
-
1

w
*

+

w
t

w
*

t

2.
||
w
t

||
2

= ||
w
t
-
1

||
2

+ 2

y
i

x
i

w
t
-
1

+

2

||
x
i

||
2

||
w
t
-
1

||
2

+

2

||
x
i

||
2

||
w
t
-
1

||
2

+

2

R
2

||
w
t

||
2

t

2

R
2

3.

t

R ||
w
*||

w
t

w
*

t

t

(R/

)
2
.

Note:

Without loss of generality, the separating plane is assumed to pass the
origin, i.e., no bias b is necessary.

The learning rate

seems to have no bearing on this upper bound.

What if the training data is not linearly separable, i.e., w* does not
exist?

CISC667, F05, Lec22, Liao

7

Dual form

The final hypothesis w is a linear combination of the
training points:

w
=

i=1 to n

i

y
i
x
i

where

i

are positive values proportional to the number of
times misclassification of
x
i
has caused the weight to be
updated.

Vector

of the hypothesis;

i

can be regarded as an indication of
the information content of the example
x
i
.

The decision function can be rewritten as

h(
x
) = sign (
w

x

+ b)

= sign( (

j=1 to n

j

y
j
x
j
)

x

+ b)

= sign(

j=1 to n

j

y
j
(
x
j

x
) + b)

CISC667, F05, Lec22, Liao

8

Rosenblatt’s Algorithm in dual form

=
0
; b = 0

R = max
1

i

n

|| x
i

||

error = 1; // flag for misclassification

while (error) { // as long as modification is made in the for
-
loop

error = 0;

for (i = 1 to
n) {

if (y
i

(

j=1 to n

j

y
j
(
x
j

x
i
)
+ b)

0 ){ // x
i

is misclassified

i

=

i

+
1

// update the weight

b = b +

y
i

R
2
// update the bias

error = 1;

}

}

}

return (

, 戩††b⼯ 桹灥p灬p湥n瑨琠数e牡r敳e瑨 摡愬 睨敲 欠楳i瑨 湵n扥b 潦

⼯ m潤楦捡瑩潮献

The training examples enter the algorithm as dot products
(
x
j

x
i
).

i

i
s a measure of information content;
x
i

with non
-
zero information content (

i

>0)
are called
support vectors, as they are located on the boundaries.

CISC667, F05, Lec22, Liao

9

Separating hyperplane
(
w,
b)

Margin

Origin

w

b

+

> 0

= 0

CISC667, F05, Lec22, Liao

10

Larger margin is preferred:

converge more quickly

generalize better

CISC667, F05, Lec22, Liao

11

w

x
+

+ b
=
+ 1

w

x
-

+ b
=

1

2 = [ (
x
+

w
)

-

(
x
-

w
)

] = (
x
+

-

x
-
) ∙
w = || x
+

-

x
-

|| ||
w
||

Therefore, maximizing the geometric margin
|| x
+

-

x
-

|| is
equivalent to minimizing ½ ||w||
2
, under linear constraints:
y
i

(
w

x
i
) +b

1 for i = 1, …, n.

Min
w
,b

<
w

w

>

subject to
y
i

<
w

x
i
> +b

1 for i = 1, …, n

CISC667, F05, Lec22, Liao

12

Optimization with constraints

Min
w
,b

<
w

w

>

subject to
y
i

<
w

x
i
> +b

1 for i = 1, …, n

Lagrangian Theory: Introducing Lagrangian multiplier

i

for each
constraint

L(
w
, b,

)= ½ ||w||
2

i

(
y
i

(
w

x
i

+ b)

1),

and then calculating

L

L

L

------

= 0,

------

= 0,

------

= 0,

w

b

This optimization problem can be solved as Quadratic Programming.

… guaranteed to converge to the global minimum because of its being
a convex

Note: advantages over the artificial neural nets

CISC667, F05, Lec22, Liao

13

The optimal w* and b* can be found by solving the dual problem for

to
maximize:

L(

⤠㴠

i

½

i

j

y
i

y
j

x
i

x
j

under the constraints:

i

0, and

i

y
i

= 0.

Once

is solved
,

w
* =

i

y
i

x
i

b* = ½ (max
y =
-
1

w
*∙
x
i
+ min
y=+1

w
*∙
x
i
)

And a
n unknown
x

is classified as

sign(
w*

x

+ b*) = sign(

i

y
i

x
i

x

+ b*)

Notes:

1.
Only the dot product for vectors is needed.

2.
Many

i

are equal to zero, and those that are not zero correspond to
x
i

on the boundaries

support vectors!

3.
In practice, instead of sign function, the actual value of
w*

x

+ b* is
used when its absolute value is less than or equal to one. Such a value is
called a discriminant.

CISC667, F05, Lec22, Liao

14

Non
-
linear mapping to a feature space

Φ
(

)

x
i

Φ
(
x
i
)

Φ
(
x
j
)

x
j

L(

⤠㴠

i

½

i

j

y
i

y
j
Φ

(
x
i

)∙
Φ

(
x
j
)

CISC667, F05, Lec22, Liao

15

X

=

Nonlinear SVMs

Input Space

Feature Space

x
1

x
2

CISC667, F05, Lec22, Liao

16

Kernel function for mapping

For input
X
= (x
1
, x
2
),
Define map

(
X
) = (x
1
x
1
,

2x
1
x
2
, x
2
x
2
).

Define Kernel function as
K(
X
,
Y
) = (
X

Y
)
2
.

It has K(
X
,
Y
) =

(
X
)

(
Y
)

We can compute the dot
product in feature space
without computing

.

K(
X
,
Y
)
=

(
X
)

(
Y
)

=
(x
1
x
1
,

2 x
1
x
2
, x
2
x
2
)

(y
1
y
1
,

2 y
1
y
2
, y
2
y
2
)

=
(x
1
x
1
y
1
y
1

+ 2x
1
x
2
y
1
y
2

+ x
2
x
2
y
2
y
2
)

=
(x
1
y
1

+ x
2
y
2
)(x
1
y
1

+ x
2
y
2
)

=
((x
1
, x
2
) ∙ (y
1
, y
2
))
2

=
(
X

Y
)
2

CISC667, F05, Lec22, Liao

17

Kernels

Given a mapping
Φ
(

) from the space of input vectors to some higher
dimensional feature space, the kernel K of two vectors
x
i
,
x
j

is the
inner product of their images in the feature space, namely,

K(
x
i
,
x
j
) =
Φ

(
x
i
)∙
Φ

(
x
j

).

Since we just need the inner product of vectors in the feature space to find
the maximal margin separating hyperplane, we use the kernel in place
of the mapping
Φ
(

).

Because inner product of two vectors is a measure of the distance between
the vectors, a kernel function actually defines the geometry of the
feature space (lengths and angles), and implicitly provides a similarity
measure for objects to be classified.

CISC667, F05, Lec22, Liao

18

Mercer’s condition

Since kernel functions play an important role, it is important
to know if a kernel gives dot products (in some higher
dimension space).

For a kernel K(x,y), if for any g(x) such that

g(x)
2

dx is
finite, we have

K(x,y)g(x)g(y) dx dy

0,

then there exist a mapping

such that

K(x,y) =

(x)

(y)

Notes:

1.
Mercer’s condition does not tell how to actually find

.

2.
Mercer’s condition may be hard to check since it must hold for
every g(x).

CISC667, F05, Lec22, Liao

19

More kernel functions

some commonly used generic kernel functions

Polynomial kernel: K(
x
,
y
) = (1+
x

y
)
p

Radial (or Gaussian) kernel: K(
x
,
y
) = exp(
-
||
x
-
y
||
2
/2

2
)

Questions: By introducing extra dimensions (sometimes
infinite), we can find a linearly separating hyperplane. But
how can we be sure such a mapping to a higher dimension
space will generalize well to unseen data? Because the
mapping introduces flexibility for fitting the training
examples, how to avoid overfitting?

Answer: Use the maximum margin hyperplane. (Vapnik
theory)

CISC667, F05, Lec22, Liao

20

The optimal w* and b* can be found by solving the
dual problem for

to maximize:

L(

⤠㴠

i

½

i

j

y
i

y
j

K(
x
i

,
x
j
)

under the constraints:

i

0, and

i

y
i

= 0.

Once

is solved
,

w
* =

i

y
i

x
i

b* = ½ max
y =
-
1

[K(
w
*,
x
i
) + min
y=+1

(
w
*∙
x
i
)]

And a
n unknown
x

is classified as

sign(
w*

x

+ b*) = sign(

i

y
i

K(
x
i

,

x
) + b*)

CISC667, F05, Lec22, Liao

21

References and resources

Cristianini & Shawe
-
Tayor, “
An introduction to
Support Vector Machines
”, Cambridge University
Press, 2000.

www.kernel
-
machines.org

SVMLight

Chris Burges, A tutorial

J.
-
P Vert, A 3
-
day tutorial

W. Noble, “Support vector machine applications
in computational biology”,
Kernel Methods in
Computational Biology
. B. Schoelkopf, K. Tsuda
and J.
-
P. Vert, ed. MIT Press, 2004.