Bioinformatics (1)

fleagoldfishΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 6 μήνες)

115 εμφανίσεις

CISC667, F05, Lec22, Liao

1

CISC 667 Intro to Bioinformatics

(Fall 2005)


Support Vector Machines I

CISC667, F05, Lec22, Liao

2

Terminologies


An object
x

is represented by a set of m attributes x
i
, 1


i


m.


A set of n training examples S = { (
x
1
, y
1
), …, (
x
n
, y
n
)},
where y
i

is the classification (or label) of instance
x
i
.


For binary classification, y
i

={

1, +1}, and for k
-
class
classification, y
i

={1, 2, …,k}.


Without loss of generality, we focus on binary classification.


The task is to learn the mapping:
x
i



y
i


A machine is a learned function/mapping/hypothesis h:

x
i



h(
x
i
,

)

where


stands for parameters to be fixed during training.


Performance is measured as

E = (1/2n)


i=1 to n

|y
i
-

h(
x
i
,

)|

CISC667, F05, Lec22, Liao

3

Linear SVMs:
find a hyperplane (specified by normal vector
w

and
perpendicular distance
b

to the origin) that separates the positive and negative
examples with the largest margin.

Separating hyperplane
(
w,
b)

Margin


Origin

w

b

w


x
i

+ b
>
0 if y
i

= +1

w


x
i

+ b
<
0 if y
i

=

1

+



An unknown x is classified as
sign(w


x

+ b)

CISC667, F05, Lec22, Liao

4

Rosenblatt’s Algorithm (1956)



;

// is the learning rate

w
0

=
0
; b
0

= 0; k = 0

R = max
1


i


n

|| x
i

||


error = 1; // flag for misclassification/mistake

while (error) { // as long as modification is made in the for
-
loop


error = 0;



for (i = 1 to
n) {



if (y
i

( <
w
k


x
i
> + b
k

)


0 ){ // misclassification




w
k+1

=
w
k

+


y
i

x
i

// update the weight




b
k+1

= b
k

+


y
i

R
2
// update the bias




k = k +1




error = 1;





}


}

}

return (
w
k
, b
k
) // hyperplane that separates the data, where k is the number of


// modifications.

CISC667, F05, Lec22, Liao

5

Questions w.r.t. Rosenblatt’s algorithm


Is the algorithm guaranteed to converge?


How quickly does it converge?


Novikoff Theorem:


Let S be a training set of size n and R = max
1


i


n

|| x
i

||. If
there exists a vector w* such that ||w*|| = 1 and


y
i

(w*


x
i
)



,


for 1


i


n,
then the number of modifications before
convergence is at most

(R/

)
2
.


CISC667, F05, Lec22, Liao

6

Proof:

1.
w
t

w
*

=
w
t
-
1

w
*

+


y
i

x
i


w
*



w
t
-
1

w
*

+







w
t

w
*


t





2.
||
w
t

||
2

= ||
w
t
-
1

||
2

+ 2


y
i

x
i


w
t
-
1

+

2

||
x
i

||
2






||
w
t
-
1

||
2

+

2

||
x
i

||
2







||
w
t
-
1

||
2

+

2

R
2



||
w
t

||
2



t

2

R
2


3.

t


R ||
w
*||


w
t



w
*


t






t


(R/

)
2
.


Note:


Without loss of generality, the separating plane is assumed to pass the
origin, i.e., no bias b is necessary.


The learning rate


seems to have no bearing on this upper bound.


What if the training data is not linearly separable, i.e., w* does not
exist?

CISC667, F05, Lec22, Liao

7

Dual form


The final hypothesis w is a linear combination of the
training points:

w
=


i=1 to n


i

y
i
x
i


where

i

are positive values proportional to the number of
times misclassification of
x
i
has caused the weight to be
updated.


Vector


捡渠扥b⁣潮獩摥敤e慳⁡a瑥湡瑩癥 牥r牥r敮瑡瑩潮
of the hypothesis;

i

can be regarded as an indication of
the information content of the example
x
i
.


The decision function can be rewritten as


h(
x
) = sign (
w


x

+ b)



= sign( (


j=1 to n


j

y
j
x
j
)

x

+ b)



= sign(


j=1 to n


j

y
j
(
x
j

x
) + b)


CISC667, F05, Lec22, Liao

8

Rosenblatt’s Algorithm in dual form


=
0
; b = 0

R = max
1


i


n

|| x
i

||


error = 1; // flag for misclassification

while (error) { // as long as modification is made in the for
-
loop


error = 0;


for (i = 1 to
n) {



if (y
i

(


j=1 to n


j

y
j
(
x
j

x
i
)
+ b)


0 ){ // x
i

is misclassified





i

=

i

+
1

// update the weight




b = b +

y
i

R
2
// update the bias




error = 1;





}


}

}

return (

, 戩††b⼯ 桹灥p灬p湥n瑨琠数e牡r敳e瑨 摡愬 睨敲 欠楳i瑨 湵n扥b 潦


⼯ m潤楦捡瑩潮献

乯瑥猺N


The training examples enter the algorithm as dot products
(
x
j

x
i
).



i

i
s a measure of information content;
x
i

with non
-
zero information content (

i

>0)
are called
support vectors, as they are located on the boundaries.

CISC667, F05, Lec22, Liao

9

Separating hyperplane
(
w,
b)

Margin


Origin

w

b

+





> 0



= 0

CISC667, F05, Lec22, Liao

10

Larger margin is preferred:



converge more quickly



generalize better


CISC667, F05, Lec22, Liao

11

w


x
+

+ b
=
+ 1

w


x
-

+ b
=


1

2 = [ (
x
+


w
)

-

(
x
-


w
)

] = (
x
+

-

x
-
) ∙
w = || x
+

-

x
-

|| ||
w
||


Therefore, maximizing the geometric margin
|| x
+

-

x
-

|| is
equivalent to minimizing ½ ||w||
2
, under linear constraints:
y
i

(
w


x
i
) +b


1 for i = 1, …, n.


Min
w
,b

<
w


w

>

subject to
y
i

<
w


x
i
> +b


1 for i = 1, …, n


CISC667, F05, Lec22, Liao

12

Optimization with constraints

Min
w
,b

<
w


w

>

subject to
y
i

<
w


x
i
> +b


1 for i = 1, …, n


Lagrangian Theory: Introducing Lagrangian multiplier

i

for each
constraint

L(
w
, b,


)= ½ ||w||
2






i

(
y
i

(
w


x
i

+ b)


1),

and then calculating






L




L




L



------

= 0,

------

= 0,

------

= 0,






w




b







This optimization problem can be solved as Quadratic Programming.


… guaranteed to converge to the global minimum because of its being
a convex

Note: advantages over the artificial neural nets

CISC667, F05, Lec22, Liao

13

The optimal w* and b* can be found by solving the dual problem for


to
maximize:

L(

⤠㴠



i



½



i


j

y
i

y
j

x
i


x
j

under the constraints:

i



0, and



i

y
i

= 0.


Once


is solved
,

w
* =



i

y
i

x
i


b* = ½ (max
y =
-
1

w
*∙
x
i
+ min
y=+1

w
*∙
x
i
)

And a
n unknown
x

is classified as



sign(
w*


x

+ b*) = sign(



i

y
i

x
i


x

+ b*)


Notes:

1.
Only the dot product for vectors is needed.

2.
Many

i

are equal to zero, and those that are not zero correspond to
x
i

on the boundaries


support vectors!

3.
In practice, instead of sign function, the actual value of
w*


x

+ b* is
used when its absolute value is less than or equal to one. Such a value is
called a discriminant.

CISC667, F05, Lec22, Liao

14


Non
-
linear mapping to a feature space


Φ
(

)

x
i

Φ
(
x
i
)

Φ
(
x
j
)

x
j

L(

⤠㴠



i



½



i


j

y
i

y
j
Φ

(
x
i

)∙
Φ

(
x
j
)

CISC667, F05, Lec22, Liao

15

X

=

Nonlinear SVMs

Input Space

Feature Space

x
1

x
2

CISC667, F05, Lec22, Liao

16

Kernel function for mapping


For input
X
= (x
1
, x
2
),
Define map

(
X
) = (x
1
x
1
,

2x
1
x
2
, x
2
x
2
).


Define Kernel function as
K(
X
,
Y
) = (
X

Y
)
2
.


It has K(
X
,
Y
) =

(
X
)


(
Y
)


We can compute the dot
product in feature space
without computing

.

K(
X
,
Y
)
=

(
X
)


(
Y
)


=
(x
1
x
1
,

2 x
1
x
2
, x
2
x
2
)

(y
1
y
1
,

2 y
1
y
2
, y
2
y
2
)


=
(x
1
x
1
y
1
y
1

+ 2x
1
x
2
y
1
y
2

+ x
2
x
2
y
2
y
2
)


=
(x
1
y
1

+ x
2
y
2
)(x
1
y
1

+ x
2
y
2
)


=
((x
1
, x
2
) ∙ (y
1
, y
2
))
2


=
(
X

Y
)
2

CISC667, F05, Lec22, Liao

17

Kernels


Given a mapping
Φ
(

) from the space of input vectors to some higher
dimensional feature space, the kernel K of two vectors
x
i
,
x
j

is the
inner product of their images in the feature space, namely,


K(
x
i
,
x
j
) =
Φ

(
x
i
)∙
Φ

(
x
j

).


Since we just need the inner product of vectors in the feature space to find
the maximal margin separating hyperplane, we use the kernel in place
of the mapping
Φ
(

).


Because inner product of two vectors is a measure of the distance between
the vectors, a kernel function actually defines the geometry of the
feature space (lengths and angles), and implicitly provides a similarity
measure for objects to be classified.


CISC667, F05, Lec22, Liao

18

Mercer’s condition

Since kernel functions play an important role, it is important
to know if a kernel gives dot products (in some higher
dimension space).


For a kernel K(x,y), if for any g(x) such that


g(x)
2

dx is
finite, we have



K(x,y)g(x)g(y) dx dy


0,

then there exist a mapping


such that

K(x,y) =

(x)


(y)

Notes:

1.
Mercer’s condition does not tell how to actually find

.

2.
Mercer’s condition may be hard to check since it must hold for
every g(x).

CISC667, F05, Lec22, Liao

19

More kernel functions


some commonly used generic kernel functions


Polynomial kernel: K(
x
,
y
) = (1+
x

y
)
p



Radial (or Gaussian) kernel: K(
x
,
y
) = exp(
-
||
x
-
y
||
2
/2

2
)


Questions: By introducing extra dimensions (sometimes
infinite), we can find a linearly separating hyperplane. But
how can we be sure such a mapping to a higher dimension
space will generalize well to unseen data? Because the
mapping introduces flexibility for fitting the training
examples, how to avoid overfitting?

Answer: Use the maximum margin hyperplane. (Vapnik
theory)


CISC667, F05, Lec22, Liao

20

The optimal w* and b* can be found by solving the
dual problem for


to maximize:

L(

⤠㴠



i



½



i


j

y
i

y
j

K(
x
i

,
x
j
)

under the constraints:

i



0, and



i

y
i

= 0.


Once


is solved
,

w
* =



i

y
i

x
i


b* = ½ max
y =
-
1

[K(
w
*,
x
i
) + min
y=+1

(
w
*∙
x
i
)]

And a
n unknown
x

is classified as



sign(
w*


x

+ b*) = sign(



i

y
i

K(
x
i

,

x
) + b*)


CISC667, F05, Lec22, Liao

21

References and resources


Cristianini & Shawe
-
Tayor, “
An introduction to
Support Vector Machines
”, Cambridge University
Press, 2000.


www.kernel
-
machines.org


SVMLight


Chris Burges, A tutorial


J.
-
P Vert, A 3
-
day tutorial


W. Noble, “Support vector machine applications
in computational biology”,
Kernel Methods in
Computational Biology
. B. Schoelkopf, K. Tsuda
and J.
-
P. Vert, ed. MIT Press, 2004.