S V M

whooshrwandanΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

77 εμφανίσεις

S
UPPORT

V
ECTOR

M
ACHINES

Jianping Fan

Dept of Computer Science

UNC
-
Charlotte

P
ERCEPTRON

R
EVISITED
: L
INEAR

S
EPARATORS



Binary classification can be viewed as the task
of separating classes in feature space:

w
T
x
+
b

= 0

w
T
x
+
b

< 0

w
T
x
+
b

> 0

f
(
x
)

=
sign(
w
T
x
+
b
)

L
INEAR

S
EPARATORS


Which of the linear separators is optimal?

C
LASSIFICATION

M
ARGIN


Distance from example
x
i

to the separator is


Examples closest to the hyperplane are
support vectors
.


Margin

ρ

of the separator is the distance between
support vectors.


w
x
w
b
r
i
T


r

ρ

M
AXIMUM

M
ARGIN

C
LASSIFICATION


Maximizing the margin is good according to intuition
and PAC (
Probably approximately correct

learning
) theory.


Implies that only support vectors matter; other
training examples are ignorable.

L
INEAR

SVM M
ATHEMATICALLY


Let training set {(
x
i
,
y
i
)}
i
=1..
n
,
x
i

R
d
,
y
i



{
-
1, 1}

be separated
by a
hyperplane

with

margin
ρ
. Then for each training
example

(
x
i
,
y
i
):





For every support vector
x
s

the above inequality is an
equality. After rescaling
w

and
b

by
ρ
/
2

in the equality
,
we obtain that distance between each
x
s

and the
hyperplane

is



Then the margin can be expressed through (rescaled)
w

and b as:

w
T
x
i

+
b


-

ρ
/2

if
y
i

=
-
1

w
T
x
i

+
b


ρ
/2

if
y
i

= 1

w
2
2


r

w
w
x
w
1
)
(
y



b
r
s
T
s
y
i
(
w
T
x
i

+
b
)



ρ
/2



L
INEAR

SVM
S

M
ATHEMATICALLY

(
CONT
.)


Then we can formulate the
quadratic
optimization problem:





Which can be reformulated as:

Find
w

and
b

such that


is maximized

and for all (
x
i
,
y
i
),
i
=1..
n

:
y
i
(
w
T
x
i

+
b)


1

w
2


Find
w

and
b

such that

Φ
(
w
)

= ||w||
2
=
w
T
w

is minimized

and for all (
x
i
,
y
i
),
i
=1..
n

:
y
i

(
w
T
x
i

+
b
)


1

S
OLVING

THE

O
PTIMIZATION

P
ROBLEM





Need to optimize a
quadratic
function subject to
linear
constraints.


Quadratic optimization problems are a well
-
known
class of mathematical programming problems for
which several (non
-
trivial) algorithms exist.


Find
w

and b such that

Φ
(
w
)

=
w
T
w

is minimized

and for all (
x
i
,
y
i
)
,

i
=1..
n

:
y
i

(
w
T
x
i

+
b
)


1

S
OLVING

THE

O
PTIMIZATION

P
ROBLEM


The solution involves constructing a
dual problem
where a
Lagrange multiplier

α
i
is associated with every
inequality constraint in the primal (original) problem
:


Find
α
1

α
n

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j

is maximized and

(1)

Σ
α
i
y
i

= 0

(2)
α
i


0 for all
α
i

T
HE

O
PTIMIZATION

P
ROBLEM

S
OLUTION


Given a solution
α
1

α
n

to the dual problem, solution to the
primal is
:





Each non
-
zero
α
i

indicates that corresponding
x
i

is a support
vector.


Then the classifying function is (note that we don’t need
w
explicitly):





Notice that it relies on an
inner product

between the test point
x

and the support vectors
x
i



we will return to this later.


Also keep in mind that solving the optimization problem
involved computing the inner products
x
i
T
x
j

between all
training points.

w


=
Σ
α
i
y
i
x
i

b
=
y
k

-

Σ
α
i
y
i
x
i

T
x
k



for any
α
k
>
0

f
(
x
) =
Σ
α
i
y
i
x
i
T
x +
b

S
OFT

M
ARGIN

C
LASSIFICATION



What if the training set is not linearly separable?


Slack variables

ξ
i

can be added to allow
misclassification of difficult or noisy examples,
resulting margin called
soft
.

ξ
i

ξ
i

S
OFT

M
ARGIN

C
LASSIFICATION

M
ATHEMATICALLY


The old formulation:






Modified formulation incorporates slack variables:






Parameter
C

can be viewed as a way to control
overfitting: it “trades off” the relative importance of
maximizing the margin and fitting the training data.

Find
w

and b such that

Φ
(
w
)

=
w
T
w

is minimized

and for all (
x
i

,
y
i
)
,

i
=1..
n

:
y
i

(
w
T
x
i

+
b
)


1

Find
w

and b such that

Φ
(
w
)

=
w
T
w

+
C
Σ
ξ
i

is minimized

and for all (
x
i

,
y
i
)
,

i
=1..
n

:
y
i

(
w
T
x
i

+
b
)


1


ξ
i,

,
ξ
i


0


S
OFT

M
ARGIN

C
LASSIFICATION



S
OLUTION


Dual problem is identical to separable case
(would
not
be identical if the 2
-
norm penalty for
slack variables
C
Σ
ξ
i
2

was used in primal
objective, we would need additional Lagrange
multipliers for slack variables):







Again,
x
i

with non
-
zero
α
i

will be support vectors.

Find
α
1

α
N

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j

is maximized and

(1)

Σ
α
i
y
i

= 0

(2) 0


α
i


C

for all
α
i

S
OFT

M
ARGIN

C
LASSIFICATION



S
OLUTION



Solution to the dual problem is:

w


=
Σ
α
i
y
i
x
i


b= y
k
(1
-

ξ
k
)
-

Σ
α
i
y
i
x
i
T
x
k



for any
k

s.t.
α
k
>
0

f
(
x
) =
Σ
α
i
y
i
x
i
T
x +
b

Again, we don’t need to compute
w

explicitly for classification:

T
HEORETICAL

J
USTIFICATION

FOR

M
AXIMUM

M
ARGINS


Vapnik has proved the following:


The class of optimal linear separators has VC dimension h
bounded from above as



where
ρ

is the margin, D is the diameter of the smallest
sphere that can enclose all of the training examples, and m
0

is the dimensionality.



Intuitively, this implies that regardless of dimensionality
m
0
we can minimize the VC dimension by maximizing the
margin
ρ
.



Thus, complexity of the classifier is kept small regardless of
dimensionality.



1
,
min
0
2
2














m
D
h

L
INEAR

SVM
S
: O
VERVIEW


The classifier is a
separating hyperplane.



Most “important” training points are support vectors;
they define the hyperplane.



Quadratic optimization algorithms can identify which
training points
x
i

are support vectors with non
-
zero
Lagrangian multipliers
α
i
.




L
INEAR

SVM
S
: O
VERVIEW



Both in the dual formulation of the problem and in the
solution training points appear only inside inner
products:


Find
α
1

α
N

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j

is maximized and

(1)

Σ
α
i
y
i

= 0

(2) 0


α
i


C

for all
α
i

f
(
x
) =
Σ
α
i
y
i
x
i
T
x +
b

N
ON
-
LINEAR

SVM
S


Datasets that are linearly separable with some noise
work out great:




But what are we going to do if the dataset is just too
hard?



How about…
mapping data to a higher
-
dimensional
space
:

0

0

0

x
2

x

x

x

N
ON
-
LINEAR

SVM
S
: F
EATURE

SPACES


General idea
: the original feature space can always be
mapped to some higher
-
dimensional feature space where
the training set is separable:

Φ
:
x



φ
(
x
)

T
HE

“K
ERNEL

T
RICK



The linear classifier relies on inner product between vectors
K
(
x
i
,
x
j
)=
x
i
T
x
j


If every
datapoint

is mapped into high
-
dimensional space via some
transformation
Φ
:
x



φ
(
x
), the inner product becomes:

K
(
x
i
,
x
j
)=
φ
(
x
i
)

T
φ
(
x
j
)


A
kernel function

is a function that is
eqiuvalent

to an inner product in
some feature space.


Example:


2
-
dimensional vectors
x
=[
x
1
x
2
]; let
K
(
x
i
,
x
j
)=(1 +
x
i
T
x
j
)
2
,


Need to show that
K
(
x
i
,
x
j
)=
φ
(
x
i
)

T
φ
(
x
j
):



K
(
x
i
,
x
j
)=(1 +
x
i
T
x
j
)
2
,
= 1+
x
i1
2
x
j1
2
+
2
x
i1
x
j1

x
i2
x
j2
+ x
i2
2
x
j2
2
+ 2
x
i1
x
j1
+
2
x
i2
x
j2
=



=
[1
x
i1
2

2
x
i1
x
i2

x
i2
2

2
x
i1

2
x
i2
]
T
[1
x
j1
2

2
x
j1
x
j2

x
j2
2

2
x
j1

2
x
j2
] =



=
φ
(
x
i
)

T
φ
(
x
j
), where
φ
(
x
) =

[1
x
1
2

2
x
1
x
2

x
2
2

2
x
1

2
x
2
]


Thus, a kernel function

implicitly
maps data to a high
-
dimensional space
(without the need to compute each
φ
(
x
) explicitly).

W
HAT

F
UNCTIONS

ARE

K
ERNELS
?


For some functions
K
(
x
i
,
x
j
) checking that
K
(
x
i
,
x
j
)=
φ
(
x
i
)

T
φ
(
x
j
) can be cumbersome.


Mercer’s theorem:

Every semi
-
positive definite symmetric function is
a kernel


Semi
-
positive definite symmetric functions correspond
to a semi
-
positive definite symmetric Gram matrix:


K
(
x
1
,
x
1
)

K
(
x
1
,
x
2
)

K
(
x
1
,
x
3
)



K
(
x
1
,
x
n
)

K
(
x
2
,
x
1
)

K
(
x
2
,
x
2
)

K
(
x
2
,
x
3
)

K
(
x
2
,
x
n
)











K
(
x
n
,
x
1
)

K
(
x
n
,
x
2
)

K
(
x
n
,
x
3
)



K
(
x
n
,
x
n
)

K=

E
XAMPLES

OF

K
ERNEL

F
UNCTIONS


Linear:
K
(
x
i
,
x
j
)=
x
i
T
x
j



Mapping
Φ
:
x



φ
(
x
), where
φ
(
x
) is
x

itself




Polynomial of power
p
:
K
(
x
i
,
x
j
)= (1+

x
i
T
x
j
)
p


Mapping
Φ
:
x



φ
(
x
), where
φ
(
x
) has dimensions




Gaussian (radial
-
basis function):
K
(
x
i
,
x
j
) =


Mapping
Φ
:
x


φ
(
x
), where
φ
(
x
) is
infinite
-
dimensional
:
every point is mapped to
a function
(a Gaussian);
combination of functions for support vectors is the
separator.



Higher
-
dimensional space still has
intrinsic
dimensionality
d
(the mapping is not
onto
), but linear
separators in it correspond to
non
-
linear
separators in
original space.

2
2
2

j
i
e
x
x











p
p
d
N
ON
-
LINEAR

SVM
S

M
ATHEMATICALLY


Dual problem formulation:







The solution is:





Optimization techniques for finding
α
i

s remain
the same!

Find
α
1

α
n

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
K
(
x
i
,

x
j
)

is maximized and

(1)

Σ
α
i
y
i

= 0

(2)
α
i


0 for all
α
i

f
(
x
) =
Σ
α
i
y
i
K
(
x
i
,

x
j
)
+
b

SVM
APPLICATIONS


SVMs were originally proposed by
Boser
,
Guyon

and
Vapnik

in
1992 and gained increasing popularity in late 1990s.


SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.


SVMs can be applied to complex data types beyond feature
vectors (e.g. graphs, sequences, relational data) by designing
kernel functions for such data.


SVM techniques have been extended to a number of tasks such
as regression [
Vapnik

et al.

’97], principal component analysis
[
Sch
ö
lkopf

et al.
’99], etc.


Most popular optimization algorithms for SVMs use
decomposition
to hill
-
climb over a subset of
α
i

s

at a time, e.g.
SMO [Platt ’99] and

[
Joachims

’99]



Tuning SVMs remains a black art: selecting a specific kernel
and parameters is usually done in a try
-
and
-
see manner.


S
OURCE

C
ODE

FOR

SVM

1. SVM Light from Cornell University: Prof.
Thorsten Joachims

http://svmlight.joachims.org/


2. Multi
-
Class SVM from Cornell: Prof.
Thorsten Joachims


http://svmlight.joachims.org/svm_multiclass.html
\


3. LIBSVM from Taiwan University: Prof.
Chih
-
Jen Lin

http://www.csie.ntu.edu.tw/~cjlin/libsvm/