# S V M

Τεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 4 χρόνια και 6 μήνες)

142 εμφανίσεις

S
UPPORT

V
ECTOR

M
ACHINES

Jianping Fan

Dept of Computer Science

UNC
-
Charlotte

P
ERCEPTRON

R
EVISITED
: L
INEAR

S
EPARATORS

Binary classification can be viewed as the task
of separating classes in feature space:

w
T
x
+
b

= 0

w
T
x
+
b

< 0

w
T
x
+
b

> 0

f
(
x
)

=
sign(
w
T
x
+
b
)

L
INEAR

S
EPARATORS

Which of the linear separators is optimal?

C
LASSIFICATION

M
ARGIN

Distance from example
x
i

to the separator is

Examples closest to the hyperplane are
support vectors
.

Margin

ρ

of the separator is the distance between
support vectors.

w
x
w
b
r
i
T

r

ρ

M
AXIMUM

M
ARGIN

C
LASSIFICATION

Maximizing the margin is good according to intuition
and PAC (
Probably approximately correct

learning
) theory.

Implies that only support vectors matter; other
training examples are ignorable.

L
INEAR

SVM M
ATHEMATICALLY

Let training set {(
x
i
,
y
i
)}
i
=1..
n
,
x
i

R
d
,
y
i

{
-
1, 1}

be separated
by a
hyperplane

with

margin
ρ
. Then for each training
example

(
x
i
,
y
i
):

For every support vector
x
s

the above inequality is an
equality. After rescaling
w

and
b

by
ρ
/
2

in the equality
,
we obtain that distance between each
x
s

and the
hyperplane

is

Then the margin can be expressed through (rescaled)
w

and b as:

w
T
x
i

+
b

-

ρ
/2

if
y
i

=
-
1

w
T
x
i

+
b

ρ
/2

if
y
i

= 1

w
2
2

r

w
w
x
w
1
)
(
y

b
r
s
T
s
y
i
(
w
T
x
i

+
b
)

ρ
/2

L
INEAR

SVM
S

M
ATHEMATICALLY

(
CONT
.)

Then we can formulate the
optimization problem:

Which can be reformulated as:

Find
w

and
b

such that

is maximized

and for all (
x
i
,
y
i
),
i
=1..
n

:
y
i
(
w
T
x
i

+
b)

1

w
2

Find
w

and
b

such that

Φ
(
w
)

= ||w||
2
=
w
T
w

is minimized

and for all (
x
i
,
y
i
),
i
=1..
n

:
y
i

(
w
T
x
i

+
b
)

1

S
OLVING

THE

O
PTIMIZATION

P
ROBLEM

Need to optimize a
function subject to
linear
constraints.

Quadratic optimization problems are a well
-
known
class of mathematical programming problems for
which several (non
-
trivial) algorithms exist.

Find
w

and b such that

Φ
(
w
)

=
w
T
w

is minimized

and for all (
x
i
,
y
i
)
,

i
=1..
n

:
y
i

(
w
T
x
i

+
b
)

1

S
OLVING

THE

O
PTIMIZATION

P
ROBLEM

The solution involves constructing a
dual problem
where a
Lagrange multiplier

α
i
is associated with every
inequality constraint in the primal (original) problem
:

Find
α
1

α
n

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j

is maximized and

(1)

Σ
α
i
y
i

= 0

(2)
α
i

0 for all
α
i

T
HE

O
PTIMIZATION

P
ROBLEM

S
OLUTION

Given a solution
α
1

α
n

to the dual problem, solution to the
primal is
:

Each non
-
zero
α
i

indicates that corresponding
x
i

is a support
vector.

Then the classifying function is (note that we don’t need
w
explicitly):

Notice that it relies on an
inner product

between the test point
x

and the support vectors
x
i

Also keep in mind that solving the optimization problem
involved computing the inner products
x
i
T
x
j

between all
training points.

w

=
Σ
α
i
y
i
x
i

b
=
y
k

-

Σ
α
i
y
i
x
i

T
x
k

for any
α
k
>
0

f
(
x
) =
Σ
α
i
y
i
x
i
T
x +
b

S
OFT

M
ARGIN

C
LASSIFICATION

What if the training set is not linearly separable?

Slack variables

ξ
i

misclassification of difficult or noisy examples,
resulting margin called
soft
.

ξ
i

ξ
i

S
OFT

M
ARGIN

C
LASSIFICATION

M
ATHEMATICALLY

The old formulation:

Modified formulation incorporates slack variables:

Parameter
C

can be viewed as a way to control
overfitting: it “trades off” the relative importance of
maximizing the margin and fitting the training data.

Find
w

and b such that

Φ
(
w
)

=
w
T
w

is minimized

and for all (
x
i

,
y
i
)
,

i
=1..
n

:
y
i

(
w
T
x
i

+
b
)

1

Find
w

and b such that

Φ
(
w
)

=
w
T
w

+
C
Σ
ξ
i

is minimized

and for all (
x
i

,
y
i
)
,

i
=1..
n

:
y
i

(
w
T
x
i

+
b
)

1

ξ
i,

,
ξ
i

0

S
OFT

M
ARGIN

C
LASSIFICATION

S
OLUTION

Dual problem is identical to separable case
(would
not
be identical if the 2
-
norm penalty for
slack variables
C
Σ
ξ
i
2

was used in primal
objective, we would need additional Lagrange
multipliers for slack variables):

Again,
x
i

with non
-
zero
α
i

will be support vectors.

Find
α
1

α
N

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j

is maximized and

(1)

Σ
α
i
y
i

= 0

(2) 0

α
i

C

for all
α
i

S
OFT

M
ARGIN

C
LASSIFICATION

S
OLUTION

Solution to the dual problem is:

w

=
Σ
α
i
y
i
x
i

b= y
k
(1
-

ξ
k
)
-

Σ
α
i
y
i
x
i
T
x
k

for any
k

s.t.
α
k
>
0

f
(
x
) =
Σ
α
i
y
i
x
i
T
x +
b

Again, we don’t need to compute
w

explicitly for classification:

T
HEORETICAL

J
USTIFICATION

FOR

M
AXIMUM

M
ARGINS

Vapnik has proved the following:

The class of optimal linear separators has VC dimension h
bounded from above as

where
ρ

is the margin, D is the diameter of the smallest
sphere that can enclose all of the training examples, and m
0

is the dimensionality.

Intuitively, this implies that regardless of dimensionality
m
0
we can minimize the VC dimension by maximizing the
margin
ρ
.

Thus, complexity of the classifier is kept small regardless of
dimensionality.

1
,
min
0
2
2

m
D
h

L
INEAR

SVM
S
: O
VERVIEW

The classifier is a
separating hyperplane.

Most “important” training points are support vectors;
they define the hyperplane.

Quadratic optimization algorithms can identify which
training points
x
i

are support vectors with non
-
zero
Lagrangian multipliers
α
i
.

L
INEAR

SVM
S
: O
VERVIEW

Both in the dual formulation of the problem and in the
solution training points appear only inside inner
products:

Find
α
1

α
N

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
x
i
T
x
j

is maximized and

(1)

Σ
α
i
y
i

= 0

(2) 0

α
i

C

for all
α
i

f
(
x
) =
Σ
α
i
y
i
x
i
T
x +
b

N
ON
-
LINEAR

SVM
S

Datasets that are linearly separable with some noise
work out great:

But what are we going to do if the dataset is just too
hard?

mapping data to a higher
-
dimensional
space
:

0

0

0

x
2

x

x

x

N
ON
-
LINEAR

SVM
S
: F
EATURE

SPACES

General idea
: the original feature space can always be
mapped to some higher
-
dimensional feature space where
the training set is separable:

Φ
:
x

φ
(
x
)

T
HE

“K
ERNEL

T
RICK

The linear classifier relies on inner product between vectors
K
(
x
i
,
x
j
)=
x
i
T
x
j

If every
datapoint

is mapped into high
-
dimensional space via some
transformation
Φ
:
x

φ
(
x
), the inner product becomes:

K
(
x
i
,
x
j
)=
φ
(
x
i
)

T
φ
(
x
j
)

A
kernel function

is a function that is
eqiuvalent

to an inner product in
some feature space.

Example:

2
-
dimensional vectors
x
=[
x
1
x
2
]; let
K
(
x
i
,
x
j
)=(1 +
x
i
T
x
j
)
2
,

Need to show that
K
(
x
i
,
x
j
)=
φ
(
x
i
)

T
φ
(
x
j
):

K
(
x
i
,
x
j
)=(1 +
x
i
T
x
j
)
2
,
= 1+
x
i1
2
x
j1
2
+
2
x
i1
x
j1

x
i2
x
j2
+ x
i2
2
x
j2
2
+ 2
x
i1
x
j1
+
2
x
i2
x
j2
=

=
[1
x
i1
2

2
x
i1
x
i2

x
i2
2

2
x
i1

2
x
i2
]
T
[1
x
j1
2

2
x
j1
x
j2

x
j2
2

2
x
j1

2
x
j2
] =

=
φ
(
x
i
)

T
φ
(
x
j
), where
φ
(
x
) =

[1
x
1
2

2
x
1
x
2

x
2
2

2
x
1

2
x
2
]

Thus, a kernel function

implicitly
maps data to a high
-
dimensional space
(without the need to compute each
φ
(
x
) explicitly).

W
HAT

F
UNCTIONS

ARE

K
ERNELS
?

For some functions
K
(
x
i
,
x
j
) checking that
K
(
x
i
,
x
j
)=
φ
(
x
i
)

T
φ
(
x
j
) can be cumbersome.

Mercer’s theorem:

Every semi
-
positive definite symmetric function is
a kernel

Semi
-
positive definite symmetric functions correspond
to a semi
-
positive definite symmetric Gram matrix:

K
(
x
1
,
x
1
)

K
(
x
1
,
x
2
)

K
(
x
1
,
x
3
)

K
(
x
1
,
x
n
)

K
(
x
2
,
x
1
)

K
(
x
2
,
x
2
)

K
(
x
2
,
x
3
)

K
(
x
2
,
x
n
)

K
(
x
n
,
x
1
)

K
(
x
n
,
x
2
)

K
(
x
n
,
x
3
)

K
(
x
n
,
x
n
)

K=

E
XAMPLES

OF

K
ERNEL

F
UNCTIONS

Linear:
K
(
x
i
,
x
j
)=
x
i
T
x
j

Mapping
Φ
:
x

φ
(
x
), where
φ
(
x
) is
x

itself

Polynomial of power
p
:
K
(
x
i
,
x
j
)= (1+

x
i
T
x
j
)
p

Mapping
Φ
:
x

φ
(
x
), where
φ
(
x
) has dimensions

-
basis function):
K
(
x
i
,
x
j
) =

Mapping
Φ
:
x

φ
(
x
), where
φ
(
x
) is
infinite
-
dimensional
:
every point is mapped to
a function
(a Gaussian);
combination of functions for support vectors is the
separator.

Higher
-
dimensional space still has
intrinsic
dimensionality
d
(the mapping is not
onto
), but linear
separators in it correspond to
non
-
linear
separators in
original space.

2
2
2

j
i
e
x
x

p
p
d
N
ON
-
LINEAR

SVM
S

M
ATHEMATICALLY

Dual problem formulation:

The solution is:

Optimization techniques for finding
α
i

s remain
the same!

Find
α
1

α
n

such that

Q
(
α
)

=
Σ
α
i

-

½
ΣΣ
α
i
α
j
y
i
y
j
K
(
x
i
,

x
j
)

is maximized and

(1)

Σ
α
i
y
i

= 0

(2)
α
i

0 for all
α
i

f
(
x
) =
Σ
α
i
y
i
K
(
x
i
,

x
j
)
+
b

SVM
APPLICATIONS

SVMs were originally proposed by
Boser
,
Guyon

and
Vapnik

in
1992 and gained increasing popularity in late 1990s.

SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.

SVMs can be applied to complex data types beyond feature
vectors (e.g. graphs, sequences, relational data) by designing
kernel functions for such data.

SVM techniques have been extended to a number of tasks such
as regression [
Vapnik

et al.

’97], principal component analysis
[
Sch
ö
lkopf

et al.
’99], etc.

Most popular optimization algorithms for SVMs use
decomposition
to hill
-
climb over a subset of
α
i

s

at a time, e.g.
SMO [Platt ’99] and

[
Joachims

’99]

Tuning SVMs remains a black art: selecting a specific kernel
and parameters is usually done in a try
-
and
-
see manner.

S
OURCE

C
ODE

FOR

SVM

1. SVM Light from Cornell University: Prof.
Thorsten Joachims

http://svmlight.joachims.org/

2. Multi
-
Class SVM from Cornell: Prof.
Thorsten Joachims

http://svmlight.joachims.org/svm_multiclass.html
\

3. LIBSVM from Taiwan University: Prof.
Chih
-
Jen Lin

http://www.csie.ntu.edu.tw/~cjlin/libsvm/