# Lecture 7

Τεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

111 εμφανίσεις

Support Vector Machines (SVM)

Used mostly for classification (also, can be modified for regression and even for
unsupervised learning applications).

Achieve accuracy comparable
to and in some cases better than

Neural Networks

Assume

training data

with

is
linearly
separable
(separable by a hyperplane).

Question
: What is the best linear classifier of the type

.

While there can be an infinite number of hype
rplanes that achieve 100% accuracy on training
data, the question is what hyperplane is the optimal with respect to the accuracy on test data?

Common sense

solution: we want to
increase the gap (margin) between positive
and n
egative cases as much as possible.

The best linear classifier is the hyperplane
in the middle of the gap.

Given
f
(
x
), the classification is obtained as

Important note
: Different
w

and
b

can result in the identical classifica
tion. For example,
we can apply any scalar
a

such that:

Therefore there are many identical solutions.

Definitions of SVM and Margin

To prevent problems caused by multiple identical solutions, we add the following
requirement:

F
ind

with maximal margin, such that for points closest to the
separating hyperplane,

(also called
the support vectors
)

and for other points,

Illustration:

Question:
How can we calculate the length of the margin as a function of
w
?
The
following diagram shows a point
x

and its projection
x
p

to the separating hyperplane,
where

r

is defined as the distance between data point
x

and the hyperpl
ane.

everything above
positive

w

r

x

everything below is
negative

+

+

+

+

+

+

+

+

depends on closest
points

margin

SUPPORT VECTO
RS

Note that
w

is a vector perpendicular to the hyperplane, so we have:

=

=

(since
)

Therefore:

Now, solve for margin length
ρ:

+1
-
1

Conclusion:

Maximizing the margin is equivalent to minimizing

(since we can
ignore the

constant 2 above).

Support Vector Machines: Learning Problem

Assuming a linearly separable dataset, the task of learning coefficients
w
and
b

of support
vector machine

reduces to solving the following constrained
optimization pr
oblem:

find
x

and
b

that minimize:

subject to:

+

+

+

+

+

+

+

+

ρ
=
SUPPORT VECTORS

This is a quadratic optimization problem with linear constraints. In general, it could be
solved in O(
M
3
) time.
This optimization problem can be solved by
using the Lagrangian
function defined as:

, such that

where

1
,

2
, …

N

are Lagrange multipliers and

=⁛

1
,

2
, …

N
]
T
.

The solution

of the original constrained optimization problem is determined by the
sa
ddle point of
L
(
w
,
b
,

⤠ 睨楣栠 ha猠 瑯 be 浩n業楺e搠 睩瑨 牥獰sc琠 瑯
w

and
b

and
maximized with respect to

.

Comments

about Lagrange multipliers:

If
, the value of

i

that maximizes
L
(w,
b
,

⤠楳

i

=

0.

If
, the
value of

i

that maximizes
L
(w,
b
,

 楳i

i

=

+

.
However, since
w

and
b

are trying to minimize
L
(w,
b
,

⤬瑨y睩汬扥c桡nge

a⁬敡獴⁥煵s氠l漠ㄮ

From this brief discussion, the so
-
called
Kuhn_Tucker Condition
s

follow
:

For the data points satisfying

it follows that

i

> 0. These
data points
are called the
support vectors

Optimality conditions:

The necessary conditions for the saddle point of
L
(
w
,
b
,

⤠re

S潬癩湧⁦潲⁴桥⁮ce獳sry⁣潮摩楯湳⁲i獵汴猠楮

⠪⨪(

Byre灬慣楮朠

a猠a

dual op
timization problem

is

constructed as

Find

subject to

This is a
convex quadratic programming problem
, so there is a global minimum. There
are a number of optimization routines capable of s
olving this optimization problem. The
optimization can be solved in
O(N
3
) time

(cubic with the size of training data) and in
linear time in the number of attributes
. (Compare this to neural networks that are
trained in O(N) time)
.

Given the values of

1
,

2
, …

N

obtained by solution of the dual problem, value of
b

can
be calculated by remembering that all support vectors have a property
. By replacing equation (***) into this equation we get
.

Support Vector
Machine: Final Predictor

Given the values

1
,

2
, …

N

and
b

obtained by solution of the dual problem, the final
SVM predictor can be expressed from (***) as

Important comments:

To
train the SVM
, all data points from the training d
ata are consulted

Since

i

0 only for the support vectors, only support vectors are used in giving a
prediction

Note that

is a scalar

Support Vector Machines

on Noisy Data

So far
, we discussed the construction of support vector

machines on linearly separable training
data. This is a very strong assumption that is unrealistic in most real life applications.

Question:

What should we do if the training
data set is not linearly separable
?

Solution:

Introducing
the sl
ack variables

i
,
i

=

1, 2, …,
N
, to relax the constraint

to
. Ideally, one would prefer all slack variables to
be zero and this would correspond to the linearly separable case.
We introduce penalty if for
som
e
i
,

i

> 0.
Therefore, the optimization problem for construction of SVM on linearly
nonseparable data is defined as:

find
x

and
b

that minimize:

subject to:

where
C

>

0 is an appropriately selected par
ameter

(the so called
slack parameter
)
. The
additional term

enforces all slack variables to be as close to zero as possible.

Dual problem:

As in the linearly separable problem, this optimization problem can be converted
to its dual

problem:

find

NOTE: The consequence of introducing parameter
C

is in constraining the range of acceptable
values of Lagrange multipliers

i
. The most appropriate choice for
C
will depend on the specific
data set available.

Support Vector Machines for Nonlinear Classification

Problem:

Support vector machines represented with a linear function
f
(
x
) (i.e. a separating
hyperplane) have very limited representational power. As suc
h, they could not be very useful in
practical classification problems.

Good News:

With a slight modification, SVM could solve
highly nonlinear classification problems!!

Justification:
Cover’s Theorem

Suppose that data set D is nonlinearly separable

in
th
e original attribute space
. The attribute space
can
be
transformed into a new attribute space where D is
linearly separable!

Caveat:

Cover’s Theorem only proves the
existence

of the transformed attribute space that could solve the nonlinear problem. It doe
s not
provide the guideline for the construction of the attribute transformation!

Ex
ample
1
. XOR problem

By constructing a new attribute:

X
1

=

X
1

X
2

the XOR problem becomes linearly separable by the
new
attribute
X
1

.

Ex
ample
2
. Second order mono
mials derived from the original two
-
dimensional
attribute space

Ex
ample
3
. Fifth order monomials derived from the original 256
-
dimensional attribute
space

There are

of such monomials, which is an extremely h
igh
-
dimensional
attribute space!!

SVM and curse
-
of
-
dimensionality:

If the original attribute space is transformed into a very high dimensional space, the
likelihood of being able to solve the nonlinear classification increases. However, one is
likely
to quickly encounter the curse
-
of
-
dimensionality problem.

The strength of SVM lies in the theoretical justification that margin maximization is an
effective mechanism for alleviating the curse
-
of
-
dimensionality problem (i.e. SVM is the
simplest classifier

that solves the given classification problem). Therefore, SVM are able
to successfully solve classification problems with extremely high attribute
dimensionality!!

SVM solution

Denote

:

M

F

as a mapping from the original M
-
dimensional attribute spac
e to the highly
dimensional attribute space
F
.

By solving the following dual problem

find

Practical Problem:

Althoug
h SVM are successful in dealing with highly dimensional attribute
spaces, the fact that the SVM training scales linearly with the number of attributes, and
considering limited memory space could largely limit the choice of mapping

.

Solution: Kernel Tric
k

For certain class of
mappings

i
t
is possible to

comput
e

scalar products

in the original attribute space.
For example, in some cases we could replace

by kernel function
K

where
,
mea
ning that the scalar product depends only on the distance between original points
x
i

and
x
j
.

Examples
of kernel function:

Linear Kernel:

Gaussian Kernel:

, A is a constant

Polynomial Kernel:

, B is a constant

Get back to
Ex
ample 2.

Second order monomials derived from the original two
-
dimensional
attribute space

It can be shown that t
he scalar product between two z vectors
satisfies the following:

Reformulation of the SVM Problem

The dual problem:

find

The resulting SVM is:

Therefore, both the optimization problem and SVM predic
tion depend only on the kernel
distances between points. This gives rise to applications where data points are objects that are not
represented by a vector of attribute values. Objects could be as complex as images or text
documents. As long as there is a
way to calculate the kernel distance between the objects the
SVM approach can be used.

Kernel Choice

A necessary and sufficient condition for a function
K
(
x
,
y
) to be a valid kernel is that the Gram
matrix
K

with elements {
K
(
x
i
,
x
j
)} is positive semidefinit
e for all possible choices of a data set
{
x
i
,
i

= 1…
N
}.
Section 6.2

in the textbook gives an overview of the rules for construction of
valid kernels.

Some

pr
actical

issues with SVM

Modeling choices
: When using some of the available SVM software packages o
r
toolboxes a user should choose (1) kernel function (e.g. Gaussian kernel) and its
parameter(s) (e.g. constant A), (2) constant
C

related to the slack variables. Several
choices should be examined using validation set in order to find the best SVM.

SVM tr
aining does not scale well with the size of the training data (i.e. s
caling a
s

O(
N
3
)
).
There are several solutions that offer speed
-
up of the original SVM algorithm:

o

chunking;

start with a subset of D, build SVM, apply it on all data, add
“problematic” dat
a points into the training data, remove “nice” points, repeat).

o

decomposition;
similar to chunking, the size of the subset is kept constant

o

sequential minimal optimization
; extreme version of chunking, only 2 data
points are used in each iteration.

SVM
-
Ba
sed solutions exist for problems outside binary classification

multi
-
class classification problems

SVM for regression

Kernel PCA

Kernel Fischer discriminant

Clustering

PCA

Kernel PCA