Support Vector Machines (SVM)
Used mostly for classification (also, can be modified for regression and even for
unsupervised learning applications).
Achieve accuracy comparable
to and in some cases better than
Neural Networks
Assume
training data
with
is
linearly
separable
(separable by a hyperplane).
Question
: What is the best linear classifier of the type
.
While there can be an infinite number of hype
rplanes that achieve 100% accuracy on training
data, the question is what hyperplane is the optimal with respect to the accuracy on test data?
Common sense
solution: we want to
increase the gap (margin) between positive
and n
egative cases as much as possible.
The best linear classifier is the hyperplane
in the middle of the gap.
Given
f
(
x
), the classification is obtained as
Important note
: Different
w
and
b
can result in the identical classifica
tion. For example,
we can apply any scalar
a
such that:
Therefore there are many identical solutions.
Definitions of SVM and Margin
To prevent problems caused by multiple identical solutions, we add the following
requirement:
F
ind
with maximal margin, such that for points closest to the
separating hyperplane,
(also called
the support vectors
)
and for other points,
Illustration:
Question:
How can we calculate the length of the margin as a function of
w
?
The
following diagram shows a point
x
and its projection
x
p
to the separating hyperplane,
where
r
is defined as the distance between data point
x
and the hyperpl
ane.
everything above
positive
w
r
x
everything below is
negative
─
─
─
─
─
─
─
─
+
+
+
+
+
+
+
+
─
depends on closest
points
margin
SUPPORT VECTO
RS
Note that
w
is a vector perpendicular to the hyperplane, so we have:
=
=
(since
)
Therefore:
Now, solve for margin length
ρ:
+1

1
Conclusion:
Maximizing the margin is equivalent to minimizing
(since we can
ignore the
constant 2 above).
Support Vector Machines: Learning Problem
Assuming a linearly separable dataset, the task of learning coefficients
w
and
b
of support
vector machine
reduces to solving the following constrained
optimization pr
oblem:
find
x
and
b
that minimize:
subject to:
─
─
─
─
─
─
─
─
+
+
+
+
+
+
+
+
─
ρ
=
SUPPORT VECTORS
This is a quadratic optimization problem with linear constraints. In general, it could be
solved in O(
M
3
) time.
This optimization problem can be solved by
using the Lagrangian
function defined as:
, such that
where
1
,
2
, …
N
are Lagrange multipliers and
=⁛
1
,
2
, …
N
]
T
.
The solution
of the original constrained optimization problem is determined by the
sa
ddle point of
L
(
w
,
b
,
⤠ 睨楣栠 ha猠 瑯 be 浩n業楺e搠 睩瑨 牥獰sc琠 瑯
w
and
b
and
maximized with respect to
.
Comments
about Lagrange multipliers:
If
, the value of
i
that maximizes
L
(w,
b
,
⤠楳
i
=
0.
If
, the
value of
i
that maximizes
L
(w,
b
,
楳i
i
=
+
.
However, since
w
and
b
are trying to minimize
L
(w,
b
,
⤬瑨y睩汬扥c桡nge
楮畣栠i⁷y⁴漠a步
a敡獴煵s氠l漠ㄮ
From this brief discussion, the so

called
Kuhn_Tucker Condition
s
follow
:
For the data points satisfying
it follows that
i
> 0. These
data points
are called the
support vectors
Optimality conditions:
The necessary conditions for the saddle point of
L
(
w
,
b
,
⤠re
S潬癩湧潲⁴桥ce獳sry潮摩楯湳i獵汴猠楮
⠪⨪(
Byre灬慣楮朠
楮i漠瑨攠La杲gng楡i晵湣瑩潮a湤nby畳楮朠
a猠a
湥眠潮獴牡楮
瑨攠
dual op
timization problem
is
constructed as
Find
瑨慴慸業ies
subject to
This is a
convex quadratic programming problem
, so there is a global minimum. There
are a number of optimization routines capable of s
olving this optimization problem. The
optimization can be solved in
O(N
3
) time
(cubic with the size of training data) and in
linear time in the number of attributes
. (Compare this to neural networks that are
trained in O(N) time)
.
Given the values of
1
,
2
, …
N
obtained by solution of the dual problem, value of
b
can
be calculated by remembering that all support vectors have a property
. By replacing equation (***) into this equation we get
.
Support Vector
Machine: Final Predictor
Given the values
1
,
2
, …
N
and
b
obtained by solution of the dual problem, the final
SVM predictor can be expressed from (***) as
Important comments:
To
train the SVM
, all data points from the training d
ata are consulted
Since
i
0 only for the support vectors, only support vectors are used in giving a
prediction
Note that
is a scalar
Support Vector Machines
on Noisy Data
So far
, we discussed the construction of support vector
machines on linearly separable training
data. This is a very strong assumption that is unrealistic in most real life applications.
Question:
What should we do if the training
data set is not linearly separable
?
Solution:
Introducing
the sl
ack variables
i
,
i
=
1, 2, …,
N
, to relax the constraint
to
. Ideally, one would prefer all slack variables to
be zero and this would correspond to the linearly separable case.
We introduce penalty if for
som
e
i
,
i
> 0.
Therefore, the optimization problem for construction of SVM on linearly
nonseparable data is defined as:
find
x
and
b
that minimize:
subject to:
where
C
>
0 is an appropriately selected par
ameter
(the so called
slack parameter
)
. The
additional term
enforces all slack variables to be as close to zero as possible.
Dual problem:
As in the linearly separable problem, this optimization problem can be converted
to its dual
problem:
find
瑨慴慸業楺es
獵扪ec琠
NOTE: The consequence of introducing parameter
C
is in constraining the range of acceptable
values of Lagrange multipliers
i
. The most appropriate choice for
C
will depend on the specific
data set available.
Support Vector Machines for Nonlinear Classification
Problem:
Support vector machines represented with a linear function
f
(
x
) (i.e. a separating
hyperplane) have very limited representational power. As suc
h, they could not be very useful in
practical classification problems.
Good News:
With a slight modification, SVM could solve
highly nonlinear classification problems!!
Justification:
Cover’s Theorem
Suppose that data set D is nonlinearly separable
in
th
e original attribute space
. The attribute space
can
be
transformed into a new attribute space where D is
linearly separable!
Caveat:
Cover’s Theorem only proves the
existence
of the transformed attribute space that could solve the nonlinear problem. It doe
s not
provide the guideline for the construction of the attribute transformation!
Ex
ample
1
. XOR problem
By constructing a new attribute:
X
1
’
=
X
1
X
2
the XOR problem becomes linearly separable by the
new
attribute
X
1
’
.
Ex
ample
2
. Second order mono
mials derived from the original two

dimensional
attribute space
Ex
ample
3
. Fifth order monomials derived from the original 256

dimensional attribute
space
There are
of such monomials, which is an extremely h
igh

dimensional
attribute space!!
SVM and curse

of

dimensionality:
If the original attribute space is transformed into a very high dimensional space, the
likelihood of being able to solve the nonlinear classification increases. However, one is
likely
to quickly encounter the curse

of

dimensionality problem.
The strength of SVM lies in the theoretical justification that margin maximization is an
effective mechanism for alleviating the curse

of

dimensionality problem (i.e. SVM is the
simplest classifier
that solves the given classification problem). Therefore, SVM are able
to successfully solve classification problems with extremely high attribute
dimensionality!!
SVM solution
Denote
:
M
F
as a mapping from the original M

dimensional attribute spac
e to the highly
dimensional attribute space
F
.
By solving the following dual problem
find
瑨慴慸業楺es
獵扪ec琠
瑨攠e獵汴楮i⁓噍猠潦⁴桥潲m
Practical Problem:
Althoug
h SVM are successful in dealing with highly dimensional attribute
spaces, the fact that the SVM training scales linearly with the number of attributes, and
considering limited memory space could largely limit the choice of mapping
.
Solution: Kernel Tric
k
For certain class of
mappings
i
t
is possible to
comput
e
scalar products
in the original attribute space.
For example, in some cases we could replace
by kernel function
K
where
,
mea
ning that the scalar product depends only on the distance between original points
x
i
and
x
j
.
Examples
of kernel function:
Linear Kernel:
Gaussian Kernel:
, A is a constant
Polynomial Kernel:
, B is a constant
Get back to
Ex
ample 2.
Second order monomials derived from the original two

dimensional
attribute space
It can be shown that t
he scalar product between two z vectors
satisfies the following:
Reformulation of the SVM Problem
The dual problem:
find
瑨慴慸業楺es
獵扪ec琠
The resulting SVM is:
Therefore, both the optimization problem and SVM predic
tion depend only on the kernel
distances between points. This gives rise to applications where data points are objects that are not
represented by a vector of attribute values. Objects could be as complex as images or text
documents. As long as there is a
way to calculate the kernel distance between the objects the
SVM approach can be used.
Kernel Choice
A necessary and sufficient condition for a function
K
(
x
,
y
) to be a valid kernel is that the Gram
matrix
K
with elements {
K
(
x
i
,
x
j
)} is positive semidefinit
e for all possible choices of a data set
{
x
i
,
i
= 1…
N
}.
Section 6.2
in the textbook gives an overview of the rules for construction of
valid kernels.
Some
pr
actical
issues with SVM
Modeling choices
: When using some of the available SVM software packages o
r
toolboxes a user should choose (1) kernel function (e.g. Gaussian kernel) and its
parameter(s) (e.g. constant A), (2) constant
C
related to the slack variables. Several
choices should be examined using validation set in order to find the best SVM.
SVM tr
aining does not scale well with the size of the training data (i.e. s
caling a
s
O(
N
3
)
).
There are several solutions that offer speed

up of the original SVM algorithm:
o
chunking;
start with a subset of D, build SVM, apply it on all data, add
“problematic” dat
a points into the training data, remove “nice” points, repeat).
o
decomposition;
similar to chunking, the size of the subset is kept constant
o
sequential minimal optimization
; extreme version of chunking, only 2 data
points are used in each iteration.
SVM

Ba
sed solutions exist for problems outside binary classification
multi

class classification problems
SVM for regression
Kernel PCA
Kernel Fischer discriminant
Clustering
PCA
Kernel PCA
Comments 0
Log in to post a comment