A Review of Support Vector Machines

randombroadAI and Robotics

Oct 15, 2013 (3 years and 5 months ago)

55 views

Smith
1

Jordan Smith

MUMT 611

Written summary of classifiers

18 February 2008



A Review of Support Vector Machines


Abstract

A support vector machine (SVM) is a learning machine that can be used for classification
problems (Cortes

1995
) as well as for regression
and novelty detection (Bennett

2000
).

SVMs
look for the hyperplane that optimally separates two classes of data.
Important features of SVMs
are the absence of local minima, the well
-
controlled capacity of the solution

(Christiani 2000)
,
and the ability to
handle high
-
dimensional input data efficiently (
Cortes 1995
)
.
It is conceptually
quite simple, but also
very

powerful: in its infancy, it has performed well against other popular
classifiers (Meyer 2002
, 2003),

and has been applied to problems in
several f
ields, including that
of music information retrieval.


1.

Introduction


1.1

History


Th
e support vector machine
was developed quite recently, emerging only in the early
1990s
. However, it is also
the product of decades of research
in computational learning

theory
by
Russian mathematicians Vladimir Vapnik and Alexey Chervonenkis. Their resulting theory,
summarized in Vapnik
’s

1982

book
Estimation of Dependences Based on Empirical Data,
has
been called Vapnik
-
Chervonenkis or VC theory (
Vapnik

200
6
)
. That book

describes the
implementation of a support vector machine for linearly separable training data

(Cortes 1995)
.
Beginning in the early 1990s by researchers at Bell Labs, a number of important extensions were
made to the SVM: in 1992, Boser, Guyon and Vapnik
proposed using Aizerman's kernel trick to
classify data
perhaps only
separable

by polynomial

or radial basis functions; in 1995, Vapnik
and
Cortes
extended the theory to handle non
-
separable trainin
g data by using a cost function
; finally
,
a method of supp
ort vector regression was developed in 1996 (Drucker).


1.2

Summary

This

very brief

introduction to SVMs will
first describe in historical order: the case of
using a SVM to classify linear, separable data; the case of using a kernel function to make a
non
-
linear classification; and the case of using a cost function to allow for non
-
separable data.

In
section 3, a number of
studies using SVMs
will be
described
, including several
related to music
information retrieval.

Finally, some studies
evaluating

the per
formance of SVMs are
summarized
.


2.

Support Vector Machines


2.1

Linear, separable data

The basic problem that a SVM learns and solves is a two
-
category classification problem.
Following the
method

of
Bennett
’s discussion

(
2000
)
,

suppose we have a set of
l

observations.
Smith
2

Figure 1. Two data sets, represented by squares and circles,
are separated by two parallel hyperplanes subtended by
support vectors (circled). The distance between these planes


the margin


is the
quantity
maximized
by a SVM
.

The solid
line is
the optimal separating hyperplane.

Figure 2. Visualization of the kernel trick. Input data are mapped into a
higher dimensional feature space using a kernel function, resulting in
linearly
-
separable training data. S
ource:
Holbrey, R.

Dimension
Reduction
Algorithms for Data Mining and Visualization.

<
http://www.comp.leeds.ac.uk/richardh/astro/index.html
> Accessed 12
February 2008.

Each observation can be represented
by a pair {
x
i
, y
i
} where
x
i

є

R
N

and
y
i

є

{
-
1, 1}. That is, each observation
contains an N
-
dimensional vector
x

and a class assignment
y
.

Our goal is to
find the optimal separating hyperplane;
that is,
th
e

flat (N
-
1)
-
dimensional
surface that
best
separates the data
.

For the time being we assume
that a separating hyperplane exists
, and
is defined by the normal vector
w
.

On
either side of this plane we construct
a

pair of parallel

plane
s

such that:

w∙x
i



b

+ 1

for

y
i

= 1

w∙x
i



b



1

for

y
i

=
-
1

where
b
indicates the offset of
the plane from the origin. This
situation is pictured in Figure 1, where
the separating plan is the solid line and the two parallel planes are the dashed lines. The dashed
lines ‘push

up’ against some of the training data points: these points are called ‘support vectors,’
and in fact they completely determine the solution. The gap between these lines is called the
margin, and we wish to maximize the size of this gap. In terms of
w
, we
wish to maximize
:

½||w||
2

subject

to the constraint:

y
i

(
w∙x
i



b
)
≥ 1

The solution can be obtained using Lagrange multipliers

(Burges 1998).




2.2

Kernel functions

Often, a non
-
linear
solution plane is required to
separate data.
To repeat the
above steps and maximize
the separation between two
non
-
linear functions can be
c
omputationally expensive.
Instead, the kernel trick is
used: input data are mapped
into a higher dimensional
feature space via a specified
kernel function. The data
are linearly separable in the
higher dimensional space.
Furthermore, if a good
kernel funct
ion is selected,
the dot product will be preserved in the feature space (Cortes 1995) so that the mathematical
Smith
3

approach outlined in section 2.1 is still applicable.

The important kernel functions
who have been
used and whose properties
have been studied mo
st extensively are linear and polynomial
functions, the radial
-
basis function, and the sigmoid function

(Sherrod 2008)
.


2.3

Non
-
separable data

A method of accommodating errors and outliers in the input data was developed in 1995
(Cortes), and can be implement
ed simply
by allowing an error of up to
ξ

in each dimension
(resulting in a ‘fuzzy margin’)
and adding a cost function C(
i
) to the optimization equation

(Burges)
. We then want to minimize:

½||w||
2

+
C∙
(
Σ

ξ
i
)

subject to the constraint:

y
i

(
w∙x
i



b
)
+
ξ
i

≥ 1

(Bennett 2000). This is substantially
harder to solve than the separable case.
In Chang and Lin’s
LIBSVM manual, the minimization conditions, constraints, and resulting decision functions are
defined for each type of classification, along with algorithms for solving the required quadratic
prog
ramming problems (2007
).


3

Studies
using
SVMs


3
.
1

Applications


Throughout
his

early papers
, Vapnik often used optical text recognition as an
experimental example application (Boser 1992, Cortes 1995, Schölkopf 1996).

(See

also

Sebastiani 1999, Joachims
1997.) Since then, many authors have since used SVMs to develop
classifiers in other disciplines: see
,

for instance
,

the work on face detection by Osuna et al.
(1997b) or on gene expression data by Brown et al. (2000). In the
field

of music information
ret
rieval, Dhanaraj and Logan used SVMs in their automatic identification of hit songs

based on
lyrics and acoustic features
(2005), Laurier and Herrera submitted a second
-
place

finishing

mood
classifier to MIREX 2007 that relied on SVMs

and acoustic features
, and Meng used SVMs at
multiple stages in his dissertation: first to perform temporal feature integration and second to
perform automatic genre identification based on these features (2006)
. Both Mandel (2005, 2006)
and Xu (2003) have studied musical genr
e classification using SVMs

based on acoustic features
.

The free software package LIB
-
SVM is a library of tools for implementing various types of
SVMs (Chang 2007) while DTREG can implement a number of predictive models, from SVMs
to various types of neura
l nets and decision trees (Sherrod 2008).


3.
2

Performance

According to Vapnik, the performance of his SVM hand
-
written digit classifier easily
outperforms state
-
of
-
the
-
art classifiers based on other learning routines. However, since their rise
in populari
ty in the 1990s, SVMs have been the object of closer scrutiny: a study by Meyer
concluded that although SVMs performed very well in classification and regression tasks, other
methods were as competitive (2002).

While the two
-
category classification problem

is the classic problem to study analytically,
but in practice, more categories must be distinguished. Hsu (2002) compared the performance of
various methods of combining binary classifiers, concluding that one
-
against
-
one and ‘directed
acyclic graph SVM’
were better than one
-
against
-
all.

Smith
4

Bibliography



Bennett, K., and C. Campbell. 2000.

Support
v
ector
m
achines: Hype or
h
allelujah?


S
pecial
I
nterest
G
roup on
K
nowledge
D
iscovery and
D
ata Mining

Explorations
.

2
(
2
)
:
1

13.


Boser, B., I. Guyon
,

and V. Vapnik
. 1992.

A
t
raining
a
lgorithm for
o
ptimal
m
argin
c
lassifiers.


Proceedings of the 5th Annual Workshop on Computational Learning Theory
.

144

52.


Brown, M., W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares Jr.
,

and D. Haussler.
2000. Knowledg
e
-
based analysis of microarray gene expression data by using su
p
port
vector machines.
Proceedings of the National Academy of Science
.

97: 262

267.


Burges, C. 1998.

A
t
utorial on
s
upport
v
ector
m
achines for
p
attern
r
ecognition.


Data Mining
and Knowledge
Discovery
.

2
(
2
)
: 955

74.


Chang, C.
,

and C. Lin. 2007.

LIBSVM: a
l
ibrary for
s
upport
v
ector
m
achines.


Manual for
software available online: <http://www.csie.ntu.edu.tw/~cjlin/libsvm/>


Christiani, N.
,

and J. Shawe
-
Taylor. 2000. Chapter 6: Further
r
eading

and
a
dvanced
t
opics. In
An
I
ntroduction to Support Vector Machines
. Cambridge: Cambridge University Press.
<http://www.support
-
vector.net/chapter_6.html>


Cortes, C.
,

and V. Vapnik. 1995. Support
-
v
ector
n
etworks.
Machine Learning
.

20
(
3
)
: 273

297.


Dhanara
j, R.
,

and B. Logan. 2005.

Automatic
p
rediction of
h
it
s
ongs.


International Conference
on Music Information Retrieval
, London UK. 488

91.


Drucker, H., C. Burges, L. Kaufman, A. Smola
,

and V. Vapnik. 1996. Support
v
ector
r
egression
m
achine.
Advances i
n N
eural Information Processing

Systems
. Cambridge: MIT Press
9
(
9
)
: 155

61.


Hsu, C.
,

and C. Lin. 2002. A
c
omparison of
m
ethods for
m
ulticlass
s
upport
v
ector
m
achines.
IEEE Transactions on Neural Networks
.

13
(
2
)
: 415

425.


Hsu, C., C. Chang
,

and C. Lin. 2007.

A practical guide to support vector classification.

<
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
>


Joachims, T. 1997.
T
ext categorization with
s
upport
v
ector
m
achines: Learning with many
relevant features
.

Springer Lecture Notes in Computer S
cience
.

1398: 137

42.


Laurier, C.
,

and P. Herrera. 2007. Audio
m
usic
m
ood
c
lassification
u
sing
s
upport
v
ector
m
achine.


Proceedings of 8th International Conference on Music Information Retrieval.


Mandel, M.
,

and D. Ellis. 2005. Song
-
level
f
eatures and
s
u
pport
v
ector
m
achines for
m
usic
c
lassification.

Proceedings of the 6
th

International Conference on Music Information
Smith
5

Retrieval
.

594

599


Mandel, M., G. Poliner
,

and D. Ellis. 2006. Support
v
ector
m
achine
a
ctive
l
earning for
m
usic
r
etrieval.
Multimedia Syst
ems
.

12
(
1
)
:

3

13.


Meng, A. 2006.
Temporal
f
eature
i
ntegration for
m
usic
o
rganization
. PhD diss., Technical
University of Denmark.


Meyer, D., F. Leisch
,

and K. Hornik. 2002. Benchmarking
s
upport
v
ector
m
achines. Report
Series SFB,
Adaptive Information Sys
tems and Modelling in Economics and Management
Science
.

78.


Meyer, D., F. Leisch, and K. Hornik. 2003. The
s
upport
v
ector
m
achine
u
nder
t
est.

Neurocomputing
.

55:

169

86.


Osuna, E., R. Freund
,

and F. Girosi. 1997a. An
i
mproved
t
raining
a
lgorithm for
s
uppo
rt
v
ector
m
achines.

Proceedings of the IEEE Workshop on Neural Networks for Signal Processing
.

276

85.


Osuna, E., R. Freund
,

and F. Girosi. 1997b.
T
raining
s
upport
v
ector
m
achines:
a
n
application to
f
ace
d
etection.

Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition
.

130

7.


Schölkopf, B, C. Burges
,

and V. Vapnik. 1996. Incorporating
i
nvariances in
s
upport
v
ector
l
earning
m
achines.
Springer Lecture Notes in Computer Science
.

1112:

47

52.


Se
bastiani, F. 1999. Machine learning in automated text categorization.
Technical
R
eport,
Consiglio Nazionale delle Ricerche.
Pisa, Italy. 1

59.


Sherrod, P. 2008.

DTREG Predictive Modeling Software.


Manual for software available
online: <www.dtreg.com>


S
mola, A.
,

and B. Schölkopf. 1998.
A t
utorial on
s
upport
v
ector
r
egression. NeuroCOLT2
Technical Report NC2
-
TR
-
1998
-
030. Holloway College, London.


Vapnik, V. 2006.
Empirical Inference Science
. Afterword
in 1982 reprint of
Estimation of
Dependences Based on

Empirical Data
.


Xu, C. N. Maddage, X. Shao, F. Cao,
and
Q. Tian. 2003.
M
usical
g
enre
c
lassification
u
sing
s
upport
v
ector
machines.

Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing
. 5:

429

32.