Smith
1
Jordan Smith
MUMT 611
Written summary of classifiers
18 February 2008
A Review of Support Vector Machines
Abstract
A support vector machine (SVM) is a learning machine that can be used for classification
problems (Cortes
1995
) as well as for regression
and novelty detection (Bennett
2000
).
SVMs
look for the hyperplane that optimally separates two classes of data.
Important features of SVMs
are the absence of local minima, the well

controlled capacity of the solution
(Christiani 2000)
,
and the ability to
handle high

dimensional input data efficiently (
Cortes 1995
)
.
It is conceptually
quite simple, but also
very
powerful: in its infancy, it has performed well against other popular
classifiers (Meyer 2002
, 2003),
and has been applied to problems in
several f
ields, including that
of music information retrieval.
1.
Introduction
1.1
History
Th
e support vector machine
was developed quite recently, emerging only in the early
1990s
. However, it is also
the product of decades of research
in computational learning
theory
by
Russian mathematicians Vladimir Vapnik and Alexey Chervonenkis. Their resulting theory,
summarized in Vapnik
’s
1982
book
Estimation of Dependences Based on Empirical Data,
has
been called Vapnik

Chervonenkis or VC theory (
Vapnik
200
6
)
. That book
describes the
implementation of a support vector machine for linearly separable training data
(Cortes 1995)
.
Beginning in the early 1990s by researchers at Bell Labs, a number of important extensions were
made to the SVM: in 1992, Boser, Guyon and Vapnik
proposed using Aizerman's kernel trick to
classify data
perhaps only
separable
by polynomial
or radial basis functions; in 1995, Vapnik
and
Cortes
extended the theory to handle non

separable trainin
g data by using a cost function
; finally
,
a method of supp
ort vector regression was developed in 1996 (Drucker).
1.2
Summary
This
very brief
introduction to SVMs will
first describe in historical order: the case of
using a SVM to classify linear, separable data; the case of using a kernel function to make a
non

linear classification; and the case of using a cost function to allow for non

separable data.
In
section 3, a number of
studies using SVMs
will be
described
, including several
related to music
information retrieval.
Finally, some studies
evaluating
the per
formance of SVMs are
summarized
.
2.
Support Vector Machines
2.1
Linear, separable data
The basic problem that a SVM learns and solves is a two

category classification problem.
Following the
method
of
Bennett
’s discussion
(
2000
)
,
suppose we have a set of
l
observations.
Smith
2
Figure 1. Two data sets, represented by squares and circles,
are separated by two parallel hyperplanes subtended by
support vectors (circled). The distance between these planes
–
the margin
–
is the
quantity
maximized
by a SVM
.
The solid
line is
the optimal separating hyperplane.
Figure 2. Visualization of the kernel trick. Input data are mapped into a
higher dimensional feature space using a kernel function, resulting in
linearly

separable training data. S
ource:
Holbrey, R.
“
Dimension
Reduction
Algorithms for Data Mining and Visualization.
”
<
http://www.comp.leeds.ac.uk/richardh/astro/index.html
> Accessed 12
February 2008.
Each observation can be represented
by a pair {
x
i
, y
i
} where
x
i
є
R
N
and
y
i
є
{

1, 1}. That is, each observation
contains an N

dimensional vector
x
and a class assignment
y
.
Our goal is to
find the optimal separating hyperplane;
that is,
th
e
flat (N

1)

dimensional
surface that
best
separates the data
.
For the time being we assume
that a separating hyperplane exists
, and
is defined by the normal vector
w
.
On
either side of this plane we construct
a
pair of parallel
plane
s
such that:
w∙x
i
≥
b
+ 1
for
y
i
= 1
w∙x
i
≤
b
–
1
for
y
i
=

1
where
b
indicates the offset of
the plane from the origin. This
situation is pictured in Figure 1, where
the separating plan is the solid line and the two parallel planes are the dashed lines. The dashed
lines ‘push
up’ against some of the training data points: these points are called ‘support vectors,’
and in fact they completely determine the solution. The gap between these lines is called the
margin, and we wish to maximize the size of this gap. In terms of
w
, we
wish to maximize
:
½w
2
subject
to the constraint:
y
i
(
w∙x
i
–
b
)
≥ 1
The solution can be obtained using Lagrange multipliers
(Burges 1998).
2.2
Kernel functions
Often, a non

linear
solution plane is required to
separate data.
To repeat the
above steps and maximize
the separation between two
non

linear functions can be
c
omputationally expensive.
Instead, the kernel trick is
used: input data are mapped
into a higher dimensional
feature space via a specified
kernel function. The data
are linearly separable in the
higher dimensional space.
Furthermore, if a good
kernel funct
ion is selected,
the dot product will be preserved in the feature space (Cortes 1995) so that the mathematical
Smith
3
approach outlined in section 2.1 is still applicable.
The important kernel functions
who have been
used and whose properties
have been studied mo
st extensively are linear and polynomial
functions, the radial

basis function, and the sigmoid function
(Sherrod 2008)
.
2.3
Non

separable data
A method of accommodating errors and outliers in the input data was developed in 1995
(Cortes), and can be implement
ed simply
by allowing an error of up to
ξ
in each dimension
(resulting in a ‘fuzzy margin’)
and adding a cost function C(
i
) to the optimization equation
(Burges)
. We then want to minimize:
½w
2
+
C∙
(
Σ
ξ
i
)
subject to the constraint:
y
i
(
w∙x
i
–
b
)
+
ξ
i
≥ 1
(Bennett 2000). This is substantially
harder to solve than the separable case.
In Chang and Lin’s
LIBSVM manual, the minimization conditions, constraints, and resulting decision functions are
defined for each type of classification, along with algorithms for solving the required quadratic
prog
ramming problems (2007
).
3
Studies
using
SVMs
3
.
1
Applications
Throughout
his
early papers
, Vapnik often used optical text recognition as an
experimental example application (Boser 1992, Cortes 1995, Schölkopf 1996).
(See
also
Sebastiani 1999, Joachims
1997.) Since then, many authors have since used SVMs to develop
classifiers in other disciplines: see
,
for instance
,
the work on face detection by Osuna et al.
(1997b) or on gene expression data by Brown et al. (2000). In the
field
of music information
ret
rieval, Dhanaraj and Logan used SVMs in their automatic identification of hit songs
based on
lyrics and acoustic features
(2005), Laurier and Herrera submitted a second

place
finishing
mood
classifier to MIREX 2007 that relied on SVMs
and acoustic features
, and Meng used SVMs at
multiple stages in his dissertation: first to perform temporal feature integration and second to
perform automatic genre identification based on these features (2006)
. Both Mandel (2005, 2006)
and Xu (2003) have studied musical genr
e classification using SVMs
based on acoustic features
.
The free software package LIB

SVM is a library of tools for implementing various types of
SVMs (Chang 2007) while DTREG can implement a number of predictive models, from SVMs
to various types of neura
l nets and decision trees (Sherrod 2008).
3.
2
Performance
According to Vapnik, the performance of his SVM hand

written digit classifier easily
outperforms state

of

the

art classifiers based on other learning routines. However, since their rise
in populari
ty in the 1990s, SVMs have been the object of closer scrutiny: a study by Meyer
concluded that although SVMs performed very well in classification and regression tasks, other
methods were as competitive (2002).
While the two

category classification problem
is the classic problem to study analytically,
but in practice, more categories must be distinguished. Hsu (2002) compared the performance of
various methods of combining binary classifiers, concluding that one

against

one and ‘directed
acyclic graph SVM’
were better than one

against

all.
Smith
4
Bibliography
Bennett, K., and C. Campbell. 2000.
“
Support
v
ector
m
achines: Hype or
h
allelujah?
”
S
pecial
I
nterest
G
roup on
K
nowledge
D
iscovery and
D
ata Mining
Explorations
.
2
(
2
)
:
1
–
13.
Boser, B., I. Guyon
,
and V. Vapnik
. 1992.
“
A
t
raining
a
lgorithm for
o
ptimal
m
argin
c
lassifiers.
”
Proceedings of the 5th Annual Workshop on Computational Learning Theory
.
144
–
52.
Brown, M., W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares Jr.
,
and D. Haussler.
2000. Knowledg
e

based analysis of microarray gene expression data by using su
p
port
vector machines.
Proceedings of the National Academy of Science
.
97: 262
–
267.
Burges, C. 1998.
“
A
t
utorial on
s
upport
v
ector
m
achines for
p
attern
r
ecognition.
”
Data Mining
and Knowledge
Discovery
.
2
(
2
)
: 955
–
74.
Chang, C.
,
and C. Lin. 2007.
“
LIBSVM: a
l
ibrary for
s
upport
v
ector
m
achines.
”
Manual for
software available online: <http://www.csie.ntu.edu.tw/~cjlin/libsvm/>
Christiani, N.
,
and J. Shawe

Taylor. 2000. Chapter 6: Further
r
eading
and
a
dvanced
t
opics. In
An
I
ntroduction to Support Vector Machines
. Cambridge: Cambridge University Press.
<http://www.support

vector.net/chapter_6.html>
Cortes, C.
,
and V. Vapnik. 1995. Support

v
ector
n
etworks.
Machine Learning
.
20
(
3
)
: 273
–
297.
Dhanara
j, R.
,
and B. Logan. 2005.
“
Automatic
p
rediction of
h
it
s
ongs.
”
International Conference
on Music Information Retrieval
, London UK. 488
–
91.
Drucker, H., C. Burges, L. Kaufman, A. Smola
,
and V. Vapnik. 1996. Support
v
ector
r
egression
m
achine.
Advances i
n N
eural Information Processing
Systems
. Cambridge: MIT Press
9
(
9
)
: 155
–
61.
Hsu, C.
,
and C. Lin. 2002. A
c
omparison of
m
ethods for
m
ulticlass
s
upport
v
ector
m
achines.
IEEE Transactions on Neural Networks
.
13
(
2
)
: 415
–
425.
Hsu, C., C. Chang
,
and C. Lin. 2007.
A practical guide to support vector classification.
<
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
>
Joachims, T. 1997.
T
ext categorization with
s
upport
v
ector
m
achines: Learning with many
relevant features
.
Springer Lecture Notes in Computer S
cience
.
1398: 137
–
42.
Laurier, C.
,
and P. Herrera. 2007. Audio
m
usic
m
ood
c
lassification
u
sing
s
upport
v
ector
m
achine.
”
Proceedings of 8th International Conference on Music Information Retrieval.
Mandel, M.
,
and D. Ellis. 2005. Song

level
f
eatures and
s
u
pport
v
ector
m
achines for
m
usic
c
lassification.
Proceedings of the 6
th
International Conference on Music Information
Smith
5
Retrieval
.
594
–
599
Mandel, M., G. Poliner
,
and D. Ellis. 2006. Support
v
ector
m
achine
a
ctive
l
earning for
m
usic
r
etrieval.
Multimedia Syst
ems
.
12
(
1
)
:
3
–
13.
Meng, A. 2006.
Temporal
f
eature
i
ntegration for
m
usic
o
rganization
. PhD diss., Technical
University of Denmark.
Meyer, D., F. Leisch
,
and K. Hornik. 2002. Benchmarking
s
upport
v
ector
m
achines. Report
Series SFB,
Adaptive Information Sys
tems and Modelling in Economics and Management
Science
.
78.
Meyer, D., F. Leisch, and K. Hornik. 2003. The
s
upport
v
ector
m
achine
u
nder
t
est.
Neurocomputing
.
55:
169
–
86.
Osuna, E., R. Freund
,
and F. Girosi. 1997a. An
i
mproved
t
raining
a
lgorithm for
s
uppo
rt
v
ector
m
achines.
Proceedings of the IEEE Workshop on Neural Networks for Signal Processing
.
276
–
85.
Osuna, E., R. Freund
,
and F. Girosi. 1997b.
T
raining
s
upport
v
ector
m
achines:
a
n
application to
f
ace
d
etection.
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition
.
130
–
7.
Schölkopf, B, C. Burges
,
and V. Vapnik. 1996. Incorporating
i
nvariances in
s
upport
v
ector
l
earning
m
achines.
Springer Lecture Notes in Computer Science
.
1112:
47
–
52.
Se
bastiani, F. 1999. Machine learning in automated text categorization.
Technical
R
eport,
Consiglio Nazionale delle Ricerche.
Pisa, Italy. 1
–
59.
Sherrod, P. 2008.
“
DTREG Predictive Modeling Software.
”
Manual for software available
online: <www.dtreg.com>
S
mola, A.
,
and B. Schölkopf. 1998.
A t
utorial on
s
upport
v
ector
r
egression. NeuroCOLT2
Technical Report NC2

TR

1998

030. Holloway College, London.
Vapnik, V. 2006.
Empirical Inference Science
. Afterword
in 1982 reprint of
Estimation of
Dependences Based on
Empirical Data
.
Xu, C. N. Maddage, X. Shao, F. Cao,
and
Q. Tian. 2003.
M
usical
g
enre
c
lassification
u
sing
s
upport
v
ector
machines.
Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing
. 5:
429
–
32.
Comments 0
Log in to post a comment