Introduction to Support Vector Machines

yellowgreatAI and Robotics

Oct 16, 2013 (3 years and 11 months ago)

87 views

IntroductiontoSupportVectorMachines
StartingfromslidesdrawnbyMing-HsuanYangand
AntoineCornu´ejols
0.
SVMBibliography
C.Burges,“Atutorialonsupportvectormachinesforpat-
ternrecognition”.DataMiningandKnowledgeDescovery,
2(2):955-974,1998.
http://citeseer.nj.nec.com/burges98tutorial.html
N.Cristianini,J.Shawe-Taylor,“SupportVectorMachines
andotherkernel-basedlearningmethods”.Cambridge
UniversityPress,2000.
C.Cortes,V.Vapnik,“Supportvectornetworks”.Journalof
MachineLearning,20,1995.
V.Vapnik.“Thenatureofstatisticallearningtheory”.
SpringerVerlag,1995.
1.
SVM—TheMainIdea
Given
asetofdatapointswhichbelongtoeitheroftwoclasses,
findan
optimalseparatinghyperplane
-maximizingthedistance(fromclosestpoints)ofeither
classtotheseparatinghyperplane,and
-minimizingtheriskofmisclassifyingthetrainingsamples
andtheunseentestsamples.
Approach:
Formulateaconstraint-basedoptimisationprob-
lem,thensolveitusing
quadraticprogramming(QP)
.
2.
Oschem˘ageneral˘apentruˆınv˘at¸areaautomat˘a
test/generalization
data
predicted
classification
algorithm
machine learning
model
data
training
data
3.
Plan
1.LinearSVMs
TheprimalformandthedualformoflinearSVMs
LinearSVMswithsoftmargin
2.Non-LinearSVMs
KernelfunctionsforSVMs
Anexampleofnon-linearSVM
4.
1.LinearSVMs:Formalisation
LetSbeasetofpointsx
i
∈R
d
withi=1,...,m.Eachpoint
x
i
belongstoeitheroftwoclasses,withlabely
i
∈{−1,+1}.
The
set
Sis
linearseparable
iftherearew∈R
d
andw
0
∈R
suchthat
y
i
(wx
i
+w
0
)≥1i=1,...,m
Thepair(w,w
0
)definesthehyperplaneofequation
wx+w
0
=0
,
namedthe
separatinghyperplane
.
The
signeddistance
d
i
ofapointx
i
totheseparatinghyper-
plane(w,w
0
)isgivenby
d
i
=
wx
i
+w
0
||w||
.
Itfollowsthaty
i
d
i

1
||w||
,therefore
1
||w||
isthe
lowerbound
on
thedistancebetweenpointsx
i
andtheseparatinghyper-
plane(w,w
0
).
5.
OptimalSeparatingHyperplane
GivenalinearlyseparablesetS,the
optimalseparat-
inghyperplane
istheseparatinghyperplaneforwhich
thedistancetotheclosest(eitherpositiveornegative)
pointsinSismaximum,thereforeitmaximizes
1
||w||
.
6.
maximal
margin
1
II w II
xi
xi
II w II
D( )
optimal separating
hyperplane
vectors
support
D(x) < -1
D(x) = 0
D(x) > 1
D(x)=wx+w
0
7.
LinearSVMs:ThePrimalForm
minimize
1
2
||w||
2
subjecttoy
i
(wx
i
+w
0
)≥1fori=1,...,m
Thisisa
constrainedquadraticproblem
(QP)withd+1pa-
rameters(w∈R
d
andw
0
∈R).Itcanbesolvedbyquadratic
optimisationmethodsifdisnotverybig(
10
3
).
Forlargevaluesofd(
10
5
):duetotheKuhn-Tuckertheorem,
becausetheaboveobjectivefunctionandtheassociated
constraintsareconvex,wecanusethemethodof
Lagrange
multipliers

i
≥0,i=1,...,m)toputtheaboveproblem
underanequivalent“dual”form.
Note:
Inthedualform,thevariables(α
i
)willbesubjecttomuchsimpler
constraintsthanthevariables(w,w
0
)intheprimalform.
8.
LinearSVMs:GettingtheDualForm
The
Lagrangeanfunction
associatedtotheprimalformofthe
givenQPis
L
P
(w,w
0
,α)=
1
2
||w
2
||−
m
￿
i=1
α
i
(y
i
(wx
i
+w
0
)−1)
withα
i
≥0,i=1,...,m.FindingtheminimumofL
P
implies
∂L
P
∂w
0
=−
m
￿
i=1
y
i
α
i
=0
∂L
P
∂w
=w−
m
￿
i=1
y
i
α
i
x
i
=0⇒w=
m
￿
i=1
y
i
α
i
x
i
where
∂L
P
∂w
=(
∂L
P
∂w
1
,...,
∂L
P
∂w
d
)
BysubstitutingtheseconstraintsintoL
P
wegetitsdualform
L
D
(α)=
m
￿
i=1
α
i

1
2
m
￿
i=1
m
￿ j=1
α
i
α
j
y
i
y
j
x
i
x
j
9.
LinearSVMs:TheDualForm
maximize
￿
m
i=1
α
i

1
2
￿
m
i=1
￿
m
j=1
α
i
α
j
y
i
y
j
x
i
x
j
subjectto
￿
m
i=1
y
i
α
i
=0
α
i
≥0,i=1,...,m
The
link
betweentheprimalandthedualform:
The
optimalsolution
(
w,
w
0
)oftheprimalQPproblem
isgivenby
w=
m
￿
i=1
α
i
y
i
x
i
α
i
(y
i
(
wx
i
+
w
0
)−1)=0foranyi=1,...,m
where
α
i
aretheoptimalsolutionsoftheabove(dual
form)optimisationproblem.
10.
SupportVectors
Theonly
α
i
(solutionsofthedualformofourQPproblem)
thatcanbenonzeroarethoseforwhichtheconstraints
y
i
(wx
i
+w
0
)≥1fori=1,...,mintheprimalformofthe
QParesatisfiedwiththeequalitysign.
Becausemost
α
i
arenull,thevector
wisalinearcombination
ofarelativesmallpercentageofthepointsx
i
.
Thesepointsarecalled
supportvectors
becausetheyarethe
closestpointstotheoptimalseparatinghyperplane(OSH)
andtheonlypointsofSneededtodeterminetheOSH.
Theproblemof
classifyinganewdatapoint
xisnowsimply
solvedbylookingatsign(
wx+
w
0
).
11.
LinearSVMswithSoftMargin
Iftheset
Sisnotlinearlyseparable
—oronesimplyignores
whetherornotSislinearlyseparable—,theprevious
analysiscanbegeneralisedbyintroducingmnon-negative
(“slack”)variablesξ
i
,fori=1,...,msuchthat
y
i
(wx
i
+w
0
)≥1−ξ
i
,fori=1,...,m
Purpose:
toallowforasmallnumberofmissclassifiedpoints,
forbettergeneralisationorcomputationalefficiency.
12.
GeneralisedOSH
ThegeneralisedOSHisthenviewedasthesolutiontothe
problem:
minimize
1
2
||w||
2
+C
￿
m
i=1
ξ
i
subjecttoy
i
(wx
i
+w
0
)≥1−ξ
i
fori=1,...,m
ξ
i
≥0fori=1,...,m
Theassociated
dualform:
maximize
￿
m
i=1
α
i

1
2
￿
m
i=1
￿
m
j=1
α
i
α
j
y
i
y
j
x
i
x
j
subjectto
￿
m
i=1
y
i
α
i
=0
0≤α
i
≤C,i=1,...,m
Asbefore:
w=
￿
m
i=1
α
i
y
i
x
i
α
i
(y
i
(
wx
i
+
w
0
)−1+
ξ
i
)=0
(C−
α
i
)
ξ
i
=0
13.
The
roleofC:
itactsasaregularizingparameter:
•largeC⇒minimizethenumberofmisclassified
points
•smallC⇒maximizetheminimumdistance
1
||w||
14.
2.NonlinearSupportVectorMachines
•Notethattheonlywaythedatapointsappearin(thedualformof)
thetrainingproblemisintheformofdotproductsx
i
x
j
.
•Inahigherdimensionalspace,itisverylikelythatalinearseparator
canbeconstructed.
•WemapthedatapointsfromtheinputspaceR
d
intosomespaceof
higherdimensionR
n
(n>d)usingafunctionΦ:R
d
→R
n
•Thenthetrainingalgorithmwoulddependonlyondotproductsof
theformΦ(x
i
)Φ(x
j
).
•Constructing(viaΦ)aseparatinghyperplanewithmaximummargin
inthehigher-dimensionalspaceyieldsa
nonlineardecisionboundary
intheinputspace.
15.
GeneralSchemaforNonlinearSVMs
Input
space
Output
space
Internal
redescription
space
h
x

y
16.
IntroducingKernelFunctions
•Butthedotproductiscomputationallyexpensive...
•Iftherewerea
“kernelfunction”
KsuchthatK(x
i
,x
j
)=
Φ(x
i
)Φ(x
j
),wewouldonlyuseKinthetrainingalgorithm.
•AllthepreviousderivationsinthemodeloflinearSVM
hold(substitutingthedotproductwiththekernelfunc-
tion),sincewearestilldoingalinearseparation,butina
differentspace.

Importantremark:
Bytheuseofthekernelfunction,it
ispossibletocomputetheseparatinghyperplanewithout
explicitlycarryingoutthemapintothehigherspace.
17.
SomeClassesofKernelFunctionsforSVMs
•Polynomial:K(x,x

)=(xx

+c)
q
•RBF(radialbasisfunction):K(x,x

)=e

||x−x

||
2

2
•Sigmoide:K(x,x

)=tanh(αxx

−b)
18.
AnIllustration
(b)
(a)
Decisionsurface
(a)byapolynomialclassifier,and
(b)byaRBF.
Supportvectorsareindicatedindarkfill.
19.
ImportantRemark
Thekernelfunctionsrequirecalculationsinx(∈R
d
),therefore
theyarenotdifficulttocompute.
ItremainstodeterminewhichkernelfunctionKcanbeas-
sociatedwithagiven(redescriptionspace)functionΦ.
Inpractice,oneproceedsviceversa:
wetestkernelfunctionsaboutwhichweknowthatthey
correspondtothedotproductinacertainspace(which
willworkasredescriptionspace,nevermadeexplicit).
Therefore,theuseroperatesby“trialanderror”...
Advantage:
theonlyparameterswhentraininganSVMare
thekernelfunctionK,andthe“tradeoff”parameterC.
20.
Mercer’sTheorem:
ACharacterisationofKernelFunctionsforSVMs
Theorem:
LetK:R
d
×R
d
→Rbeasymmetricalfunction.
Krepresentsadotproduct,
i.e.thereisafunctionΦ:R
d
→R
n
suchthat
K(x,x

)=Φ(x)Φ(x

)
ifandonlyif
￿
K(x,x

)f(x)f(x

)dxdx

≥0
foranyfunctionfsuchthat
￿
f
2
(x)dxisfinite.
Remark:
Thetheoremdoesn’tsayhowtoconstructΦ.
21.
Somesimplerulesforbuilding(Mercer)kernels
IfK
1
andK
2
arekernelsoverX×X,withX⊆R
n
,
then
•K(x,y)=K
1
(x,y)+K
2
(x,y)
•K(x,y)=aK
1
(x,y),witha∈R
+
•K(x,y)=K
1
(x,y)K
2
(x,y)
arealsokernels.
22.
IllustratingtheGeneralArchitectureofSVMs
fortheproblemofhand-writtencharacterrecognition
K
K
K

2

3

4

1
K

Output:sign(
￿
i
α
i
y
i
K(x
i
,x)+w
0
)
Comparison:K(x
i
,x)
Supportvectors:x
1
,x
2
,x
3
,...
Input:x
23.
AnExercise:xor
x1
x
2
1
-1
-1
1
Note:useK(x,x

)=(xx

+1)
2
.
ItcanbeeasilyshownthatΦ(x)=(x
2
1
,x
2
2
,

2x
1
x
2
,

2x
1
,

2x
2
,1)∈R
6
for
x=(x
1
,x
2
)∈R
2
.
24.
i
x
i
y
i
Φ(i)
1
(1,1)
−1
(1,1,
￿
(2),
￿
(2),
￿
(2),1)
2
(1,−1)
1
(1,1,−
￿
(2),
￿
(2),−
￿
(2),1)
3
(−1,1)
1
(1,1,−
￿
(2),−
￿
(2),
￿
(2),1)
4
(−1,−1)
−1
(1,1,
￿
(2),−
￿
(2),−
￿
(2),1)
L
D
(α)=
￿
4
i=1
α
i

1
2
￿
4
i=1
￿
4
j=1
α
i
α
j
y
i
y
j
Φ(x
i
)Φ(x
j
)

1

2

3

4


1
2
(9α
21
−2α
1
α
2
−2α
1
α
3
+2α
1
α
4
+

22
+2α
2
α
3
−2α
2
α
4
+

23
−2α
3
α
4
+

24
)
subjectto:
−α
1

2

3
−α
4
=0
25.
∂L
D
(α)
∂α
1
=0⇔9α
1
−α
2
−α
3

4
=1
∂L
D
(α)
∂α
2
=0⇔α
1
−9α
2
−α
3

4
=−1
∂L
D
(α)
∂α
3
=0⇔α
1
−α
2
−9α
3

4
=−1
∂L
D
(α)
∂α
4
=0⇔α
1
−α
2
−α
3
+9α
4
=1
¯α
1
=¯α
2
=¯α
3
=¯α
4
=
1
8
¯w=
1
8
(−Φ(x
1
)+Φ(x
2
)+Φ(x
3
)−Φ(x
4
))=
1
8
(0,0,−4

2,0,0,0)
¯wΦ(x
i
)+¯w
0
=y
i
⇒¯w
0
=0
Theoptimalseparationhyperplane:¯wΦ(x)+¯w
0
=0⇔−x
1
x
2
=0
Test:sign(−x
1
x
2
)
26.
ThexorExercise:Result
1
2 x
2
margin:
maximum
1
2
D(x , x ) = -1
1
2
D(x , x ) = 0
x
2
x
1
1
2
D(x , x ) = 0
2
1
D(x , x ) = 0
1
2
D(x , x ) = +1
1
2
D(x , x ) = 0
1
2
D(x , x ) = +1
1
2
D(x , x ) = -1
1
2
D(x , x ) = -1
1
2
D(x , x ) = +1
2 x x
1
2
D(x , x ) = -x x
2
1
1
2
2
1
2
D(x , x ) =- 2 x x
1
012-1-2
-2
-1
0
1
2
012-1-2
-2
-1
0
1
2
Feature space
Input space
27.
ConcludingRemarks:SVM—ProsandCons
Pros:
•Findthe
optimal
separationhyperplane.
•Candealwith
veryhighdimentionaldata
.
•Somekernelshaveinfinite
Vapnik-Chervonenkisdimension
(see
Computationallearningtheory,ch.7inTomMitchell’sbook),which
meansthattheycanlearnveryelaborateconcepts.
•Usually
workverywell
.
Cons:
•Require
bothpositiveandnegativeexamples
.
•Needtoselect
agoodkernelfunction
.
•Require
lotsofmemoryandCPUtime
.
•Therearesome
numericalstabilityproblems
insolvingthecon-
strainedQP.
28.
Multi-classClassificationwithSVM
SVMscanonlydobinaryclassification.
ForMclasses,onecanusethe
one-against-the-rest
approach:
constructahyperplanebetweenclasskandtheM−1other
classes.⇒MSVMs.
Topredicttheoutputofanewinstance,justpredictwitheach
oftheseMSVMs,andthenfindoutwhichoneputsthe
predictionfurthestintothepositiveregionoftheinstance
space.
29.
SVMImplementations
•SVM
light
•LIBSVM
•mySVM
•Matlab
•Huller
•...
30.