SupportVectorMachine
(andStatisticalLearningTheory)
Tutorial
JasonWeston
NECLabsAmerica
4IndependenceWay,Princeton,USA.
jasonw@neclabs.com
1SupportVectorMachines:history
SVMsintroducedinCOLT92byBoser,Guyon&Vapnik.Became
ratherpopularsince.
Theoreticallywellmotivatedalgorithm:developedfromStatistical
LearningTheory(Vapnik&Chervonenkis)sincethe60s.
Empiricallygoodperformance:successfulapplicationsinmany
elds(bioinformatics,text,imagerecognition,...)
2SupportVectorMachines:historyII
Centralizedwebsite:www.kernelmachines.org.
Severaltextbooks,e.g.AnintroductiontoSupportVector
MachinesbyCristianiniandShaweTaylorisone.
Alargeanddiversecommunityworkonthem:frommachine
learning,optimization,statistics,neuralnetworks,functional
analysis,etc.
3SupportVectorMachines:basics [Boser,Guyon,Vapnik'92],[Cortes&Vapnik'95]

++
+
+
+



margin
margin
Niceproperties:convex,theoreticallymotivated,nonlinearwithkernels..
4Preliminaries:
Machinelearningisaboutlearningstructurefromdata.
AlthoughtheclassofalgorithmscalledSVMscandomore,inthis
talkwefocusonpatternrecognition.
Sowewanttolearnthemapping:X7!Y,wherex2Xissome
objectandy2Yisaclasslabel.
Let'stakethesimplestcase:2classclassication.So:x2Rn
,
y2f1g.
5Example:Supposewehave50photographsofelephantsand50photosoftigers.
vs.
Wedigitizetheminto100x100pixelimages,sowehavex2Rn
where
n=10;000.
Now,givenanew(different)photographwewanttoanswerthequestion:
isitanelephantoratiger?[weassumeitisoneortheother.]
6Trainingsetsandpredictionmodels
input/outputsetsX,Y
trainingset(x1
;y
1
);:::;(xm
;y
m
)
generalization:givenapreviouslyseenx2X,ndasuitable
y2Y.
i.e.,wanttolearnaclassier:y=f(x;),wherearethe
parametersofthefunction.
Forexample,ifwearechoosingourmodelfromthesetof
hyperplanesinRn
,thenwehave:
f(x;fw;bg)=sign(wx+b):
7EmpiricalRiskandthetrueRisk
Wecantrytolearnf(x;)bychoosingafunctionthatperformswell
ontrainingdata:
Remp
()=
1
m
m
X
i=1
`(f(xi
;);y
i
)=TrainingError
where`isthezeroonelossfunction,`(y;^y)=1,ify6
=^y,and0
otherwise.Remp
iscalledtheempiricalrisk.
Bydoingthiswearetryingtominimizetheoverallrisk:
R()=
Z
`(f(x;);y)dP(x;y)=TestError
whereP(x,y)isthe(unknown)jointdistributionfunctionofxandy.
8Choosingthesetoffunctions Whataboutf(x;)allowingallfunctionsfromXtof1g?
Trainingset(x
1
;y
1
);:::;(xm
;y
m
)2Xf1g
Testsetx1
;:::;x
m
2X,
suchthatthetwosetsdonotintersect.
Foranyfthereexistsf
:
1.f
(xi
)=f(x
i
)foralli
2.f
(xj
)6
=f(xj
)forallj
Basedonthetrainingdataalone,thereisnomeansofchoosingwhich
functionisbetter.Onthetestsethowevertheygivedifferentresults.So
generalizationisnotguaranteed.
=)arestrictionmustbeplacedonthefunctionsthatweallow.
9EmpiricalRiskandthetrueRisk Vapnik&Chervonenkisshowedthatanupperboundonthetrueriskcan
begivenbytheempiricalrisk+anadditionalterm:
R()Remp
()+
s
h(log(
2m
h
+1)−log(
4
)
m
wherehistheVCdimensionofthesetoffunctionsparameterizedby.
TheVCdimensionofasetoffunctionsisameasureoftheircapacity
orcomplexity.
Ifyoucandescribealotofdifferentphenomenawithasetof
functionsthenthevalueofhislarge.
[VCdim=themaximumnumberofpointsthatcanbeseparatedinall
possiblewaysbythatsetoffunctions.]
10VCdimension:TheVCdimensionofasetoffunctionsisthemaximumnumberofpoints
thatcanbeseparatedinallpossiblewaysbythatsetoffunctions.For
hyperplanesinRn
,theVCdimensioncanbeshowntoben+1.
x
x
x
11VCdimensionandcapacityoffunctions Simplicationofbound:
TestErrorTrainingError+ComplexityofsetofModels
Actually,alotofboundsofthisformhavebeenproved(different
measuresofcapacity).Thecomplexityfunctionisoftencalleda
regularizer.
Ifyoutakeahighcapacitysetoffunctions(explainalot)yougetlow
trainingerror.Butyoumightovert.
Ifyoutakeaverysimplesetofmodels,youhavelowcomplexity,but
won'tgetlowtrainingerror.
12Capacityofasetoffunctions(classication)
13Capacityofasetoffunctions(regression)
sine curve fit
y
hyperplane fit
x
true function
14Controllingtherisk:modelcomplexity
(training error)
S
S
hh
n
h*
1
1
S*
n
Bound on the risk
Confidence interval
Empirical risk
15Capacityofhyperplanes Vapnik&Chervonenkisalsoshowedthefollowing:
Considerhyperplanes(wx)=0wherewisnormalizedw.r.tasetof
pointsX
suchthat:mini
jwxi
j=1:
Thesetofdecisionfunctionsfw
(x)=sign(wx)denedonX
such
thatjjwjjAhasaVCdimensionsatisfying
hR
2
A2
:
whereRistheradiusofthesmallestspherearoundtheorigincontaining
X
.
=)minimizejjwjj2
andhavelowcapacity
=)minimizingjjwjj2
equivalenttoobtainingalargemarginclassier
w
{x  <w x> + b = 0}
,
<w x> + b > 0
,
<w x> + b < 0
,
,
w
{x  <w x> + b = 0}
,
{x  <w x> + b = 1}
,
{x  <w x> + b = +1}
,
x2
x1
Note:
<w x1> + b = +1
<w x2> + b = 1
=> <w (x
1x
2)> = 2
=>(x1x
2) =
w
w
<
>
,
,
,
,
2
w
yi
= 1
yi = +1
16LinearSupportVectorMachines(atlast!) So,wewouldliketondthefunctionwhichminimizesanobjectivelike:
TrainingError+Complexityterm
Wewritethatas:
1
m
m
X
i=1
`(f(xi
;);y
i
)+Complexityterm
Fornowwewillchoosethesetofhyperplanes(wewillextendthislater),
sof(x)=(wx)+b:
1
m
m
X
i=1
`(wxi
+b;y
i
)+jjwjj2
subjecttomini
jwx
i
j=1:
17LinearSupportVectorMachinesII Thatfunctionbeforewasalittledifculttominimizebecauseofthestep
functionin`(y;^y)(either1or0).
Let'sassumewecanseparatethedataperfectly.Thenwecanoptimize
thefollowing:
Minimizejjwjj2
,subjectto:
(wxi
+b)1;ifyi
=1
(wxi
+b)−1;ifyi
=−1
Thelasttwoconstraintscanbecompactedto:
yi
(wxi
+b)1
Thisisaquadraticprogram.
18SVMs:nonseparablecase Todealwiththenonseparablecase,onecanrewritetheproblemas:
Minimize:
jjwjj2
+C
m
X
i=1
i
subjectto:
yi
(wxi
+b)1−i
;
i
0
Thisisjustthesameastheoriginalobjective:
1
m
m
X
i=1
`(wxi
+b;y
i
)+jjwjj2
except`isnolongerthezerooneloss,butiscalledthehingeloss:
`(y;^y)=max(0;1−y^y).Thisisstillaquadraticprogram!
i

++
+
+
+




+

margin
19SupportVectorMachinesPrimal
Decisionfunction:
f(x)=wx+b
Primalformulation:
minP(w;b)=
1
2
kwk2

{z
}
maximizemargin
+C
X
i
H1
[yi
f(xi
)]

{z
}
minimizetrainingerror
IdeallyH1
wouldcountthenumberoferrors,approximatewith:
HingeLossH1
(z)=max(0;1−z)
0
z
1
H (z)
20SVMs:nonlinearcase Linearclassiersaren'tcomplexenoughsometimes.SVMsolution:
Mapdataintoaricherfeaturespaceincludingnonlinearfeatures,then
constructahyperplaneinthatspacesoallotherequationsarethesame!
Formally,preprocessthedatawith:
x7!(x)
andthenlearnthemapfrom(x)toy:
f(x)=w(x)+b:
21SVMs:polynomialmapping
:R2
!R3
(x1
;x
2
)7!(z1
;z
2
;z
3
):=(x2
1
;
p
(2)x1
x2
;x
2
2
)
x1
x2
z1
z3
z2
22SVMs:nonlinearcaseII ForexampleMNISThandwritingrecognition.
60,000trainingexamples,10000testexamples,28x28.
LinearSVMhasaround8.5%testerror.
PolynomialSVMhasaround1%testerror.
5
0
4
1
9
2
1
3
1
4
3
5
3
6
1
7
2
8
6
9
4
0
9
1
1
2
4
3
2
7
3
8
6
9
0
5
6
0
7
6
1
8
7
9
3
9
8
5
9
3
3
0
7
4
9
8
0
9
4
1
4
4
6
0
4
5
6
1
0
0
1
7
1
6
3
0
2
1
1
7
9
0
2
6
7
8
3
9
0
4
6
7
4
6
8
0
7
8
3
1
23SVMs:fullMNISTresults
Classier
TestError
linear
8.4%
3nearestneighbor
2.4%
RBFSVM
1.4%
Tangentdistance
1.1%
LeNet
1.1%
BoostedLeNet
0.7%
TranslationinvariantSVM
0.56%
Choosingagoodmapping()(encodingpriorknowledge+gettingright
complexityoffunctionclass)foryourproblemimprovesresults.
24SVMs:thekerneltrick Problem:thedimensionalityof(x)canbeverylarge,makingwhardto
representexplicitlyinmemory,andhardfortheQPtosolve.
TheRepresentertheorem(Kimeldorf&Wahba,1971)showsthat(for
SVMsasaspecialcase):
w=
m
X
i=1
i
(xi
)
forsomevariables.Insteadofoptimizingwdirectlywecanthus
optimize.
Thedecisionruleisnow:
f(x)=
m
X
i=1
i
(x
i
)(x)+b
WecallK(xi
;x)=(xi
)(x)thekernelfunction.
25SupportVectorMachineskerneltrickII WecanrewritealltheSVMequationswesawbefore,butwiththe
w=
Pm
i=1
i
(xi
)equation:
Decisionfunction:
f(x)=
X
i
i
(xi
)(x)+b
=
X
i
i
K(xi
;x)+b
Dualformulation:
minP(w;b)=
1
2
k
m
X
i=1
i
(xi
)k2

{z
}
maximizemargin
+C
X
i
H1
[yi
f(xi
)]

{z
}
minimizetrainingerror
26SupportVectorMachinesDual Butpeoplenormallywriteitlikethis:
Dualformulation:
min
D()=
1
2
X
i;j
i
j
(xi
)(xj
)−
X
i
yi
i
s.t.
8
<
:
i
i
=0
0yi
i
C
DualDecisionfunction:
f(x)=
X
i
i
K(x
i
;x)+b
KernelfunctionK(;)isusedtomake(implicit)nonlinearfeature
map,e.g.
Polynomialkernel:K(x;x0
)=(xx0
+1)
d
.
RBFkernel:K(x;x0
)=exp(−γjjx−x0
jj2
).
27PolynomialSVMs ThekernelK(x;x0
)=(xx0
)d
givesthesameresultastheexplicit
mapping+dotproductthatwedescribedbefore:
:R2
!R3
(x1
;x
2
)7!(z1
;z
2
;z
3
):=(x2
1
;
p
(2)x1
x2
;x
2
2
)
((x1
;x
2
)((x0
1
;x
0
2
)=(x
2
1
;
p
(2)x1
x2
;x
2
2
)(x0
2
1
;
p
(2)x0
1
x0
2
;x
0
2
2
)
=x2
1
x0
2
1
+2x1
x0
1
x2
x0
2
+x2
2
x0
2
2
isthesameas:
K(x;x0
)=(xx0
)2
=((x1
;x
2
)(x0
1
;x
0
2
))2
=(x1
x0
1
+x2
x0
2
)2
=x2
1
x0
2
1
+x2
2
x0
2
2
+2x1
x0
1
x2
x0
2
Interestingly,ifdislargethekernelisstillonlyrequiresnmultiplications
tocompute,whereastheexplicitrepresentationmaynottinmemory!
28RBFSVMs TheRBFkernelK(x;x
0
)=exp(−γjjx−x0
jj2
)isoneofthemost
popularkernelfunctions.Itaddsabumparoundeachdatapoint:
f(x)=
m
X
i=1
i
exp(−γjjxi
−xjj2
)+b
..
(x)
(x')
x
x'
29SVMs:moreresults ThereismuchmoreintheeldofSVMs/kernelmachinesthanwecould
coverhere,including:
Regression,clustering,semisupervisedlearningandotherdomains.
Lotsofotherkernels,e.g.stringkernelstohandletext.
Lotsofresearchinmodications,e.g.toimprovegeneralization
ability,ortailoringtoaparticulartask.
Lotsofresearchinspeedinguptraining.
PleaseseetextbookssuchastheonesbyCristianini&ShaweTayloror
bySchoelkopfandSmola.
30SVMs:software LotsofSVMsoftware:
LibSVM(C++)
SVMLight(C)
AswellascompletemachinelearningtoolboxesthatincludeSVMs:
Torch(C++)
Spider(Matlab)
Weka(Java)
Allavailablethroughwww.kernelmachines.org.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο