Speeding the Training of Support Vector Machines and Solution of ...

yellowgreatAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

52 views

SVM
May2007
DOE-PI
DianneP.O'Leary
c
2007
1
SpeedingtheTrainingofSupportVectorMachines
andSolutionofQuadraticPrograms
DianneP.O'Leary
ComputerScienceDept.and
InstituteforAdvancedComputerStudies
UniversityofMaryland
JinHyukJung
ComputerScienceDept.
AndreTits
Dept.ofElectricalandComputerEngineeringand
InstituteforSystemsResearch
WorksupportedbytheU.S.DepartmentofEnergy.
2
ThePlan
IntroductiontoSVMs
Ouralgorithm
Convergenceresults
Examples
3
Theproblem:traininganSVM
Given:
Asetofsampledatapointsai
,insamplespaceS,withlabels
di
=1,i=1;:::;m.
Find:
Ahyperplanefx:hw;xi =0g;suchthat
sign(hw;aii )=di
;
or,ideally,
di(hw;ai
i )1:
7
Whichhyperplaneisbest?
Wewantto
maximizetheseparationmargin
1=kwk.
8
Generalization1
Wemightmapamoregeneralseparatortoahyperplanethroughsome
transformation:
Forsimplicity,wewillassumethatthismappinghasalreadybeendone.
9
Generalization2
Ifthereisnoseparatinghyperplane,wemightwanttobalancemaximizing
theseparationmarginwithapenaltyfor
misclassifying
databyputtingit
onthewrongsideofthehyperplane.Thisisthe
soft-margin
SVM.
Weintroduce
slackvariables
y0andrelaxtheconstraints
di
(hw;ai
i )1to
di
(hw;ai
i )1yi
:
Insteadofminimizingkwk,wesolve
min
w; ;y
1
2
kwk22
+eT
y
forsome>0,subjecttotherelaxedconstraints.
10
Jargon
Summary
Theprocessofdeterminingwand iscalled
training
themachine.
Aftertraining,givenanewdatapointx,wesimplycalculate
sign(hw;xi )to
classify
itasineitherthepositiveornegativegroup.
Thisprocessisthoughtofasa
machine
{calledthe
supportvector
machine(SVM)
.
Wewillseethattrainingthemachineinvolvessolvinga
convex
quadraticprogrammingproblem
whosenumberofvariablesisthe
dimensionnofthesamplespaceandwhosenumberofconstraintsis
thenumbermofsamplepoints{typically
verylarge
.
12
Primalanddual
Primalproblem
:
minw; ;y
1
2
kwk22
+eT
y
s:t:D(Awe )+ye;
y0;
Dualproblem
:
maxv

1
2
vT
Hv+eT
v
s:t:eT
Dv=0;
0ve;
whereH=DAAT
D2Rmm
isasymmetricandpositivesemidenite
matrixwith
hij
=didj
hai;aj
i:
13
Supportvectors
Supportvectors
(SVs)arethepatternsthatcontributetodeningthe
classier.Theyareassociatedwithnonzerov
i.
vi
si
yi
Supportvector
(0;]
0
[0;1)
On-BoundarySV
(0;)
0
0
O-BoundarySV

0
(0;1)
Nonsupportvector
0
(0;1)
0
vi:dualvariable(Lagrangemultiplierforrelaxedconstraints).
si
:slackvariableforprimalinequalityconstraints.
yi:slackvariableinrelaxedconstraints.
14
SolvingtheSVMproblem
Applystandardoptimizationmachinery:
Writedownthe
optimalityconditions
fortheprimal/dualformulation
usingtheLagrangemultipliers.Thisisasystemof
nonlinearequations
.
Applya(Mehotra-stylepredictor-corrector)
interiorpointmethod(IPM)
tosolvethenonlinearequationsbytracingoutapathfromagiven
startingpointtothesolution.
15
Theoptimalityconditions
wAT
Dv=0;
dT
v=0
evu=0;
DAw d+yes=0;
Sv=
e
;
Yu=
e
;
s;u;v;y0:
for=0.
Interiorpointmethod(IPM):
Let0<<1bea
centeringparameter
and
let
=
sT
v+yT
u
2m
:
Followthepathtracedas!0,=1.
16
AstepoftheIPM
AteachstepoftheIPM,thenextpointonthepathiscomputedusinga
variantof
Newton'smethod
appliedtothesystemofequations.
Thisgivesalinearsystemforthesearchdirection:
26666664
IAT
D
dT
II
DAIId
SV
YU
37777775
26666664
w
v
u
s
y

37777775
=
26666664
(wAT
Dv)
dT
v
(evu)
(DAw d+yes)
Sv
Yu
37777775
Byblockelimination,wecanreducethissystemto
Mw=somevector
(sometimescalledthe
normalequations
).
17
Thematrixforthenormalequations
Mw=somevector
M=I+AT
D
1DA

d

dT
dT

1d
:
Here,Dand
arediagonal,

d=AT
D
1d,and
!
1
i
=
vi
ui
sivi
+yi
ui
:
Givenw,wecaneasilyndtheothercomponentsofthedirection.
18
TwovariantsoftheIPMalgorithm
In
anescalingalgorithms
,=0.Newton'smethodthenaimstosolve
theoptimalityconditionsfortheoptimizationproblem.
Disadvantage:
Wemightnotbeinthedomainoffastconvergencefor
Newton.
In
predictor-correctoralgorithms
,theanescalingstepisusedto
computeacorrectorstepwith>0.Thisstepdrawsusbacktoward
thepath.
Advantage:
Superlinearconvergencecanbeproved.
19
ExaminingM=I+AT
D
1DA

d

dT
dT

1
d
:
Mostexpensivecalculation:
FormingM.
Ourapproachisto
modifyNewton'smethod
byusinganapproximationto
themiddleandlastterms.We'lldiscussthemiddleone;thethirdissimilar.
Themiddletermis
AT
D
1DA=
m
X
i=1
1
!i
ai
aTi
;
!
1
i
=
vi
ui
si
vi
+
yi
ui
:
Notethat!
1
i
iswell-behavedexceptfor
on-boundary
supportvectors,
sinceinthatcasesi
!0andyi
!0.
20
Ouridea:
Weonlyincludetermscorrespondingtoconstraintswehypothesizeare
activeattheoptimalsolution.
Wecould
ignorethevalueofdi.

balance
thenumberofpositiveandnegativepatternsincluded.
Thenumberoftermsisconstrainedtobe
atleastqL,whichistheminimumofnandthenumberof!
1
values
greaterthan
p
,forsomeparameter.
atmostqU
,whichisaxedfractionofm.
21
Somerelatedwork
UseofapproximationstoMinLP-IPMsdatesbackto
Karmarkar
(1984)
,andadaptiveinclusionoftermswasstudied,forexample,by
WangandO'Leary(2000)
.

Osuna,Freund,andGirosi(1997)
proposedsolvingasequenceofCQPs,
buildinguppatternsasnewcandidatesforsupportvectorsareidentied.

Joachims(1998)
and
Platt(1999)
usedvariantsrelatedtoOsunaetal.

FerrisandMunson(2002
)focusedonecientsolutionofnormal
equations.

GertzandGrin(2005)
usedpreconditionedcg,withapreconditioner
basedonneglectingtermsinM.
22
ConvergenceanalysisforPredictor-CorrectorAlgorithm
Allofourconvergenceresultsarederivedfromageneralconvergence
analysisforadaptiveconstraintreductionforconvexquadratic
programming(sinceourproblemisaspecialcase).
23
Assumptions:
N(AQ)\N(H)=f0gforallindexsetsQwithjQjnumberof
activeconstraintsatthesolution.
(AutomaticforSVMtraining.)
Westartatapointthatsatisestheinequalityconstraintsintheprimal.
Thesolutionsetisnonemptyandbounded.
Thegradientsoftheactiveconstraintsatanyprimalfeasiblepointare
linearlyindependent.
Theorem:
Thealgorithmconvergestoasolutionpoint.
24
AdditionalAssumptions:
Thesolutionsetcontainsjustonepoint.

Strictcomplementarity
holdsatthesolution,meaningthats

+v
>0
andy
+u
>0.
Theorem:
Theconvergencerateisq-quadratic.
25
Testproblems
ProvidedbyJoshGrin(SANDIA)
ProblemnPatterns(+;)SV(+;)In-boundSVs(+;)
mushroom276(4208,3916)(1146,1139)(31,21)
isolet
617(300,7497)(74,112)(74,112)
waveform861(1692,3308)(633,638)(110,118)
letter-recog
153(789,19211)(266,277)(10,30)
26
Algorithmvariants
Weusedthereducedpredictor-correctoralgorithm,withconstraintschosen
inoneofthefollowingways:

One-sideddistance
:Useallpointsonthewrongsideoftheboundary
planesandallpointsclosetotheboundaryplanes.

Distance
:Useallpointsclosetotheboundaryplanes.



:Useallpointswithlargevaluesof!
1
i
.
Weuseda
balanced
selection,choosingapproximatelyequalnumbersof
positiveandnegativeexamples.
Wecomparedwith
noreduction
,usingallofthetermsinformingM.
27
mushroom
isolet
waveform
letter
0
5
10
15
20
25
30
35
40
Time
Time (sec)
Problem


No reduction
Onesided distance
Distance
e
28
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
5
10
15
20
Time (sec)
qU/m
letter/ distance
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
50
100
qU/m
Iterations
letter/ distance


Adaptive balanced
Adaptive
Nonadaptive balanced
Nonadaptive
29
0
5
10
15
20
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Iterations
# of patterns used
mushroom/Adaptive Balanced CR/ distance


qU = 100%
qU = 80%
qU = 60%
qU = 45%
0
5
10
15
20
1012
1010
108
106
104
102
100
102
Iterations

mushroom/Adaptive Balanced CR/ distance


No reduction
qU = 100%
qU = 80%
qU = 60%
qU = 45%
30
Comparisonwithothersoftware
ProblemType
LibSVMSVMLightMatlabOurs
mushroomPolynomial
5.852.21280.7
mushroomMapping(Linear)
30.760.2710.14.2
isoletLinear
6.530.8323.920.1
waveformPolynomial
2.923.58404.1
waveformMapping(Linear)
33.085.81361.816.2
letterPolynomial
2.855.82831.2
letterMapping(Linear)
11.645.9287.413.5
LibSVM,byChih-ChungChangandChih-JenLin,usesavariantof
SMO(byPlatt),implementedinC
SVMLight,byJoachims,implementedinC
Matlab'sprogramisavariantofSMO.
OurprogramisimplementedinMatlab,soaspeed-upmaybe
possibleifconvertedtoC.
31
Howouralgorithmworks
Tovisualizetheiteration,weconstructedatoyproblemwith
n=2,
amappingcorrespondingtoanellipsoidalseparator.
WenowshowsnapshotsofthepatternsthatcontributetoMastheIPM
iterationproceeds.
32
Iteration: 2, # of obs: 1727
33
Iteration: 5, # of obs: 1440
34
Iteration: 8, # of obs: 1026
35
Iteration: 11, # of obs: 376
36
Iteration: 14, # of obs: 170
37
Iteration: 17, # of obs: 42
38
Iteration: 20, # of obs: 4
39
Iteration: 23, # of obs: 4
40
Anextension
RecallthatthematrixinthedualproblemisH=DAA
T
D,andAis
mnwithm>>n.
TheelementsofKAA
T
are
kij
=aTi
aj
;
andouradaptivereductionisecientbecauseweapproximatetheterm
AT

1AintheformationofM,whichisonlynn.
Oftenwewanta
kernelfunction
moregeneralthantheinnerproduct,so
that
kij
=k(ai
;aj
):
ThiswouldmakeMmm.
Eciencyisretained
if
wecanapproximateK

LLT
whereLhasrank
muchlessthanm.Thisisoftenthecase,andpivotedCholeskyalgorithms
havebeenusedintheliteraturetocomputeL.
41
Conclusions
WehavesucceededinsignicantlyimprovingthetrainingofSVMsthat
havelargenumbersoftrainingpoints.
SimilartechniquesapplytogeneralCQPproblemswithalargenumber
ofconstraints.
Savingsisprimarilyinlateriterations.Futureworkwillfocusonusing
clusteringofpatterns(e.g.,BoleyandCao(2004))toreduceworkin
earlyiterations.
42