A Bahadur Representation of the Linear Support Vector Machine

yellowgreatAI and Robotics

Oct 16, 2013 (3 years and 10 months ago)

78 views

ABahadurRepresentation
oftheLinearSupportVectorMachine
YoonkyungLee
DepartmentofStatistics
TheOhioStateUniversity
October7,2008
DataMiningandStatisticalLearning
StudyGroup
Outline

SupportVectorMachines

StatisticalPropertiesofSVM

MainResults
(asymptoticanalysisofthelinearSVM)

AnIllustrativeExample

Discussion
Applications

Handwrittendigitrecognition

Cancerdiagnosiswithmicroarraydata

Textcategorization
Classication

x=(x
1
,...,x
d
)∈R
d

y∈Y={1,...,k}

Learnaruleφ:R
d
→Yfromthetrainingdata
{(x
i
,y
i
),i=1,...,n},where(x
i
,y
i
)arei.i.d.withP(X,Y).

The0-1lossfunction:
L(y,φ(x))=I(y6=φ(x))
MethodsofRegularization(Penalization)
Findf(x)∈Fminimizing
1
n
n
￿
i=1
L(y
i
,f(x
i
))+λJ(f).

Empiricalrisk+penalty

F:aclassofcandidatefunctions

J(f):complexityofthemodelf

λ>0:aregularizationparameter

WithoutthepenaltyJ(f),ill-posedproblem
MaximumMarginHyperplane
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
wx+b=-1
wx+b=0
wx+b=1
Margin=2/||w||
SupportVectorMachines
Boser,Guyon,&Vapnik(1992)
Vapnik(1995),TheNatureofStatisticalLearningTheory
Adiscussionpaper(2006)inStatisticalScience

y
i
∈{−1,1},classlabelsinthebinarycase

Findf∈F={
f(x)=w

x+b
|w∈R
d
andb∈R}
minimizing
1
n
n
￿
i=1
(1−y
i
f(x
i
))
+
+λkwk
2
,
whereλisaregularizationparameter.

Classicationrule:
φ(x)=sign(f(x))
HingeLoss
-2
-1
0
1
2
0
1
2
t=yf
[-t]
*
(1-t)
+
(1−yf(x))
+
isanupperboundofthemisclassicationlossfunction
I(y6=φ(x))=[−yf(x)]

≤(1−yf(x))
+
where[t]

=I(t≥0)and
(t)
+
=max{t,0}.
SVMinGeneral
Findf(x)=b+h(x)withh∈H
K
(RKHS)minimizing
1
n
n
￿
i=1
(1−y
i
f(x
i
))
+
+λkhk
2
H
K
.

LinearSVM:
H
K
={h(x)=w

x|w∈R
d
}with
i)K(x,x

)=x

x

ii)khk
2
H
K
=kw

xk
2
H
K
=kwk
2

NonlinearSVM:K(x,x

)=(1+x

x

)
d
,
exp(−kx−x

k
2
/2σ
2
),...
StatisticalPropertiesofSVM

Fisherconsistency(Lin,DM&KD2002)
arg
f
minE[(1−Yf(X))
+
|X=x]=sign(p(x)−1/2)
wherep(x)=P(Y=1|X=x)

SVMapproximatestheBayesdecisionrule
φ
B
(x)=sign(p(x)−1/2).

Bayesriskconsistency(Zhang,AOS2004,Steinwart,
IEEEIT2005,Bartlettetal.,JASA2006)
R(
ˆ
f
SVM
)→R(φ
B
)inprob.
underuniversalapproximationconditiononH
K

Rateofconvergence(Steinwartetal.,AOS2007)
MainQuestions

RecursiveFeatureElimination(Guyonetal.,ML2002):
backwardeliminationofvariablesbasedonthetted
coefcientsofthelinearSVM

Whatisthestatisticalbehaviorofthecoefcients?

Whatdeterminestheirvariances?

Studyasymptoticpropertiesofthecoefcientsofthelinear
SVM.
SomethingNew,Old,andBorrowed

Thehingeloss:
noteverywheredifferentiable,
noclosedformexpressionforthesolution

Usefullink:
sign(p(x)−1/2),thepopulationminimizerw.r.t.thehinge
lossisthemedianofYatx.

Asymptoticsforleastabsolutedeviation(LAD)estimators
(Pollard,ET1991)

Convexityoftheloss
Preliminaries

(X,Y):apairofrandomvariableswithX∈X⊂R
d
and
Y∈{1,−1}

P(Y=1)=π
+
andP(Y=−1)=π


LetfandgbethedensitiesofXgivenY=1and−1.

With
￿x=(1,x
1
,...,x
d
)

andβ=(b,w

)

,
h(x;β)=w

x+b=
￿
x

β

￿
β
λ,n
=argmin
β
1
n
n
￿
i=1
(1−y
i
h(x
i
;β))
+
+λkwk
2
PopulationVersion

L(β)=E
￿
1−Yh(X;β)
￿
+

β

=argmin
β
L(β)

ThegradientofL(β):
S(β)=−E
￿
ψ(1−Yh(X;β))Y
￿
X
￿
whereψ(t)=I(t≥0)

TheHessianmatrixofL(β):
H(β)=E
￿
δ(1−Yh(X;β))
￿
X
￿
X

￿
whereδistheDiracdeltafunction.
MoreonH(β)

TheHessianmatrixofL(β):
H(β)=E
￿
δ(1−Yh(X;β))
￿
X
￿
X

￿
H
j,k
(β)=E
￿
δ(1−Yh(X;β))X
j
X
k
￿
for0≤j,k≤d

+
￿
X
δ(1−b−w

x)x
j
x
k
f(x)dx


￿
X
δ(1+b+w

x)x
j
x
k
g(x)dx.

ForafunctionsonX,denethe
Radontransform
Rsofs
forp∈Randξ∈R
d
as
(Rs)(p,ξ)=
￿
X
δ(p−ξ

x)s(x)dx.
(theintegralofsoverhyperplanesξ

x=p)

H
j,k
(β)=π
+
(Rf
j,k
)(1−b,w)+π

(Rg
j,k
)(1+b,−w),
wheref
j,k
(x)=x
j
x
k
f(x)andg
j,k
(x)=x
j
x
k
g(x).
RegularityConditions
(A1)
Thedensitiesfandgarecontinuousandhavenite
secondmoments.
(A2)
ThereexistsB(x
0

0
)suchthatf(x)>C
1
andg(x)>C
1
foreveryx∈B(x
0

0
).
(A3)
Forsome1≤i

≤d,
￿
X
{x
i

≥G
−i

}x
i

g(x)dx<
￿
X
{x
i

≤F
+
i

}x
i

f(x)dx
or
￿
X
{x
i

≤G
+
i

}x
i

g(x)dx>
￿
X
{x
i

≥F

i

}x
i

f(x)dx.
(whenπ
+


,itsaysthatthemeansaredifferent.)
(A4)
LetM
+
={x∈X|
￿x

β

=1}and
M

={x∈X|
￿x

β

=−1}.Thereexisttwosubsetsof
M
+
andM

onwhichtheclassdensitiesfandgare
boundedawayfromzero.
BahadurRepresentation

Bahadur(1966),ANoteonQuantilesinLargeSamples

Astatisticalestimatorisapproximatedbyasumof
independentvariableswithahigher-orderremainder.

Letξ=F
−1
(p)bethepthquantileofdistributionF.
ForX
1
,...,X
n
∼iidF,thesamplepthquantileis
ξ+
￿
n
￿
i=1
I(X
i
>ξ)−n(1−p)
￿
/nf(ξ)+R
n
,
wheref(x)=F

(x).
Bahadur-typeRepresentationoftheLinearSVM
Theorem
Supposethat(A1)-(A4)aremet.Forλ=o(n
−1/2
),wehave

n(
￿
β
λ,n
−β

)=−
1

n
H(β

)
−1
n
￿
i=1
ψ(1−Y
i
h(X
i


))Y
i
￿
X
i
+o
P
(1).
RecallthatH(β

)=E
￿
δ(1−Yh(X;β

))
￿
X
￿
X

￿
and
ψ(t)=I(t≥0).
AsymptoticNormalityof
￿
β
λ,n
Theorem
Suppose(A1)-(A4)aresatised.Forλ=o(n
−1/2
),

n(
￿
β
λ,n
−β

)→N
￿
0,H(β

)
−1
G(β

)H(β

)
−1
￿
indistribution,where
G(β)=E
￿
ψ(1−Yh(X;β))
￿
X
￿
X

￿
.
Corollary
UnderthesameconditionsasinTheorem,

n
￿
h(x;
￿
β
λ,n
)−h(x;β

)
￿
→N
￿
0,
˜
x

H(β

)
−1
G(β

)H(β

)
−1
˜
x
￿
indistribution.
AnIllustrativeExample

TwomultivariatenormaldistributionsinR
d
withmean
vectorsµ
f
andµ
g
andacommoncovariancematrixΣ

π
+


=1/2.

WhatistherelationbetweentheBayesdecisionboundary
andtheoptimalhyperplanebytheSVM,h(x;β

)=0?

f

g
Example

TheBayesdecisionboundary(Fisher'sLDA):
￿
Σ
−1

f
−µ
g
)
￿

￿
x−
1
2

f

g
)
￿
=0.

ThehyperplanedeterminedbytheSVM:
￿
x

β

=0withS(β

)=0.

β

balancestwoclasseswithinthemargin
E
￿
ψ(1−Yh(X;β

))Y
￿
X
￿
=0
P(h(X;β

)≤1|Y=1)=P(h(X;β

)≥−1|Y=−1)
E
￿
I{h(X;β

)≤1}X
j
|Y=1
￿
=E
￿
I{h(X;β

)≥−1}X
j
|Y=−1
￿
Example

Directcalculationshowsthat
β

=C(d
Σ

f

g
))
￿

1
2

f

g
)

I
d
￿
Σ
−1

f
−µ
g
),
whered
Σ

f

g
)istheMahalanobisdistancebetweenthe
twodistributions.

ThelinearSVMisequivalenttoFisher'sLDA.

Theassumptions(A1)-(A4)aresatised.So,themain
theoremapplies.

Considerd=1,µ
f

g
=0,σ=1,and
d
Σ

f

g
)=|µ
f
−µ
g
|.
DistanceandMargins
-505
0.00.20.4
d=0.5
x
-505
0.00.20.4
d=1
x
-505
0.00.20.4
d=3
x
-505
0.00.20.4
d=6
x
AsymptoticVariance
0123456
01020304050
Mahalanobis distance
asymptotic variability
(a)Intercept
0123456
01020304050
Mahalanobis distance
asymptotic variability
(b)Slope
Figure:
Theasymptoticvariabilitiesoftheinterceptandtheslopefor
theoptimalhyperplaneasafunctionoftheMahalanobisdistance.
BivariateNormalExample

µ
f
=(1,1)


g
=(−1,−1)

andΣ=diag(1,1)

d
Σ

f

g
)=2

2andtheBayeserrorrateis0.07865.

Find(
ˆ
β
0
,
ˆ
β
1
,
ˆ
β
2
)=argmin
β
1
n
n
￿
i=1
(1−y
i
￿
x
i

β)
+
Samplesizen
Estimates
100200500
Optimalcoefcients
ˆ
β
0
0.0006-0.00130.0022
0
ˆ
β
1
0.77090.74500.7254
0.7169
ˆ
β
2
0.77490.74590.7283
0.7169
Table:
Averagesofestimatedoptimalcoefcientsover1000
replicates.
SamplingDistributionsof
ˆ
β
0
and
ˆ
β
1
-0.4-0.20.00.20.4
0123
Density
estimateasymptotic
(a)
ˆ
β
0
0.50.60.70.80.91.0
012345
Density
estimateasymptotic
(b)
ˆ
β
1
Figure:
Estimatedsamplingdistributionsof
ˆ
β
0
and
ˆ
β
1
TypeIerrorrates
500100015002000
0.00.10.20.30.40.50.60.7
n
Type I error rate
d=6d=12d=18d=24
Figure:
ThemedianvaluesofthetypeIerrorratesinvariable
selectionwhenµf
=(1
d/2
,0
d/2
)

,µg
=0

d
,andΣ=I
d
ConcludingRemarks

Examineasymptoticpropertiesofthecoefcientsof
variablesinthelinearSVM.

EstablishBahadurtyperepresentationofthecoefcients.

Howthemarginsoftheoptimalhyperplaneandthe
underlyingprobabilitydistributioncharacterizetheir
statisticalbehavior

VariableselectionfortheSVMintheframeworkof
hypothesistesting

Forpracticalapplications,needconsistentestimatorsof
G(β

)andH(β

).

ExtensionoftheSVMasymptoticstothenonlinearcase

Exploreadifferentscenariowheredalsogrowswithn.
Reference

ABahadurRepresentationoftheLinearSupportVector
Machine,Koo,J.-Y.,Lee,Y.,Kim,Y.,andPark,C.,
JournalofMachineLearningResearch(2008).
Availableat
www.stat.osu.edu/∼yklee
or
http://www.jmlr.org/
.