A Bahadur Representation of the Linear Support Vector Machine

yellowgreatΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

68 εμφανίσεις

ABahadurRepresentation
oftheLinearSupportVectorMachine
YoonkyungLee
DepartmentofStatistics
TheOhioStateUniversity
October7,2008
DataMiningandStatisticalLearning
StudyGroup
Outline

SupportVectorMachines

StatisticalPropertiesofSVM

MainResults
(asymptoticanalysisofthelinearSVM)

AnIllustrativeExample

Discussion
Applications

Handwrittendigitrecognition

Cancerdiagnosiswithmicroarraydata

Textcategorization
Classication

x=(x
1
,...,x
d
)∈R
d

y∈Y={1,...,k}

Learnaruleφ:R
d
→Yfromthetrainingdata
{(x
i
,y
i
),i=1,...,n},where(x
i
,y
i
)arei.i.d.withP(X,Y).

The0-1lossfunction:
L(y,φ(x))=I(y6=φ(x))
MethodsofRegularization(Penalization)
Findf(x)∈Fminimizing
1
n
n
￿
i=1
L(y
i
,f(x
i
))+λJ(f).

Empiricalrisk+penalty

F:aclassofcandidatefunctions

J(f):complexityofthemodelf

λ>0:aregularizationparameter

WithoutthepenaltyJ(f),ill-posedproblem
MaximumMarginHyperplane
-1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
wx+b=-1
wx+b=0
wx+b=1
Margin=2/||w||
SupportVectorMachines
Boser,Guyon,&Vapnik(1992)
Vapnik(1995),TheNatureofStatisticalLearningTheory
Adiscussionpaper(2006)inStatisticalScience

y
i
∈{−1,1},classlabelsinthebinarycase

Findf∈F={
f(x)=w

x+b
|w∈R
d
andb∈R}
minimizing
1
n
n
￿
i=1
(1−y
i
f(x
i
))
+
+λkwk
2
,
whereλisaregularizationparameter.

Classicationrule:
φ(x)=sign(f(x))
HingeLoss
-2
-1
0
1
2
0
1
2
t=yf
[-t]
*
(1-t)
+
(1−yf(x))
+
isanupperboundofthemisclassicationlossfunction
I(y6=φ(x))=[−yf(x)]

≤(1−yf(x))
+
where[t]

=I(t≥0)and
(t)
+
=max{t,0}.
SVMinGeneral
Findf(x)=b+h(x)withh∈H
K
(RKHS)minimizing
1
n
n
￿
i=1
(1−y
i
f(x
i
))
+
+λkhk
2
H
K
.

LinearSVM:
H
K
={h(x)=w

x|w∈R
d
}with
i)K(x,x

)=x

x

ii)khk
2
H
K
=kw

xk
2
H
K
=kwk
2

NonlinearSVM:K(x,x

)=(1+x

x

)
d
,
exp(−kx−x

k
2
/2σ
2
),...
StatisticalPropertiesofSVM

Fisherconsistency(Lin,DM&KD2002)
arg
f
minE[(1−Yf(X))
+
|X=x]=sign(p(x)−1/2)
wherep(x)=P(Y=1|X=x)

SVMapproximatestheBayesdecisionrule
φ
B
(x)=sign(p(x)−1/2).

Bayesriskconsistency(Zhang,AOS2004,Steinwart,
IEEEIT2005,Bartlettetal.,JASA2006)
R(
ˆ
f
SVM
)→R(φ
B
)inprob.
underuniversalapproximationconditiononH
K

Rateofconvergence(Steinwartetal.,AOS2007)
MainQuestions

RecursiveFeatureElimination(Guyonetal.,ML2002):
backwardeliminationofvariablesbasedonthetted
coefcientsofthelinearSVM

Whatisthestatisticalbehaviorofthecoefcients?

Whatdeterminestheirvariances?

Studyasymptoticpropertiesofthecoefcientsofthelinear
SVM.
SomethingNew,Old,andBorrowed

Thehingeloss:
noteverywheredifferentiable,
noclosedformexpressionforthesolution

Usefullink:
sign(p(x)−1/2),thepopulationminimizerw.r.t.thehinge
lossisthemedianofYatx.

Asymptoticsforleastabsolutedeviation(LAD)estimators
(Pollard,ET1991)

Convexityoftheloss
Preliminaries

(X,Y):apairofrandomvariableswithX∈X⊂R
d
and
Y∈{1,−1}

P(Y=1)=π
+
andP(Y=−1)=π


LetfandgbethedensitiesofXgivenY=1and−1.

With
￿x=(1,x
1
,...,x
d
)

andβ=(b,w

)

,
h(x;β)=w

x+b=
￿
x

β

￿
β
λ,n
=argmin
β
1
n
n
￿
i=1
(1−y
i
h(x
i
;β))
+
+λkwk
2
PopulationVersion

L(β)=E
￿
1−Yh(X;β)
￿
+

β

=argmin
β
L(β)

ThegradientofL(β):
S(β)=−E
￿
ψ(1−Yh(X;β))Y
￿
X
￿
whereψ(t)=I(t≥0)

TheHessianmatrixofL(β):
H(β)=E
￿
δ(1−Yh(X;β))
￿
X
￿
X

￿
whereδistheDiracdeltafunction.
MoreonH(β)

TheHessianmatrixofL(β):
H(β)=E
￿
δ(1−Yh(X;β))
￿
X
￿
X

￿
H
j,k
(β)=E
￿
δ(1−Yh(X;β))X
j
X
k
￿
for0≤j,k≤d

+
￿
X
δ(1−b−w

x)x
j
x
k
f(x)dx


￿
X
δ(1+b+w

x)x
j
x
k
g(x)dx.

ForafunctionsonX,denethe
Radontransform
Rsofs
forp∈Randξ∈R
d
as
(Rs)(p,ξ)=
￿
X
δ(p−ξ

x)s(x)dx.
(theintegralofsoverhyperplanesξ

x=p)

H
j,k
(β)=π
+
(Rf
j,k
)(1−b,w)+π

(Rg
j,k
)(1+b,−w),
wheref
j,k
(x)=x
j
x
k
f(x)andg
j,k
(x)=x
j
x
k
g(x).
RegularityConditions
(A1)
Thedensitiesfandgarecontinuousandhavenite
secondmoments.
(A2)
ThereexistsB(x
0

0
)suchthatf(x)>C
1
andg(x)>C
1
foreveryx∈B(x
0

0
).
(A3)
Forsome1≤i

≤d,
￿
X
{x
i

≥G
−i

}x
i

g(x)dx<
￿
X
{x
i

≤F
+
i

}x
i

f(x)dx
or
￿
X
{x
i

≤G
+
i

}x
i

g(x)dx>
￿
X
{x
i

≥F

i

}x
i

f(x)dx.
(whenπ
+


,itsaysthatthemeansaredifferent.)
(A4)
LetM
+
={x∈X|
￿x

β

=1}and
M

={x∈X|
￿x

β

=−1}.Thereexisttwosubsetsof
M
+
andM

onwhichtheclassdensitiesfandgare
boundedawayfromzero.
BahadurRepresentation

Bahadur(1966),ANoteonQuantilesinLargeSamples

Astatisticalestimatorisapproximatedbyasumof
independentvariableswithahigher-orderremainder.

Letξ=F
−1
(p)bethepthquantileofdistributionF.
ForX
1
,...,X
n
∼iidF,thesamplepthquantileis
ξ+
￿
n
￿
i=1
I(X
i
>ξ)−n(1−p)
￿
/nf(ξ)+R
n
,
wheref(x)=F

(x).
Bahadur-typeRepresentationoftheLinearSVM
Theorem
Supposethat(A1)-(A4)aremet.Forλ=o(n
−1/2
),wehave

n(
￿
β
λ,n
−β

)=−
1

n
H(β

)
−1
n
￿
i=1
ψ(1−Y
i
h(X
i


))Y
i
￿
X
i
+o
P
(1).
RecallthatH(β

)=E
￿
δ(1−Yh(X;β

))
￿
X
￿
X

￿
and
ψ(t)=I(t≥0).
AsymptoticNormalityof
￿
β
λ,n
Theorem
Suppose(A1)-(A4)aresatised.Forλ=o(n
−1/2
),

n(
￿
β
λ,n
−β

)→N
￿
0,H(β

)
−1
G(β

)H(β

)
−1
￿
indistribution,where
G(β)=E
￿
ψ(1−Yh(X;β))
￿
X
￿
X

￿
.
Corollary
UnderthesameconditionsasinTheorem,

n
￿
h(x;
￿
β
λ,n
)−h(x;β

)
￿
→N
￿
0,
˜
x

H(β

)
−1
G(β

)H(β

)
−1
˜
x
￿
indistribution.
AnIllustrativeExample

TwomultivariatenormaldistributionsinR
d
withmean
vectorsµ
f
andµ
g
andacommoncovariancematrixΣ

π
+


=1/2.

WhatistherelationbetweentheBayesdecisionboundary
andtheoptimalhyperplanebytheSVM,h(x;β

)=0?

f

g
Example

TheBayesdecisionboundary(Fisher'sLDA):
￿
Σ
−1

f
−µ
g
)
￿

￿
x−
1
2

f

g
)
￿
=0.

ThehyperplanedeterminedbytheSVM:
￿
x

β

=0withS(β

)=0.

β

balancestwoclasseswithinthemargin
E
￿
ψ(1−Yh(X;β

))Y
￿
X
￿
=0
P(h(X;β

)≤1|Y=1)=P(h(X;β

)≥−1|Y=−1)
E
￿
I{h(X;β

)≤1}X
j
|Y=1
￿
=E
￿
I{h(X;β

)≥−1}X
j
|Y=−1
￿
Example

Directcalculationshowsthat
β

=C(d
Σ

f

g
))
￿

1
2

f

g
)

I
d
￿
Σ
−1

f
−µ
g
),
whered
Σ

f

g
)istheMahalanobisdistancebetweenthe
twodistributions.

ThelinearSVMisequivalenttoFisher'sLDA.

Theassumptions(A1)-(A4)aresatised.So,themain
theoremapplies.

Considerd=1,µ
f

g
=0,σ=1,and
d
Σ

f

g
)=|µ
f
−µ
g
|.
DistanceandMargins
-505
0.00.20.4
d=0.5
x
-505
0.00.20.4
d=1
x
-505
0.00.20.4
d=3
x
-505
0.00.20.4
d=6
x
AsymptoticVariance
0123456
01020304050
Mahalanobis distance
asymptotic variability
(a)Intercept
0123456
01020304050
Mahalanobis distance
asymptotic variability
(b)Slope
Figure:
Theasymptoticvariabilitiesoftheinterceptandtheslopefor
theoptimalhyperplaneasafunctionoftheMahalanobisdistance.
BivariateNormalExample

µ
f
=(1,1)


g
=(−1,−1)

andΣ=diag(1,1)

d
Σ

f

g
)=2

2andtheBayeserrorrateis0.07865.

Find(
ˆ
β
0
,
ˆ
β
1
,
ˆ
β
2
)=argmin
β
1
n
n
￿
i=1
(1−y
i
￿
x
i

β)
+
Samplesizen
Estimates
100200500
Optimalcoefcients
ˆ
β
0
0.0006-0.00130.0022
0
ˆ
β
1
0.77090.74500.7254
0.7169
ˆ
β
2
0.77490.74590.7283
0.7169
Table:
Averagesofestimatedoptimalcoefcientsover1000
replicates.
SamplingDistributionsof
ˆ
β
0
and
ˆ
β
1
-0.4-0.20.00.20.4
0123
Density
estimateasymptotic
(a)
ˆ
β
0
0.50.60.70.80.91.0
012345
Density
estimateasymptotic
(b)
ˆ
β
1
Figure:
Estimatedsamplingdistributionsof
ˆ
β
0
and
ˆ
β
1
TypeIerrorrates
500100015002000
0.00.10.20.30.40.50.60.7
n
Type I error rate
d=6d=12d=18d=24
Figure:
ThemedianvaluesofthetypeIerrorratesinvariable
selectionwhenµf
=(1
d/2
,0
d/2
)

,µg
=0

d
,andΣ=I
d
ConcludingRemarks

Examineasymptoticpropertiesofthecoefcientsof
variablesinthelinearSVM.

EstablishBahadurtyperepresentationofthecoefcients.

Howthemarginsoftheoptimalhyperplaneandthe
underlyingprobabilitydistributioncharacterizetheir
statisticalbehavior

VariableselectionfortheSVMintheframeworkof
hypothesistesting

Forpracticalapplications,needconsistentestimatorsof
G(β

)andH(β

).

ExtensionoftheSVMasymptoticstothenonlinearcase

Exploreadifferentscenariowheredalsogrowswithn.
Reference

ABahadurRepresentationoftheLinearSupportVector
Machine,Koo,J.-Y.,Lee,Y.,Kim,Y.,andPark,C.,
JournalofMachineLearningResearch(2008).
Availableat
www.stat.osu.edu/∼yklee
or
http://www.jmlr.org/
.