COMP9444: Neural Networks

habitualparathyroidsAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

130 views

COMP9444:NeuralNetworks
VapnikChervonenkisDimension,
PACLearning
and
StructuralRiskMinimization
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning1
Howgoodaclassifierdoesalearnerpro-
duce?

Trainingerror
istheprecentageofincorrectlyclassifieddataamong
thetrainingdata.

Trueerror
istheprobabilityofmisclassifyingarandomlydrawn
datapointfromthedomain(accordingtotheunderlyingprobability
distribution).

Howdoesthetrainingerrorrelatetothetrueerror?
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning2
Howgoodaclassifierdoesalearnerpro-
duce?

Wecanestimatethetrueerrorbyusingrandomlydrawntestdata
fromthedomainthatwasnotshowntotheclassifierduringtraining.

Testerror
isthepercentageofincorrectlyclassifieddataamongthe
testdata.

Validationerror
isthepercentageofincorrectlyclassifieddataamong
thevalidationdata.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning3
CalibratingMLPsforthelearningtaskat
hand

Splittheoriginaltrainingdatasetintotwosubsets:

Trainingdataset

Validationdataset

Usethetrainingdatasetforresamplingtoestimatewhichnetwork
configuration(e.g.howmanyhiddenunitstouse)seemsbest.

Onceanetworkconfigurationisselected,trainthenetworkonthe
trainingsetandestimateitstrueerrorusingthevalidationsetnot
showntothenetworkatanytimebefore.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning4
Resamplingmethods

Resamplingtriestousethesamedatamultipletimestoreduce
estimationerrorsonthetrueerrorofaclassifierbeinglearnedusinga
particularlearningmethod.

Frequentlyusedresamplingmethodsare

n-foldcrossvalidation(withnbeing5or10)

andbootstrapping.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning5
CrossValidation

Splitavailabletrainingdataintonsubsetss
1
,...,s
n
.

Foreachisetapartsubsets
i
fortesting.Trainthelearneronall
remainingn−1subsetsandtesttheresultons
i
.

Averagethetesterrorsoverallnlearningruns.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning6
Bootstrapping

Assumetheavailabletrainingdatarepresentstheactualdistribution
ofdataintheunderlyingdomain.i.e.forkdatapoints,assumeeach
datapointoccurswithprobability1/k.

Drawasetofttrainingdatapointsbyrandomlyselectingadatapoint
fromtheoriginalcollectionwithprobability1/kforeachdatapoint(if
therearemultipleoccurrencesofthesamedatapoint,theprobability
ofthatdatapointoccurringisdeemedtobeproportionallyhigher.)

Oftentischosentobeequaltok.Thiswillusuallyresultinhaving
onlyabout63%oftheoriginalsample,whiletheremaining37%
(1/e)aremadeupofduplicates
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning7
Bootstrapping

Trainlearnerontherandomlydrawntrainingsample.

Testthelearnedclassifierontheitemsfromtheoriginaldataset
whichwerenotinthetrainingset.

Repeattherandomselection,trainingandtestingforanumberof
timesandaveragethetesterroroverallruns(typicalnumberofruns
is30to50.)
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning8
FactorsDeterminingtheDifficultyofLearn-
ing

Therepresentationofdomainobjects,i.e.whichattributesandhow
many.

Thenumberoftrainingdatapoints.

Thelearner,i.e.theclassoffunctionsitcanpotentiallylearn.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning9
PACLearning

Weknowanumberofheuristicstoensurelearningonthetraining
setwillcarrytonewdata,suchasusingaminimallycomplexmodel,
trainingonarepresentativedataset,withsufficientsamples

Howcanweaddressthisissueinaprincipledway?

ProbablyApproximatelyCorrectlearning,andStructuredRisk
Minimisation
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning10
PACLearning

Wanttoensuretheprobabilityofpoorlearningislow
P[|E
out
−E
in
|>ε]<1−δ

If|E
out
−E
in
|islow,ourlearneris“approximatelycorrect”.This
functionprovidesaboundsontheprobabilitythatthisisthecase.

Hoeffdinginequality:probabilitythatthesumofrandomvariables
deviatesfromexpectedvalues
P[|ν−µ|>ε]≤2e
−2ε
2
N

Foragivenhypothesish,wecansay
P[|E
out
(h)−E
in
(h)|>ε]≤2e
−2ε
2
N
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning11
PACLearning

Forachosenhypothesisg,drawnfromthesetofhypothesesH:
P[|E
out
(g)−E
in
(g)|>ε]≤2Me
−2ε
2
N

Mrepresentsthenumberofhypotheses,∞foraperceptron

Thisisnotveryuseful.Alternative,wecanseehowthevaluegrows
withthenumberofsamples.
m
H
(N)=max
x
1
...x
N
|H(x
1
...x
N
)|

ThegrowthfunctionrepresentsthemostdichotomiesthatH
implementsoverallpossiblesamplesofsizeN
m
H
(N)≤2
N
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning12
Vapnik-Chervonenkisdimension

HshattersNpoints,ifthereexistsanarrangementofNpointssuch
thatforallarrangementsoflabelsonthepoints,thereisahypothesis
hthatcapturesthelabels

Vapnik-Chervonenkisdimensiond
VC
(H)isthelargestNthatHcan
shatter(andwherem
H
(N)=2
N
).

IfN≤d
VC
,Hmaybeabletoshatterthedata.IfN>d
VC
Hcannot
shatterthedata.

Bounds:itcanbeprovedthatm
H
(N)≤N
d
VC
+k
P[|E
out
−E
in
|>ε]≤4m
H
(2N)e
−ε
2
N
8
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning13
Vapnik-Chervonenkisdimension

Definition

Variousfunctionsets-relevanttolearningsystems-andtheir
VC-dimension

Boundsfromprobablyapproximatelycorrectlearning
(PAC-learning).

GeneralboundsontheRisk.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning14
TheVapnik-Chervonenkisdimension
Inthefollowing,weconsiderabooleanclassificationfunctionfonthe
inputspaceXtobeequivalenttothesubsetofXsuchthat∀x∈Xf(x)=1.
TheVC-dimensionisausefulcombinatorialparameteronsetsofsubsets
intheinputspace,e.g.onfunctionclassesHontheinputspaceX.
DefinitionWesayasetS⊆XisshatteredbyH,ifforallsubsetss⊆S
thereisatleastonefunctionh∈Hsuchthats=h∩S.
DefinitionTheVapnik-ChervonenkisdimensionofH,d
VC
(H)isthe
cardinalityofthelargestsetS⊆XshatteredbyH.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning15
TheVapnik-Chervonenkisdimension
1
2
3
4
TheVC-dimensionofthesetoflineardecisionfunctionsinthe
2-dimensionalEuclideanspaceis3.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning16
TheVapnik-Chervonenkisdimension
TheVC-dimensionofthesetoflineardecisionfunctionsinthen-
dimensionalEuclideanspaceisequalton+1.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning17
VC-dimensionandPAC-Learning
TheoremUpperBound(Blumeretal.,1989)
LetLbealearningalgorithmthatusesHconsistently,i.e.thatfindsan
h∈Hthatisconsistentwithallthedata.Forany0<ε,δ<1given
N=
(4log(
2
δ
)+8d
VC
(H)log(
13
ε
))
ε
randomexamples,Lwillwithprobabilityofatleast1−δ
eitherproduceafunctionh∈Hwitherror≤ε
orindicatecorrectly,thatthetargetfunctionisnotinH.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning18
VC-dimensionandPAC-Learning
TheoremLowerBound(Ehrenfeuchtetal.,1992)
LetLbealearningalgorithmthatusesHconsistently.Forany0<ε<
1
8
,
0<δ<
1
100
givenlessthan
N=
d
VC
(H)−1
32ε
randomexamples,thereissomeprobabilitydistributionforwhichLwill
notproduceafunctionh∈Hwitherror(h)≤εwithprobability1−δ.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning19
VC-dimensionbounds
εδ
d
VC
lowerbound
upperbound
5%5%
10
6
9192
10%5%
10
3
4040
5%5%
4
2
3860
10%5%
4
1
1707
10%10%
4
1
1677
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning20
TheVapnik-Chervonenkisdimension
X
Y
LetCbethesetofallsquaresintheplane(paralleltoaxis).
VC-dim=...
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning21
TheVapnik-Chervonenkisdimension
X
Y
LetCbethesetofallrectanglesintheplane(paralleltoaxis).
VC-dim=...
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning22
TheVapnik-Chervonenkisdimension
X
Y
LetCbethesetofallcirclesintheplane.
VC-dim=...
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning23
TheVapnik-Chervonenkisdimension
X
Y
LetCbethesetofalltrianglesintheplane,allowingrotation.
VC-dim=...
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning24
TheVapnik-Chervonenkisdimension
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-10
-5
0
5
10
f(x,1)
f(x,2.5)
TheVC-dimensionofallfunctionsofthefollowingformisinifinite!
f(x,α)=sin(αx),α∈ℜ.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning25
TheVapnik-Chervonenkisdimension
Thefollowingpointsontheline
x
1
=10
−1
,...,x
n
=10
−n
canbeshatteredbyfunctionsfromthisset.
Toseparatethesedataintotwoclassesdeterminedbythesequence
δ
1
,...,δ
n
∈{0,1}
itissufficienttochoosethevalueofparameter
α=π(
n

i=1
(1−δ
i
)10
i
+1)
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning26
VC-DimensionsofNeuralNetworks
Asagoodheuristic(Bartlett):
VC-dimension≈numberofparameters

Forlinearthresholdfunctionsonℜ
n
,d
VC
=n+1.
(numberofparametersisn+1.)

Forlinearthresholdnetworks,andfixeddepthnetworkswith
piecewisepolynomialsquashingfunctions
c
1
|
~
W|≤d
VC
≤c
2
|
~
W|log|
~
W|
where|
~
W|isnumberofweightsinthenetwork.

Somethresholdnetworkshaved
VC
≥c|
~
W|log|
~
W|.

d
VC
(sigmoidnet)≤c|
~
W|
4
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning27
VC-DimensionsofNeuralNetworks
AnyfunctionclassHthatcanbecomputedbyaprogramthattakesareal
inputvector~xandkrealparametersandinvolvesnomorethantofthe
followingoperations:
•+,−,×,/onrealnumbers
•>,≥,=,≤,<,6=onrealnumbers
•outputvaluey∈{−1,+1}
hasVC-dimensionofO(kt).
(SeeworkofPeterBartlett)
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning28
VC-DimensionsofNeuralNetworks
AnyfunctionclassHthatcanbecomputedbyaprogramthattakesareal
inputvector~xandkrealparametersandinvolvesnomorethantofthe
followingoperations:
•+,−,×,/,e
α
onrealnumbers
•>,≥,=,≤,<,6=onrealnumbers
•outputvaluey∈{−1,+1}
hasVC-dimensionofO(k
2
t
2
).
Thisincludessigmoidnetworks,RBFnetworks,mixtureofexperts,etc.
(SeeworkofPeterBartlett)
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning29
StructuralRiskMinimisation(SRM)
Thecomplexity(orcapacity)ofafunctionclassfromwhichthelearner
choosesafunctionthatminimisestheempiricalrisk(i.e.theerroronthe
trainingdata)determinestheconvergencerateofthelearnertotheoptimal
function.
Foragivennumberofindependentlyandidenticallydistributedtraining
examples,thereisatrade-offbetweenthedegreetowhichtheempirical
riskcanbeminimisedandtowhichtheempiricalriskwilldeviatefrom
thetruerisk(i.e.thetrueerror-erroronunseendataaccordingtothe
underlyingdistribution).
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning30
StructuralRiskMinimisation(SRM)
ThegeneralSRMprinciple:
Choosecomplexityparameterd,e.g.thenumberofhiddenunitsina
MLP,orthesizeofadecisiontree,andfunctiong∈Hsuchthatthe
followingisminimised:
E
out
(g)≤E
in
(g)+c
￿
d
VC
(H)
N
whereNisthenumberoftrainingexamples.
ThehighertheVCdimensionisthemorelikelywilltheempiricalerrorbe
low.
Structuralriskminimisationseekstherightbalance.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning31
StructuralRiskMinimisation(SRM)
dVC
*
VC dim
Error
out-error
complexity
in-error
E
out
≤E
in

E
out
≤E
in
+
￿
8
N
log
4m
H
(2N)
1−δ
m
H
(N)≤N
d
VC
(H)
E
out
≤E
in
+c
￿
d
VC
(H)
N
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning32
VC-dimensionHeuristic
Forneuralnetworksanddecisiontrees:
VC-dimension≈size.
Hence,theorderofthemisclassificationprobabilityisnomorethan
trainingerror+
￿
size
m
wheremisthenumberoftrainingexamples.
Thissuggeststhatthenumberoftrainingexamplesshouldgrowroughly
linearlywiththesizeofthehypothesistobeproduced.
Ifthefunctiontobeproducedistoocomplexfortheamountofdata
available,itislikelythatthelearnedfunctionisnotanear-optimalone.
(SeeworkofPeterBartlett)
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning33
Summary

TheVC-dimensionisausefulcombinatorialparameterofsetsof
functions.

Itcanbeusedtoestimatethetrueriskonthebasisoftheempirical
riskandthenumberofindependentlyandidenticallydistributed
trainingexamples.

Itcanalsobeusedtodetermineasufficientnumberoftraining
examplestolearnprobablyapproximatelycorrectly.

ApplicationsoftheVC-dimensionforchoosingthemostsuitable
subsetoffunctionsforagivennumberofindependentlyand
identicallydistributedexamples.

Tradingempiricalriskagainstconfidenceinestimate.
COMP9444
c
￿2011
COMP944411s2VC-dimensionandPAC-learning34
Summary(cont.)
LimitationsoftheVCLearningTheory

Theprobabilisticboundsontherequirednumberofexamplesare
worstcaseanalysis.

Nopreferencerelationonthefunctionsofthelearneraremodelled.

Inpractice,learningexamplesarenotnecessarilydrawnfromthe
sameprobabilitydistributionasthetestexamples.
COMP9444
c
￿2011