K

bunkietameΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

447 εμφανίσεις

Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure
(Salakhutdinov and Hinton, AISTATS 2007)
Peut-on adapter la représentation à autre

chose qu’un classifieur réseau de neurones?
Dans ce papier, on l’adapte à un classifieur

des K plus proches voisinsLearning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure
(Salakhutdinov and Hinton, AISTATS 2007)
Decoder
30
W
4
Top
2000
RBM
T T
W +! W +!
1 8 1 8
500 500
T T
2000
W +! W +!
2 7 2 7
500 500
W
3
T T
500 W +! W +!
3 6 3 6
RBM
2000 2000
T T
W +! *NCA W +!
4 5 4 5
30 30
500
W +! W +!
4 4 4 4
W
2
2000 2000
500
RBM
W +! W +!
3 3 3 3
500 500
W +! W +!
2 2 2 2
500
500 500
W W +! W +!
1 1 1 1 1
RBM Encoder
Pretraining Fine!tuning
Figure3: Leftpanel:PretrainingconsistsoflearningastackofRBM’sinwhichthefeatureactivationsofoneRBMaretreatedasdata
bythenextRBM.Rightpanel: Afterpretraining, theRBM’sare“unrolled”. Setting resultsinnonlinearNCA,setting
resultsinadeepmulti-layerautoencoder. For ,theNCAobjectiveiscombinedwithautoencoder reconstructionerror to
createregularizednonlinearNCA.Thenetworkisfine-tunedbybackpropagation.
where is the learning rate, denotes an expec- The marginaldistributionovervisible units is givenby
tationwith respectto thedatadistributionand Eq.9.withanenergyterm:
is an expectation with respect to the distribution defined
by the model. To circumvent the difficulty of computing
(14)
,weuse1-stepContrastiveDivergence[11]:
(11)
Thegradientofthelog-likelihoodfunctionis:
The expectation defines the frequencywith
whichinput andfeature areontogetherwhenthefea-
turesarebeingdrivenbytheobserveddatafromthetrain-
ingsetusingEq. 7. Afterstochasticallyactivatingthefea-
tures,Eq. 8isusedto“reconstruct”binarydata. ThenEq.
Ifwesetvariances forallvisibleunits ,theparam-
7isusedagainto activatethefeaturesand
eterupdatesarethesameasdefinedinEq.11.
isthecorrespondingfrequencywhenthefeaturesarebeing
drivenbythereconstructeddata. Thelearningruleforthe
biasesisjustasimplifiedversionofEq.11.
3.3 GreedyRecursivePretraining
Afterlearningthefirstlayerofhiddenfeatureswehavean
3.2 ModelingReal-valuedData
undirectedmodelthatdefines viaaconsistentpair
Welling et. al. [19]introducedaclassoftwo-layerundi-
ofconditionalprobabilities, and . Adiffer-
rected graphical models that generalize Restricted Boltz-
ent way to express what has been learned is and
mann Machines (RBM’s) to exponential family distribu-
.Unlikeastandarddirectedmodel,this doesnot
tions. Thisallowsthemtomodelimageswithreal-valued
haveitsownseparateparameters.Itisacomplicated,non-
pixelsbyusingvisibleunitsthathaveaGaussiandistribu-
factorialprioron thatisdefinedimplicitlybytheweights.
tionwhosemeanisdeterminedbythehiddenunits:
Thispeculiardecompositioninto and suggests
arecursivealgorithm: keepthelearned butreplace
(12)
byabetterpriorover .
(13) Foranyapproximatingdistribution wecanwrite:form pretrainingto extract useful featuresfrombinary or
Binary
h
form pretrainingto extract useful featuresfrombinary or
real-valueddata.Insection4weshowthatnonlinearNCA
Binary Hidden Features
h
significantly outperforms linear methods on the MNIST
real-valueddata.Insection4weshowthatnonlinearNCA
Hidden Features
form pretrainingto extract useful featuresfrombinary or
dataset of handwritten digits. In section 5 we show how
significantly outperforms linear methods on the MNIST
Binary
W
h
real-valueddata.Insection4weshowthatnonlinearNCA
nonlinearNCAcanberegularizedbyaddinganextraterm
Hidden Features
dataset of handwritten digits. In section 5 we show how
form pretrainingto extract useful featuresfrombinary or
W
Binary
to the objective function. The extra term is the error in
significantly outperforms linear methods on the MNIST
h
nonlinearNCAcanberegularizedbyaddinganextraterm
real-valueddata.Insection4weshowthatnonlinearNCA
Hidden Features
reconstructingthedatafromthecode. Usingthisregular-
dataset of handwritten digits. In section 5 we show how
x Binary
W
to the objective function. The extra term is the error in
significantly outperforms linear methods on the MNIST
izingterm,weshowhownonlinearNCAcanbenefitfrom
Visible Data
nonlinearNCAcanberegularizedbyaddinganextraterm
reconstructingthedatafromtdhaetacsoetdeo.fUs hanindwr gtihttiesnredgiguiltasr.-In section 5 we show how
additional,unlabeleddata. Wefurtherdemonstratethesu-
x W Binary
to the objective function. The extra term is the error in
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
nonlinearNCAcanberegularizedbyaddinganextraterm
izingterm,weshowhownonlinearNCAcanpebrieonrietyfitoffrroem gularized nonlinear NCA when only small
Visible Data
sentsavectorofstochasticbinaryfeatures andandthebottom
reconstructingthedatafromthecode. Usingthisregular-
fractionoftheimagesarelabeled.
x Binary
to the objective function. The extra term is the error in
additional,unlabeleddata. Wefurtherdemonstratethesu-
layerrepresents avectorofstochasticbinary “visible”variables
izingterm,weshowhownonlinearNCAcanbenefitfrom
Visible Data
reconstructingthedatafromthecode. Usingthisregular-
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
periority of regularized nonlinear NCA when only small .Whenmodelingreal-valuedvisiblevariables,thebottomlayer
x Binary
additional,unlabeleddata. Wefurtherdemonstratethesu-
sentsavectorofstochasticbinaryfeatures andandthebottom
iscomposedoflinearunitswithGaussiannoise.
izingterm,weshowhownonlinearNCAcanbenefitfrom
2 LearningNonlinearNCA
Visible Data
fractionoftheimagesarelabeled.
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
layerrepresents avectorofstochasticbinary “visible”variables
periority of regularized nonlinear NCA when only small
additional,unlabeleddata. Wefurtherdemonstratethesu-
We are given a set of labeled training cases ,
sentsavectorofstochasticbinaryfeatures andandthebottom
.Whenmodelingreal-valuedvisiblevariables,thebottomlayer
fractionoftheimagesarelabeled. Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
3 Pretraining
periority of regularized nonlinear NCA when only small
, where , and .
layerrepresents avectorofstochasticbinary “visible”variables
iscomposedoflinearunitswithGaussiannoise.
sentsavectorofstochasticbinaryfeatures andandthebottom
2 LearningNonlinearNCA
fractionoftheimagesarelabeled.
Foreachtrainingvector ,definetheprobabilitythatpoint
.WhenmodeIlnintghirseasel-cvtaiolunewe dvidsiebslcerivbaeriaanbluens,suthpeerbvoistetodmwlaayyetor learn
layerrepresents avectorofstochasticbinary “visible”variables
selectsoneofitsneighbours (asin[9,13])inthetrans-
iscomposedoflinearunitswithGaussiannoise.
anadaptive,multi-layer,non-linear”encoder”networkthat
We are given a set of labeled training cases ,
2 LearningNonlinearNCA
.Whenmodelingreal-valuedvisiblevariables,thebottomlayer
3 Pretraining
Learning a Nonlinear Embedding by Preserving
formedfeaturespaceas:
transformstheinputdatavector intoitslow-dimensional
iscomposedoflinearunitswithGaussiannoise.
, where , and .
2 LearningNonlinearNCA
We are given a set of labeled training cases ,
featurerepresentation .Thislearningistreatedasa
Class Neighbourhood Structure
Foreachtrainingvector ,definetheprobabilitythatpoint
3 Pretraining
In this section we describe an unsupervisedway to learn
We are given a set of labeled training cases ,
(3)
, where , and .
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
(Salakhutdinov and Hinton, AISTATS 2007)
3 Pretraining
selectsoneofitsneighbours (asin[9,13])inthetrans-
anadaptive,multi-layer,non-linear”encoder”networkthat
, where , and .
resentations. Subsequentfine-tuningof theweight vector
Foreachtrainingvector ,definetheprobabilitythatpoint
In this section we describe an unsupervisedway to learn
formedfeaturespaceas:
transformstheinputdatavector intoitslow-dimensional
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
Foreachtrainingvector ,definetheprobabilitythatpoint
In this section we describe an unsupervisedway to learn
selectsoneofitsneighbours (asin[9,13])inthetrans-
anadaptive,multi-layer,non-linear”encoder”networkthat
NCA (Neighbourhood Component Analysis)

featurerepresentation .Thislearningistreatedasa
selectsoneofitsneighbours (asin[9,13])inthetrans-
anadaptive,multi-layer,non-linear”encoder”networkthat
formedfeaturespaceas:
transformstheinputdatavector intoitslow-dimensional
(3)
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
formedfeaturespaceas:
3.1 ModelingBinaryData
transformstheinputdatavector intoitslow-dimensional
featurerepresentation .Thislearningistreatedasa
and is a multi-layer neural network parametrized
resentations. Subsequentfine-tuningof theweight vector
featurerepresentation .Thislearningistreatedasa
probabilité que ‘a’ We model binary data (e.g. the MNIST digits) using a
(3)
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
by the weight vector (see fig 1). The probabilitythat
(3)
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectiveRfeusntrcicttieodnBinolEtzqm.a5nn Machine [6, 16, 11] (see fig. 2).
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
choisisse ‘b’
resentations. Subsequentfine-tuningof theweight vector
point belongstoclass dependsontherelativeproximity
The“visible”stochasticbinaryinputvector and“hidden”
resentations. Subsequentfine-tuningof theweight vector
comme voisin
Représentations
ofallotherdatapointsthatbelongtoclass :
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
stochasticbinaryfeaturevector aremodeledbyproducts
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
réseau de neurones
3.1 ModelingBinaryData
ofconditionalBernoullidistributions:
(4)
and is a multi-layer neural network parametrized
probabilité que ce
We model binary data (e.g. the MNIST digits) using a
3.1 ModelingBinaryData
by the weight vector (see fig 1). The probabilitythat
3.1 ModelingBinaryData
(7)
voisin soit de la
and is a multi-layer neural network parametrized
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).
TheNCAobjective(asin[9])istomaximizetheexpected
and is a multi-layer neural network parametrized
We model binary data (e.g. the MNIST digits) using a
point belongstoclass dependsontherelativeproximity
Objectif: maximiser
même classe
We model binary data (e.g. the MNIST digits) using a
by the weight vector (see fig 1). The probabilitythat
The“visible”stochasticbinaryinputvector and“hidden”
numberofcorrectlyclassifiedpointsonthetrainingdata:
by the weight vector (see fig 1). The probabilitythat
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).
ofallotherdatapointsthatbelongtoclass :
cette probabilité
(8)
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).
point belongstoclass dependsontherelativeproximity
stochasticbinaryfeaturevector aremodeledbyproducts
point belongstoclass dependsontherelativeproximity
The“visible”stochasticbinaryinputvector and“hidden”
The“visible”stochasticbinaryinputvector and“hidden”
ofallotherdatapointsthatbelongtoclass :
ofconditionalBernoullidistr(i5b)utions:
ofallotherdatapointsthatbelongtoclass :
(4)
stochasticbinaryfeaturevector aremodeledbyproducts
stochasticbinaryfeaturevector aremodeledbyproducts
where isthelogisticfunction, is
ofconditionalBernoullidistributions:
(4)
ofconditionalBernoullidistributions:
(4)
(7)
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
TheNCAobjective(asin[9])istomaximizetheexpected
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
(7)
(7)
numberofcorrectlyclassifiedpointsonthetrainingdata:
TheNCAobjective(asin[9])istomaximizetheexpected
rametervector, . Themarginaldistributionovervisible
TheNCAobjective(asin[9])istomaximizetheexpected
(8)
vector is:
numberofcorrectlyclassifiedpointsonthetrainingdata:
(6)
numberofcorrectlyclassifiedpointsonthetrainingdata:
(8)
(8)
(5)
(9)
When isconstrainedtobealineartrans-
(5)
(5)
where isthelogisticfunction, is
formation, we get linear NCA. When is defined
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
whewh reere isitshethleogloigstiiscticfufnucntciotino,n, isis
byamultilayer,non-linearneuralnetwork,wecanexplore
where isanenergyterm(i.e. anegativelogprob-
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
amuchricherclassoftransformationsbybackpropagating
ability+anunknownconstantoffset)givenby:
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
rametervector, . Themarginaldistributionovervisible
thederivativesoftheobjectivefunctionsinEq.5or6with
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
respect to parameter vector through the layersroafmtheetervector, . Themarginaldistributionovervisible
vector is:
rametervector, . Themarginaldistributionovervisible
(6)
(10)
encodernetwork. In ourexperiments,the NCA objective
vector is:
(6)
vector is:
(6)
ofEq.5workedslightlybetterthan . Wesus-
(9)
When isconstrainedtobealineartrans-
pectthatthisisbecause ismorerobusttohandling
Theparameterupdatesrequiredtoperformgradientascent
(9)
When isconstrainedtobealineartrans- (9)
outliers. ,ontheotherhand,wouldstronglypenalize
formation, we get linear NCA. When is defined
When isconstrainedtobealineartrans-
inthelog-likelihoodcanbeobtainedfromEq.9:
formation, we get linear NCA. When is defined
configurationswhereapointinthefeaturespacedoesnot
byamultilayer,non-linfeoarrmnaetuiroanl,nwe etwgoertk,liwe neacraNC nexA.ploWrehen is defined
where isanenergyterm(i.e. anegativelogprob-
lieclosetoanyothermemberofitsclass. Thederivatives
byamultilayer,non-linearneuralnetwork,wecanexplore
byamultilayer,non-linearneuralnetwork,wecanexplore
amuchricherclassoftransformationsbybackpropagating
where isanenergyterm(i.e. anegativelogprob-
where isanenergyterm(i.e. anegativelogprob-
ability+anunknownconstantoffset)givenby:
ofEq.5aregivenintheappendix.
amuchricherclassoftransformationsbybackpropagating
amuchricherclassoftransformationsbybackpropagating ability+anunknownconstantoffset)givenby:
thederivativesoftheobjectivefunctionsinEq.5or6with
ability+anunknownconstantoffset)givenby:
thederivativesoftheobjectivefunctionsinEq.5or6with
thederivativesoftheobjectivefunctionsinEq.5or6with
respect to parameter vector through the layers of the
respect to parameter vector through the layers of the
(10)
(10)
respect to parameter vector through the layers of the
encodernetwork. In ourexperiments,the NCA objective
encodernetwork. In ourexperiments,the NCA objective
(10)
encodernetwork. In ourexperiments,the NCA objective
ofEq.5workedslightlybetterthan . Wesus-
ofEq.5workedslightlybetterthan . Wesus-
ofEq.5workedslightlybetterthan . Wesus-
pectthatthisisbecause ismorerobusttohandling
Theparameterupdatesrequiredtoperformgradientascent
pectthatthisisbecause ismorerobusttohandling
Theparameterupdatesrequiredtoperformgradientascent
pectthatthisisbecause ismorerobusttohandling
outliers. ,ontheotherhand,wouldstronglypenalize
Theparameterupdatesrequiredtoperformgradientascent
outliers. ,ontheotherhand,wouldstronglypenalize
inthelog-likelihoodcanbeobtainedfromEq.9:
inthelog-likelihoodcanbeobtainedfromEq.9:
outliers. ,ontheotherhand,wouldstronglypenalize
configurationswhereapointinthefeaturespacedoesnot
inthelog-likelihoodcanbeobtainedfromEq.9:
configurationswhereapointinthefeaturespacedoesnot
configurationswhereapointinthefeaturespacedoesnot
lieclosetoanyothermemberofitsclass. Thederivatives
lieclosetoanyothermemberofitsclass. Thederivatives
lieclosetoanyothermemberofitsclass. Thederivatives
ofEq.5aregivenintheappendix.
ofEq.5aregivenintheappendix.
ofEq.5aregivenintheappendix.form pretrainingto extract useful featuresfrombinary or
Binary
h
real-valueddata.Insection4weshowthatnonlinearNCA
Hidden Features
significantly outperforms linear methods on the MNIST
dataset of handwritten digits. In section 5 we show how
W
nonlinearNCAcanberegularizedbyaddinganextraterm
to the objective function. The extra term is the error in
reconstructingthedatafromthecode. Usingthisregular-
x Binary
izingterm,weshowhownonlinearNCAcanbenefitfrom
Visible Data
additional,unlabeleddata. Wefurtherdemonstratethesu-
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
periority of regularized nonlinear NCA when only small
sentsavectorofstochasticbinaryfeatures andandthebottom
fractionoftheimagesarelabeled.
layerrepresents avectorofstochasticbinary “visible”variables
.Whenmodelingreal-valuedvisiblevariables,thebottomlayer
iscomposedoflinearunitswithGaussiannoise.
2 LearningNonlinearNCA
We are given a set of labeled training cases ,
form pretrainingto extract useful featuresfrombinary or
3 Pretraining
Binary
, where , and .
h
real-valueddata.Insection4weshowthatnonlinearNCA
Hidden Features
Foreachtrainingvector ,definetheprobabilitythatpoint
In this section we describe an unsupervisedway to learn
significantly outperforms linear methods on the MNIST
selectsoneofitsneighbours (asin[9,13])inthetrans-
anadaptive,multi-layer,non-linear”encoder”networkthat
dataset of handwritten digits. In section 5 we show how
W
form pretrainingto extract useful featuresfrombinary or
formedfeaturespaceas:
transformstheinputdatavector intoitslow-dimensional
Binary
form pretrainingto extract useful featuresfrombinary or
form pretrainingto extract useful featuresfrombinary or
h
Binary
nonlinearNCAcanberegularizedbyaddinganextraterm
Binary
real-valueddata.Insection4weshowthatnonlinearNCA h
form pretrainingto extract useful featuresfrombinary or
Hidden Features
h
featurerepresentation .Thislearningistreatedasa
real-valueddata.Insection4weshowthatnonlinearNCA
Binary
Hidden Features
real-valueddata.Insection4weshowthatnonlinearNCA
h
to the objective function. The extra term is the error in
Hidden Features
significantly outperforms linear methods on the MNIST
real-valueddata.Insection4weshowthatnonlinearNCA
(3)
significantly outperforms linear methods on the MNIST
Hidden Features
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
significantly outperforms linear methods on the MNIST
dataset of handwritten digits. In section 5 we show how
reconstructingthedatafromthecode. Usingthisregular-
significantly outperforms linear methods on the MNIST
W
dataset of handwritten digits. In section 5 we show how
x Binary
resentations. Subsequentfine-tuningof theweight vector
W
dataset of handwritten digits. In section 5 we show how
nonlinearNCAcanberegularizedbyaddinganextraterm
izingterm,weshowhownonlinearNCAcanbenefitfrom
W
nonlinearNCAcanberegularizedbyaddinganextraterm
dataset of handwritten digits. In section 5 we show how
Visible Data
W
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
to the objective function. The extra term is the error in
nonlinearNCAcanberegularizedbyaddinganextraterm
to the objective function. The extra term is the error in
additional,unlabeleddata. Wefurtherdemonstratethesu-
nonlinearNCAcanberegularizedbyaddinganextraterm
reconstructingthedatafromthecode. Usingthisregular-
to the objective function. The extra term is the error in
reconstructingthedatafromthecode. Usingthisregular- x Binary
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
to the objective function. The extra term is the error in
periority of regularized nonlinear NCA when only small
x Binary
izingterm,weshowhownonlinearNCAcanbenefitfrom
Visible Data
reconstructingthedatafromthecode. Usingthisregular-
izingterm,weshowhownonlinearNCAcan3b.e1nefiM tforodmelingBinaryData
sentsavectorofstochasticbinaryfeatures andandthebottom
Visible Data
x Binary
reconstructingthedatafromthecode. Usingthisregular-
fractionoftheimagesarelabeled.
x Binary
additional,unlabeleddata. Wefurtherdemonstratethesu-
and is a multi-layer neural network parametrized
additional,unlabeleddata. Wefurtherdemonstratethesu-
layerrepresents avectorofstochasticbinary “visible”variables
izingterm,weshowhownonlinearNCAcanbenefitfrom
Visible Data
izingterm,weshowhownonW lineeamroNC deAl bciannarbyendeafittafr(oe.mg. the MNIST digits) using a
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
Visible Data
periority of regularized nonlinear NCA when only small
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
by the weight vector (see fig 1). The probabilitythat
periority of regularized nonlinear NCA when.oWnlhyensmmaolldelingreal-valuedvisiblevariables,thebottomlayer
additional,unlabeleddata. Wefurtherdemonstratethesu-
sentsavectorofstochasticbinaryfeatures andandthebottom
additional,unlabeleddata. Wefurtherdemonstratethesu-
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).
sentsavectorofstochasticbinaryfeatures andandthebottom
fractionoftheimagesarelabeled.
fractionoftheimagesarelabeled. iscomposedoflinearunitswithGaussiannoise.
point belongstoclass dependsontherelativeproximity
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
2 LearningNonlinearNCA
layerrepresents avectorofstochasticbinary “visible”variables
periority of regularized nonlinear NCA when only small
Figure2:ARestrictedBoltzmannMachine.Thetoplayerrepre-
layerrepresents avectorofstochasticbinary “visible”variables
periority of regularized nonlinear NCA when only small
The“visible”stochasticbinaryinputvector and“hidden”
sentsavectorofstochasticbinaryfeatures andandthebottom
.Whenmodelingreal-valuedvisiblevariables,thebottomlayer
ofallotherdatapointsthatbelongtoclass :
.Whenmodelingresael-nvtasluaedvevcistoibrleofvasrtioacbhleass,ttihcebbiontatorymfleaayteurres andandthebottom
fractionoftheimagesarelabeled.
fractionoftheimagesarelabeled.
We are given a set of labeled training cases , stochasticbinaryfeaturevector aremodeledbyproducts
iscomposedofllianyeaerrurneiptsrewisetnhtGa s ausvseiacntonrooisfe.stochasticbinary “visible”variables
2 LearningNonlinearNCA
iscomposedoflinearunitswithGaussiannoise.
layerrepresents avectorofstochasticbinary “visible”variables
2 LearningNonlinearNCA
3 Pretraining
Learning a Nonlinear Embedding by Preserving
.Whenmodelingreal-valuedvisiblevariables,thebottomlayer
ofconditionalBernoullidistributions:
, where , and .
(4)
.Whenmodelingreal-valuedvisiblevariables,thebottomlayer
We are given a set of labeled training cases ,
We are given a set of labeled training cases ,
iscomposedoflinearunitswithGaussiannoise.
Class Neighbourhood Structure
2 LearningNonlinearNCA
iscomposedoflinearunitswithGaussiannoise.
3 Pretraining
Foreachtrainingvector ,definetheprobabilitythatpoint
2 LearningNonlinearInNCA this section we describe an unsupervisedway to learn
3 Pretraining
, where , and .
, where , and .
(Salakhutdinov and Hinton, AISTATS 2007)
(7)
selectsoneofitsneighbours (asin[9,13])inthetrans-
anadaptive,multi-layer,non-linear”encoder”networkthat
Foreachtrainingvector ,definetheprobabilitythatpoint
We are given a set of labeled training cases ,
In this section we describe an unsupervisedway to learn
TheNCAobjective(asFinor[e9a]c)hitsrationim ngavxeicmtoirzet,hdeefiexnepethcetepdrobabilitythatpoint
We are given a set of labeled training caIsnesthis section, we describe an unsupervisedway to learn
3 Pretraining
3 Pretraining
formedfeaturespaceas:
selectsoneofitsneighbours (asin[9,13])inthetrans-
transformstheinputdatavector intoitslow-dimensional
anadaptive,multi-layer,non-linear”encoder”networkthat
selectsoneofitsneighbou,rswh(aesrein[9,13])in,thaendtrans- .
NCA: la fonction de coût est
anadaptive,multi-layer,non-linear”encoder”networkthat
numberofcorrectlyclassifiedpointsonthetrainingdata:
, where , and .

formedfeaturespaceas:
transformstheinputdatavector intoitslow-dimensional
formedfeaturespaceas: (8)
featurerepresenttraantisofonrmstheinp.uTtdhaitsalveeacrtnoringinistotritesaltoew- ddaismaensional
Foreachtrainingvector ,definetheprobabilitythatpoint
In this section we describe an unsupervisedway to learn
Foreachtrainingvector ,definetheprobabilitythatpoint
In this section we describe an unsupervisedway to learn
featurerepresentation .Thislearningistreatedasa
featurerepresentation .Thislearningistreatedasa
(3)
selectsoneofitsneighbours (asin[9,13])inthetrans-
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
anadaptive,multi-layer,non-linear”encoder”networkthat
selectsoneofitsneighbours (asin[9,13])inthetrans-
anadaptive,multi-layer,non-linear”encoder”networkthat
(5)
(3)
(3)
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
formedfeaturespaceas:
resentations. Subsequentfine-tuningof theweight vector
transformstheinputdatavector intoitslow-dimensional
formedfeaturespaceas:
transformstheinputdatavector intoitslow-dimensional
resentations. Subsequentfine-tuningof theweight vector
resentations. Subsequentfine-tuningof theweight vector
where isthelogisticfunction, is
featurerepresentation .Thislearningistreatedasa
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
featurerepresentation .Thislearningistreatedasa
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
(3)
(3)
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
pretrainingstagethatdiscoversgoodlow-dimensionalrep-
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-

resentations. Subsequentfine-tuningof theweight vector
resentations. Subsequentfine-tuningof theweight vector
3.1 ModelingBinaryData
3.1 ModelingBinaryData
3.1 ModelingBinaryData
rametervector, . Themarginaldistributionovervisible
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
WefocusontheEuclideandistancemetric:
iscarriedoutusingtheobjectivefunctioninEq.5
and is a multi-layer neural network parametrized
and is a multi-layer neural network parametrized
and is a multi-layer neural network parametrized
We model binary data (e.g. the MNIST digits) using a
We model binary data (e.g. the MNIST digits) using a
We model binary data (e.g. the MNIST digits) using a
vector is:
(6)
by the weight vector (see fig 1). The probabilitythat
by the weight vector (see fig 1). The probabilitythat
by the weight vector (see fig 1). The probabilitythat
.
ResRtreicstteridctBedoltBzomltaznmnaM nnacM hiancehi[n6e, 1[66,, 1161,]1(s1e]e(sfiege. fi2g).. 2).
On optimise les paramètres de
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).

poipnotintbelboenlgosntgosctolacsslassdepdeenpdesnodnstohnetrheelarteivlaetipvreopxrim oxitiymity
33.1.1 MMooddeelliinnggBBiinnaarryyDa Dattaa
point belongstoclass dependsontherelativeproximity
The“visible”stochasticbinaryinputvector and“hidden”
The“visible”stochasticbinaryinputvector and“hidden”
par descente de gradient
ofoalflaoltlhoetrhdeartdaaptaoipnotsintthsatthbaetlboenlgontogctloascslas:s :
The“visible”stochasticbinaryinputvector and“hidden”
anadnd isisaammulutil-til-alyaeyrernenueruarlalnnetewtwoorkrkppaararammeetrtirzizeedd
(9)
stochasticbinaryfeaturevector aremodeledbyproducts
stochasticbinaryfeaturevector aremodeledbyproducts
When isconstrainedtobealineartrans-
ofallotherdatapointsthatbelongtoclass :
W Weemmooddeellbbiinnaarryyddaattaa((ee..gg.. tthhee M MNI NISSTT ddiiggiittss)) uussiinnggaa
stochasticbinaryfeaturevector aremodeledbyproducts
by the weight vector (see fig 1). The probabilitythat
by the weight vector (see fig 1). The probabilitythat
ofconditionalBernoullidistributions:
ofconditionalBernoullidistributions:
(4)
(4)
formation, we get linear NCA. When is defined
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).
Restricted Boltzmann Machine [6, 16, 11] (see fig. 2).
ofconditionalBernoullidistributions:
point belongstoclass dependsontherelativeproximity
point belongstoclass dependsontherelativeproximity
(4)
byamultilayer,non-linearneuralnetwork,wecanexplore
The“visible”stochasticbinaryinputvector and“hidden”
The“visible”stochasticbinaryinputvector and“hidden”
where isanenergyterm(i.e. anegativelogprob-
(7)
ofallotherdatapointsthatbelongtoclass : (7)
ofallotherdatapointsthatbelongtoclass :
TheNCAobjective(asin[9])istomaximizetheexpected
amuchricherclassoftransformationsbybackpropagating
TheNCAobjective(asin[9])istomaximizetheexpected
stochasticbinaryfeaturevector aremodeledbyproducts
stochasticbinaryfeaturevector aremodeledbyproducts
ability+anunknownconstantoffset)givenby:
(7)
numberofcorrectlyclassifiedpointsonthetrainingdata:
numberofcorrectlyclassifiedpointsonthetrainingdata:
thederivativesoftheobjectivefunctionsinEq.5or6with
ofconditionalBernoullidistributions:
ofconditionalBernoullidistributions:
TheNCAobjective(asin[9])istomaximizetheexpected
(4)
(4)
(8)
(8)
respect to parameter vector through the layers of the
numberofcorrectlyclassifiedpointsonthetrainingdata:
(10)
(5)
(7)
(8)
(5)
encodernetwork. In ourexperiments,the NCA objective (7)
TheNCAobjective(asin[9])istomaximizetheexpected
TheNCAobjective(asin[9])istomaximizetheexpected
ofEq.5workedslightlybetterthan . Wesus- where isthelogisticfunction, is
where isthelogisticfunction, is
numberofcorrectlyclassifiedpointsonthetrainingdata:
numberofcorrectlyclassifiedpointsonthetrainingdata:
(5)
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
pectthatthisisbecause Onecouisldmaloterernraotibvuelsytmtoaxhiamnidzelinthgesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
(8)
(8)
Theparameterupdatesrequiredtoperformgradientascent
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
where isthelogisticfunction, is
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
outliers. ,ontheotherhand,wouldstronglypenalize
inthelog-likelihoodcanbeobtainedfromEq.9:
(5)
rametervector, . Themarginaldistributionovervisible
(5)
rametervector, . Themarginaldistributionovervisible
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
configurationswhereapointinthefeaturespacedoesnot
vector is:
(6)
vector is:
(6)
where isthelogisticfunction, is
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
lieclosetoanyothermemberofitsclass. Thederivatives
where isthelogisticfunction, is
Onecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
rametervector, . Themarginaldistributionovervisible
ofEq.5aregivenintheappendixOn . ecouldalternativelymaximizethesumofthelogprob- asymmetricinteractiontermbetweeninput andfeature ,
(9)
(9)
When isconstrainedtobealineartrans-
abilitiesofcorrectclassification: and , arebiases. Thebiasesarepartoftheoverallpa-
When isconstrainedtobealineartrans-
abilitiesofcorrectclassificatiovne:ctor is: and , arebiases. Thebiasesarepartoftheoverallpa-
(6)
formation, we get linear NCA. When is defined
rametervector, . Themarginaldistributionovervisible
formation, we get linear NCA. When is defined
rametervector, . Themarginaldistributionovervisible
byamultilayer,non-linearneuralnetwork,wecanexplore
byamultilayer,non-linearneuralnetwork,wecanexplore
vector is:
(6)
where isanenergyterm(i.e. anegativelogprob-
vector is:
where isanenergyterm(i.e. anegativelogprob-
(6)
(9)
amuchricherclassoftransformationsbybackpropagating
When iscaom nsutcrhairnicehdertoclabsesoafltirnanesafrortmraantiso-nsbybackpropagating
ability+anunknownconstantoffset)givenby:
ability+anunknownconstantoffset)givenby:
thederivativesoftheobjectivefunctionsinEq.5or6with
thederivativesoftheobjectivefunctionsinEq.5or6with
formation, we get linear NCA. When is defined
(9)
respect to paW ramheenter vector througihstchoenlsatyrearisneodf tthoebealineartrans-
(9)
respect to parameter vector through the layers of the
When isconstrainedtobealineartrans-
(10)
byamultilayer,non-linearneuralnetwork,wecanexplore
(10)
encodernetwork. In ourexperiments,the NCA objective
formation, we get linear NCA. When is defined
where isanenergyterm(i.e. anegativelogprob-
encodernetwork. In ourexperiments,the NCA objective
formation, we get linear NCA. When is defined
amuchricherclassoftransformationsbybackpropagating
ofEq.5workedslightlybetterthan . Wesus-
byamultilayer,non-linearneuralnetwork,wecanexplore
ofEq.5workedslightlybetterthaanbility.+Waensuusn-knownconstantoffset)givenby:
byamultilayer,non-linearneuralnetwork,wecanexplore
where isanenergyterm(i.e. anegativelogprob-
pectthatthisisbecause ismorerobusttohandling
thederivativesoftheobjectivefunctionsinEq.5or6with
where isanenergyterm(i.e. anegativelogprob-
Theparameterupdatesrequiredtoperformgradientascent
amuchricherclassoftransformationsbybackpropagating
pectthatthisisbecause ismorerobusttohandling
Theparameterupdatesrequiredtoperformgradientascent
amuchricherclassoftransformationsbybackpropagating
ability+anunknownconstantoffset)givenby:
outliers. ,ontheotherhand,wouldstronglypenalize
respect to parameter vector through the layers of the ability+anunknownconstantoffset)givenby:
inthelog-likelihoodcanbeobtainedfromEq.9:
outliers. ,ontheotherhand,wouldstronglypenalize
thederivativesoftheobjectivefunctionsinEq.5or6with
inthelog-likelihoodcanbeobtainedfromEq.9:
(10)
thederivativesoftheobjectivefunctionsinEq.5or6with
configurationswhereapointinthefeaturespacedoesnot
encodernetwork. In ourexperiments,the NCA objective
configurationswhereapointinthefeaturespacedoesnot
respect to parameter vector through the layers of the
respect to parameter vector through the layers of the
lieclosetoanyothermemberofitsclass. Thederivatives
(10)
lieclosetoanyothermemberofitsclass. Thederivatives
(10)
ofEq.5workedslightlybetterthan . Wesus-
encodernetwork. In ourexperiments,the NCA objective
ofEq.5aregivenintheappendix.
encodernetwork. In ourexperiments,the NCA objective
ofEq.5aregivenintheappendix.
pectthatthisisbecause ismoreroboufsEtqto.5hawnodrlkiendgslightlybetterthan . Wesus-
Theparameterupdatesrequiredtoperformgradientascent
ofEq.5workedslightlybetterthan . Wesus-
pectthatthisisbecause ismorerobusttohandling
outliers. ,ontheotherhand,wouldstronglypenalize
Theparameterupdatesrequiredtoperformgradientascent
inthelog-likelihoodcanbeobtainedfromEq.9:
pectthatthisisbecause ismorerobusttohandling
Theparameterupdatesrequiredtoperformgradientascent
outliers. ,ontheotherhand,wouldstronglypenalize
configurationswhereapointinthefeaturespacedoesnot
inthelog-likelihoodcanbeobtainedfromEq.9:
outliers. ,ontheotherhand,wouldstronglypenalize
inthelog-likelihoodcanbeobtainedfromEq.9:
configurationswhereapointinthefeaturespacedoesnot
lieclosetoanyothermemberofitsclass. Thederivatives
configurationswhereapointinthefeaturespacedoesnot
lieclosetoanyothermemberofitsclass. Thederivatives
ofEq.5aregivenintheappendix.
lieclosetoanyothermemberofitsclass. Thederivatives
ofEq.5aregivenintheappendix.
ofEq.5aregivenintheappendix.Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure
(Salakhutdinov and Hinton, AISTATS 2007)
Résultats: classification (MNIST, )
λ = 1
7

3.2
1
3
7

3.2
2.8
9
1
3
2.6
2.8
9
4
8
2.4 2.6
4
8
2.4
2.2
2.2
2
Nonlinear NCA 30D
2
Nonlinear NCA 30D
Linear NCA 30D
2
1.8
Linear NCA 30D
1.8 2
Autoencoder 30D
Autoencoder 30D
1.6 1.6 3 3
PCA 30D
PCA 30D
1.4
1.4
1.2
1.2
1
6
1
0.8
6
0.8 0.6
5

1 3 5 7
0.6
0 5
Number of Nearest Neighbours

1 3 5 7
0
Number of Nearest Neighbours
LinearNCA LDA PCA
LinearNCA LDA PCA
Figure4: ThetopleftpanelshowsKNNresultsontheMNISTtestset. Thetoprightpanelshowsthe2-dimensional codesproduced
bynonlinearNCAonthetestdatausinga784-500-500-2000-2encoder. Thebottompanelsshowthe2-dimensionalcodesproducedby
linearNCA,LinearDiscriminantAnalysis,andPCA.
contains60,000trainingand10,000testimagesof whichcanbeexpensiveto obtain, isverylimitedso non-
handwrittendigits. Outof60,000trainingimages,10,000 linearNCAmaysufferfromoverfitting.
Figure4: ThetopleftpanelshowsKNNresultsontheMNISTtestset. Thetoprightpanelshowsthe2-dimensional codesproduced
wereusedforvalidation.Theoriginalpixelintensitieswere
After the pretrainingstage, the individualRBM’s at each
bynonlinearNCAonthetestdatausinga784-500-500-2000-2encoder. Thebottompanelsshowthe2-dimensionalcodesproducedby
normalizedtolieintheinterval andhadapreponder-
level can be “unrolled” as shown in figure 3 to create a
linearNCA,LinearDiscriminantAnalysis,andPCA.
anceofextremevalues.
deepautoencoder. Ifthestochasticactivitiesofthebinary
We used a 28 28 500 500 2000 30 architecture as features are replaced by deterministic, real-valued proba-
contains60,000trainingand10,000testimagesof whichcanbeexpensiveto obtain, isverylimitedso non-
shownasfig. 3,similartooneusedin[12]. The30code bilities,wecanthenbackpropagatethroughtheentirenet-
handwrittendigits. Outof60,000trainingimages,10,000 linearNCAmaysufferfromoverfitting.
unitswerelinearandtheremaininghiddenunitswerelo- worktofine-tunetheweightsforoptimalreconstructionof
wereusedforvalidation.Theoriginalpixelintensitieswere
gistic.Figure4showsthatNonlinearNCA,after50epochs thedata. Trainingsuchdeepautoencoders,whichdoesnot
After the pretrainingstage, the individualRBM’s at each
oftraining,achievesanerrorrateof1.08%,1.00%,1.03%, requireanylabeleddata,produceslow-dimensionalcodes
normalizedtolieintheinterval andhadapreponder-
level can be “unrolled” as shown in figure 3 to create a
and 1.01% using 1,3,5, and 7 nearest neighbours. This thataregoodatreconstructingtheinputdatavectors,and
anceofextremevalues.
is compared to the best reported error rates (without us- tenddteoeppresaeurvtoeecnlacssondeeigr.hbIofutrhheoosdtsotrcuhcatusrteic[1a4c].tivitiesofthebinary
inganydomain-specificknowledge)of1.6%forrandomly
We used a 28 28 500 500 2000 30 architecture as features are replaced by deterministic, real-valued proba-
The NCA objective, that encouragescodesto lie close to
initialized backpropagation and 1.4% for Support Vector
othercodesbelongingtothesameclass,canbecombined
shownasfig. 3,similartooneusedin[12]. The30code bilities,wecanthenbackpropagatethroughtheentirenet-
Machines [5]. Linear methods such as linear NCA or
withtheautoencoderobjectivefunction(seefig.3)tomax-
unitswerelinearandtheremaininghiddenunitswerelo- worktofine-tunetheweightsforoptimalreconstructionof
PCAaremuchworsethannonlinearNCA.Figure4(right
imize:
gistic.Figpuarneel4) sshhoowswsthteha2-tdNo imennlsiinoneaalrcNC odesA,proadfutecred50byenpoonc-hs thedata. Trainingsuchdeepautoencoders,whichdoesnot
E (16)
linearNCAcomparedtolinearNCA,LinearDiscriminant
oftraining,achievesanerrorrateof1.08%,1.00%,1.03%, requireanylabeleddata,produceslow-dimensionalcodes
Analysis,andPCA.
and 1.01% using 1,3,5, and 7 nearest neighbours. This thataregoodatreconstructingtheinputdatavectors,and
where isdefinedinEq. 5, isthereconstruction
error,and isatrade-offparameter. Whenthederivative
is compared to the best reported error rates (without us- tendtopreserveclassneighbourhoodstructure[14].
5 RegularizedNonlinearNCA
of the reconstruction error is backpropagated through
inganydomain-specificknowledge)of1.6%forrandomly
The NCA objective, that encouragescodesto lie close to
Inmanyapplicationdomains,alargesupplyofunlabeled theautoencoder,itiscombined,atthecodelevel,withthe
initialized backpropagation and 1.4% for Support Vector
data is readily available but the amount of labeled data, derivativesof .
othercodesbelongingtothesameclass,canbecombined
Machines [5]. Linear methods such as linear NCA or
withtheautoencoderobjectivefunction(seefig.3)tomax-
PCAaremuchworsethannonlinearNCA.Figure4(right
imize:
panel) shows the 2-dimensional codes produced by non-
E (16)
linearNCAcomparedtolinearNCA,LinearDiscriminant
Analysis,andPCA.
where isdefinedinEq. 5, isthereconstruction
error,and isatrade-offparameter. Whenthederivative
5 RegularizedNonlinearNCA
of the reconstruction error is backpropagated through
Inmanyapplicationdomains,alargesupplyofunlabeled theautoencoder,itiscombined,atthecodelevel,withthe
data is readily available but the amount of labeled data, derivativesof .
Test Error (%)
Test Error (%)Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure
(Salakhutdinov and Hinton, AISTATS 2007)
Résultats: classification (MNIST, varie)
λ
1%labels 5%labels 10%labels


21

8
6
20
7.5
19
5.5
Regularized NCA (!=0.99)
7
18
Regularized NCA (!=0.99)
Nonlinear NCA 30D (!=1)
5
Nonlinear NCA 30D (!=1)
6.5
Linear NCA 30D
17
Linear NCA 30D
Autoencoder 30D (!=0)
16 6
Autoencoder 30D (!=0)
KNN in pixel space
4.5
KNN in pixel space
15
5.5
Regularized NCA (!=0.999)
4
14
5
Nonlinear NCA 30D (!=1)
13 Linear NCA 30D
3.5
4.5
Autoencoder 30D (!=0)
12
KNN in pixel space
4
3
11
3.5
10
2.5
3
9

8 2.5 2
1 3 5 7 1 3 5 7 1 3 5 7
Number of Nearest Neighbours Number of Nearest Neighbours Number of Nearest Neighbours
Figure5: KNNontheMNISTtestsetwhenonlyasmallfractionofclasslabelsisavailable. LinearNCAandKNNinpixelspacedo
nottakeadvantageoftheunlabeleddata.
....... .......
*NCA
30 20 30 20
....... .......
Figure6: Leftpanel: TheNCAobjectivefunctionisonlyappliedtothefirst30
codeunits,butall50unitsareusedforimagereconstruction. Rightpanel: The
toprow shows thereconstructed images as we varythe activationof code unit
25from1to-23withastepsizeof4. Thebottomrowshowsthereconstructed
imagesaswevarycodeunit42from1to-23.
This setting is particularly useful for semi-supervised as unlabeled data. Figure 5 reveals that regularized non-
4
learningtasks. Considerhavingasetof labeledtrain- linearNCA( ) outperformsbothnonlinearNCA
ing data , where as before , and ( )andanautoencoder( ). Evenwhentheen-
, and a set of unlabeledtraining data . tiretrainingsetislabeled,regularizedNCAstillperforms
Let . Theoverallobjectivetomaximizecan slightlybetter.
bewrittenas:
5.1 Splittingcodesintoclass-relevantand
(17)
class-irrelevantparts
Toallowaccuratereconstructionofadigitimage,thecode
mustcontaininformationaboutaspectsoftheimagesuch
where isthereconstructionerrorfortheinputdatavec-
as its orientation, slant, size and stroke thicknessthat are
tor . For the MNIST dataset we use the cross-entropy
notrelevanttoitsclassification.Theseirrelevantaspectsin-
error:
evitablycontributetotheEuclideandistancebetweencodes
(18)
andharmclassification. Todiminishthisunwantedeffect,
we used 50-dimensionalcodes but only used the first 30
dimensionsintheNCAobjectivefunction. Theremaining
where istheintensityofpixel forthetraining
20dimensionswerefreetocodeallthoseaspectsofanim-
example ,and istheintensityofitsreconstruction.
agethat do notaffectits class label butareimportantfor
reconstruction.
When the number of labeled example is small, regular-
ized nonlinear NCA performsbetter than nonlinear NCA
Figure 6 shows how the reconstruction is affected by
( ),whichusestheunlabeleddataforpretrainingbut
changingthe activity level of a single code unit. Chang-
ignores it during the fine-tuning. It also performs better
ingaunitamongthefirst30changestheclass;changinga
than an autoencoder ( ), which ignores the labeled
unitamongthelast20doesnot. With thesplit
set. To test the effectof the regularizationwhen most of
codesachieveanerrorrateof1.00%0.97%0.98%0.97%
the datais unlabeled,we randomlysampled 1%, 5% and
4
10% of the handwritten digits in each class and treated
The parameter was selected, using cross-validation, from
them as labeled data. The remaining digits were treated amongthevalues .
Test Error (%)
Test Error (%)
Test Error (%)1%labels 5%labels 10%labels


21
8
6
20
7.5
19
5.5
Regularized NCA (!=0.99)
7
18
Regularized NCA (!=0.99)
Nonlinear NCA 30D (!=1)
5
6.5 Nonlinear NCA 30D (!=1)
Linear NCA 30D
17
Linear NCA 30D
Autoencoder 30D (!=0)
16 6
Autoencoder 30D (!=0)
KNN in pixel space
4.5
KNN in pixel space
15
5.5
Regularized NCA (!=0.999)
4
14
Learning a Nonlinear Embed 5 ding by Preserving
Nonlinear NCA 30D (!=1)
13 Linear NCA 30D
3.5
4.5
Autoencoder 30D (!=0)
12
KNN in pixel space
4
Class Neighbourhood Structure
3
11
3.5
10
2.5
(Salakhutdinov and Hinton, AISTATS 2007)
3
9


8 2.5 2
1 3 5 7 1 3 5 7 1 3 5 7
Number of Nearest Neighbours Number of Nearest Neighbours Number of Nearest Neighbours
Résultats: séparer l’information de classe
Figure5: KNNontheMNISTtestsetwhenonlyasmallfractionofclasslabelsisavailable. LinearNCAandKNNinpixelspacedo
nottakeadvantageoftheunlabeleddata.
....... .......
*NCA
30 20 30 20
....... .......
Figure6: Leftpanel: TheNCAobjectivefunctionisonlyappliedtothefirst30
codeunits,butall50unitsareusedforimagereconstruction. Rightpanel: The
toprow shows thereconstructed images as we varythe activationof code unit
25from1to-23withastepsizeof4. Thebottomrowshowsthereconstructed
imagesaswevarycodeunit42from1to-23.
This setting is particularly useful for semi-supervised as unlabeled data. Figure 5 reveals that regularized non-
4
learningtasks. Considerhavingasetof labeledtrain- linearNCA( ) outperformsbothnonlinearNCA
ing data , where as before , and ( )andanautoencoder( ). Evenwhentheen-
, and a set of unlabeledtraining data . tiretrainingsetislabeled,regularizedNCAstillperforms
Let . Theoverallobjectivetomaximizecan slightlybetter.
bewrittenas:
5.1 Splittingcodesintoclass-relevantand
(17)
class-irrelevantparts
Toallowaccuratereconstructionofadigitimage,thecode
mustcontaininformationaboutaspectsoftheimagesuch
where isthereconstructionerrorfortheinputdatavec-
as its orientation, slant, size and stroke thicknessthat are
tor . For the MNIST dataset we use the cross-entropy
notrelevanttoitsclassification.Theseirrelevantaspectsin-
error:
evitablycontributetotheEuclideandistancebetweencodes
(18)
andharmclassification. Todiminishthisunwantedeffect,
we used 50-dimensionalcodes but only used the first 30
dimensionsintheNCAobjectivefunction. Theremaining
where istheintensityofpixel forthetraining
20dimensionswerefreetocodeallthoseaspectsofanim-
example ,and istheintensityofitsreconstruction.
agethat do notaffectits class label butareimportantfor
reconstruction.
When the number of labeled example is small, regular-
ized nonlinear NCA performsbetter than nonlinear NCA
Figure 6 shows how the reconstruction is affected by
( ),whichusestheunlabeleddataforpretrainingbut
changingthe activity level of a single code unit. Chang-
ignores it during the fine-tuning. It also performs better
ingaunitamongthefirst30changestheclass;changinga
than an autoencoder ( ), which ignores the labeled
unitamongthelast20doesnot. With thesplit
set. To test the effectof the regularizationwhen most of
codesachieveanerrorrateof1.00%0.97%0.98%0.97%
the datais unlabeled,we randomlysampled 1%, 5% and
4
10% of the handwritten digits in each class and treated
The parameter was selected, using cross-validation, from
them as labeled data. The remaining digits were treated amongthevalues .
Test Error (%)
Test Error (%)
Test Error (%)Using Deep Belief Nets to Learn Covariance
Kernels for Gaussian Processes
(Salakhutdinov and Hinton, NIPS 2008)
Peut-on adapter la représentation à une

tâche de régression?
Dans ce papier, on l’adapte à un régresseur

par processus Gaussien (GP)Using Deep Belief Nets to Learn Covariance
Kernels for Gaussian Processes
(Salakhutdinov and Hinton, NIPS 2008)
à la place d’une
couche de sortie de
1000
réseau de neurones
W
3
target y
1000
RBM
GP
Feature
Representation
1000 1000
T F(X|W)
W W
2 3
Binary
h
1000 1000
Hidden Features
RBM
T
W
2
1000 1000
W
T
W W
1 1
Gaussian
x Visible
Input X
RBM
Units
Figure 1: Left panel: Markov random field of the generalized RBM. The top layer represents stochastic binary
hidden features h and and the bottom layer is composed of linear visible units x with Gaussian noise. When
using a Constrained Poisson Model, the top layer represents stochastic binary latent topic features h and the
bottom layer represents the Poisson visible word-count vector x. Middle panel: Pretraining consists of learning
a stack of RBM’s. Right panel: After pretraining, the RBM’s are used to initialize a covariance function of the
Gaussian process, which is then fine-tuned by backpropagation.
where g(x) = 1/(1+exp(−x)) is the logistic function, w is a symmetric interaction term between
ij
2
input i and feature j, σ is the variance of input i, and b , b are biases. The marginal distribution
i j
i
over visible vector x is:
!
exp(−E(x,h))
"
#
p(x) = (9)
exp(−E(u,g))du
g
u
h
2
# # #
(x −b )
i i x
i
where E(x,h) is an energy term: E(x,h) = − b h − h w . The param-
2 j j j ij
i j i,j
σ

i
i
eter updates required to perform gradient ascent in the log-likelihood is obtained from Eq. 9:
∂ logp(x)
Δw = " = "(<z h > − <z h > ) (10)
ij i j data i j model
∂w
ij
where " is the learning rate, z = x /σ , <·> denotes an expectation with respect to the data
i i i data
distribution and <·> is an expectation with respect to the distribution defined by the model.
model
To circumvent the difficulty of computing <·> , we use 1-step Contrastive Divergence [5]:
model
Δw = "(<z h > − <z h > ) (11)
ij i j data i j recon
The expectation <z h > defines the expected sufficient statistics of the data distribution and
i j data
is computed as z p(h = 1|x) when the features are being driven by the observed data from the
i j
training set using Eq. 8. After stochastically activating the features, Eq. 7 is used to “reconstruct”
real-valued data. Then Eq. 8 is used again to activate the features and compute <z h > when
i j recon
the features are being driven by the reconstructed data. Throughout our experiments we set variances
2
σ = 1 for all visible units i, which facilitates learning. The learning rule for the biases is just a
i
simplified version of Eq. 11.
3.2 Modeling Count Data with the Constrained Poisson Model
We use a conditional “constrained” Poisson distribution for modeling observed “visible” word count
data x and a conditional Bernoulli distribution for modeling “hidden” topic features h:
#
$ ’
!
exp(λ + h w )
i j ij
j
% &
# #
p(x = n|h) = Pois n, × N , p(h = 1|x) = g(b + w x ) (12)
i j j ij i
exp λ + h W
k j kj
k j
i
% &
−λ n
where Pois n,λ = e λ /n!, w is a symmetric interaction term between word i and feature
ij
#
j, N = x is the total length of the document, λ is the bias of the conditional Poisson model
i i
i
for word i, and b is the bias of feature j. The Poisson rate, whose log is shifted by the weighted
j
combination of the feature activations, is normalized and scaled up by N. We call this the “Con-
strained Poisson Model” since it ensures that the mean Poisson rates across all words sum up to the
length of the document. This normalization is significant because it makes learning stable and it
deals appropriately with documents of different lengths.
3maximizingthelogprobabilityofthelabelswithrespectto W. Inthefinalmodelmostofthein-
maximizingthelogprobabilityofthelabelswithrespectto W. Inthefinalmodelmostofthein-
formationforlearningacovariancekernelwillhavecomefrommodelingtheinputdata. Thevery
formationforlearningacovariancekernelwillhavecomefrommodelingtheinputdata. Thevery
limitedinformationinthelabelswillbeusedonlytoslightlyadjustthelayersoffeaturesalready
limitedinformationinthelabelswillbeusedonlytoslightlyadjustthelayersoffeaturesalready
maximizingthelogprobabilityofthelabelswithrespectto W. Inthefinalmodelmostofthein-
maximizingthelogprobabilityofthelabelswithrespectto W. Inthefinalmodelmostofthein-
discoveredbytheD diBscNov.eredbytheDBN.
formationforlearningacovariancekernelwillhavecomefrommodelingtheinputdata. Thevery
formationforlearningacovariancekernelwillhavecomefrommodelingtheinputdata. Thevery
maximizingthelogprobabilityofthelabelswithrespectto W. Inthefinalmodelmostofthein-
limitedinformationinthelabelswillbeusedonlytoslightlyadjustthelayersoffeaturesalready
formationforlearningacovariancekernelwillhavecomefrommodelingtheinputdata. Thevery
limitedinformationinthelabelswillbeusedonlytoslightlyadjustthelayersoffeaturesalready
2 GaussianProcessesforRegressionandBinaryClassification
2 GaussianProcessesforRegressionandBinaryClassification
discoveredbytheDBN.
limitedinformationinthelabelswillbeusedonlytoslightlyadjustthelayersoffeaturesalready
discoveredbytheDBN.
N
N
Foraregressiontask,wearegivenadatasetDof i.i.d.labeledinputvectors X = {x } and
discoveredbytheDBN.
Foraregressiontask,wearegivenadatasetDof i.i.d.labeledinputvectors X = {x } and
l n
l n n=1
n=1
N
N
their correspondingtarget labels {y } ∈ R. We are interested in the following probabilistic
their correspondingtarget labels {y } ∈ R. Wenare interested in the following probabilistic
2 GaussianProcessesforRegressionandBinaryClassification
n=1
n
2 GaussianProcessesforRegressionandBinaryClassification
n=1
Using Deep Belief Nets to Learn Covariance
2 GaussianProcessesforRegressionandBinaryClassification
regressionmodel:
regressionmodel:
N
N
2
2
Foraregressiontask,wearegivenadatasetDof i.i.d.labeledinputvectors X = {x } and
Foraregressiontask,wearegivenada Kta ernels f setDo or Gaussian Pr f i.i.d.labeledocesses inputvectors X = {x } and
y = f(x ) +!, !∼N(!N|0,lσ ) n (1)
n n l n
n=1
y = f(x ) +!, !∼N(!|0,σ ) (1)
n=1
Foraregressiontask,wearegivenadatasetDof i.i.d.labeledinputvectors X = {x } and
n n
l n
n=1
N
N
(Salakhutdinov and Hinton, NIPS 2008)
N
their correspondingtarget labels {y } ∈ R. We are interested in the following probabilistic
their correspondingtarget labels {y } ∈ R. We are interested in the following probabilistic
n
n
n=1
their correspondingtarget labels {y } ∈ R. We are interested in the following probabilistic
n=1
A Gaussianprocessregressionplacesa zero-meanGP prioroverthe underlyinglatent function f
n
n=1
A Gaussianprocessregressionplacesa zero-meanGP prioroverthe underlyinglatent function f
T
regressionmodel:
regressionmodel:
regressionmodel: T
wearemodeling,sothata-priori p(f|X ) =N(f|0,K),where f = [f(x ),...,f(x )] and Kisthe
l 1 n
wearemodeling,sothata-priori p(f|X ) =N(f|0,K),where f = [f(x ),...,f(x )] and Kisthe
2
l 2 1 n
2
y = f(x ) +!, !∼N(!|0,σ ) (1)
y y = = f f( (x x )) + +!!,, !!∼∼N N((!!||00,,σσ )) (1()1)
n n
n n
n n
covari(T anrès cour cematrix,t) ra whosppel des GPs eentriesarespecifiedbythecovariancefunction K = K(x ,x ). The
ij i j
covariancematrix,whoseentriesarespecifiedbythecovariancefunction K = K(x ,x ). The

ij i j
covariancefunctionencodesourpriornotionofthesmoothnessof f, orthepriorassumptionthat
A Gaussianprocessregressionplacesa zero-meanGP prioroverthe underlyinglatent function f
A Gaussianprocessregressionplacesa zero-meanGP prioroverthe underlyinglatent function f
A Gaussianprocessregressionplacesa zero-meanGP prioroverthe underlyinglatent function f
covariancefunctionencodesourpriornotionofthesmoothnessof f, orthepriorassumptionthat
T
T
wearemodeling,sothata-prioirfitpw (fo|X inp)u=tvNe(cfto|0r,sKar)e,wsihmeirleafra =cc[for(d xin)g,.t.o .,fso (xme)]disatnadncKem isethaesure,theirlabelsshouldbehighly
T
l 1 n
wearemiofdtewloinign,psuottvheacttoar-sprairoerisipm (fil|aXra)cc=oN rdi(nfg|0to ,Kso)m,w ehdeisrteanfc= em [fe(axsur)e,,..t.h,efir(xlab)e]lssahnoduK ldbisethhieghly
GP où
wearemodeling,sothata-priori p(f|X ) =N(f|0,K),where f = [f(x ),...,f(x )] and Kisthe
l 1 n
l 1 n
covariancematrix,whoseentriesarespecifiedbythecovariancefunction K = K(x ,x ). The
correlated.InthispaperwewillusethesphericalGaussiankernel,parameterizedbyθ ={α,β}:
ij i j
correlated.InthispaperwewillusethesphericalGaussiankernel,parameterizedbyθ ={α,β}:
covariancematrix,whoseentriesarespecifiedbythecovariancefunction K = K(x ,x ). The
covariancematrix,whoseentriesarespecifiedbythecovariancefunction K = K(x ,x ). The
ij i j
ij i j
covariancefunctionencodesourpriornotionofthesmoothnessof f, orthepriorassumptionthat
! "
1
T
covariancefunctionencodesourpriornotionofthesmoothnessof f, orthepriorassumptionthat
! "
covariancefunctionencodesourpriornotionofthesmoothnessof f, orthepriorassumptionthat
1
iftwoinputvectorsaresimilaraccordingtosomedistancemeasure,theirlabelsshouldbehighly
K = αexp − (x − x ) (x − x ) (2)
matrice T
ij i j i j
K = αexp − (x − x ) (x − x ) (2)
ij i j i j

iftwoinputvectorsaresimilaraccordingtosomedistancemeasure,theirlabelsshouldbehighly
iftwoinputvectorsaresimilaraccordingtosomedistancemeasure,theirlabelsshouldbehighly
correlated.InthispaperwewillusethesphericalGaussiankernel,parameterizedbyθ ={α,β}:
covariance

Integratingoutthefunctionvalues f,themarginallog-likelihoodtakesform:
correlated.InthispaperwewillusethesphericalGaussiankernel,parameterizedbyθ ={α,β}:
correlated.InthispaperwewillusethesphericalGaussiankernel,parameterizedbyθ ={α,β}:
! "
1
Integratingoutthefunctionvalues f,themarginallog-likelihoodtakesform:
T
K = αexp − (x − x ) (x − x ) (2)
ij i j i j
N 1 1
! "
1
! "
2 T 2 −1

vraisemblance 1
T
N 1 1
L = logp(y|X ) =− T log2π− log|K +σ I|− y (K +σ I) y (3)
l
2 T 2 −1
K = αexp − (x − x ) (x − x ) (2)
ij i j i j
K = αexp − (x − x ) (x − x ) (2)
L = logp(y|X ) =− log2π− log|K +σ I|− y (K +σ I) y (3)
ij i j i j
2 2 2
à maximiserl
Integratingoutthefunctionvalues f,themarginallog-likelihoodtakesform:


2 2 2
whichcanthenbemaximizedwithrespecttotheparametersθandσ. Givenanewtestpoint x ,a

N 1 1
Integratingoutthefunctionvalues f,themarginallog-likelihoodtakesform:
whichcanthenbemaximizedwithrespecttotheparametersθandσ. Givenanewtestpoint x ,a
2 T 2 −1

Integratingoutthefunctionvalues f,themarginallog-likelihoodtakesform:
L = logp(y|X ) =− log2π− log|K +σ I|− y (K +σ I) y (3)
predictionisobtainedbyconditioningontheobserveddataandθ. Thedistributionofthepredicted
l
2 2 2
predictionisobtainedbyconditioningontheobserveddataandθ. Thedistributionofthepredicted
N 1 1
value y at x takestheform:
∗ ∗
2 T 2 −1
N 1 1
whichcanthenbemaximizedwithrespecttotheparametersθandσ. Givenanewtestpoint x ,a
2 T 2 −1
L = logp(y|X ) =− log2π− log|K +σ I|− y (K +σ I) y∗ (3)
value y at x takestheform:
l
∗ ∗
L = logp(y|X ) =− log2π− log|K +σ I|− y (K +σ I) y (3)
l
2 T 2 −1 T 2 −1 2
2 2 2
predictionisobtainedbyconditioningontheobserveddataandθ. Thedistributionofthepredicted
2 p(y |x ,D,θ,2σ ) =N(y |k (K + 2σ I) y,k − k (K +σ I) k +σ ) (4)
∗ ∗ ∗ ∗∗ ∗
∗ ∗
2 T 2 −1 T 2 −1 2
p(y |x ,D,θ,σ ) =N(y |k (K +σ I) y,k − k (K +σ I) k +σ ) (4)
whichcanthenbemaximizedwithrespecttotheparametersθandσ. Givenanewtestpoint x ,a
value y at x takesth∗efo∗rm: ∗ ∗∗ ∗

∗ ∗ ∗ ∗
whichcanthenbemaximizedwithrespecttotheparametersθandσ. Givenanewtestpoint x ,a
where k = K(x ,X ),and k = K(x ,x ). ∗
∗ ∗ l ∗∗ ∗ ∗
predictionisobtainedbyconditioningontheobserveddataandθ. Thedistributionofthepredicted
where k =2K(x ,X ),Tand k 2= K −1(x ,x ). T 2 −1 2
∗ ∗ l ∗∗ ∗ ∗
predictionisobtainedbyconditioningontheobserveddataandθ. Thedistributionofthepredicted
p(y |x ,D,θ,σ ) =N(y |k (K +σ I) y,k − k (K +σ I) k +σ ) (4)
∗ ∗ ∗ ∗∗ ∗
∗ ∗
Forabinaryclassificationtask,wesimilarlyplaceazeromeanGPpriorovertheunderlyinglatent
value y at x takestheform:
∗ ∗
value y at x takestheform:
∗ ∗
where k = K(x ,X ),and k = K(x ,x ).
Forabinaryclassificationtask,wesimilarlyplaceazeromeanGPpriorovertheunderlyinglatent
∗ ∗ l ∗∗ ∗ ∗
function f,whichisthenpassedthroughthelogisticfunction g(x) = 1/(1 + exp(−x))todefinea
2 T 2 −1 T 2 −1 2
function f,whichisthenpassedthroughthelogisticfunction g(x) = 1/(1 + exp(−x))todefinea
p(y |x ,D,θ,σ ) =N(y |k (K +σ I) y,k − k (K +σ I) k +σ ) (4)
2 T 2 −1 T 2 −1 2
∗ ∗ ∗ ∗∗ ∗
prior p(y = 1|x ) = g(f(x )). Givenanewtestpoint x ,inferenceisdonebyfirstobtainingthe
∗ ∗
n n n ∗
Forabinaryclassificationtask,wesimilarlyplaceazeromeanGPpriorovertheunderlyinglatent
p(y |x ,D,θ,σ ) =N(y |k (K +σ I) y,k − k (K +σ I) k +σ ) (4)
∗ ∗ ∗ ∗∗ ∗
∗ ∗
prior p(y = 1|x ) = g(f(x )). Givenanewtestpoint x ,inferenceisdonebyfirstobtainingthe
n n n ∗
distributionoverthelatentfunction f = f(x ):
function f,whichisthenpassedthroughthelogisticfunction g(x) =∗1/(1 + e∗xp(−x))todefinea
where k = K(x ,X ),and k = K(x ,x ).
∗ ∗ l ∗∗ ∗ ∗
where k = K(x ,X ),and k = K(x ,x ).
#
∗ ∗ l ∗∗ ∗ ∗
distributionoverthelatentfunction f = f(x ):
∗ ∗
prior p(y = 1|x ) = g(f(x )). Givenanewtestpoint x ,inferenceisdonebyfirstobtainingthe
n n n ∗
#
Forabinaryclassificationtask,wesimilarlyplaceazeromeanGPpriorovertheunderlyinglatent
p(f |x ,D) = p(f |x ,X ,f)p(f|X ,y)df (5)
distributionoverthelatentfunction f = f(x ): ∗ ∗ ∗ ∗ l l
∗ ∗
Forabinaryclassificationtask,wesimilarlyplaceazeromeanGPpriorovertheunderlyinglatent
#
p(f |x ,D) = p(f |x ,X ,f)p(f|X ,y)df (5)
function f,whichisthenpassedthroughthelogisticfunction g(x) = 1/(1 + exp(−x))todefinea
∗ ∗ ∗ ∗ l l
function f,whichisthenpassedthroughthelogisticfunction g(x) = 1/(1 + exp(−x))todefinea
whichisthenusedtoproduceaprobabilisticprediction:
p(f |x ,D) = p(f |x ,X ,f)p(f|X ,y)df (5)
∗ ∗ ∗ ∗ l l
prior p(y = 1|x ) = g(f(x )). Givenanewtestpoint x ,inferenceisdonebyfirstobtainingthe
n n n ∗
#
prior p(y = 1|x ) = g(f(x )). Givenanewtestpoint x ,inferenceisdonebyfirstobtainingthe
whichisthenusedtoproduceaprobabilisticprediction:
n n n ∗
distributionoverthelatentfunction f = f(x ):
∗ ∗
#
whichisthenusedtoproduceaprobabilisticprediction:
p(y = 1|x ,D) = g(f )p(f |x ,D)df (6)
distributionoverthelatentfunction f = f(x ):
∗ ∗ ∗ ∗ ∗ ∗
#
∗ ∗
#
#
p(y = 1|x ,D) = g(f )p(f |x ,D)df (6)
∗ ∗ ∗ ∗ ∗ ∗
p(f |x ,D) = p(f |x ,X ,f)p(f|X ,y)df (5)
∗ ∗ ∗ ∗ l l
p(y = 1|x ,D) = g(f )p(f |x ,D)df (6)
Thenon-GaussianlikelihoodmakestheintegralinEq.5analyticallyintractable.Inourexperiments,
∗ ∗ ∗ ∗ ∗ ∗
p(f |x ,D) = p(f |x ,X ,f)p(f|X ,y)df (5)
∗ ∗ ∗ ∗ l l
weapproximatethenon-Gaussianposterior p(f|X ,y)withaGaussianoneusingexpectationprop-
Thenon-GaussianlikelihoodmakestheintegralinEq.5analyticallyintractable.Inourexperiments,
l
Thenon-GaussianlikelihoodmakestheintegralinEq.5analyticallyintractable.Inourexperiments,
whichisthenusedtoproduceaprobabilisticprediction:
agation[12].Formorethoroughreviewsandimplementationdetailsreferto[13,16].
weapproximatethenon-Gaussianposterior p(f|X ,y)withaGaussianoneusingexpectationprop-
l
#
whichisthenusedtoproduceaprobabilisticprediction:
weapproximatethenon-Gaussianposterior p(f|X ,y)withaGaussianoneusingexpectationprop-
l
agation[12].Formorethoroughreview#sandimplementationdetailsreferto[13,16].
agation[12].Formorethoroughreviewsandimplementationdetailsreferto[13,16].
p(y = 1|x ,D) = g(f )p(f |x ,D)df (6)
∗ ∗ ∗ ∗ ∗ ∗
p(y = 1|x ,D) = g(f )p(f |x ,D)df (6)
3 LearningDeepBeliefNetworks(DBN’s)
∗ ∗ ∗ ∗ ∗ ∗
InthissectionwedescribeanunsupervisedwayoflearningaDBNmodeloftheinputdata X =
Thenon-GaussianlikelihoodmakestheintegralinEq.5analyticallyintractable.Inourexperiments,
3 LearningDeepBeliefNetworks(DBN’s)
3 LearningDeepBeliefNetworks(DBN’s)
Thenon-GaussianlikelihoodmakestheintegralinEq.5analyticallyintractable.Inourexperiments,
[X ,X ],thatcontainsbothlabeledandunlabeleddatasets. ADBNcanbetrainedefficientlyby
weapproximatethenon-Gaussianposterior p(f|X ,y)withaGaussianoneusingexpectationprop-
l u
InthissectionwedescribeanunsupervisedwayoflearningaDBNmodeloftheinputdata X =
l
InthissectionwedescribeanunsupervisedwayoflearningaDBNmodeloftheinputdata X =
weapproximatethenon-Gaussianposterior p(f|X ,y)withaGaussianoneusingexpectationprop-
usingaRestrictedBoltzmannMachine(RBM)tolearnonelayerofhiddenfeaturesatatime[7].
l
agation[12].Formorethoroughreviewsandimplementationdetailsreferto[13,16].
[X ,X ],thatcontainsbothlabeledandunlabeleddatasets. ADBNcanbetrainedefficientlyby
[X ,X ],thatcontainsbothlabeledandunlabeleddatasets. ADBNcanbetrainedefficientlyby
l u
l u
Welling et. al. [18] introduced a class of two-layer undirected graphical models that generalize
agation[12].Formorethoroughreviewsandimplementationdetailsreferto[13,16].
usingaRestrictedBoltzmannMachine(RBM)tolearnonelayerofhiddenfeaturesatatime[7].
usingaRestrictedBoltzmannMachine(RBM)tolearnonelayerofhiddenfeaturesatatime[7].
RBM’s to exponential family distributions. This framework will allow us to model real-valued
Welling et. al. [18] introduced a class of two-layer undirected graphical models that generalize
Welling et. al. [18] introduced a class of two-layer undirected graphical models that generalize
3 LearningDeepBiemliaegfesNofeftaw ceopraktcshe(sDaB ndNw’osr)d-countvectorsofdocuments.
RBM’s to exponential family distributions. This framework will allow us to model real-valued
RBM’s to exponential family distributions. This framework will allow us to model real-valued
3 LearningDeepBeliefNetworks(DBN’s)
imagesoffacepatchesandword-countvectorsofdocuments.
InthissectionwedescribeanunsupervisedwayoflearningaDBNmodeloftheinputdata X =
imagesoffacepatchesandword-countvectorsofdocuments.
InthissectionwedescribeanunsupervisedwayoflearningaDBNmodeloftheinputdata X =
3.1 ModelingReal-valuedData
[X ,X ],thatcontainsbothlabeledandunlabeleddatasets. ADBNcanbetrainedefficientlyby
l u
[3X.1,X Mo]d,etlhinagtRcoeanlt-avianlusebdoDthatlaabeledandunlabeleddatasets. ADBNcanbetrainedefficientlyby
usingaRestrictedBoltzmannMachine(RBM)tolearnonelayerofhiddenfeaturesatatime[7].
We use a conditional Gaussian distribution for modeling observed “visible” pixel values x (e.g.
l u
3.1 ModelingReal-valuedData
usingaRestrictedBoltzmannMachine(RBM)tolearnonelayerofhiddenfeaturesatatime[7].
We use a conditional Gaussiainmdaigsetrsibouftifoancefso)ram ndodaeclionngdiotbiosneravleBde“rvniosuiblllie”dipstirxiebluvtiaolnuefsorxm(eo.dge.ling“hidden”features h(Fig.1):
Welling et. al. [18] introduced a class of two-layer undirected graphical models that generalize
We use a conditional Gaussian distribution for modeling observed “visible” pixel values x (e.g.
$
imagesoffaces)andaconditionalBernoullidistributionformodeling“hidden”features h(Fig.1): 2
Welling et. al. [18] introduced a class of two-layer undirected graphical models that generalize
RBM’s to exponential family distributions. This framework will allow us to model real-valued
(x−b −σ h w )
i i j ij
imagesoffaces)andaconditionalBernoullidistributionformodeling“hidden”features h(Fig.1):
$
1 j

2
p(x = x|h) = exp(− ) (7)
i 2
(x−b −σ h w ) $
Rim BM ag’esstooffeaxcpeopnaetncthiaesl afanm diwlyorddi-sctoriubnuttivoencsto.rsTohfisdofcraum meew ntosr.k will allow us to model real-valued
i i j ij

2πσ
i
1 j
2 i

(x−b −σ h w )
p(x = x|h) = exp(− ) (7)
i i j ij
i 2 ! "
$
2σ j
1
2πσ
x
imagesoffacepatchesandword-coun itvectorsofdocuments.
i i

p(x = x|h) = exp(− ) (7)
i p(h = 1|x) = g b + 2 w (8)
! "
$ j j ij

i
2πσ
σ
i i
x
i
i
p(h = 1|x) = g b + w (8)
! "
j j ij $
3.1 ModelingReal-valuedData
i
σ
i
x
i
p(h = 1|x) = g b + w (8)
j j ij
i
σ
i
3.1 ModelingReal-valuedData
We use a conditional Gaussian distribution for modeling observed “visible” pixel values x (e.g.
2
imagesoffaces)andaconditionalBernoul2lidistributionformodeling“hidden”features h(Fig.1):
We use a conditional Gaussian distribution for modeling observed “visible” pixel values x (e.g.
$
2
2
imagesoffaces)andaconditionalBernoullidistributionformodeling“hidden”features h(Fig.1):
(x−b −σ h w )
i i j ij
1 j
$

p(x = x|h) = exp(− ) (7)
2
i
2

2πσ
(x−b −σ h w )
i
i i i j ij
j
1
! "
$

p(x = x|h) = exp(− ) (7)
2
i x
i

2πσ
p(h = 1|x) = g b + w (8)
i
i
j j ij
i
σ
i
! "
$
x
i
p(h = 1|x) = g b + w (8)
j j ij
i
σ
i
2
2Themarginaldistributionovervisiblecountvectors xisgiveninEq.9withan“energy”givenby
! ! ! !
E(x,h) =− λ x + log(x !)− b h − x h w (13)
i i i j j i j ij
i i j i,j
Themarginaldistributionovervisiblecountvectors xisgiveninEq.9withan“energy”givenby
Thegradientofthelog-likelihoodfunctionis:
! ! ! !
Themarginaldistributionovervisiblecountvectors xisgiveninEq.9withan“energy”givenby
E(x,h) =− λ x + log(x !)− b h − x h w (13)
i i i j j i j ij
! ! ! !
∂ logp(v)
i i j i,j
Δw = " = "(<x h > − <x h > ) (14)
E(x,h) =− λ x + log(x !)− b h − x h w (13)
ij i j data i j model
i i i j j i j ij
∂w
ij
Thegradientofthelog-likelihoodfunctionis:
i i j i,j
∂ logp(v)
Thegradientofthelog-likelihoodfunctionis:
Δw = " = "(<x h > − <x h > ) (14)
3.3 GreedyRecursiveLearningofDeepBeliefNets
ij i j data i j model
∂w
ij
∂ logp(v)
Δw = " = "(<x h > − <x h > ) (14)
ij i j data i j model
Asinglelayerofbinaryfeaturesisnotthebestwaytocapturethestructureintheinputdata. We
∂w
ij
3.3 GreedyRecursiveLearningofDeepBeliefNets
nowdescribeanefficientwaytolearnadditionallayersofbinaryfeatures.
Asinglelayerofbinaryfeaturesisnotthebestwaytocapturethestructureintheinputdata. We
3.3 GreedyRecursiveLearningofDeepBeliefNets
Afterlearningthefirstlayerofhiddenfeatureswe havean undirectedmodelthatdefines p(v,h)
nowdescribeanefficientwaytolearnadditionallayersofbinaryfeatures.
bydefiningaconsistentpairofconditionalprobabilities, p(h|v)and p(v|h)whichcanbeusedto
Asinglelayerofbinaryfeaturesisnotthebestwaytocapturethestructureintheinputdata. We
Afterlearningthefirstlayerofhiddenfeatureswe havean undirectedmodelthatdefines p(v,h)
samplefromthemodeldistribution.Adifferentwaytoexpresswhathasbeenlearnedis p(v|h)and
nowdescribeanefficientwaytolearnadditionallayersofbinaryfeatures.
bydefiningaconsistentpairofconditionalprobabilities, p(h|v)and p(v|h)whichcanbeusedto
p(h).Unlikeastandard,directedmodel,this p(h)doesnothaveitsownseparateparameters.Itisa
samplefromthemodeldistribution.Adifferentwaytoexpresswhathasbeenlearnedis p(v|h)and
Afterlearningctohm epfilrisctaltaeyde,rnoofnh-fidadcteonrifaelatpurrieosrwone hhatvheaatnisudnedfiirneecdteidmmpolidceiltltyhabtydpe(fihn|evs)pa (vn,dhp)(v|h).Thispeculiar
p(h).Unlikeastandard,directedmodel,this p(h)doesnothaveitsownseparateparameters.Itisa
bydefiningaconsistentpairofconditionalprobabilities, p(h|v)and p(v|h)whichcanbeusedto
decompositioninto p(h) and p(v|h) suggestsa recursivealgorithm: keepthe learned p(v|h) but
complicated,non-factorialprioron hthatisdefinedimplicitlyby p(h|v)and p(v|h).Thispeculiar
samplefromthemodeldistribution.Adifferentwaytoexpresswhathasbeenlearnedis p(v|h)and
replace p(h) by a better prior over h, i.e. a prior that is closer to the average, over all the data
decompositioninto p(h) and p(v|h) suggestsa recursivealgorithm: keepthe learned p(v|h) but
p(h).Unlikeastandard,directedmodel,this p(h)doesnothaveitsownseparateparameters.Itisa
vectors,oftheconditionalposteriorover h.Soafterlearninganundirectedmodel,thepartwekeep
replace p(h) by a better prior over h, i.e. a prior that is closer to the average, over all the data
complicated,non-factorialprioron hthatisdefinedimplicitlyby p(h|v)and p(v|h).Thispeculiar
vectors,oftheconditionalposteriorover h.Soafterlearninganundirectedmodel,thepartwekeep
ispartofamultilayerdirectedmodel.
decompositioninto p(h) and p(v|h) suggestsa recursivealgorithm: keepthe learned p(v|h) but
ispartofamultilayerdirectedmodel.
Wecansamplefromthisaverageconditionalposteriorbysimplyusing p(h|v)onthetrainingdata
replace p(h) by a better prior over h, i.e. a prior that is closer to the average, over all the data
Wecansamplefromthisaverageconditionalposteriorbysimplyusing p(h|v)onthetrainingdata
vectors,oftheconditionalposteriorover h.Soafterlearninganundirectedmodel,thepartwekeep
andthesesamplesarethenthe“data”thatisusedfortrainingthenextlayeroffeatures. Theonly
andthesesamplesarethenthe“data”thatisusedfortrainingthenextlayeroffeatures. Theonly
ispartofamultilayerdirectedmodel.
differencefromlearningthefirstlayeroffeaturesisthatthe“visible”unitsofthesecond-levelRBM
differencefromlearningthefirstlayeroffeaturesisthatthe“visible”unitsofthesecond-levelRBM
are also binary [6, 3]. The learning rule provided in the previous section remains the same [5].
are also binary [6, 3]. The learning rule provided in the previous section remains the same [5].
Wecansamplefromthisaverageconditionalposteriorbysimplyusing p(h|v)onthetrainingdata
We couldinW itieacliozueldthineitniaeliwzeRthBeM newmRoBdeMlbmyodseim lbpylysimuspilnygustihnegethxeisetxinisgtinlgealeranrendedmmooddeellbbuuttwwitihththtehe
andthesesamplesarethenthe“data”thatisusedfortrainingthenextlayeroffeatures. Theonly
rolesofthehiddenandvisibleunitsreversed. Thisensuresthat p(v) inournewmodelstartsout
rolesofthehiddenandvisibleunitsreversed. Thisensuresthat p(v) inournewmodelstartsout
differencefromlearningthefirstlayeroffeaturesisthatthe“visible”unitsofthesecond-levelRBM
beingexactlythesameas p(h)inouroldone. Providedthenumberoffeaturesperlayerdoesnot
beingexactlythesameas p(h)inouroldone. Providedthenumberoffeaturesperlayerdoesnot
are also binary [6, 3]. The learning rule provided in the previous section remains the same [5].
decrease,[7]showthateachextralayerincreasesavariationallowerboundonthelogprobability
decrease,[7]showthateachextralayerincreasesavariationallowerboundonthelogprobability
We couldinitializethenewRBM modelbysimplyusingtheexistinglearnedmodelbutwith the
ofdata. Tosuppressnoiseinthelearningsignal,weusethereal-valuedactivationprobabilitiesfor
rolesofthehiodfdednataan.dTvoissiubplepruensistsnroeviseersiendt.hTehliesaernnisnugressigthnaatlp, (w ve)uinseouthrenreewalm-voadleulesdtaarctstivoauttionprobabilitiesfor
thevisibleunitsofeveryRBM,buttopreventhiddenunitsfromtransmittingmorethanonebitof
beingexactlythesameas p(h)inouroldone. Providedthenumberoffeaturesperlayerdoesnot
thevisibleunitsofeveryRBM,buttopreventhiddenunitsfromtransmittingmorethanonebitof
informationfromthedatatoitsreconstruction,thepretrainingalwaysusesstochasticbinaryvalues
decrease,[7]showthateachextralayerincreasesavariationallowerboundonthelogprobability
informationfforrotm hethhiedddeantuantiots.itsreconstruction,thepretrainingalwaysusesstochasticbinaryvalues
ofdata. Tosuppressnoiseinthelearningsignal,weusethereal-valuedactivationprobabilitiesfor
forthehiddenunits.
Thegreedy,layer-by-layertrainingcanberepeatedseveraltimestolearnadeep,hierarchicalmodel
thevisibleunitsofeveryRBM,buttopreventhiddenunitsfromtransmittingmorethanonebitof
in which each layer of features captures strong high-order correlations between the activities of
Thegreedy,layer-by-layertrainingcanberepeatedseveraltimestolearnadeep,hierarchicalmodel
informationfromthedatatoitsreconstruction,thepretrainingalwaysusesstochasticbinaryvalues
featuresinthelayerbelow.
in which each layer of features captures strong high-order correlations between the activities of
forthehiddenunits.
featuresinthelayerbelow.
Thegreedy,layer-by-layertrainingcanberepeatedseveraltimestolearnadeep,hierarchicalmodel
4 LearningtheCovarianceKernelforaGaussianProcess
Using Deep Belief Nets to Learn Covariance
in which each layer of features captures strong high-order correlations between the activities of
Afterpretraining,thestochasticactivitiesofthebinaryfeaturesineachlayerarereplacedbydeter-
Kernels for Gaussian Processes
4 LearningtheCovarianceKernelforaGaussianProcess
featuresinthelayerbelow.
ministic,real-valuedprobabilitiesandtheDBNisusedtoinitializeamulti-layer,non-linearmap-
(Salakhutdinov and Hinton, NIPS 2008)
ping f(x|W) as shownin figure1. We define a Gaussian covariancefunction, parameterizedby
Afterpretraining,thestochasticactivitiesofthebinaryfeaturesineachlayerarereplacedbydeter-
θ = {α,β}and W as:
4 LearningtheCovarianceKernelforaGaussianProcess
ministic,real-valuedprobabilitiesandtheDBNisusedtoinitializeamulti-layer,non-linearmap-
" #
Apprendre la matrice de co 1 variance
2

K = αexp − ||F(x |W)− F(x |W)|| (15)
ij i j
ping f(x|W) as shownin figure1. We define a Gaussian covariancefunction, parameterizedby
Afterpretraining,thestochasticactivitiesofthebinaryfeaturesineachlayerarereplacedbydeter-

ministic,real-θva= lue{dαp,rβob}aabnilditiW esaansd: theDBNisusedtoinitializeamulti-layer,non-linearmap-
Notethatthiscovariancefunctionisinitializedinanentirelyunsupervisedway.Wecannowmaxi-
" #
1
ping f(x|W) as shownin figure1. We define a Gaussian covariancefunction, parameterizedby
2
K = αexp − ||F(x |W)− F(x |W)|| (15)
mizethelog-likelihoodofEq.3withrespecttotheparametersofthecovariancefunctionusingthe
ij i j
θ = {α,β}and W as:

labeledtrainingdata[9].Thederivativeofthelog-likelihoodwithrespecttothekernelfunctionis:
" #
1
2
K = αexp − ||F(x |W)− F(x |W)|| (15)
ij i j" #
∂L 1
Notethatthiscovariancefunctionisinitializedinanentirelyunsupervisedway.Wecannowmaxi-
−1 T −1 −1

= K yy K − K (16)
y y y
∂K 2
mizethelog-likelihoodofEq.3withrespecttotheparametersofthecovariancefunctionusingthe
y
Notethatthiscovariancefunctionisinitializedinanentirelyunsupervisedway.Wecannowmaxi-
labeledtrainingdata[9].Thederivativeofthelog-likelihoodwithrespecttothekernelfunctionis:
2
where K Pr = ocède par descente de gradient K +σ Iisthecovariancematrix.Usingthechainrulewereadilyobtainthenecessary
mizethelog-likelihoodofEq.3wiythrespecttotheparametersofthecovariancefunctionusingthe

" #
∂L 1
gradients:
−1 T −1 −1
labeledtrainingdata[9].Thederivativeofthelog-likelihoodwithrespecttothekernelfunctionis:
= K yy K − K (16)
y y y
∂K 2
" #
y
∂L 1
∂L ∂L ∂K ∂L ∂L ∂K ∂F(x|W)
y y
−1 T −1 −1
= K =yy K − Kand = (16) (17)
y y y
∂θ ∂K ∂θ W ∂K ∂F(x|W) ∂W
∂K 2
2
y y
y
where K = K +σ Iisthecovariancematrix.Usingthechainrulewereadilyobtainthenecessary
y
2
gradients:
where K = K +σ Iisthecovariancematrix.Usingthechainrulewereadilyobtainthenecessary
y
4
gradients:
∂L ∂L ∂K ∂L ∂L ∂K ∂F(x|W)
y y
= and = (17)
∂L ∂L ∂K ∂L ∂L ∂K ∂F(x|W)
y y
∂θ ∂K ∂θ W ∂K ∂F(x|W) ∂W
y y
= and = (17)
∂θ ∂K ∂θ W ∂K ∂F(x|W) ∂W
y y
4
4Using Deep Belief Nets to Learn Covariance
Kernels for Gaussian Processes
(Salakhutdinov and Hinton, NIPS 2008)
Training Data
Test Data
Résultats: régression
Unlabeled
!22.07 32.99 !41.15 66.38 27.49
Training Data
Test Data
A
Unlabeled
!22.07 32.99 !41.15 66.38 27.49
A
B
B
Figure2:ToppanelA:Randomlysampledexamplesofthetrainingandtestdata.BottompanelB:Thesame
sampleofthetrainingandtestimagesbutwithrectangularocclusions.
Figure2:ToppanelA:Randomlysampledexamplesofthetrainingandtestdata.BottompanelB:Thesame
Training GPstandard GP-DBNgreedy GP-DBNfine GPpca
sampleofthetrainingandtestimagesbutwithrectangularocclusions.
labels Sph. ARD Sph. ARD Sph. ARD Sph. ARD
Training GPstandard GP-DBNgreedy GP-DBNfine GPpca
A 100 22.24 28.57 17.94 18.37 15.28 15.01 18.13(10) 16.47(10)
labels Sph. ARD Sph. ARD Sph. ARD Sph. ARD
500 17.25 18.16 12.71 8.96 7.25 6.84 14.75(20) 10.53(80)
1000 16.33 16.36 11.22 8.77 6.42 6.31 14.86(20) 10.00(160)
A 100 22.24 28.57 17.94 18.37 15.28 15.01 18.13(10) 16.47(10)
500 17.25 18.16 12.71 8.96 7.25 6.84 14.75(20) 10.53(80)
B 100 26.94 28.32 23.15 19.42 19.75 18.59 25.91(10) 19.27(20)
1000 16.33 16.36 11.22 8.77 6.42 6.31 14.86(20) 10.00(160)
500 20.20 21.06 15.16 11.01 10.56 10.12 17.67(10) 14.11(20)
1000 19.20 17.98 14.15 10.43 9.13 9.23 16.26(10) 11.55(80)
B 100 26.94 28.32 23.15 19.42 19.75 18.59 25.91(10) 19.27(20)
500 20.20 21.06 15.16 11.01 10.56 10.12 17.67(10) 14.11(20)
Table1: Performanceresultsontheface-orientationregressiontask.Therootmeansquarederror(RMSE)on
1000 19.20 17.98 14.15 10.43 9.13 9.23 16.26(10) 11.55(80)
thetestsetisshownforeachmethodusingsphericalGaussiankernelandGaussiankernelwithARDhyper-
parameters. Byrow: A)Non-occludedfacedata,B)Occludedfacedata. FortheGPpcamodel,thenumberof
Table1: Performanceresultsontheface-orientationregressiontask.Therootmeansquarederror(RMSE)on
principalcomponentsthatperformsbestonthetestdataisshowninparenthesis.
thetestsetisshownforeachmethodusingsphericalGaussiankernelandGaussiankernelwithARDhyper-
parameters. Byrow: A)Non-occludedfacedata,B)Occludedfacedata. FortheGPpcamodel,thenumberof
principalcomponentsthatperformsbestonthetestdataisshowninparenthesis.
where∂F(x|W)/∂W iscomputedusingstandardbackpropagation.Wealsooptimizetheobserva-
2 3
tionnoiseσ . Itisnecessarytocomputetheinverseof K ,soeachgradientevaluationhas O(N )
y
complexitywhere N isthenumberofthelabeledtrainingcases.WhenlearningtherestrictedBoltz-
where∂F(x|W)/∂W iscomputedusingstandardbackpropagation.Wealsooptimizetheobserva-
2 3
mannmachinesthatarecomposedtoformtheinitialDBN,however,eachgradientevaluationscales
tionnoiseσ . Itisnecessarytocomputetheinverseof K ,soeachgradientevaluationhas O(N )
y
linearly in time and space with the number of unlabeled training cases. So the pretrainingstage
complexitywhere N isthenumberofthelabeledtrainingcases.WhenlearningtherestrictedBoltz-
canmakeefficientuseofverylargesetsofunlabeleddatatocreatesensible,high-levelfeaturesand
mannmachinesthatarecomposedtoformtheinitialDBN,however,eachgradientevaluationscales
whentheamountoflabeleddataissmall.Thentheverylimitedamountofinformationinthelabels
linearly in time and space with the number of unlabeled training cases. So the pretrainingstage
canbeusedtoslightlyrefinethosefeaturesratherthantocreatethem.
canmakeefficientuseofverylargesetsofunlabeleddatatocreatesensible,high-levelfeaturesand
whentheamountoflabeleddataissmall.Thentheverylimitedamountofinformationinthelabels
5 ExperimentalResults
canbeusedtoslightlyrefinethosefeaturesratherthantocreatethem.
In this section we present experimentalresults for several regression and classification tasks that
5 ExperimentalResults
involvehigh-dimensional,highly-structureddata. Thefirstregressiontaskistoextracttheorienta-
tionofafacefromagray-levelimageofalargepatchoftheface. Thesecondregressiontask is
In this section we present experimentalresults for several regression and classification tasks that
tomapimagesofhandwrittendigitstoasinglereal-valuethatisascloseaspossibletotheinteger
involvehigh-dimensional,highly-structureddata. Thefirstregressiontaskistoextracttheorienta-
representedbythedigitintheimage.Thefirstclassificationtaskistodiscriminatebetweenimages
tionofafacefromagray-levelimageofalargepatchoftheface. Thesecondregressiontask is
ofodddigitsandimagesofevendigits. Thesecondclassificationtaskistodiscriminatebetween
tomapimagesofhandwrittendigitstoasinglereal-valuethatisascloseaspossibletotheinteger
twodifferentclassesofnewsstorybasedonthevectorofwordcountsineachstory.
representedbythedigitintheimage.Thefirstclassificationtaskistodiscriminatebetweenimages
ofodddigitsandimagesofevendigits. Thesecondclassificationtaskistodiscriminatebetween
5.1 ExtractingtheOrientationofaFacePatch
twodifferentclassesofnewsstorybasedonthevectorofwordcountsineachstory.
The Olivettiface data set contains ten 64×64images of each of forty differentpeople. We con-
◦ ◦
structedadatasetof13,00028×28imagesbyrandomlyrotating(−90 to +90 ), cropping,and
5.1 ExtractingtheOrientationofaFacePatch
subsamplingtheoriginal400images.Thedatasetwasthensubdividedinto12,000trainingimages,
The Olivettiface data set contains ten 64×64images of each of forty differentpeople. We con-
whichcontainedthefirst30people,and1,000testimages,whichcontainedtheremaining10peo-
◦ ◦
structedadatasetof13,00028×28imagesbyrandomlyrotating(−90 to +90 ), cropping,and
ple. 1,000randomlysampledfacepatchesfromthetrainingsetwereassignedanorientationlabel.
subsamplingtheoriginal400images.Thedatasetwasthensubdividedinto12,000trainingimages,
Theremaining11,000trainingimageswereusedasunlabeleddata. Wealsomadeamoredifficult
whichcontainedthefirst30people,and1,000testimages,whichcontainedtheremaining10peo-
versionofthetaskbyoccludingpartofeachfacepatchwithrandomlychosenrectangles. PanelA
ple. 1,000randomlysampledfacepatchesfromthetrainingsetwereassignedanorientationlabel.
offigure2showsrandomlysampledexamplesfromthetrainingandtestdata.
Theremaining11,000trainingimageswereusedasunlabeleddata. Wealsomadeamoredifficult
For training on the Olivetti face patches we used the 784-1000-1000-1000architectureshown in
versionofthetaskbyoccludingpartofeachfacepatchwithrandomlychosenrectangles. PanelA
figure 1. The entire training set of 12,000 unlabeled images was used for greedy, layer-by-layer
offigure2showsrandomlysampledexamplesfromthetrainingandtestdata.
trainingofaDBNmodel. The2.8millionparametersoftheDBNmodelmayseemexcessivefor
For training on the Olivetti face patches we used the 784-1000-1000-1000architectureshown in
12,000trainingcases, buteach trainingcase involvesmodeling625real-valuesrather thanjust a
figure 1. The entire training set of 12,000 unlabeled images was used for greedy, layer-by-layer
single real-valued label. Also, we only train each layer of features for a few passes through the
trainingofaDBNmodel. The2.8millionparametersoftheDBNmodelmayseemexcessivefor
trainingdataandwepenalizethesquaredweights.
12,000trainingcases, buteach trainingcase involvesmodeling625real-valuesrather thanjust a
single real-valued label. Also, we only train each layer of features for a few passes through the
trainingdataandwepenalizethesquaredweights.
5
5Using Deep Belief Nets to Learn Covariance
Kernels for Gaussian Processes
(Salakhutdinov and Hinton, NIPS 2008)
Résultats: visualisation
45
1.0 40
InputPixelSpace
35
30
25
0.8
20
15
10
5
0.6
0
1 2 3 4 5 6
log !
90
80
FeatureSpace
0.4
70
60
50
40
0.2
More Relevant
30
20
10
0
!1 0 1 2 3 4 5 6
0 0.2 0.4 0.6 0.8 1.0
log !
Feature 992
Figure3: Leftpanelshowsascatterplotofthetwomostrelevantfeatures,witheach pointreplaced bythe
correspondinginputtestimage.Forbettervisualization,overlappedimagesarenotshown.Rightpaneldisplays
thehistogramplotsofthelearnedARDhyper-parameters logβ.
After the DBN has been pretrained on the unlabeled data, a GP model was fitted to the labeled
datausingthetop-levelfeaturesoftheDBNmodelasinputs. WecallthismodelGP-DBNgreedy.
GP-DBNgreedy can be significantly improved by slightly altering the weights in the DBN. The
GPmodelgiveserrorderivativesforitsinputvectorswhicharethetop-levelfeaturesoftheDBN.
These derivativescan be backpropagatedthroughthe DBN to allow discriminativefine-tuningof
theweights. EachtimetheweightsintheDBNareupdated,theGPmodelisalsorefitted. Wecall