Learning Deep Energy Models
Author:
Jiquan
Ngiam
et. al., 2011
Presenter: Dibyendu Sengupta
1
The Deep Learning Problem
Vectorized
pixel intensities
Slightly higher level representation
.
.
.
Output or the highest level representation: That’s the CRAZY FROG!!!!!
Learning and
modeling the different
layers is a very
challenging problem
2
Outline
Placing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using
DEMs
3
State of the art methods
•
Deep
Belief
Network
(DBN)
:
RBMs
stacked
and
trained
in
a
“Greedy”
manner
to
form
DBN
1
each
of
which
models
the
posterior
distribution
of
the
previous
layer
.
•
Deep
Boltzmann
Machine
(DBM)
:
DBM
2
has
undirected
connections
between
the
layers
of
networks
initialized
by
RBMs
.
Joint
training
is
done
on
the
layers
.
•
Deep
Energy
Model
(DEM)
:
DEM
consists
of
a
feedforward
NN
that
deterministically
transforms
the
input
and
the
output
of
the
feedforward
network
is
modeled
with
a
stochastic
hidden
unit
.
DBN Layers
DBM Layers
DBM Layers
1 Hinton et al., A Fast Learning Algorithm for Deep Belief Nets,
Neural Computation, 2006
2
Salakhutdinov
et al., Deep Boltzmann Machines,
AISTATS, 2009
4
Deep Belief Network (DBN)
DBNs
are graphical models which learn to extract a deep hierarchical representation of the
training data by modeling observed data
x
and “
l
” hidden layers “
h
” as follows by the joint
distribution
Algorithm:
1.
Train the first layer as an RBM that models the input,
x
as its visible layer.
2.
The first layer is used as the input data for the second layer which is chosen either by mean
activations of or samples of
3.
Iterate for the desired number of layers, each time propagating upward either samples or
mean values.
4.
Fine

tune all the parameters of this deep architecture with respect to log

likelihood or with
respect to a supervised training criterion.
5
Deep Boltzmann Machine (DBM)
•
DBM is similar to DBN: Primary contrasting
feature is undirected connections in all the
layers
•
Layerwise
training algorithm is used to
initialize the layers using
RBMs
•
Joint training is performed on all the layers
6
Motivation of Deep Energy Models (DEM)
•
Both
DBN
and
DBM
has
multiple
stochastic
hidden
layers
•
Computing
the
conditional
posterior
over
the
stochastic
hidden
layers
is
intractable
•
Learning
and
inference
is
more
tractable
in
single
layer
RBM
but
it
suffers
from
lack
of
representational
power
•
To
overcome
both
the
defects
DEMs
combine
layers
of
deterministic
hidden
layers
with
a
layer
of
stochastic
hidden
layer
7
Outline
Placing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using
DEMs
8
Energy Based Models (
EBMs
)
x
–
Visible units,
h
–
Hidden units, Z
–
Partition Function,
F(x
)
–
Free Energy
We would like the configurations to be at low energy
Independent of
h
9
General Learning Strategy for
EBMs
Gradient based methods on this functional formulation to learn
parameters
θ
In general, the “positive” term is easy to compute but the “negative”
term is often intractable and sampling needs to be done.
Expectations are computed for both these terms to estimate their
values
10
Energy based models of RBM
W: weights
connecting
visible (
v
)
and
hidden (
h
) units
b
and
c
: offsets
of visible and
hidden units
RBM representation
Exploiting the structure of RBM we can obtain
In particular for
RBMs
with binary units:
11
Outline
Placing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using
DEMs
12
Sigmoid Deep Energy Model
g
θ
(v
) represents the
feedforward
output of the neural
network
g
θ
Similar to
RBMs
, an energy function defines the
connections between
g
θ
(v
) and the hidden units
h
(assumed binary)
The conditional posteriors of the hidden variables are easy to compute:
Representational power of the model can be increased by adding more layers of the
feedforward
NN
13
Generalized Deep Energy Models
Generalized Free Energy Function
Sigmoid DEM with
g
θ
as the
feedforward
NN
Different models for
H(v
) enable
DEMs
with multiple layers of nonlinearities
PoT
Distribution
Covariance RBM Distribution
Examples of 2

layered network: First layer computes squared responses followed by a soft

rectification.
There can also be linear combinations of models like mean

covariance RBM which is a linear
combination of RBM and
cRBM
14
An alternative deep version of DEM
•
PoT
and
cRBM
uses shallow feedback in the
energy landscape
•
“Stacked
PoT
” or
SPoT
is chosen as a deeper
version of
PoT
by stacking a bunch of
PoT
layers
•
This creates a more expressive deeper model
15
Outline
Placing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using
DEMs
16
Learning Parameters in
DEMs
•
Models were trained by maximizing the log

likelihood
•
Stochastic gradient ascent was used to learn the parameters,
θ
•
Obtain update rules similar to generalized Energy Based Model (EBM)
updates
2
nd
Term: Expectation over data
–
can be easily computed
1
st
Term: Expectation over model distribution
–
Harder to compute and
often intractable and is approximated by Sampling
17
Hybrid Monte Carlo (HMC) Sampler
•
Model samples obtained by simulation of physical system
•
Particles are subjected to potential and kinetic energies
•
Velocities are sampled from a
univariate
Gaussian to obtain state
1
•
State of the particles follow conservation of Hamiltonian
H(s,φ
)
•
n

steps of Leap

Frog Algorithm applied to state
1
(
s,φ
) to obtain state
2
•
Acceptance is performed based on P
acc
(state
1
, state
2
)
Neal RM,
Proabilistic
inference using Markov Chain Monte Carlo Methods,
Technical Report, U Toronto, 1993
Hamiltonian Dynamics
18
Greedy
Layerwise
and Joint Training
•
First greedy
layerwise
training is performed by
–
Training successive layers to optimize data likelihood
–
Freeze parameters of the earlier layers
–
Learning objective (i.e. data likelihood) stays the same
throughout training of deep model
•
Joint training is subsequently performed on all layers by

Unfreezing the weights

Optimizing the same objective function

Computational cost is comparable to
layerwise
training
•
Training in DEM is computationally much cheaper than DBN
and DBM since only the top layer needs sampling and all
intermediate layers are deterministic hidden units
19
Discriminative Deep Energy Models
a
l
: Activations of
l
th
in
g
θ
used to learn a linear classifier of image
labels
y
via weights
U
Training
:
Done by
hybrid generative

discriminative
objective
Gradient of generative cost: Can be computed by previously discussed method
Gradient discriminative cost: Can be computed by considering the model to be a
short

circuited
feedforward
NN with
softmax
classification
20
Outline
Placing DEM in context of other models
Energy Based Models
Description of Deep Energy Models
Learning in Deep Energy Models
Experiments using
DEMs
21
Experiments: Natural Images
Remarks:
•
Experiments done with sigmoid and
SPoT
models under Annealed Importance Sampling (AIS)
•
M1 and M2: Training using Greedy
Layerwise
stacking of 1 and 2 layers respectively
•
M1

M2

M12: Greedy
layerwise
training for 2 layers followed by joint training of the two layers
•
Joint training results in performance improvement over pure Greedy
lawerwise
training but
the convergence of Log

Likelihood is not evident from plots
•
M1

M2

M12 seem to require a significantly larger number of iterations
•
Adding multiple layers in
SPoT
significantly improves model performance
Convergence: Questionable!
22
Experiments: Object Recognition
Samples from
SPoT
M12 model trained
on NORB dataset
Remarks:
•
Models were trained on NORB dataset
•
Hybrid discriminative

generative Deep models (
SPoT
) performed better than the fully
discriminative model
•
Fully discriminative model suffers from
overfitting
•
Optimal
α
parameter that weighs discriminative

generative cost is obtained by cross validation
on a subset of training data
•
Iteration counts for convergence in the models were not reported
23
Conclusions
•
It is often difficult to scale
SPoT
model to realistic datasets
because of the slowness of HMC
•
Jointly training all layers yields significantly better models than
Greedy
layerwise
training
Single layer Sigmoid
DEM: Trained by Greedy
layerwise
Two layer Sigmoid DEM:
Trained by Greedy and
Joint Training
Filters appear
Blob

like
Filters appear
Gabor

like
24
What is the best Multi

Stage
Architecture for Object Recognition?
Author: Kevin Jarrett et. al., 2009
Presenter:
Sreeparna
Mukherjee
25
Coming back to the starting problem
Vectorized
pixel intensities
Feature Extraction: Stage 1
.
.
.
Output or the highest level feature extraction: That’s the CRAZY FROG!!!!!
Can it be done more
efficiently with
multiple feature
extraction stages
instead of just one
26
Outline
Different Feature Extraction Models
Multi

Stage Feature Extraction Architecture
Learning Protocols
Experiments
27
Existing Feature
E
xtraction
M
odels
•
There are several single

stage feature extraction systems
inspired by mammalian visual cortex
–
Scale Invariant Feature Transform (SIFT)
–
Histogram of Oriented Gradients (HOG)
–
Geometric Blur
•
There are also models with two or more successive
stages of feature extractions

Convolutional
networks trained in supervised or
unsupervised mode

Multistage systems using a non

linear MAX or HMAX
models
28
Contrasts among different models
The feature extraction models primarily differ in
following aspects
–
Number of stages of feature extraction
–
Type of non

linearity used after filter

bank
–
Type of filter used
–
Type of classifier used
29
Questions to Address
•
How do the non

linearities
following filter
banks influence recognition accuracy?
•
Does unsupervised or supervised learning of
filter banks improve performance over hard

wired or random filters?
•
Is there any benefit of using a 2

stage feature
extractor as opposed to single stage feature
extractor?
30
Intuitions in Feature Extraction Architecture
•
Supervised training on a small number of labeled
datasets (e.g. Caltech

101) will fail
•
Filters need to be carefully handpicked for good
performance
•
Non

linearities
should not be a significant factor
What do you think?
These intuitions are wrong
–
We’ll see how !!!!
31
Outline
Different Feature Extraction Models
Multi

Stage Feature Extraction Architecture
Learning Protocols
Experiments
32
General Model Architecture
Vectorized
pixel intensities
Filter Bank Layer
Output or the highest level feature extraction: That’s the CRAZY FROG!!!!!
Non Linear Transformation Layers
Pooling Layer: Local Averaging to remove small perturbations
33
Filter Bank Layer (F
CSG
)
Input (
x
):
n
1
2D feature maps of size n
2
×
n
3
x
ijk
is each component in each feature map x
i
Output (
y
):
m
1
feature maps of size m
2
×
m
3
Filter (
k
):
k
ij
is a filter in the filter bank of size l
1
×
l
2
mapping x
i
to
y
j
34
Non

Linear Transformations in F
CSG
F
CSG
comprises of Convolution Filters (C), Sigmoid/
tanh
non

linearity (S) and gain (G) coefficients
g
j
35
Rectification Layer (
R
abs
)
•
This layer returns the absolute value of its
input
•
Other rectifying non

linearities
produced
similar results
under
R
abs
36
Local Contrast Normalization Layer (N)
This layer performs local
–
Subtractive Normalization
–
Divisive Normalization
Subtractive Normalization
Divisive Normalization
w
pq
is a normalized Gaussian weighting window
37
Average Pooling and
Subsampling
Layer (P
A
)
Averaging
: This creates robustness to small
distortions
w
pq
is a uniform
weighting window
Subsampling
: Spatial resolution is reduced by down

sampling with a ratio S in both spatial directions
Max Pooling and
Subsampling
Layer (P
M
)
Average operation is replaced by Max operation
Subsampling
procedure stays the same
38
Hierarchy among the Layers
Layers can be combined in various hierarchical
ways to obtain different architectures
F
CSG
–
P
A
F
CSG
–
R
abs
–
P
A
F
CSG
–
R
abs
–
N
–
P
A
F
CSG
–
P
M
A typical multistage architecture: F
CSG
–
R
abs
–
N
–
P
A
39
Outline
Different Feature Extraction Models
Multi

Stage Feature Extraction Architecture
Learning Protocols
Experiments
40
Unsupervised Training Protocols
Input: X (
vectorized
patch or stack of patches)
Dictionary: W
–
to be learnt
Feature Vector: Z
*

obtained by minimizing the
energy function
41
Learning procedure:
Olshausen

Field
O
btaining Z
*
from E
OF
via “basis pursuit” is an expensive optimization problem
42
The energy function to be minimized
λ
–
Sparsity
Hyper

parameter
Learning W: Done by
minimizing the Loss
Function L
OF
(W) using
stochastic gradient
descent
Learning procedure: PSD
•
E
PSD
optimization is faster as it has the predictor term
•
Goal of algorithm is to make the
regressor
C(X,K,G) as close to Z as possible
•
After training completion Z* = C(X,K) for input X i.e. the method is fast
feedforward
Regressor
function mapping X
Y
43
Loss Function
Outline
Different Feature Extraction Models
Multi

Stage Feature Extraction Architecture
Learning Protocols
Experiments
44
Experiment: Caltech 101 Dataset
R and RR
–
Random Features
and Supervised Classifier
U and UU
–
Unsupervised
Features, Supervised Classifier
U
+
and U
+
U
+

Unsupervised
Feature, Global
S
upervised
Refinement
G
–
Gabor Functions
Remarks:
•
Random filter and no filter learning achieve decent performance
•
Both Rectification and Supervised fine tuning improved performance
•
Two

stage systems are better than single

stage models
•
Unsupervised training does not significantly improve performance if both rectification and
normalization are used
•
Performance of Gabor Filters were worse than random filters
45
Experiment: NORB Dataset
Remarks
:
•
Rectification and Normalization makes a significant improvement when samples are
low
•
As the number of samples increases, improvement with Rectification and
Normalization becomes insignificant
•
Random filters performs much worse on large number of labeled samples
46
Experiment: MNIST Dataset
•
Two

stage feature extraction architecture was used
•
The parameters are first trained using PSD
•
Classifier is initialized randomly and the whole
system is fine tuned in supervised mode
•
A test error rate of 0.53% was observed
–
best
known error rate on MNIST without distortions or
preprocessing
47
Coming back to the same Questions!
•
How do the non

linearities
following filter banks influence recognition
accuracy?

Yes
–
Rectification improves performance possibly due to
i
) non

polar
features improves recognition or ii) it prohibits cancellations of
neighbors during pooling layer
–
Normalization also enables performance improvement and makes
supervised learning faster by contrast enhancement
•
Does unsupervised or supervised learning of filter banks improve
performance over hard

wired or random filters?

Yes
–
Random filter shows good performance in the limit of small training
set sizes where the optimal stimuli for random filters are similar to
that trained filters
–
Global supervised learning of filters yield good results if proper non

linearities
are used
•
Is there any benefit of using a 2

stage feature extractor as opposed to
single stage feature extractor?

Yes
–
The experiments show that 2

stage feature extractor performs much
better compared to single stage feature extractor models.
48
Questions
49
Extra Slides
50
Hybrid Monte Carlo (HMC) Sampler
•
Model samples are obtained by simulating a
physical system
•
Particles are subjected to potential and kinetic
energies
•
Velocities are sampled from a
univariate
Gaussian to obtain state
1
•
State of the particles follow conservation of
Hamiltonian
H(s,φ
)
•
n

steps of Leap

Frog Algorithm applied to state
1
(
s,φ
) to obtain state
2
•
Acceptance is performed based on P
acc
(state
1
,
state
2
)
Neal RM,
Proabilistic
inference using Markov Chain Monte
Carlo Methods,
Technical Report, U Toronto, 1993
Hamiltonian Dynamics
Leap

Frog
Discretization
:
51
Comments 0
Log in to post a comment