Learning Deep Energy Models

appliancepartAI and Robotics

Oct 19, 2013 (3 years and 7 months ago)

132 views

Learning Deep Energy Models

Author:
Jiquan

Ngiam

et. al., 2011

Presenter: Dibyendu Sengupta

1

The Deep Learning Problem

Vectorized

pixel intensities

Slightly higher level representation



.


.


.

Output or the highest level representation: That’s the CRAZY FROG!!!!!

Learning and
modeling the different
layers is a very
challenging problem

2

Outline

Placing DEM in context of other models


Energy Based Models


Description of Deep Energy Models


Learning in Deep Energy Models


Experiments using
DEMs




3

State of the art methods


Deep

Belief

Network

(DBN)
:

RBMs

stacked

and

trained

in

a

“Greedy”

manner

to

form

DBN
1

each

of

which

models

the

posterior

distribution

of

the

previous

layer
.



Deep

Boltzmann

Machine

(DBM)
:

DBM
2

has

undirected

connections

between

the

layers

of

networks

initialized

by

RBMs
.

Joint

training

is

done

on

the

layers
.



Deep

Energy

Model

(DEM)
:

DEM

consists

of

a

feedforward

NN

that

deterministically

transforms

the

input

and

the

output

of

the

feedforward

network

is

modeled

with

a

stochastic

hidden

unit
.

DBN Layers

DBM Layers

DBM Layers

1 Hinton et al., A Fast Learning Algorithm for Deep Belief Nets,
Neural Computation, 2006

2
Salakhutdinov

et al., Deep Boltzmann Machines,
AISTATS, 2009


4

Deep Belief Network (DBN)

DBNs

are graphical models which learn to extract a deep hierarchical representation of the
training data by modeling observed data
x

and “
l
” hidden layers “
h
” as follows by the joint
distribution

Algorithm:


1.
Train the first layer as an RBM that models the input,
x

as its visible layer.


2.
The first layer is used as the input data for the second layer which is chosen either by mean
activations of or samples of


3.
Iterate for the desired number of layers, each time propagating upward either samples or
mean values.


4.
Fine
-
tune all the parameters of this deep architecture with respect to log
-

likelihood or with
respect to a supervised training criterion.


5

Deep Boltzmann Machine (DBM)


DBM is similar to DBN: Primary contrasting
feature is undirected connections in all the
layers




Layerwise

training algorithm is used to
initialize the layers using
RBMs



Joint training is performed on all the layers

6

Motivation of Deep Energy Models (DEM)


Both

DBN

and

DBM

has

multiple

stochastic

hidden

layers



Computing

the

conditional

posterior

over

the

stochastic

hidden

layers

is

intractable




Learning

and

inference

is

more

tractable

in

single

layer

RBM

but

it

suffers

from

lack

of

representational

power




To

overcome

both

the

defects

DEMs

combine

layers

of

deterministic

hidden

layers

with

a

layer

of

stochastic

hidden

layer

7

Outline

Placing DEM in context of other models


Energy Based Models


Description of Deep Energy Models


Learning in Deep Energy Models


Experiments using
DEMs



8

Energy Based Models (
EBMs
)

x



Visible units,
h



Hidden units, Z


Partition Function,
F(x
)


Free Energy


We would like the configurations to be at low energy

Independent of
h

9

General Learning Strategy for
EBMs

Gradient based methods on this functional formulation to learn
parameters
θ

In general, the “positive” term is easy to compute but the “negative”
term is often intractable and sampling needs to be done.


Expectations are computed for both these terms to estimate their
values


10

Energy based models of RBM

W: weights
connecting
visible (
v
)
and
hidden (
h
) units


b

and
c
: offsets
of visible and
hidden units

RBM representation

Exploiting the structure of RBM we can obtain

In particular for
RBMs

with binary units:

11

Outline

Placing DEM in context of other models


Energy Based Models


Description of Deep Energy Models


Learning in Deep Energy Models


Experiments using
DEMs


12

Sigmoid Deep Energy Model

g
θ
(v
) represents the
feedforward

output of the neural
network
g
θ


Similar to
RBMs
, an energy function defines the
connections between
g
θ
(v
) and the hidden units
h

(assumed binary)

The conditional posteriors of the hidden variables are easy to compute:

Representational power of the model can be increased by adding more layers of the
feedforward

NN

13

Generalized Deep Energy Models

Generalized Free Energy Function

Sigmoid DEM with
g
θ

as the
feedforward

NN

Different models for
H(v
) enable
DEMs

with multiple layers of nonlinearities

PoT

Distribution


Covariance RBM Distribution

Examples of 2
-
layered network: First layer computes squared responses followed by a soft
-
rectification.


There can also be linear combinations of models like mean
-
covariance RBM which is a linear
combination of RBM and
cRBM

14

An alternative deep version of DEM


PoT

and
cRBM

uses shallow feedback in the
energy landscape



“Stacked
PoT
” or
SPoT

is chosen as a deeper
version of
PoT

by stacking a bunch of
PoT

layers



This creates a more expressive deeper model

15

Outline

Placing DEM in context of other models


Energy Based Models


Description of Deep Energy Models


Learning in Deep Energy Models


Experiments using
DEMs

16

Learning Parameters in
DEMs



Models were trained by maximizing the log
-
likelihood




Stochastic gradient ascent was used to learn the parameters,
θ




Obtain update rules similar to generalized Energy Based Model (EBM)
updates

2
nd

Term: Expectation over data


can be easily computed


1
st

Term: Expectation over model distribution


Harder to compute and
often intractable and is approximated by Sampling

17

Hybrid Monte Carlo (HMC) Sampler



Model samples obtained by simulation of physical system




Particles are subjected to potential and kinetic energies




Velocities are sampled from a
univariate

Gaussian to obtain state
1




State of the particles follow conservation of Hamiltonian
H(s,φ
)




n
-
steps of Leap
-
Frog Algorithm applied to state
1

(
s,φ
) to obtain state
2




Acceptance is performed based on P
acc
(state
1
, state
2
)





Neal RM,
Proabilistic

inference using Markov Chain Monte Carlo Methods,
Technical Report, U Toronto, 1993

Hamiltonian Dynamics

18

Greedy
Layerwise

and Joint Training


First greedy
layerwise

training is performed by


Training successive layers to optimize data likelihood


Freeze parameters of the earlier layers


Learning objective (i.e. data likelihood) stays the same
throughout training of deep model






Joint training is subsequently performed on all layers by

-

Unfreezing the weights

-

Optimizing the same objective function

-

Computational cost is comparable to
layerwise

training




Training in DEM is computationally much cheaper than DBN
and DBM since only the top layer needs sampling and all
intermediate layers are deterministic hidden units

19

Discriminative Deep Energy Models

a
l
: Activations of
l
th

in
g
θ

used to learn a linear classifier of image
labels
y

via weights
U

Training
:
Done by
hybrid generative
-
discriminative
objective

Gradient of generative cost: Can be computed by previously discussed method


Gradient discriminative cost: Can be computed by considering the model to be a
short
-
circuited
feedforward

NN with
softmax

classification

20

Outline

Placing DEM in context of other models


Energy Based Models


Description of Deep Energy Models


Learning in Deep Energy Models


Experiments using
DEMs

21

Experiments: Natural Images

Remarks:




Experiments done with sigmoid and
SPoT

models under Annealed Importance Sampling (AIS)



M1 and M2: Training using Greedy
Layerwise

stacking of 1 and 2 layers respectively



M1
-
M2
-
M12: Greedy
layerwise

training for 2 layers followed by joint training of the two layers



Joint training results in performance improvement over pure Greedy
lawerwise

training but
the convergence of Log
-
Likelihood is not evident from plots



M1
-
M2
-
M12 seem to require a significantly larger number of iterations



Adding multiple layers in
SPoT

significantly improves model performance

Convergence: Questionable!

22

Experiments: Object Recognition

Samples from
SPoT

M12 model trained
on NORB dataset

Remarks:




Models were trained on NORB dataset




Hybrid discriminative
-
generative Deep models (
SPoT
) performed better than the fully
discriminative model




Fully discriminative model suffers from
overfitting




Optimal
α

parameter that weighs discriminative
-
generative cost is obtained by cross validation
on a subset of training data




Iteration counts for convergence in the models were not reported


23

Conclusions


It is often difficult to scale
SPoT

model to realistic datasets
because of the slowness of HMC



Jointly training all layers yields significantly better models than
Greedy
layerwise

training





Single layer Sigmoid
DEM: Trained by Greedy
layerwise


Two layer Sigmoid DEM:
Trained by Greedy and
Joint Training

Filters appear
Blob
-
like

Filters appear
Gabor
-
like

24

What is the best Multi
-
Stage
Architecture for Object Recognition?

Author: Kevin Jarrett et. al., 2009

Presenter:
Sreeparna

Mukherjee

25

Coming back to the starting problem

Vectorized

pixel intensities

Feature Extraction: Stage 1



.


.


.

Output or the highest level feature extraction: That’s the CRAZY FROG!!!!!

Can it be done more
efficiently with
multiple feature
extraction stages
instead of just one

26

Outline

Different Feature Extraction Models


Multi
-
Stage Feature Extraction Architecture


Learning Protocols


Experiments




27

Existing Feature
E
xtraction
M
odels


There are several single
-
stage feature extraction systems
inspired by mammalian visual cortex


Scale Invariant Feature Transform (SIFT)


Histogram of Oriented Gradients (HOG)


Geometric Blur




There are also models with two or more successive
stages of feature extractions


-

Convolutional

networks trained in supervised or


unsupervised mode


-

Multistage systems using a non
-
linear MAX or HMAX


models


28

Contrasts among different models

The feature extraction models primarily differ in
following aspects



Number of stages of feature extraction




Type of non
-
linearity used after filter
-
bank



Type of filter used



Type of classifier used



29

Questions to Address


How do the non
-
linearities

following filter
banks influence recognition accuracy?



Does unsupervised or supervised learning of
filter banks improve performance over hard
-
wired or random filters?




Is there any benefit of using a 2
-
stage feature
extractor as opposed to single stage feature
extractor?

30

Intuitions in Feature Extraction Architecture


Supervised training on a small number of labeled
datasets (e.g. Caltech
-
101) will fail



Filters need to be carefully handpicked for good
performance



Non
-
linearities

should not be a significant factor

What do you think?

These intuitions are wrong


We’ll see how !!!!

31

Outline

Different Feature Extraction Models


Multi
-
Stage Feature Extraction Architecture


Learning Protocols


Experiments




32

General Model Architecture

Vectorized

pixel intensities

Filter Bank Layer

Output or the highest level feature extraction: That’s the CRAZY FROG!!!!!

Non Linear Transformation Layers

Pooling Layer: Local Averaging to remove small perturbations

33

Filter Bank Layer (F
CSG
)

Input (
x
):

n
1
2D feature maps of size n
2
×

n
3

x
ijk

is each component in each feature map x
i



Output (
y
):

m
1

feature maps of size m
2
×

m
3



Filter (
k
):

k
ij

is a filter in the filter bank of size l
1

×

l
2

mapping x
i
to
y
j






34

Non
-
Linear Transformations in F
CSG

F
CSG
comprises of Convolution Filters (C), Sigmoid/
tanh

non
-
linearity (S) and gain (G) coefficients
g
j

35

Rectification Layer (
R
abs
)


This layer returns the absolute value of its
input



Other rectifying non
-
linearities

produced
similar results

under
R
abs


36

Local Contrast Normalization Layer (N)

This layer performs local


Subtractive Normalization


Divisive Normalization

Subtractive Normalization

Divisive Normalization

w
pq

is a normalized Gaussian weighting window


37

Average Pooling and
Subsampling

Layer (P
A
)

Averaging
: This creates robustness to small

distortions

w
pq

is a uniform
weighting window



Subsampling
: Spatial resolution is reduced by down
-
sampling with a ratio S in both spatial directions

Max Pooling and
Subsampling

Layer (P
M
)

Average operation is replaced by Max operation

Subsampling

procedure stays the same

38

Hierarchy among the Layers

Layers can be combined in various hierarchical
ways to obtain different architectures



F
CSG


P
A



F
CSG


R
abs



P
A



F
CSG


R
abs


N


P
A



F
CSG


P
M


A typical multistage architecture: F
CSG


R
abs


N


P
A

39

Outline

Different Feature Extraction Models


Multi
-
Stage Feature Extraction Architecture


Learning Protocols


Experiments




40

Unsupervised Training Protocols


Input: X (
vectorized

patch or stack of patches)


Dictionary: W


to be learnt


Feature Vector: Z
*

-

obtained by minimizing the
energy function

41

Learning procedure:
Olshausen
-
Field

O
btaining Z
*
from E
OF
via “basis pursuit” is an expensive optimization problem

42

The energy function to be minimized

λ



Sparsity

Hyper
-
parameter


Learning W: Done by
minimizing the Loss
Function L
OF
(W) using
stochastic gradient
descent


Learning procedure: PSD



E
PSD

optimization is faster as it has the predictor term



Goal of algorithm is to make the
regressor

C(X,K,G) as close to Z as possible


After training completion Z* = C(X,K) for input X i.e. the method is fast
feedforward

Regressor

function mapping X


Y

43

Loss Function

Outline

Different Feature Extraction Models


Multi
-
Stage Feature Extraction Architecture


Learning Protocols


Experiments




44

Experiment: Caltech 101 Dataset

R and RR


Random Features
and Supervised Classifier


U and UU


Unsupervised
Features, Supervised Classifier


U
+
and U
+
U
+

-

Unsupervised
Feature, Global
S
upervised
Refinement


G


Gabor Functions

Remarks:



Random filter and no filter learning achieve decent performance



Both Rectification and Supervised fine tuning improved performance



Two
-
stage systems are better than single
-
stage models



Unsupervised training does not significantly improve performance if both rectification and
normalization are used



Performance of Gabor Filters were worse than random filters

45

Experiment: NORB Dataset

Remarks
:




Rectification and Normalization makes a significant improvement when samples are
low




As the number of samples increases, improvement with Rectification and
Normalization becomes insignificant




Random filters performs much worse on large number of labeled samples


46

Experiment: MNIST Dataset


Two
-
stage feature extraction architecture was used




The parameters are first trained using PSD




Classifier is initialized randomly and the whole
system is fine tuned in supervised mode




A test error rate of 0.53% was observed


best
known error rate on MNIST without distortions or
preprocessing

47

Coming back to the same Questions!


How do the non
-
linearities

following filter banks influence recognition
accuracy?
-

Yes


Rectification improves performance possibly due to
i
) non
-
polar
features improves recognition or ii) it prohibits cancellations of
neighbors during pooling layer


Normalization also enables performance improvement and makes
supervised learning faster by contrast enhancement


Does unsupervised or supervised learning of filter banks improve
performance over hard
-
wired or random filters?
-

Yes


Random filter shows good performance in the limit of small training
set sizes where the optimal stimuli for random filters are similar to
that trained filters


Global supervised learning of filters yield good results if proper non
-
linearities

are used



Is there any benefit of using a 2
-
stage feature extractor as opposed to
single stage feature extractor?
-

Yes


The experiments show that 2
-
stage feature extractor performs much
better compared to single stage feature extractor models.

48

Questions

49

Extra Slides

50

Hybrid Monte Carlo (HMC) Sampler



Model samples are obtained by simulating a
physical system




Particles are subjected to potential and kinetic
energies




Velocities are sampled from a
univariate

Gaussian to obtain state
1




State of the particles follow conservation of
Hamiltonian
H(s,φ
)




n
-
steps of Leap
-
Frog Algorithm applied to state
1

(
s,φ
) to obtain state
2




Acceptance is performed based on P
acc
(state
1
,
state
2
)



Neal RM,
Proabilistic

inference using Markov Chain Monte
Carlo Methods,
Technical Report, U Toronto, 1993

Hamiltonian Dynamics

Leap
-
Frog
Discretization
:

51