Deep Belief Nets and Restricted Boltzmann Machines

guineanhillΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

132 εμφανίσεις

Erte

Pan

Wireless Eng. Group


Advisor: Dr. Han


Department of Electrical and Computer Engineering

University of Houston, Houston, TX.


Deep Belief Nets and Restricted Boltzmann Machines

Graphical Model

hidden

i

j

visible

hidden

i

j

visible



Undirected graphical model
:



links have
no directional
significance



inference(infer the states of unobserved variables) is easy



learning(adjust weights between variables to make the network
more likely to generate the observed data) and generating
processes are tricky



Directed graphical model
:



links have
a particular directionality indicated by
arrows



inference is difficult



learning and generating processes are simple



Generative model: graphical model captures the causal process
by which the observed data was generated, so it is also called
generative model.

Boltzmann Machine Model



Boltzmann Machine
:



one input layer and one hidden layer



typically binary states for every unit



stochastic (vs. deterministic)



recurrent (vs. feed
-
forward)



generative model (vs. discriminative): estimate the
distribution of observations(say p(image)), while
traditional discriminative networks only estimate the
labels(say p(
label|image
))




defined
Energy

of the network and
Probability

of a
unit’s state(scalar T is referred to as the “temperature”):

Boltzmann Machine

Restricted Boltzmann
Machine Model

Restricted Boltzmann Machine



Restricted
Boltzmann Machine
:



a bipartite graph: no
intralayer

connections, feed
-
forward



RBM does not have T factor, the rest are the same as
BM



one important feature of RBM is that the visible units
and hidden units are conditionally independent, which
will lead to a beautiful result later on:



Stochastic Search



Why BM?



Different optimization criteria in traditional networks and RBM for
optimization purpose
:



Traditional: Error criterion. BP method strictly goes along the gradient
descent direction. Any direction that enlarge error is NOT acceptable. Easy to
get stuck in local minima.



BM: associate the network with “Energy”. Simulated Annealing enables the
energy to grow under certain probability.

Simulated Annealing



Simulated Annealing
for BM:

1.
Create initial solution
S (global states of the network)


Initialize temperature
T
>>1


2.
Repeat

until

T =T
-
lower
-
bound


Repeat

until
thermal equilibrium is reached at
the current T


Generate a random transition from
S

to
S



Let

E

=
E
(
S

)


E
(
S
)


if

E
< 0

then
S

=
S



else if
exp
[

E
/
T
] >
rand
(0,1)

then

S

=
S



Reduce temperature
T

according to the
cooling schedule

3.
Return
S


This term allows “thermal
disturbance” which facilitate
finding global minimum

Restricted Boltzmann Machine



Two characters to define a Restricted
Boltzmann Machine:



states

of all the units: obtained through probability distribution.



weights

of the network: obtained through training(Contrastive Divergence).



As mentioned before, the objective of RBM is to estimate the distribution of input data.
And this goal is fully determined by the weights, given the input.



Energy

defined for the RBM:

Restricted Boltzmann Machine



Distribution of visible layer

of the RBM(Boltzmann
Distribution):



Z is the partition function defined as the sum of over all possible
configurations of {
v,h
}



Probability that unit i is on(binary state 1):
σ
(.) is
the logistic/sigmoid function

Restricted Boltzmann Machine



the <.>
0
denotes an average
w.r.t
. the data distribution



the gradient is then computed as:





the <.>


denotes an average
w.r.t
. the model distribution



Given
i.i.d
. samples , The objective
is to maximize the
average log
-
likelihood:



Training for RBM:
Maximum Likelihood learning



the probability over a
vector x with
parameter W(weights) is:

Restricted Boltzmann Machine



Then the update of weights, W, can be computed as:



the <.>
0
term can be computed using the input samples



the <.>

term can be solved by MCMC but very slow and suffering from large
variance of estimated gradient



Solution:
Contrastive Divergence



maximizing the log probability of the data is the same as minimizing the
KL
divergence
, (




define CD to be: where n is the small
number that we run the MC



use
CDn

multiplied by learning rate as the update of the weights



Note:

this update direction is NOT the gradient of ANY function, yet it is
successful in application…

Restricted Boltzmann Machine



Summarized algorithm for training RBM:

1)

take a training sample v, compute the probabilities of the hidden units and sample a hidden
activation vector h

from this probability distribution.

2)

compute the expectation

of

vh

and call this the

positive gradient
. (clamped phase, or positive
phase)

3)

From

h , sample a reconstruction

v’ of the visible units, then resample the hidden activations

h’
from this.

4)
Compute the

expectation of
v’h


and call this the

negative gradient
.(free phase, or negative phase)

5)
Let the weight update
Wij

to

be the positive gradient minus the negative gradient, times some
learning rate.



In RBM, the previous equations then become(for calculating a particular
weight between two units):





these two equations are obtained by substituting the energy function into
the learning rule.

General Deep Belief Nets



Problem with DBNs:

Since DBNs are directed graph model,
g
iven input data, the
posterior of hidden units is intractable due to the “explaining
away” effect.



Solution:
Complementary Priors

to ensure the posterior of
hidden units are under the independent constraints.

truck hits house

earthquake

house
jumps

20

20

-
20

-
10

-
10

General Deep Belief Nets

Explaining Away Effect



p(1,1)=.0001

p(1,0)=.4999

p(0,1)=.4999

p(0,0)=.0001

posterior

Explaining Away Effect



Brief summary for explaining away effect:

Given the observations, the posterior of associated hidden
variables are actually NOT independent(the probability
that one hidden variable is on or off influences the states
of others), even though the hidden variables are assumed
to be independent in their priors.



The reason is that we have non
-
independence in the
likelihood term:



Posterior(non
-
indep
) = prior(
indep
.) * likelihood (non
-
indep
.)



Eliminate
Explaining Away by
Complementary Priors



Add
extra hidden layers to create CP that has opposite
correlations with the likelihood term, so (when likelihood
is multiplied by the prior),
posterior will
become factorial


v
1


h
1


v
0


h
0


v
2


h
2

etc.

+

+

+

+

Complementary Priors



Definition of
Complementary Priors
:



Consider observations x and hidden variables y, for a given likelihood
function P(
x|y
), the priors of y, P(y) is called the complementary priors of
P(
x|y
), provided that P(
x,y
)=P(
x|y
) P(y) leads to the posteriors P(
y|x
) that
exactly
factorises
.



Infinite directed model with tied weights & Complementary Priors & Gibbs
sampling:



Recall that the RBMs have the property



The definition of energy function of RBM makes it proper model that has two
sets of conditional independencies(complementary priors for both v and h)



Since we need to estimate the distribution of data, P(v), we can perform
Gibbs sampling alternatively from P(
v,h
) for infinite times. This procedure is
analogous to unroll the single RBM into
infinite directed stacks of RBMs

with
tied weights(due to “complementary priors”) where each RBM takes input
from the hidden layer of the lower level RBM.

DBNs based on RBMs



DBNs based on stacks of RBMs:


The top two hidden layers form an
undirected associative
memory(regarded as a shorthand for
infinite stacks) and the remained
hidden layers form a directed acyclic
graph.


h2



data



h1




h3

RBM

RBM

RBM



The red arrows are NOT part of the
generative model. They are just for
inference purpose

Training Deep Belief Nets



Previous discussion gives an intuition of training stacks of RBMs one layer at
a time.



This greedy learning algorithm is proved to be efficient in the sense of
expected variance by Hinton.



First, learn all the weights tied.

Learn as a single
RBM

Training Deep Belief Nets



Then freeze bottom layer and
relearn all the other layers.



Then freeze bottom two layers
and relearn all the other layers.

Learn as a single
RBM

Learn as a single
RBM

Fine
-
tuning Deep Belief Nets



Each time we learn a new layer, the inference at the lower layers will
become incorrect, but the
variational

bound on the log probability of the
data improves, proved by Hinton.



Since the inference at lower layers becomes incorrect, Hinton uses a fine
-
tuning procedure to adjust the weights, called wake
-
sleep algorithm.



Wake
-
sleep algorithm
:



wake phase: do a down
-
top pass, sample
h using the recognition weight based
on input v for each RBM, and then adjust
the generative weight by the RBM learning
rule.



sleep phase: do a top
-
down pass, start by
a random state of h at the top layer and
generate v. Then the recognition weights
are modified.


h2



data



h1




h3

RBM

RBM

RBM

Deep Belief Nets



Analogs for wake
-
sleep algorithm:



wake phase: if the reality is different with the imagination, then modify the
generative weights to make what is imagined as close as the reality.



sleep phase: if the illusions produced by the concepts learned during wake
phase are different with the concepts, then modify the recognition weight to
make the illusions as close as the concepts.



Questions on DBNs:



training vector vs. training set(Patch Training)



How to perform unsupervised classification?

Performances from DBNs



A: 2
-
D coded representation of hand
-
written database MNIST by PCA



B: 2
-
D coded representation of MNIST by DBNs

Results produced by Hinton etc.

Performances from DBNs

A

B



A: 2
-
D coded representation of documents retrieval data by LSA



B: 2
-
D coded representation of the same data by DBNs

Results produced by Hinton etc.

Convolutional DBNs



Limitations of DBNs
:



unable to process high dimensional data(DBNs transform 2D images into vectors
and then input them into the networks, thus certain spatial information is lost)




even if using vectors as the input instead, DBNs are unable to be scaled up
properly for real image sizes. They are only suitable for small images



directly extending the DBNs to fit the high dimensional data suffers from
inefficient computation(millions of weights to estimate)



Advantages of CDBNs
:



feature detectors are shared through all locations in an image, therefore they
form the
convolution kernels

and reduce computation




max
-
pooling
: shrink the representation to be
translation
-
invariant

and reduce
computation

Architecture of CDBNs



Energy term and Probability are
defined similarly to RBM:



All units are 2D
binary

images,
within one unit of detection layer,
the weights/convolutional kernels
are shared, leading to the
convolution operation :

CDBNs



Training of CDBNs is done by optimizing the networks’ energy via
sparsity

regularization
(imposed by max
-
pooling):



This yields a similar updating strategy for the weights and biases as the
Contrastive Divergence.



The sparse constraints also give rise to a simple inference of the network:






where

Performance of CDBNs

Results produced by Andrew Y. Ng etc.



Hierarchical representations of Caltech
-
101 object classification database by
CDBNs. Top: first layer CDBN output. Bottom: second layer CDBN output.

References



Review:



Learning deep architectures for AI, Y.
Bengio

2009



Foundations:



A fast learning algorithm for deep belief nets, Hinton 2006



Reducing the dimensionality of data with neural networks, Hinton 2006



A practical guide to training restricted Boltzmann machines, Hinton 2010



On contrastive divergence learning, Hinton, 2005



On the convergence property of contrastive divergence,
Tieleman
, 2010



Training products of experts by minimizing contrastive divergence, Hinton, 2002



Learning multiple layers of representation, Hinton, 2007



Applications:



Sparse deep belief net model for visual area V2, H. Lee 2008



Convolutional deep belief network for scalable unsupervised learning of hierarchical
representations, H. Lee 2009



Unsupervised learning of invariant feature hierarchies with applications to object
recognition, Y.
LeCun

2007