AN INTRODUCTION TO DEEP
LEARNING
Presenter: User
Introduction
•
In statistical machine learning, a major
issue is the selection of an appropriate
feature space where input instances have
desired properties for solving a particular
problem.
•
The intermediate space can be specified
explicitly by automatically learning with
deep architectures.
Theoretical Limitations of Shallow
Architectures
•
some highly nonlinear functions can be represented
much more compactly in terms of number of parameters
with deep architectures than with shallow ones (e.g.
SVM).
•
For example : Parity function for n

bit inputs:
feed

forward neural network with O(log n) hidden
layers and O(n) neurons VS
feed

forward neural network with only one hidden
layer needs an exponential number of the same
neurons.
•
Deep architectures address this issue with the use of
distributed representations and as such may constitute a
tractable alternative.
Adding layers
•
Although
shallow
architectures
if
no
so
effective
as
deep
architecture
when
deal
with
highly
varying
function
.
But
adding
layers
does
not
necessarily
lead
to
better
solutions
.
For
example,
the
more
the
number
of
layers
in
a
neural
network,
the
lesser
the
impact
of
the
back

propagation
on
the
first
layers
.
The
gradient
descent
then
tends
to
get
stuck
in
local
minima
or
plateaus
.
This
issue
has
been
solved
by
introducing
an
unsupervised
layer

wise
pre

training
of
deep
architectures
.
More
precisely,
in
a
deep
learning
scheme
each
layer
is
treated
separately
and
successively
trained
in
a
greedy
manner
.
Restricted Boltzmann Machines(RBMS)
•
Learning can be made much more
efficient in a restricted Boltzmann
machine(RBM), which has no
connections between hidden units.
•
Multiple hidden layers can be learned by
treating the hidden activities of one RBM
as the data for training a higher

level
RBM.
•
RMBs are a special case of energy

based
models. Its probability distribution is defined
by an energy function E, which is usually
defined over couples (v , h) of binary vectors
by:
•
:the weights
•
:the biases
•
An RMB defines a joint probability distribution
on both the visible and hidden units:
•
v: visible units
h: hidden units
•
:Partition function
•
What we consider the most important is the
probability distribution on data vectors v
as follows:
•
We need to calculate to determine the
distribution above. It needs exponential time:
•
(n is the number of visible units and m is
the number of hidden units)
•
The character of RBM is that there are no
hidden

to

hidden and no visible

to

visible
connections. So, conditional independence
between each hidden units. The unit’s
activation probability is:
Contrastive Divergence Algorithm
•
The aim of learning RBM is calculate the
parameter . We can obtain it through
maximizing Log

likelihood function on training
set (include T samples):
•
In order to get the best parameter , we can
use stochastic gradient ascent to get the max
of as follows:
•
We assume
Ө
is a parameter of , so the
partial derivative process is:
•
:Mathematical expectation about p
•
We use “data” and “model” to represent
and respectively. So:
Gibbs Sampling
•
The second term is an expectation of
which include . It can be approximated
with a Markov chain Monte Carlo algorithm
such as Gibbs sampling:
Contrastive Divergence (CD)
•
Hinton put forward a fast learning algorithm
—
Contrastive Divergence.
•
In CD, we just need only k (general k=1)steps
Gibbs Sampling to get the approximate
distribution.
Algorithm 1
Comments 0
Log in to post a comment