AN INTRODUCTION TO DEEP LEARNING

zoomzurichAI and Robotics

Oct 16, 2013 (3 years and 5 months ago)

77 views

AN INTRODUCTION TO DEEP
LEARNING

Presenter: User

Introduction


In statistical machine learning, a major
issue is the selection of an appropriate
feature space where input instances have
desired properties for solving a particular
problem.


The intermediate space can be specified
explicitly by automatically learning with
deep architectures.

Theoretical Limitations of Shallow
Architectures


some highly nonlinear functions can be represented
much more compactly in terms of number of parameters
with deep architectures than with shallow ones (e.g.
SVM).


For example : Parity function for n
-
bit inputs:


feed
-
forward neural network with O(log n) hidden
layers and O(n) neurons VS


feed
-
forward neural network with only one hidden
layer needs an exponential number of the same
neurons.



Deep architectures address this issue with the use of
distributed representations and as such may constitute a
tractable alternative.

Adding layers


Although

shallow

architectures

if

no

so

effective

as

deep

architecture

when

deal

with

highly

varying

function
.

But

adding

layers

does

not

necessarily

lead

to

better

solutions
.

For

example,

the

more

the

number

of

layers

in

a

neural

network,

the

lesser

the

impact

of

the

back
-
propagation

on

the

first

layers
.

The

gradient

descent

then

tends

to

get

stuck

in

local

minima

or

plateaus
.

This

issue

has

been

solved

by

introducing

an

unsupervised

layer
-
wise

pre
-
training

of

deep

architectures
.

More

precisely,

in

a

deep

learning

scheme

each

layer

is

treated

separately

and

successively

trained

in

a

greedy

manner
.

Restricted Boltzmann Machines(RBMS)


Learning can be made much more
efficient in a restricted Boltzmann
machine(RBM), which has no
connections between hidden units.


Multiple hidden layers can be learned by
treating the hidden activities of one RBM
as the data for training a higher
-
level
RBM.



RMBs are a special case of energy
-
based
models. Its probability distribution is defined
by an energy function E, which is usually
defined over couples (v , h) of binary vectors
by:






:the weights



:the biases



An RMB defines a joint probability distribution
on both the visible and hidden units:






v: visible units




h: hidden units



:Partition function




What we consider the most important is the
probability distribution on data vectors v
as follows:





We need to calculate to determine the
distribution above. It needs exponential time:



(n is the number of visible units and m is
the number of hidden units)


The character of RBM is that there are no
hidden
-
to
-
hidden and no visible
-
to
-
visible
connections. So, conditional independence
between each hidden units. The unit’s
activation probability is:







Contrastive Divergence Algorithm


The aim of learning RBM is calculate the
parameter . We can obtain it through
maximizing Log
-
likelihood function on training
set (include T samples):


In order to get the best parameter , we can
use stochastic gradient ascent to get the max
of as follows:





We assume
Ө

is a parameter of , so the
partial derivative process is:










:Mathematical expectation about p


We use “data” and “model” to represent


and respectively. So:



Gibbs Sampling


The second term is an expectation of


which include . It can be approximated
with a Markov chain Monte Carlo algorithm
such as Gibbs sampling:


Contrastive Divergence (CD)


Hinton put forward a fast learning algorithm

Contrastive Divergence.


In CD, we just need only k (general k=1)steps
Gibbs Sampling to get the approximate
distribution.


Algorithm 1