Learning optimal control and self-organization - DevLeaNN

wonderfuldistinctAI and Robotics

Oct 16, 2013 (3 years and 5 months ago)

39 views


Thomas Trappenberg

Learning
:

A
modern review of anticipatory


systems
in brains and machines

2

Outline

Machine Learning

Computational
Neuroscience

1.

Supervised

Learning

Synaptic

Plasticity

2.

Sparse Unsupervised Learning

Cortical Object Recognition

3.

Reinforcement

Learning

Basal Ganglia

Universal Learning machines

1961: Outline of a theory of Thought
-
Processes


and Thinking Machines




Neuronic & Mnemonic Equation



Reverberation



Oscillations



Reward learning




Eduardo Renato Caianiello (1921
-
1993)

But: NOT STOCHASTIC

(only small noise in weights)

Stochastic networks:


The Boltzmann machine


Hinton & Sejnowski 1983

MultiLayer

Perceptron

(MLP)

Universal
approximator

(learner)




but




Overfitting




Meaningful input



Unstructured learning



Only
deterministic

(just use chain rule)

Linear large margin classifiers




Support Vector Machines (SVM)

MLP: Minimize training error


(here threshold
Perceptron
)



VM
: Minimize generalization error





(empirical risk)


Linear

in parameter learning

Linear in parameters

Thanks to Doug Tweet (
UoT
) for pointing out LIP

Linear hypothesis

Non
-
Linear hypothesis

SVM in dual form

}

Kernel function

Liquid/echo state machines

Extreme learning machines

Fundamental stochastisity

Irreducible indeterminacy

Epistemological limitations

Sources of fluctuations




Probabilistic framework

Goal of learning:

Make predictions !!!!!!!!!!!

learning vs memory

Goal of learning:

Plant equation for robot


Distance traveled when both motors

are running with Power 50

Hypothesis:

Learning
:
Choose parameters that make training data most likely

The hard problem
:
How to come up with a useful hypothesis

Assume independence of training examples

and consider this as function of parameters (log likelihood)

Maximum

Likelihood

Estimation

How about
building more
elaborate multivariate models
?

Causal (graphical) models
(Judea Pearl)

Parameters of CPT usually learned from data!

and arguing with


10 parameters

31

Hidden Markov Model (HMM) for localization



Integrating sensor information becomes trivial



Breakdown of point estimates in global localization (particle filters)

Synaptic
Plasticity

Gradient descent rule for LMS loss function:

… with linear hypothesis:

Perceptron

learning rule


Hebb

rule

Donald O. Hebb

The organization of behavior (1949):

(1904
-
1985)

see also Sigmund Freud,
Law of association by simultaneity
, 1888

15

Classical LTP/LTD

R. Enoki, Y. Hu, D. Hamilton, and A. Fine, Neuron 62 (2009)

R. Enoki, Y. Hu, D. Hamilton, and A. Fine, Neuron 62 (2009)

R. Enoki, Y. Hu, D. Hamilton, and A. Fine, Neuron 62 (2009)

R. Enoki, Y. Hu, D. Hamilton, and A. Fine, Neuron 62 (2009)



D. Standage, S. Jalil and T. Trappenberg, Biological Cybernetics 96 (2007)

Data from G.Q. Bi and M.M. Poo, J Neurosci 18 (1998)

Population argument of `weight dependence’


Is Bi and Poo’s weight
dependent STDP data an
experimental artifact?



-

Three sets of assumptions


(B, C, D)



-

Their data may reflect population
effects


… with Dominic
Standage

(Queen’s University)

2.
Sparse Unsupervised Learning

Horace Barlow


Possible mechanisms underlying the transformations

of sensory of sensory messages (1961)


``
… reduction of redundancy is an important principle
guiding the organization of sensory messages …




Sparsness

&
Overcompleteness

The Ratio Club

minimizing reconstruction error

and sparsity

PCA

27

Geoffrey E. Hinton

Deep believe networks:

The stacked Restricted Boltzmann Machine

sparse
convolutional

RBM

… with Paul
Hollensen


& Warren Connors

Truncated

Cone

Side Scan

Sonar

Synthetic

Aperture Sonar

scRBM
/SVM mine sensitivity: .983
±
.024, specificity:
.954
±
.012

SIFT/SVM mine sensitivity: .970
±
.025, specificity:
.944
±
.008

scRBM

reconstruction

Sonar images

sparse and topographic RBM (
rtRBM
)

… with Paul
Hollensen

…with
Pitoyo

Hartono

Map Initialized
Perceptron

(MIP)

Free
-
Energy
-
Based

Supervised Learning:

TD learning generalized
to Boltzmann machines


(
Sallans

& Hinton 2004)

Paul Hollensen:

Sparse, topographic RBM successfully learns to drive the e
-
puck and avoid
obstacles, given training data (proximity sensors, motor speeds)

RBM features

3
.
Reinforcement learning

2.
Reinforcement learning

-
0.1

-
0.1

-
0.1

-
0.1

-
0.1

-
0.1

-
0.1

-
0.1

-
0.1

From Russel and Norvik

Markov Decision Process (MDP)

If we know all these factors the problem is said to be
fully observable

And we can just sit down and
contemplate

about the problem before moving

Goal: maximize total expected payoff

Two important quantities

policy:

value function:

Optimal Control

Calculate value function (dynamic programming)

Richard Bellman

1920
-
1984

Bellman Equation for policy
p

Deterministic policies

to simplify notation

Solution: Analytic or Incremental

Value
Iteration:

Bellman Equation for optimal policy

Policy
Iteration:

Chose one policy




calculate corresponding value function







chose better policy based on this value function

For each state evaluate all possible actions

But:


Environment not known a priori

Observability of states

Curse of Dimensionality

Solution:



Online (TD)



POMDP



Model
-
based RL

Online

value function estimation (
TD learning
)

If the environment is not known,

use Monte Carlo method with bootstrapping

Expected payoff

before taking step

Expected reward after taking step =


actual reward plus discounted expected payoff of next step

=T
emporal
D
ifference

What if the environment is not completely known ?



This leads to the exploration
-
exploitation dilemma

Online optimal control: Exploitation versus Exploration

On
-
policy TD learning: Sarsa

Off
-
policy TD learning: Q
-
learning

Model
-
based RL: TD(
l
)

Instead of tabular methods as mainly discussed before, use
function approximator with parameters
q

and gradient descent
with exponential eligibility trace
e

which weights updates with
l

for each step (Satton 1988):

Free Energy
-
based reinforcement learning


(
Sallans

& Hinton 2004
)


… Paul
Hollensen

Basal
Ganglia

… work with Patrick Connor

Our questions



How do humans learn values that guide behaviour?


(human behaviour)




How is this implemented in the brain?


(anatomy and physiology)




How can we apply this knowledge?


(medical interventions and robotics)

Ivan Pavlov

1849
-
1936

Nobel Prize 1904

Classical Conditioning

Rescorla
-
Wagner Model (1972)

Stimulus B
Stimulus A
Reward

Stimulus A

No reward

Wolfram Schultz

Reward Signals in the Brain

Maia & Frank 2011

Disorders with effects

On dopamine system:


Parkinson’s disease

Tourett’s syndrome

ADHD

Drug addiction

Schizophrenia

Adding Biological Qualities to the Model

Input

Rescorla
-
Wagner Model

Rescorla and Wagner, 1972

Dopamine and Reward

Prediction Error

Schultz, 1998

Striatum

Adding Biological Qualities to the Model

GPe

Striatum

Dual pathway model

e.g. M. Frank

Striatal interactions

and plasticity

e.g. J. Wickens

SLIM: Striatal with Lateral Inhibition Model

Model Equations

Parameters to Vary

>Input Salience

>Tuning Curve Width

>Mean Input Weight

>Std Input Weight

>% Input Connection

>Input Learning Rate

>Mean Lateral Weight

>Std Lateral Weight

>% Lateral Connection

>Lateral Learning Rate

>Number of Neurons

>Activation Threshold

>Activation Slope

>Activation Exponent

>Number of Inputs

>Input Noise

>Reward Salience

Acquisition:

(Pavlov, 1927)


CS+

(Classical

Conditioning)


V(S) = R

(Reinforcement

Learning)

Conditioning Paradigms 1

Extinction:


CS
-

(Classical

Conditioning)


V(S) = 0

(Reinforcement

Learning)

Conditioning Paradigms 2

Partial (probabilistic) Conditioning:


CS+/
-


V(S) = P


Conditioning Paradigms 3

Negative Patterning:

(Woodbury, 1943)


CS
A

+,CS
B

+,CS
AB




V(S
A
) = R

V(S
B
) = R

V(S
AB
) = 0


Conditioning Paradigms 4

… with Patrick and Laurent
Mattina

ENSTA
-
Bretagne, Brest, France

Simulating Hyperactivity in ADHD using Reinforcement Learning

Higher initial value


looses more interest


more switching

Works only when taking a switching cost into account

Conclusion and Outlook

Three basic categories of learning:


Supervised:

Lots of progress through statistical learning theory





Kernel machines, graphical models, etc


Unsupervised:

Hot research area with some progress,





deep temporal learning


Reinforcement:

Important topic in animal behavior,






model
-
based RL

62

The Anticipating Brain

learn to model the world


… argue about it


… and act accordingly