in the Brain

bouncerarcheryΤεχνίτη Νοημοσύνη και Ρομποτική

14 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

54 εμφανίσεις

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
1


Organizing Principles for Learning
in the Brain

Associative Learning:

Hebb rule and variations, self
-
organizing maps


Adaptive Hedonism:

Brain seeks pleasure and avoids pain: conditioning and

reinforcement learning


Imitation:

Brain specially set up to learn from other brains?

Imitation learning approaches


Supervised Learning:

Of course brain has no explicit teacher, but timing of development

may lead to some circuits being trained by others

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
2


Classical Conditioning and
Reinforcement Learning

Outline:


1.
classical conditioning and its variations

2.
Rescorla Wagner rule

3.
instrumental conditioning

4.
Markov decision processes

5.
reinforcement learning

Note: this presentation follows a chapter of “Theoretical Neuroscience” by Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
3



Example of embodied models of reward based learning:

Skinnerbots in Touretzky’s lab at CMU:

http://www
-
2.cs.cmu.edu/~dst/Skinnerbots/index.html

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
4



Project Goals

We are developing computational theories of operant conditioning.
While classical (Pavlovian) conditioning has a well
-
developed theory,
implemented in the Rescorla
-
Wagner model and its descendants (work
by Sutton & Barto, Grossberg, Klopf, Gallistel, and others), there is at
present no comprehensive theory of operant conditioning. Our work
has four components:

1. Develop computationally explicit models of operant conditioning that
reproduce classical animal learning experiments with rats, dogs, pigeons,
etc.

2. Demonstrate the workability of these models by implementing them on
mobile robots, which then become
trainable robots

(Skinnerbots). We
originally used
Amelia
, a B21 robot manufactured by
Real World Interface
,
as our implementation platform. We are moving to the Sony AIBO.

3. Map our computational theories onto neuroanatomical structures known
to be involved in animal learning, such as the hippocampus, amygdala, and
striatum.

4. Explore issues in human
-
robot interaction that arise when non
-
scientists
try to train robots as if they were animals.


also at: http://www
-
2.cs.cmu.edu/~dst/Skinnerbots/index.html

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
5


Classical Conditioning

Pavlov’s classic finding:

(classical conditioning)


Initially, sight of food leads to dog salivating:


food salivating

unconditioned stimulus, US unconditioned response, UR

(reward)


Sound of bell consistently precedes food. Afterwards, bell leads
to salivating:


bell salivating

conditioned stimulus, CS conditioned response, CR

(expectation of reward)

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
6


Variations of Conditioning 1

Extinction:

Stimulus (bell) repeatedly shown without reward (food):

conditioned response (salivating) reduced.


Partial reinforcement:

Stimulus only sometimes preceding reward:

conditioned response weaker than in classical case.


Blocking (2 stimuli):

First: stimulus S1 associated with reward: classical conditioning.

Then: stimulus S1 and S2 shown together followed by reward:

Association between S2 and reward
not

learned.

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
7


Variations of Conditioning 2

Inhibitory Conditioning (2 stimuli):

Alternate 2 types of trials:

1. S1 followed by reward.

2. S1+S2 followed by absence of reward.

Result: S2 becomes predictor of absence of reward.


To show this use for example the following 2 methods:

A. train animal to predict reward based on S2.

Result: learning slowed


B. train animal to predict reward based on S3, then show S2+S3.

Result: conditioned response weaker than for S3 alone.

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
8


Variations of Conditioning 3

Overshadowing (2 stimuli):

Repeatedly present S1+S2 followed by reward.

Result: often, reward prediction shared unequally between stimuli.


Example (made up):

red light + high pitch beep precede pigeon food.


Result: red light more effective in predicting the food than


high pitch beep.


Secondary Conditioning:

S1 preceding reward (classical case). Then, S2 preceding S1.

Result: S2 leads to prediction of reward.

But: if S1 following S2 showed too often: extinction will occur

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
9


Summary of Conditioning Findings

(incomplete, has been studied extensively for decades,

many books on topic)

figure taken from Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
10


Modeling Conditioning

The Rescorla Wagner rule

(1972):


Consider
stimulus variable
u

representing presence (
u
=1) or

absence (
u
=0) of stimulus. Correspondingly,
reward variable
r

represents presence or absence of reward.


The
expected reward
v

is modeled as “stimulus
x

weight”:



v

=
wu


Learning is done by adjusting the weight to minimize error

between predicted reward and actual reward.

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
11


Rescorla Wagner Rule

Denote the
prediction error

by
δ

(delta):


δ

=
r
-
v


Learning rule:


w

:=
w

+
ε

δ

u ,


where

ε

is a learning rate.


Q: Why is this useful?

A: This rule does
stochastic gradient descent

to minimize the


expected squared error (
r
-
v
)
2
, w converges to <r>. R.W. rule


is variant of the “delta rule” in neural networks.


Note: in psychological terms the learning rate is measure

of
associability

of stimulus with reward.


u
u
wu
r
wu
r
w
v
r
w
w


2


)
)(
(
2


)
(


)
(

2
2
2















Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
12


Rescorla Wagner Rule Example

prediction error
δ

=
r
-
v;
learning rule:
w

:=
w

+
ε

δ

u

figure taken from Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
13


Multiple Stimuli

Essentially the same idea/learning rule:


In case of multiple stimuli:
v

=
w∙u

(predicted reward = dot product of stimulus vector and weight vector)


Prediction error
:
δ

=
r
-
v


Learning rule:
w

:=
w

+
ε

δ

u


i
j
j
j
i
i
i
i
u
u
w
w
r
r
w
v
r
w
w


2


)
(
2


)
(
)
(

2
2
2




























u
w
u
w
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
14


In how far does Rescorla Wagner rule account for variants

of classical conditioning?

(prediction:
v

=
w∙u
; error:
δ

=
r
-
v; learning:
w

:=
w

+
ε

δ

u
)


figure taken from Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
15


(prediction:
v

=
w∙u
; error:
δ

=
r
-
v; learning:
w

:=
w

+
ε

δ

u
)


Extinction, Partial Reinforcement:
o.k., since w converges to <r>


Blocking:
during pre
-
training, w
1

converges to r. During training

v=w
1
u
1
+w
2
u
2
=r, hence
δ
=0 and w
2

does not grow.


Inhibitory Conditioning:
on S1 only trials, w
1

gets positive value.

on S1+S2 trials, v=w
1
+w
2

must converge to zero, hence w
2

becoming negative.


Overshadow:
v=w
1
+w
2

goes to r, but w
1

and w
2

may become

different if there are different learning rates
ε
i

for them.


Secondary Conditioning:
R.
-
W.
-
rule predicts negative S2 weight!


Rescorla Wagner rule qualitatively accounts for wide range

of conditioning phenomena but not secondary conditioning.


Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
16


Temporal Difference Learning

Motivation
: need to keep track of time within a trial

Idea
: (Sutton&Barto, 1990)

Try to predict the total future reward expected from time t onward

to the time T of end of trial. Assume time is in discrete steps.




Predicted

total future reward from time t (one stimulus case):







t
T
t
r
t
R
0
)
(
)
(






t
t
u
w
t
v
0
)
(
)
(
)
(



Problem
: how to adjust the weight? Would like to adjust w(
τ
)

to make v(t) approximate the true total future reward R(t)

(reward that is yet to come) but this is unknown since lying in future.

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
17


TD Learning cont’d.






t
T
t
r
t
R
0
)
(
)
(






t
t
u
w
t
v
0
)
(
)
(
)
(



Solution
: (Temporal Difference Learning Rule)

)
(
)
(
)
(
)
(







t
u
t
w
w
, with

)
(
)
1
(
)
(
)
(
t
v
t
v
t
r
t





temporal

difference

To see why this makes sense:













1
0
0
)
1
(
)
(
)
(
t
T
t
T
t
r
t
r
t
r




We want
v
(
t
) to approximate left hand side but also:
v
(
t
+1) should

approximate 2
nd

term of right hand side. Hence:

)
1
(
)
(
)
(
)
(
0








t
v
t
r
t
r
t
v
t
T


or

)
1
(
)
(
)
(



t
v
t
v
t
r
Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
18


TD Learning Rule Example

figure taken from Dayan&Abbott

)
(
)
(
)
(
)
(







t
u
t
w
w
)
(
)
1
(
)
(
)
(
t
v
t
v
t
r
t





;

Note:

temporal difference

learning rule can also

account for secondary

conditioning

(sorry, no example)

reward and time course

of reward correctly

predicted!

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
19


Dopamine and Reward Prediction

figure taken from Dayan&Abbott

(VTA=

ventral

tegmental

area

(midbrain))

VTA neurons

fire for unex
-

pected reward:

seem to re
-

present the

prediction

error
δ

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
20


Instrumental Conditioning

So far
: only concerned with prediction of reward.

Didn’t consider agent’s
actions
. Reward usually depends on what

you do!
Skinner

boxes
, etc.


Distinguish two scenarios:

A.
Rewards follow actions immediately (
Static Action Choice
)


Example: n
-
armed bandit (slot machine)


B. Rewards may be delayed (
Sequential Action Choice
)


Example: playing chess


Goal:

choose actions to maximize rewards


Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
21


Static Action Choice

Consider bee foraging:


Bee can choose to fly to blue or yellow flowers,

wants to maximize nectar volume.


Bees learn to fly to “better” flower in single session (~40 flowers)

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
22


Simple model of bee foraging

When bee chooses blue: reward ~p(r
b
)


or yellow: reward ~p(r
y
)


Assume model bee has
stochastic policy
:

chooses to fly to blue or yellow flower with p(b) or p(y) respectively.


A “convenient” assumption: p(b), p(y) follow
softmax decision rule
:

Notes:

p(b)+p(y)=1; m
b
, m
y

are
action values

to be adjusted;


β
: inverse temperature: big
β



deterministic behavior

)
exp(
)
exp(
)
exp(
)
(
y
b
b
m
m
m
b
p





)
exp(
)
exp(
)
exp(
)
(
y
b
y
m
m
m
y
p





))
(
(
)
(
y
b
m
m
b
p




)
exp(
1
/
1
)
(
x
x




, where

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
23


Exploration
-
Exploitation dilemma

Why use softmax action selection?

Idea
: bee could also choose “better” action all the time.

But
: bee can’t be sure that better action is really better action.


Bee needs to test and continuously verify which action leads

to higher rewards.


This is the famous
exploration
-
exploitation dilemma

of

reinforcement learning:

Need to
explore

to know what’s good.

Need to
exploit

what you know is good to maximize reward.


Generalization of softmax

to many possible actions:




a
N
a
a
a
m
m
a
p
1
'
'
)
exp(
)
exp(
)
(


Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
24


The Indirect Actor

Question:

how to adjust the action values
m
a
?

Idea
: have action values adapt to average reward for that action:


m
b

= <
r
b
> and
m
y

= <
r
y
>


This can be achieved with simple delta rule:


m
b



m
b

+
εδ

, where
δ

=
r
b
-
m
b



This is
indirect

actor because action choice is mediated

indirectly by expected amounts of rewards.

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
25


Indirect Actor Example

figure taken from

Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
26


The Direct Actor

figure taken from Dayan&Abbott

Idea
: choose action values directly to maximize


expected reward



Maximize this expected reward by stochastic gradient ascent:


y
b
r
y
p
r
b
p
r
)
(
)
(


)
)
(
)
(
)
(
)
(
(
y
b
b
r
b
p
y
p
r
y
p
b
p
m
r





)
))(
(
(
r
r
b
p
m
m
a
ab
b
b






This leads to the following learning rule:




where
δ
ab

is the “Kronecker delta” and
r
is a parameter often chosen

to be an estimate of the average reward per time.

¯

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
27


Direct Actor Example

figure taken from

Dayan&Abbott

again: nectar

volumes reversed

after first 100 visits

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
28


Sequential Action Choice

So far:

immediate reward after each action (n
-
armed bandit problem)

Now:

delayed rewards
, can be in
different states


Example:

Maze Task


figure taken from

Dayan&Abbott

Amount of reward after decision at second intersection depends on

action taken at first intersection.

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
29


Policy Iteration

Big body of research on how to solve this and more complicated

tasks, easily filling an entire course by itself. Here we just consider

one example method:

policy iteration
.


Assumption:
state is
fully observable

(in contrast to only
partially

observable
), i.e. the rat knows exactly where it is at any time.


Idea:

maintain and improve a
stochastic policy
, determining actions

at each decision point (A,B,C) using action

values and softmax decision.


Two elements:

critic:
use temporal difference learning to predict

future rewards from A,B,C if current policy is followed

actor:
maintain and improve the policy


figure taken from Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
30


Policy Iteration cont’d.

How to formalize this idea?

Introduce
state variable

u

to describe whether rat is at A,B,C.

Also introduce
action value vector

m
(
u
) describing the policy.

(softmax rule assigns probability of action
a

based on action values)


Immediate reward for taking action
a

in state
u
:
r
a
(
u
)

Expected future reward for starting in state
u

and following

current policy:
v
(
u
) (
state value
).

The rat’s estimate for this is denoted by
w
(
u
).


Policy Evaluation

(critic): estimate w(u) using

temporal difference learning.


Policy Improvement

(actor): improve action

values
m
(
u
)

based on estimated state values.

figure taken from Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
31


Policy Evaluation

Initially, assume all action values are 0, i.e.

left/right equally likely everywhere.


True value of each state can be found

by inspection:

v(B) = ½(5+0)=2.5; v(C) = ½(2+0)=1;

v(A) = ½(v(B)+v(C))=1.75.


These values can be learned with temporal difference learning rule:




)
(
)
(
u
w
u
w
with

)
(
)
'
(
)
(
u
v
u
v
u
r
a




where
u’

is the state that results from taking action
a
in state
u
.

figure taken from Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
32


Policy Evaluation Example




)
(
)
(
u
w
u
w
with

)
(
)
'
(
)
(
u
v
u
v
u
r
a




figures taken from

Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
33


Policy Improvement

)
(
)
'
(
)
(
u
v
u
v
u
r
a




figures taken from

Dayan&Abbott




))
;
'
(
(
)
(
)
(
'
'
'
u
a
p
u
m
u
m
aa
a
a



How to adjust action values?

where

and p(
a
’;
u
) is the softmax probability of chosing action
a’

in

state
u
as determined by

m
a’
(u)
.


Example: consider starting out from random policy and assume

state value estimates
w
(u) are accurate. Consider
u
=A, leads to

75
.
0
)
(
)
(
0




A
v
B
v

75
.
0
)
(
)
(
0





A
v
C
v

for left turn

for right turn

rat will increase

probability of
going left in A

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
34


Policy Improvement Example

figures taken from

Dayan&Abbott

Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
35


Some Extensions

-
Introduction of a
state vector

u


-
discounting

of future rewards: put more


emphasis on rewards in the near future


than rewards that are far away.




Note: reinforcement learning is big subfield

of machine learning. There is a good

introductory textbook by Sutton and Barto.


Jochen Triesch, UC San Diego, http://cogsci.ucsd.edu/~triesch
36


Questions to discuss/think about

1.
Even at one level of abstraction there are many different
“Hebbian”, or Reinforcement learning rules;
is it important
which one you use? What is the right one?

2.
The applications we discussed in Hebbian and Reinforcement
learning considered networks passively receiving simple
sensory input and learning to code it or behave “well”;
how
can we model learning through
interaction

with complex
environments? Why might it be important to do so?

3.
The problems we considered so far are very “low
-
level”, no
hint of “complex behaviors” yet.
How can we bridge this huge
divide? How can we “scale up”? Why is it difficult?