IROS04
(Japan, Sendai)
University of Tehran
Amir massoud Farahmand
-
Majid Nili Ahmadabadi
Babak Najar Araabi
farahmand@ipm.ir
,
{mnili, araabi}@ut.ac.ir
Behavior Hierarchy
Learning in a Behavior
-
based System using
Reinforcement Learning
Department of Electrical and Computer Engineering
University of Tehran
Iran
IROS04
(Japan, Sendai)
University of Tehran
Paper Outline
•
Challenges and Requirements of Robotic Systems
•
Behavior
-
based Approach to AI
•
How should we design a Behavior
-
based System
(BBS)?!
•
Learning in BBS
•
Structure Learning in BBS
•
Value Function Decomposition
•
Experiments: Multi
-
Robot Object Lifting
•
Conclusions, Ongoing Research, and Future Work
IROS04
(Japan, Sendai)
University of Tehran
Challenges and
Requirements
of Robotic Systems
Challenges
•
Sensor and Effector
Uncertainty
•
Partial Observability
•
Non
-
Stationarity
Requirements
(among many others)
•
Multi
-
goal
•
Robustness
•
Multiple Sensors
•
Scalability
•
Automatic design
•
[Learning]
IROS04
(Japan, Sendai)
University of Tehran
Behavior
-
based Approach to AI
•
Behavior
-
based approach as a good candidate for low
-
level
intelligence.
•
Behavioral (activity) decomposition
–
against functional decomposition
•
Behavior: Sensor
-
>Action (Direct link between perception and
action)
•
Situatedness
–
Situatedness motto:
The world is its own best model!
•
Embodiment
•
Intelligence as Emergence
–
(interaction of agent with environment)
IROS04
(Japan, Sendai)
University of Tehran
Behavioral decomposition
manipulate
the world
build maps
explore
locomote
avoid obstacles
sensors
actuators
IROS04
(Japan, Sendai)
University of Tehran
Behavior
-
based System Design
•
Hand Design
–
Common in almost everywhere (just ask some people in
IROS04)
–
Complicated: may be infeasible in complex problems
–
Even if it is possible to find a working system, probably it is not
optimal.
•
Evolution
–
Time consuming
–
Good solutions can be found
–
Biologically feasible
•
Learning
–
Biologically feasible
–
Learning is essential for life
-
time survival of the agent.
We have focuses on learning in this presentation.
IROS04
(Japan, Sendai)
University of Tehran
The Importance of Learning
•
Unknown environment/body
–
[exact] Model of environment/body is not known
•
Non
-
stationary environment/body
–
Changing environment (offices, houses, streets, and almost
everywhere)
–
Aging
•
Designer may not know how to benefit from every
aspects of her agent/environment
–
Let’s the agent learn it by itself (learning as optimization)
•
etc …
IROS04
(Japan, Sendai)
University of Tehran
Learning in Behavior
-
based
Systems
•
There are a few works on behavior
-
based
learning
–
Mataric, Mahadevan, Maes, and ...
•
… but there is no deep investigation about
it (specially mathematical formulation)!
IROS04
(Japan, Sendai)
University of Tehran
Learning in Behavior
-
based
Systems
There are different methods of learning with
different viewpoints, but we have
concentrated on Reinforcement Learning.
–
[Agent] Did I perform it correctly?!
–
[Tutor] Yes/No!
IROS04
(Japan, Sendai)
University of Tehran
Learning in Behavior
-
based
Systems
We have divided learning in BBS into these two
parts:
•
Structure Learning
–
How should we organize behaviors in the architecture
assume having a repertoire of working behaviors
•
Behavior Learning
–
How should each behavior behave? (we do not have
a necessary toolbox)
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning Assumptions
•
Structure Learning in
Subsumption Architecture as a
good sample for BBS
•
Purely parallel case
•
We know B1, B2, and … but we
do not know how to arrange
them in the architecture
–
we know how to {avoid
obstacles, pick an object,
stop, move forward, turn,
…} but we don’t know
which one is superior to
others.
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning
manipulate
the world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
The agent wants to learn
how to arrange these
behaviors in order to get
maximum reward from its
environment (or tutor).
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning
manipulate
the world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning
manipulate
the world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
1
-
explore
becomes
controlling behavior and
suppress
avoid obstacles
2
-
The agent
hits
a wall!
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning
manipulate
the world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
Tutor (environment) gives
explore
a
punishment
for its
being in that place of the
structure.
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning
manipulate
the world
build maps
explore
locomote
avoid obstacles
Behavior Toolbox
“explore”
is not a very good
behavior for the highest
position of the structure. So
it is replaced by “
avoid
obstacles”.
IROS04
(Japan, Sendai)
University of Tehran
Structure Learning Issues
•
How should we represent structure?
–
Sufficient (Concept space should be covered by
Hypothesis space)
–
Tractable (small Hypothesis space)
–
Well
-
defined credit assignment
•
How should we assign credits to architecture?
–
If the agent receives a reward/punishment, how
should we reward/punish structure of the
architecture?
IROS04
(Japan, Sendai)
University of Tehran
Value Function Decomposition and
Structure Learning
Each structure has a value regarding its
receiving reinforcement signal.
T
structure
agent with
the
t
T
r
E
V
•
The objective is finding a structure T with a
high value.
•
We have decomposed value function to
simpler components that enable us to benefit
from previous experiments.
IROS04
(Japan, Sendai)
University of Tehran
Value Function Decomposition
•
It is possible to decompose total system’s value
to value of each behavior in each layer.
•
We call it Zero
-
Order method.
layer
i
in the
behavior
g
controllin
is
)
,
(
th
j
t
ij
ZO
B
r
E
V
j
i
V
IROS04
(Japan, Sendai)
University of Tehran
Value Function Decomposition
Zero Order Method
It stores the value of behavior
-
being in a specific
layer.
avoid obstacles
(0.8)
avoid obstacles
(0.6)
explore
(0.7)
explore
(0.9)
locomote
(0.4)
Higher layer
Lower layer
ZO Value Table in the agent’s mind
locomote
(0.4)
IROS04
(Japan, Sendai)
University of Tehran
Credit Assignment for
Zero Order Method
•
Controlling behavior is the only responsible
behavior for the current reinforcement signal.
•
Appropriate ZO value table updating method is
available.
IROS04
(Japan, Sendai)
University of Tehran
Value Function Decomposition
Another Method (First Order)
It stores the value of relative order of behaviors
–
How much is it good/bad if “B1 is being placed higher than B2”?!
•
V(
avoid obstacles
>
explore
) = 0.8
•
V(
explore
>
avoid obstacles
) =
-
0.3
•
Sorry! Not that easy (and informative) to show graphically!!
•
Credits are assigned to all (controlling, activated) pairs of
behaviors.
–
The agent receives reward while B1 is controlling and B3 and B5 are
activated
•
(B1>B3): +
•
(B1>B5): +
IROS04
(Japan, Sendai)
University of Tehran
Structure Representation
Both of these methods are provided with a
lot of probabilistic reasoning which shows
how to
–
decompose total system value to simple
components
–
assign credits
–
update values table
Check the Proceeding for Mathematical
Formulation!
IROS04
(Japan, Sendai)
University of Tehran
Example: Multi
-
Robot
Object Lifting
•
A Group of three robots want
to lift an object using their
own local sensors
–
No central control
–
No communication
–
Local sensors
•
Objectives
–
Reaching prescribed height
–
Keeping tilt angle small
IROS04
(Japan, Sendai)
University of Tehran
Example: Multi
-
Robot
Object Lifting
Behavior Toolbox
Stop
Push More
Hurry Up
Slow Down
Don’t Go Fast
?!
IROS04
(Japan, Sendai)
University of Tehran
Example: Multi
-
Robot
Object Lifting
Sample shot of tilt angle of the object after sufficient learning
5
10
15
20
25
30
35
40
45
50
-40
-30
-20
-10
0
10
20
30
40
Episodes
Average total reward per episode
Mean hand-designed performance
Zero order
First order
IROS04
(Japan, Sendai)
University of Tehran
Example: Multi
-
Robot
Object Lifting
Sample shot of height of each robot after sufficient learning
0
10
20
30
40
50
60
70
80
90
0
0.5
1
1.5
2
2.5
3
3.5
Steps
z of robots
goal
1
2
3
IROS04
(Japan, Sendai)
University of Tehran
Example: Multi
-
Robot
Object Lifting
Sample shot of tilt angle of the object after sufficient learning
0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
45
Steps
Tilt angle (in degrees)
IROS04
(Japan, Sendai)
University of Tehran
Conclusions, Ongoing Research,
and Future Work
•
We have devised two different methods for
structure learning for behavior
-
based
system.
•
Good results in two different tasks
–
Multi
-
robot Object Lifting
–
An Abstract Problem (not reported yet)
IROS04
(Japan, Sendai)
University of Tehran
Conclusions, Ongoing Research,
and Future Work
•
… but from where should we find
necessary behaviors?!
–
Behavior Learning
•
We have devised some methods for
behavior learning which will be reported
soon.
IROS04
(Japan, Sendai)
University of Tehran
Conclusions, Ongoing Research,
and Future Work
•
However, there are many steps remained for
fully automated agent design
–
How should we generate new behaviors without even
knowing which sensory information is necessary for
the task (feature selection)
–
Problem of Reinforcement Signal Design
•
Designing a good reinforcement signal is not easy at all.
IROS04
(Japan, Sendai)
University of Tehran
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment