IROS04 - Webdocs Cs Ualberta

loutclankedΤεχνίτη Νοημοσύνη και Ρομποτική

13 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

73 εμφανίσεις

IROS04

(Japan, Sendai)

University of Tehran

Amir massoud Farahmand

-

Majid Nili Ahmadabadi


Babak Najar Araabi

farahmand@ipm.ir
,
{mnili, araabi}@ut.ac.ir


Behavior Hierarchy
Learning in a Behavior
-
based System using
Reinforcement Learning

Department of Electrical and Computer Engineering

University of Tehran

Iran


IROS04

(Japan, Sendai)

University of Tehran

Paper Outline


Challenges and Requirements of Robotic Systems


Behavior
-
based Approach to AI


How should we design a Behavior
-
based System
(BBS)?!


Learning in BBS


Structure Learning in BBS


Value Function Decomposition


Experiments: Multi
-
Robot Object Lifting


Conclusions, Ongoing Research, and Future Work


IROS04

(Japan, Sendai)

University of Tehran

Challenges and
Requirements

of Robotic Systems

Challenges


Sensor and Effector
Uncertainty


Partial Observability


Non
-
Stationarity


Requirements

(among many others)


Multi
-
goal


Robustness


Multiple Sensors


Scalability


Automatic design


[Learning]


IROS04

(Japan, Sendai)

University of Tehran

Behavior
-
based Approach to AI


Behavior
-
based approach as a good candidate for low
-
level
intelligence.


Behavioral (activity) decomposition


against functional decomposition


Behavior: Sensor
-
>Action (Direct link between perception and
action)


Situatedness


Situatedness motto:
The world is its own best model!


Embodiment


Intelligence as Emergence


(interaction of agent with environment)

IROS04

(Japan, Sendai)

University of Tehran

Behavioral decomposition

manipulate

the world

build maps

explore

locomote

avoid obstacles

sensors

actuators

IROS04

(Japan, Sendai)

University of Tehran

Behavior
-
based System Design


Hand Design


Common in almost everywhere (just ask some people in
IROS04)


Complicated: may be infeasible in complex problems


Even if it is possible to find a working system, probably it is not
optimal.


Evolution


Time consuming


Good solutions can be found


Biologically feasible


Learning


Biologically feasible


Learning is essential for life
-
time survival of the agent.

We have focuses on learning in this presentation.

IROS04

(Japan, Sendai)

University of Tehran

The Importance of Learning


Unknown environment/body


[exact] Model of environment/body is not known


Non
-
stationary environment/body


Changing environment (offices, houses, streets, and almost
everywhere)


Aging


Designer may not know how to benefit from every
aspects of her agent/environment


Let’s the agent learn it by itself (learning as optimization)


etc …

IROS04

(Japan, Sendai)

University of Tehran

Learning in Behavior
-
based
Systems


There are a few works on behavior
-
based
learning


Mataric, Mahadevan, Maes, and ...


… but there is no deep investigation about
it (specially mathematical formulation)!

IROS04

(Japan, Sendai)

University of Tehran

Learning in Behavior
-
based
Systems

There are different methods of learning with
different viewpoints, but we have
concentrated on Reinforcement Learning.


[Agent] Did I perform it correctly?!


[Tutor] Yes/No!



IROS04

(Japan, Sendai)

University of Tehran

Learning in Behavior
-
based
Systems

We have divided learning in BBS into these two
parts:


Structure Learning


How should we organize behaviors in the architecture
assume having a repertoire of working behaviors


Behavior Learning


How should each behavior behave? (we do not have
a necessary toolbox)

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning Assumptions


Structure Learning in
Subsumption Architecture as a
good sample for BBS


Purely parallel case


We know B1, B2, and … but we
do not know how to arrange
them in the architecture


we know how to {avoid
obstacles, pick an object,
stop, move forward, turn,
…} but we don’t know
which one is superior to
others.

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning

manipulate

the world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

The agent wants to learn
how to arrange these
behaviors in order to get
maximum reward from its
environment (or tutor).

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning

manipulate

the world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning

manipulate

the world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

1
-
explore

becomes
controlling behavior and
suppress
avoid obstacles

2
-
The agent
hits

a wall!

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning

manipulate

the world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

Tutor (environment) gives
explore

a
punishment

for its
being in that place of the
structure.

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning

manipulate

the world

build maps

explore

locomote

avoid obstacles

Behavior Toolbox

“explore”

is not a very good
behavior for the highest
position of the structure. So
it is replaced by “
avoid
obstacles”.

IROS04

(Japan, Sendai)

University of Tehran

Structure Learning Issues


How should we represent structure?


Sufficient (Concept space should be covered by
Hypothesis space)


Tractable (small Hypothesis space)


Well
-
defined credit assignment


How should we assign credits to architecture?


If the agent receives a reward/punishment, how
should we reward/punish structure of the
architecture?


IROS04

(Japan, Sendai)

University of Tehran

Value Function Decomposition and
Structure Learning

Each structure has a value regarding its
receiving reinforcement signal.



T

structure

agent with

the
t
T
r
E
V


The objective is finding a structure T with a
high value.


We have decomposed value function to
simpler components that enable us to benefit
from previous experiments.

IROS04

(Japan, Sendai)

University of Tehran

Value Function Decomposition


It is possible to decompose total system’s value
to value of each behavior in each layer.


We call it Zero
-
Order method.



layer
i

in the
behavior

g
controllin

is

)
,
(
th
j
t
ij
ZO
B
r
E
V
j
i
V


IROS04

(Japan, Sendai)

University of Tehran

Value Function Decomposition

Zero Order Method

It stores the value of behavior
-
being in a specific
layer.

avoid obstacles

(0.8)

avoid obstacles

(0.6)

explore

(0.7)

explore

(0.9)

locomote

(0.4)

Higher layer

Lower layer

ZO Value Table in the agent’s mind

locomote

(0.4)

IROS04

(Japan, Sendai)

University of Tehran

Credit Assignment for

Zero Order Method


Controlling behavior is the only responsible
behavior for the current reinforcement signal.


Appropriate ZO value table updating method is
available.

IROS04

(Japan, Sendai)

University of Tehran

Value Function Decomposition

Another Method (First Order)

It stores the value of relative order of behaviors


How much is it good/bad if “B1 is being placed higher than B2”?!


V(
avoid obstacles
>
explore
) = 0.8


V(
explore
>
avoid obstacles
) =
-
0.3


Sorry! Not that easy (and informative) to show graphically!!


Credits are assigned to all (controlling, activated) pairs of
behaviors.


The agent receives reward while B1 is controlling and B3 and B5 are
activated


(B1>B3): +


(B1>B5): +


IROS04

(Japan, Sendai)

University of Tehran

Structure Representation

Both of these methods are provided with a
lot of probabilistic reasoning which shows
how to


decompose total system value to simple
components


assign credits


update values table

Check the Proceeding for Mathematical
Formulation!

IROS04

(Japan, Sendai)

University of Tehran

Example: Multi
-
Robot

Object Lifting


A Group of three robots want
to lift an object using their
own local sensors


No central control


No communication


Local sensors


Objectives


Reaching prescribed height


Keeping tilt angle small

IROS04

(Japan, Sendai)

University of Tehran

Example: Multi
-
Robot

Object Lifting

Behavior Toolbox

Stop

Push More

Hurry Up

Slow Down

Don’t Go Fast

?!

IROS04

(Japan, Sendai)

University of Tehran

Example: Multi
-
Robot

Object Lifting

Sample shot of tilt angle of the object after sufficient learning

5
10
15
20
25
30
35
40
45
50
-40
-30
-20
-10
0
10
20
30
40
Episodes
Average total reward per episode
Mean hand-designed performance
Zero order
First order
IROS04

(Japan, Sendai)

University of Tehran

Example: Multi
-
Robot

Object Lifting

Sample shot of height of each robot after sufficient learning

0
10
20
30
40
50
60
70
80
90
0
0.5
1
1.5
2
2.5
3
3.5
Steps
z of robots
goal
1
2
3
IROS04

(Japan, Sendai)

University of Tehran

Example: Multi
-
Robot

Object Lifting

Sample shot of tilt angle of the object after sufficient learning

0
10
20
30
40
50
60
70
80
90
0
5
10
15
20
25
30
35
40
45
Steps
Tilt angle (in degrees)
IROS04

(Japan, Sendai)

University of Tehran

Conclusions, Ongoing Research,
and Future Work


We have devised two different methods for
structure learning for behavior
-
based
system.


Good results in two different tasks


Multi
-
robot Object Lifting


An Abstract Problem (not reported yet)


IROS04

(Japan, Sendai)

University of Tehran

Conclusions, Ongoing Research,
and Future Work


… but from where should we find
necessary behaviors?!


Behavior Learning


We have devised some methods for
behavior learning which will be reported
soon.

IROS04

(Japan, Sendai)

University of Tehran

Conclusions, Ongoing Research,
and Future Work


However, there are many steps remained for
fully automated agent design


How should we generate new behaviors without even
knowing which sensory information is necessary for
the task (feature selection)


Problem of Reinforcement Signal Design


Designing a good reinforcement signal is not easy at all.


IROS04

(Japan, Sendai)

University of Tehran