Reinforcement Learning Applications in Robotics

albanianboneyardΤεχνίτη Νοημοσύνη και Ρομποτική

2 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

97 εμφανίσεις

Reinforcement Learning
Applications in Robotics

Gerhard Neumann,

Seminar A,

SS 2006

Overview


Policy Gradient Algorithms


RL for Quadrupal Locomotion


PEGASUS Algorithm


Autonomous Helicopter Flight


High Speed Obstacle Avoidance


RL for Biped Locomotion


Poincare
-
Map RL


Dynamic Planning


Hierarchical Approach


RL for Acquisition of Robot Stand
-
Up Behavior

RL for Quadruped Locomotion
[Kohl04]


Simple Policy
-
Gradient Example


Optimize Gait for


Sony
-
Aibo Robot





Use Parameterized Policy


12 Parameters


Front + rear locus (height, x
-
pos, y
-
pos)


Height of the front and the rear of the body





Quadruped Locomotion


Policy: No notion of state


open loop control!


Start with initial Policy


Generate t = 15 random policies R
i




is


Evaluate Value of each policy on the real robot


Estimate gradient for each parameter


Update policy into the direction of the gradient

Quadruped Locomotion


Estimation of the
Walking Speed of a
policy


Automated process
of the Aibos


Each Policy is
evaluated 3 times


One Iteration (3 x
15 evaluations)
takes 7.5 minutes

Quadruped Gait: Results









Better than the best
known gait for AIBO!

Pegasus [Ng00]


Policy Gradient Algorithms:


Use finite time horizon, evaluate Value


Value of a policy in a stochastic environment is hard to
estimate


=> Stochastic Optimization Process


PEGASUS:


For all policy evaluation trials use fixed set of start states
(scenarios)


Use „fixed randomization“ for policy evaluation


Only works for simulations!


The same conditions for each evaluation trial


=> Deterministic Optimization Process!


Can be solved by any optimization method


Commonly Used: Gradient Ascent, Random Hill Climbing

Autonomous Helicopter Flight
[Ng04a, Ng04b]


Autonomously learn to fly an unmanned
helicopter


70000 $ => Catastrophic Exploration!


Learn Dynamics from the observation of a
Human pilot



Use PEGASUS to:


Learn to Hover


Learn to fly complex maneuvers


Inverted Helicopter flight

Helicopter Flight: Model
Indenfication


12 dimensional state
space


World Coordinates
(Position + Rotation) +
Velocities


4
-
dimensional actions


2 rotor
-
plane pitch


Rotor blade tilt


Tail rotor tilt


Actions are selected
every 20 ms

Helicopter Flight: Model
Indenfication



Human pilot flies helicopter, data
is logged


391s training data


reduced to 8 dimensions
(position can be estimated from
velocities)


Learn transition probabilities
P(st+1|st, at)


supervised learning with locally
weighted linear regression


Model Gaussian noise for
stochastic model


Implemented a simulator for
model validation



Helicopter Flight: Hover
Control


Desired hovering position :


Very Simple Policy Class









Edges are optained by human prior knowledge


Learns more or less linear gains of the controller


Quadratic Reward Function:


punishment for deviation of desired position and
orientation

Helicopter Flight: Hover
Control


Results:


Better performance than Human Expert (red)

Helicopter Flight: Flying
maneuvers


Fly 3 manouvers from the most difficult RC
helicopter competition class


Trajectory Following:


punish distance from projected point on trajectory


Additional reward for making progress along the trajectory



Helicopter Flight: Results


Videos:









Video1




Video2

Helicopter Flight: Inverse
Flight


Very difficult for
humans


Unstable!


Recollect data for
inverse flight


Use same methods
than before


Learned in 4 days!


from data collection to
flight experiment


Stable inverted flight
controller


sustained position




Video

High Speed Obstacle
Avoidance [Michels05]


Obstacle Avoidance with RC car in
unstructured Environments


Estimate depth information from
monocular cues


Learn controller with PEGASUS for
obstacle avoidance


Graphical Simulation : Does it work in the
real environment?


Estimate Depths Information:


Supervised Learning


Divide image into 16 horizontal stripes


Use features of the strip and the neighbored strips as
input vectors.


Target Values (shortest distance within a strip)
either from simulation or laser range finders


Linear Regression


Output of the vision system





angle of the strip with the largest
distance


Distance of the strip


Obstacle Avoidance: Control


Policy: 6 Parameters







Again, a very simple
policy is used



Reward:


Deviation of the desired
speed, Number of
crashes

Obstacle Avoidance: Results


Using a graphical simulation to train the vision system also
works for outdoor environments










Video

RL for Biped Robots


Often used only for simplified planar models






Poincare
-
Map based RL [Morimoto04]


Dynamic Planning [Stilman05]


Other Examples for RL in real robots:


Strongly Simplify the Problem: [Zhou03]

Poincare Map
-
Based RL


Improve walking controllers with RL


Poincare map: Intersection
-
points of an n
-
dimensional
trajectory with an (n
-
1) dimensional Hyperplane


Predict the state of the biped a half cycle ahead at the phases :


Poincare Map


Learn Mapping:



Input Space : x = (d, d‘)


Distance between stance foot and body


Action Space :




Modulate Via
-
Points of the joint trajectories


Function Approximator: Receptive Field
Locally Weighted Regression (RFWR) with
a fixed grid

Via Points


Nominal Trajectories from
human walking patterns


Control output


is used to modulate via
points with a circle


Hand selected via
-
points


Increment via
-
points of one
joint by the same amount

Learning the Value function


Reward Function:


0.1 if height of the robot > 0.35m


-
0.1 else


Standard SemiMDP update rules


Only need to learn the value function


for and


Model
-
Based Actor
-
Critic Approach



A … Actor


Update Rule:


Results:


Stable walking performance after 80 trials


Beginning of


Learning








End of Learning

Dynamic Programming for
Biped Locomotion [Stilman05]


4
-
link planar robot


Dynamic Programming
for Reduced
Dimensional Spaces


Manual temporal
decomposition of the
problem into phases of
single and double
support


Use intuitive reductions
fo the state space for
both phases

State
-
Increment Dynamic
Programming


8
-
dimensional state space:





Discretize State
-
Space by coarse grid


Use Dynamic Programming:




Interval
ε

is defined as the minimum time
intervall required for any state index to change

State Space Considerations


Decompose into 2 state space
components (DS + SS)


Important disctinctions between the
dynamcs of DS and SS


Periodic System:


DP can not be applied separately to state
space components


Establish mapping between the components
for the DS and SS transition

State Space Reduction


Double Support:


Constant step length (df)


Can not change during DS


Can change after robot completes SS


Equivalent to 5
-
bar linkage model


Entire state space can be described
by 2 DoF (use k
1

and k
2
)


5
-
d state space





10x16x16x12x12 grid => 368640
States

State Space Reduction


Single Support


Compass 2
-
link Model


Assume k
1

and k
2

are constant


Stance knee angle k
1

has small
range in human walking


Swing knee k
2

has strong effect
on d
f
, but can be prescribed in
accordance with h
2

with little effect
on the robot‘s CoM


4
-
D state space





35x35x18x18 grid => 396900
states

State
-
Space Reduction


Phase Transitions


DS to SS transition occurs when the rear
foot leaves the ground


Mapping:



SS to DS transition occurs when the
swing leg makes contact


Mapping:

Action Space, Rewards


Use discretized torques


DS: hip and both knee joints can accelerate the
CoM


Fix hip action to zero to gain better resolution for the
knee joints


Discretize 2
-
D action space from +
-

5.4 Nm into 7x7
intervalls


SS: Only choose hip torque


17 intervalls in the range of +
-

1.8 Nm


State x Actions


398640x49+396900x17 = 26280660 cells (!!)


Reward:




Results


11 hours of
computation


The computed
policy locates a
limit cycle
through the
space.

Performance under error


Alter different properties of
the robot in simulation


Do not relearn the policy


Wide range of
disturbances

are accepted


Even if the used model of
the dynamics is incorrect!


Wide set of acceptable
states allows the actual
trajectory to be distinct
from the expected limit
cycle

Learning of a Stand
-
up
Behavior [Morimoto00]


Learning to stand
-
up with a 3
-
linked
planar robot.


6
-
D state space


Angles + Velocities


Hierarchical Reinforcement Learning


Task decomposition by Sub
-
goals


Decompose task into:


Non

linear problem in a lower
dimensional space


Nearly
-
linear problem in a high
-
dimensional space

Upper
-
level Learning


Coarse Discretization of postures


No speed information in the state space
(3
-
d state space):


Actions: Select sub
-
goals



New Sub
-
goal

Upper
-
Level Learning


Reward Function:


Reward success of stand
-
up


Reward also for the success of a subgoal


Choosing sub
-
goals which are easier to reach from the
current state is prefered


Use Q(lambda) learning to learn the sequence of
sub
-
goals

Lower
-
level learning


Lower level is free to choose at which speed
to reach sub
-
goal (desired posture)


6
-
D state space


Use Incremental Normalized Gaussian networks
(ING
-
net) as function approximator


RBF network with rule for allocating new RBF
-
centers


Action Space:


Torque
-
Vector:

Lower
-
level learning


Reward:


-
1.5 if the robot falls down


Continuous time actor critic learning
[Doya99]


Actor and Critic are learnt with ING
-
nets.


Control Output:


Combination of linear servo controller
and non
-
linear feedback controller



Results:


Simulation Results


Hierarchical architecture 2x faster than plain
architecture


Real Robot


Before Learning


During Learning


After Learning


Learned on average in 749 trials (7/10
learning runs)


Used on average 4.3 subgoals


The end


For People who are interested in using
RL:


RL
-
Toolbox


www.igi.tu
-
graz.ac.at/ril
-
toolbox


Thank you

Literature


[Kohl04] Policy Gradient Reinforcement Learning for Fast
Quadrupedal Locomotion, N. Kohl and P. Stone, 2005


[Ng00] PEGASUS : A policy search method for large MDPs
and POMDPs, A. Ng and M. Jordan, 2000


[Ng04a] Autonomous inverted helicopter flight via
reinforcement learning, A. Ng et al., 2004


[Ng04b] Autonomous helicopter flight via reinforcement
learning, A. Ng et al., 2004


[Michels05] High Speed Obstacle Avoidance using Monocular
Vision and Reinforcement Learning, J. Michels, A. Saxena
and A. Ng, 2005


[Morimoto04] A Simple Reinforcement Learning Algorithm For
Biped Walking, J. Morimoto and C. Atkeson, 2004

Literature


[Stilman05] Dynamic Programming in Reduced Dimensional
Spaces: Dynamic Planning for Robust Biped Locomotion, M.
Stilman, C. Atkeson and J. Kuffner, 2005


[Morimoto00] Acquisition of Stand
-
Up Behavior by a Real
Robot using Hierarchical Reinforcement Learning, J. Morimoto
and K. Doya, 2000


[Morimoto98] Hierarchical Reinforcement Learning of Low
-
Dimensional Subgoals and High
-
Dimensional Trajectories, J.
Morimoto and K. Doya, 1998


[Zhou03] Dynamic Balance of a biped robot using fuzzy
Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003


[Doya99] Reinforcement Learning in Continuous Time And
Space, K. Doya,1999