Project report

grapedraughtΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

102 εμφανίσεις

RL Project


By:

Eran Harel

Ran Tavori


Instructor:

Ishai Menache











Project taken at the software lab of E.E., Technion, winter semester 2002

RL Project


Ran Tavori and Eran Harel



-

2

-


Abstract

RL


Reinforcement Learning is the process of iteratively learning how to
achieve best performanc
e by the method of trial and error.

The general purpose of the project is to implement a test bed software for various
RL algorithms in different domains. After doing so, the next stage is to implement
a couple of RL algorithms and a couple of environments

and test them all.

The following document describes the requirements of the project, presents
a comprehensive overview of RL algorithms, describes the design of the
implemented software and presents the outcome of the tests we have
made.

RL Project


Ran Tavori and Eran Harel



-

3

-


Table Of Content
s


Introduction

................................
................................
................................
...........

5

Requirements

................................
................................
................................
.......

6

RL Algorithms

................................
................................
................................
.......

8

Value Functions

................................
................................
................................

9

TD

................................
................................
................................
...................

10

Q
-
Learning

................................
................................
................................
......

11

Exploration vs. Exploitation: ε
-
Greedy

................................
.............................

15

Function Approximation

................................
................................
......................

17

The Feature Space

................................
................................
......................

17

RBF

................................
................................
................................
.................

18

In our case

................................
................................
................................
...

19

Design

................................
................................
................................
.................

21

The Three Main Interfaces

................................
................................
..............

21

ISimulation interface

................................
................................
....................

21

IAgent interface

................................
................................
............................

22

IEnvironment Interface

................................
................................
.................

24

Additional Interfaces

................................
................................
........................

25

ISensation Interface

................................
................................
.....................

25

IAction Interface

................................
................................
...........................

27

Package Overview

................................
................................
..........................

28

Simulation Package

................................
................................
.....................

28

Agent Package

................................
................................
............................

29

Environment Package

................................
................................
..................

31

Implemented Environments

................................
................................
................

34

The Maze Environment

................................
................................
...................

34

The PredatorAndPrayEnvironment
................................
................................
..

35

Experimental results

................................
................................
...........................

38

Maze Environment Results

................................
................................
.............

38

Predator And Pray Environment Results

................................
.........................

41

Operating manual

................................
................................
...............................

44

RL Project


Ran Tavori and Eran Harel



-

4

-


Environment editor.

................................
................................
.........................

44

The simulation type chooser

................................
................................
...........

45

The simulation main window

................................
................................
...........

45

Improvements that we have no
t implemented

................................
....................

47

List of Figures

................................
................................
................................
.....

48

Bibliography

................................
................................
................................
........

49

RL Project


Ran Tavori and Eran Harel



-

5

-


Introduction

The following document describes the deta
ils of the software implemented
in the project.

We first describe the requirements of the project.

Then we present a short theoretical introduction to RL and to function
approximation.

We then dive into the details of the software design, explaining every
important aspect of the platform design. In this part only abstract
explanations are given, leaving the technical details to the auto
-
generated
documentation generated by java from the code.

After the design overview we present the outcome of the test runs

that we
made, including graphs that measure performance.

Then we present a user manual that details the various features of the
software how to use them.

At last we list a few of the interesting fields of investigation and open
questions that we ran into,

which if only time would allow us, we would have
made further investigation in their direction.


RL Project


Ran Tavori and Eran Harel



-

6

-


Requirements

1.

The main target of the project is to create a test bed for various
agents,

in
different
environments,

that use reinforcement learning.

2.

The proje
ct will include a platform for testing the agents in the different
environments (henceforth


the platform):

2.1.

The platform must define a common interface for an
environment
.

2.2.

The platform must define a common interface for an
agent
.

2.3.

The platform must impleme
nt simulation software that gets as
an input an environment and an agent and runs a simulation
based on that input.

3.

The project will include test application of agents and environments:

3.1.

Agents:

3.1.1.

Tabular Q
-
agent

3.1.2.

Tabular sarsa lambda agent

3.1.3.

Tabular TD
-
lambda a
gent

3.1.4.

TD
-
lambda agent that uses function approximation

3.2.

Environments:

3.2.1.

A maze.

3.2.2.

Predator and pray

RL Project


Ran Tavori and Eran Harel



-

7

-


Definitions
1

Agent


An agent is referred to as an entity that resides in an environment and
interacts with it. An example for this might be a chess player. The

player is
the agent whereas the board is the environment. In this example, the other
player might be considered as being another agent, but it might also be
considered as being part of the environment, since that as far as the first
agent knows, the board

and the other player are both external to it, hence
they both might be considered as the environment as a whole. In the
software we wrote this kind of abstraction actually being used.

Environment


The environment is the complementary of the agent. An env
ironment
is considered to be everything that is external to the agent. Put another way,
the environment is what the agent can not control.

State


An agent has a state in the environment. For example, the state might be the
location of the agent on a gri
d.

Action


For each state there is a set of actions that might be taken when the agent
is in the state.

Goal


Usually the agent interacts with the environment in order to achieve a goal.
A goal might be, for example, winning a chess game.


Episode


An e
pisode is a series of states, in which the first state is the an initial
state and the last state is a terminal state. The terminal state is usually a
goal.

Policy


A
policy

defines the learning agent's way of behaving at a given time.
Roughly speaking, a

policy is a mapping from perceived states of the
environment to actions to be taken when in those states. In some cases the
policy may be a simple function or lookup table, whereas in others it may
involve extensive computation such as a search process. T
he policy is the
core of a reinforcement learning agent in the sense that it alone is sufficient
to determine behavior. In general, policies may be stochastic.

Reward Function


A
reward function

defines the goal in a reinforcement learning
problem. Rough
ly speaking, it maps perceived states (or state
-
action pairs)
of the environment to a single number, a
reward
, indicating the intrinsic
desirability of the state. A reinforcement
-
learning agent's sole objective is
to maximize the total reward it receives i
n the long run. The reward
function defines what are the good and bad events for the agent. The
reward function must necessarily be fixed. It may, however, be used as a
basis for changing the policy. For example, if an action selected by the
policy is foll
owed by low reward then the policy may be changed to select
some other action in that situation in the future. In general, reward
functions may also be stochastic.




1

The definitions are taken from [
1
]

RL Project


Ran Tavori and Eran Harel



-

8

-


RL Algorithms
2

When speaking about algorithms of artificial intelligence one can divide the

algorithms to two main classes:



Domain specific algorithms



General algorithms


Surely this division is not accurate, but for the matter of the discussion we’ll
keep it.

One might expect from domain specific algorithms to have better
performance in their d
omain than any general algorithm, simply because
they were specially suited to fit the domain. But, on the other hand, they
would not work in any other domain. Further more, in order to fit an
algorithm to a certain domain, one must have some knowledge in
advance
about this domain. But this knowledge is not always present.

An example of an algorithm that is domain specific, is the
A*
algorithm for
finding a path in a maze. Without getting into the details of the algorithm, it
is sufficient to say that A*

ha
s formally proven that it is able to find the
optimal route in the maze, with time complexity of
O(n
2
)

and space
complexity of
O(n)

in the length of the path.


As to RL algorithms


they all belong to the second class of algorithms


the general algorithms
. They are not domain specific and they can be
applied to any domain that satisfies a small set of requirements:



When the algorithm does something “right” it gets a reward



When the algorithm does something “wrong” it gets a
negative reward.


So the essence

of all RL algorithms is the method of trial and error


let the
algorithm try and solve the problem; if it succeeds give it a reward so that it
can sense that it has done something right; and if it did not succeed, give it
a negative reward so that it wil
l be able to refrain from doing that in the past.


This kind of attitude is obviously inefficient in many cases, but, on the other
hand, this might be the only method acceptable in some cases. In the case
where there is no prior knowledge of the domain, th
is is the only method
acceptable.




2

The overview of the RL algorithms is according to [
1
] and [
3
].

RL Project


Ran Tavori and Eran Harel



-

9

-



In this project, where we had to solve 2 problems: the maze problem and
the predator
-
pray problem, we could implement domain specific algorithms
for solving these problems, however we preferred the implementation of
some
general RL algorithms on top of them so that we will be able to learn
and test the capabilities of such algorithms.


In the following discussion we will present a short introduction to some
basic RL algorithms, including the algorithms that were implemente
d in the
project.

Value Functions

Let us start by defining the term of a value function. A value function is a
mapping of states to real values:
:
V S

R
. Here S represents the set of
possible states in the given domain. For example, i
f the domain was a grid
of size m on n, then S would be a set of size m*n that includes all the states
on the grid.

The essence of defining a value function is to be able to estimate the
goodness of a state. The logic is pretty simple: if an agent is in a

state that
is worse than a neighbor state, it would go to the other state, so that it can
make its situation better.

Of course, the real issue here is how do we get this value function? Since
we assume no prior knowledge of the domain, we must also assume

that at
the beginning we have no value function at all. So what we will have to do is
to build the value function step by step using the method of error and trial.

Markov Decision Process

A process can be describes as a sequence of states. The agent
-
envir
onment interaction could be treated as such a process, in which, given
the current state, the next state is determined by the agent’s action, by the
environment (hence the current state and maybe some other factors) and,
optionally some stochastic function

of them all.

A process is said to be Markovian if is has the following property:

The next state of the process is determined exclusively by the current state,
the action taken, and optionally some stochastic function of them


A simple example of an MDP (
Markovian Decision Process) would an
automata. If the MDP is deterministic, then the automata would be
deterministic too and if the MDP is stochastic, the automata would be
indeterminist automata.


RL Project


Ran Tavori and Eran Harel



-

10

-


This definition of an MDP will serve us in the following t
ext.

TD

One method of estimating the value function is the method of TD (Temporal
Difference).

The idea behind TD methods is:



given the current estimate of V(s)



and given the reward r, after taking an action a from s



and given the next state s’, after taki
ng an action a from s


Then
v(s)

gets updated using the following rule:



( ) ( ) (`) ( )
V s V s r V s V s
 
   

Equation
1
: update rule for TD


where:



α is a parameter called the learning rate (
0 1

 
)



γ is a parameter called the discount factor (
0 1

 
)


The main idea of this update rule is: the value of s equals to the value of the
reward given by taking an act
ion from s plus the value of the next state
(discounted).

This rule is used iteratively when trying to estimate the value function of a
given policy. A whole algorithm that uses this rule is presented in
Figure
1
.


Initialize ( ) arbitrarily, to the polic
y to be evaluated
Repeat (for each episode):
Initialize
Repeat (for each step of the episode):
action given by for
take action a; observe rewa
V s
s
s

 



rd r, and next state s`
( ) ( ) (`) ( )
`
until is terminal
V s V S r V s V s
s s
s
 
   


Figure
1

:

TD algorithm for estimating V
π

RL Project


Ran Tavori and Eran Harel



-

11

-



Here
π

is a policy to be evaluated. This Algorithm is used to estimate
V(s)

given the policy π, which maps states to actions:
:
S A


.


The algorithm estimates V
π
, for a given π. So now what we have to do is to
make π the optimal po
licy. So how do we make π the optimal policy? If only
would we have known V, then this would be a simple job: π is the policy
that maps a state to an action that yields the best (expected) next state.
And what is the best next state? This is given to us by

V. So, if only would
have we known V, then this would be simple… But we do have a clue of
what V is


this is what the algorithm in
Figure
1

calculates


it calculates V.
So we can use V in order to improve our policy!

After unders
tanding the last paragraph, what we can do is the following:
estimate V
π

using the presented algorithm, and then use V
π

in order to
improve π.

This method is known to actually work, and it even has a formal proof of
converging to the optimal policy!


There’s a little more to TD then what is presented here, but for the purpose

of the overview this should be enough.

Q
-
Learning

We have introduced the usage of the value function, which maps states to
real values:
:
V S

R
. In this section we will introduce a bit different
approach.

But before doing so, let us un
derstand first what is the problem with the
method of value function. In this method there is an implicit assumption that
the agent has some model of the environment. What does it mean? It
means that the agent, when it has to choose the action to be taken
from
state
s
, it uses the value function V in order to decide what is the best
action to take. But how exactly does it use the value function in order to
make the decision? This is where the implicit assumption of the model of
the environment lies


the ag
ent has to take an action that will result in the
best reward and the best next state. But how does it know what the next
state is going to be if it takes an action
a

from state
s
? For this it needs a
model of the environment. A model of the environment is

a mapping of the
couple
(state, action)

to the next state:
:
M S A A
 
. So, given the current
state,
s
, a model of the environment would determine what will the next
state ,
s`
, be if we were to take an action
a
.

So, by using the method of

value functions the agent not only has to build a
value function, but it also has to have a model of the environment. This is
not always present. Sometimes the agent might have a model of the
RL Project


Ran Tavori and Eran Harel



-

12

-


environment in advance, for example, if the environment is a si
mple grid,
then the agent can tell that if it is now at position (3,4), by taking the action
SOUTH it would get to position (3,5). But this is not always true: what if
there was an obstacle at this position? And what it there was a wind that
carried the ag
ent to a different position? Well, sometimes the agent just has
to build the model of its environment dynamically.

Building a model of the environment is not always simple either. We have
introduced a model of an environment that is deterministic:
:
M S A S
 
.
But what if the environment is not deterministic? Then we would have to
use a different model, a model that is stochastic:




`|,
P s s a p

. This
means that the probability of getting to state
s`

by taking action
a

while
bei
ng at state
s

is
p
.


To summarize this, let us say that by using the method of value function it is
not sufficient for the agent to learn the value function


it must also have a
model of the environment, which is not always easily feasible.

This is where
the next method comes in hand.


We introduce now the method of Q
-
learning. This method does not require
an explicit model of the environment, rather it implicitly builds one.

Recall that value function maps state to real values. As opposed to that, Q
fun
ctions map a couple of
(state, action)

to real values:
:
Q S A
 
R
.

What is the meaning of this mapping? Well, the meaning is quite intuitive
and straight forward:
Q(s,a)

measures the goodness of taking action
a

while
being is state
s
.

If
the agent uses such a mapping, then all he has to do is, if it is in state
s

he must take the action
a

that produces the maximal value for
Q(s,a)
,
hence:



argmax (,)
a
Q s a

Equation
2
: usage of a policy π that is derived

from
Q


The update rule for Q is the following rule:



(,) (,) max (`,) (,)
a
Q s a Q s a r Q s a Q s a
 
 
   
 

Equation
3
: update rule for Q


where:

RL Project


Ran Tavori and Eran Harel



-

13

-




α is the learning rate (
0 1

 
)



and γ is the discount factor (
0 1

 
)


This rule resembles the update rule for V, presented in
Equation
1
, only that
this rule applies to Q function. For this reason it also uses a
max
, while the
update rule for V does not.


An algorithm that uses the rule is present
ed in the next figure.


initialize (,) arbitrarily
repeat (for each episode):
initialize
repeat (for each step of the episode):
choose from using policy derived f
rom (e.g. -greedy)
take action , observe t
Q s a
s
a s Q
a



`
he reward and the next state `
(,) (,) max (`,`) (,)
`
until is terminal
a
r s
Q s a Q s a r Q s a Q s a
s s
s
 
 
   
 


Figure
2
: Q
-
learning algorithm


There is one important issue about the algorithm that has not been
introduced yet, and this is the
ε
-
greedy issue. This will be covered in the
following section, but in short, the line that says “choose
a

from
s

using
policy derived from
Q

(e.g. ε
-
greedy)” means that the agent should choose
an action
a

that maximizes
Q(s,a)
, just like presented in
Equation
2
, but
with a probability of ε he must choose a random action.


In the beginning of this section we have discussed the advantages of
calculating a Q function over a V function. We have said that in case where
the agent does not
have a model of its environment, of where it is difficult to
create one, it is possible to use Q
-
learning, which does not require one.

But what are the disadvantages of the method of Q
-
learning?

The answer to that lies in the complexity of the state space.

The dimension
of the domain of a V function is the size of the state space:
S
. (recall that
:
V S

R
), whilst the dimension of the domain of a Q function is
S A

.
(recall that
:
Q S A
 
R
).

RL Project


Ran Tavori and Eran Harel



-

14

-


This implies 2 main problems:



Since the size of Q is greater then the size of V this implies
that in order to store a Q function we need to have more
memory capacity, hence the memory complexity increases
by a factor of
|A|
.



In order to
compute both
V(s)

for a given state
s
, or
Q(s,a)
for a given state
s
and a given action
a
, one has to make
infinite number of visits to
s

or to
s

and take action
a
,
accordingly. We will not present a formal proof of this
statement, but it might be understo
od intuitively by seeing
that the update rule for both Q and V is iterative, so in order
for it to converge to the correct value one must make an
infinite number of updates to it. By using the same logic, in
order for V and Q to converge to the correct val
ues with
arbitrary precision, they both have to make an arbitrary
number of updates for each element of the function. After
seeing this it is easy to see that since V has less elements
than Q, V will probably converge faster than Q.


To summarize this, let

us say that both Q and V methods have their pros
and cons. As a rule of thumb one should use a value function when it is
possible to construct a good
-
enough model of the environment. Of course,
when it is impossible to obtain such a model, there is now ot
her choice but
to use a Q function. But in the twilight zone, where it is possible to construct
a model of the environment, but there is a chance that this model might be
wrong, or that it is very difficult to do so, one must try both methods and
come up w
ith a method that best suites its domain.

In the software that we wrote we used both methods.

There are many algorithm that use Q
-
learning as their base method. In the
project we used 2 of them: the one presented in
Figure
2

and an

algorithm
called sarsa, that is presented next.

RL Project


Ran Tavori and Eran Harel



-

15

-


initialize (,) arbitrarily
repeat (for each episode):
initialize
choose from using policy derived fro
m (e.g. -greedy)
repeat (for each step of the episode):
take action , observe ,
Q s a
s
a s Q
a r



`
choose ` from ` using policy derived
from (e.g. -greedy)
(,) (,) (`,`) (,)
`;`
until is terminal
s
a s Q
Q s a Q s a r Q s a Q s a
s s a a
s

 
   
 

Figure
3
: sarsa algorithm

The main difference between the sarsa algorithm and the algorithm
presented in
Figure
2

is the update ru
le.

Exploration vs. Exploitation: ε
-
Greedy

We have already mentioned the idea of choosing an action according to a
policy that is not necessarily the best action; hence choosing a random
action with probability of ε and choosing the best action with probab
ility of 1
-
ε.

The question the needs to be asked is: what is it good for? Why not always
choose the best action possible, and by that increasing the reward? The
answer lies in that the policy, which we assume is using a Q function, just
for the matter of t
he discussion, does not necessarily have the correct Q
function. Moreover it is almost always that it does not have the correct
function. We expect it to have an estimated value of Q, and we use this
value to choose the next action. What if this value is n
ot correct?

Let us look at a simple example: suppose that we have a state
s

and an
action
a
, for which we have
Q(s,a)=1

and some other state
a`
, where
a`
≠a
,
and
Q(s,a`)=0
. Recall that Q is not necessarily the correct function and it is
just an estimate. And now suppose that by taking action
a`

we would get a
higher reward than the reward gained by taking action
a
. But if we would
always choose the action t
hat maximizes Q , we would always choose
a
.
So how can we learn that taking action
a`

would lead us to a higher reward
than the reward gained by taking action
a
? In order for us to learn that, we
would have to take
a`

at least once. As a matter of fact, in

order for us to
know the true value of
Q(s,a`)

we would have to take
a`

from
s

an infinite
number of times. And this is where the
ε
-
greedy approach comes in hand.
Using this approach, we would always choose a random action with
probability ε; hence we will eventually end up with having taken all actions
from all states an infinite number of times, and by that we can surly get a
prope
r estimation of
Q(s,a)

for each
s

and
a
.

RL Project


Ran Tavori and Eran Harel



-

16

-


Generally speaking, the approach of ε
-
greedy introduces the dilemma of
exploration vs. exploitation. What this means is that if we were to put ε=0;
hence using a perfectly greedy policy, we would be exploiting the

Q function
(or the value function), and if we were to put ε=1 we would be exploring all
the time, and not getting the rewards, even though we do know what are
the good actions and what are the bad actions. So the dilemma here is
when to explore, by taking

a random action, and when to exploit, by taking
the best action.

This dilemma is solved in the ε
-
greedy approach by taking a random action
with probability ε and taking the best action with probability 1
-
ε. This method
was used in the software we wrote.

T
he approach of ε
-
greedy is also helpful in a situation were the environment
is dynamically changing (but slowly). For example, a maze that has moving
walls. In this case there is generally no Q function that is correct all the time


even if we were to lea
rn a perfect Q function that would apply to the
current environment, once the environment changes, the function is not
perfect anymore. The means of the policy to adjust itself to the changing
environment is by exploring, from time to time, and seeing that

“it is not
missing anything”; e.g. one approach that is acceptable for dealing with
such cases is using an
ε
-
greedy algorithm.

In our implementation ε had a constant value (usually it was 0.1). But it is
also possible for an algorithm to have ε that is changing. The decision of
how and when to change ε is related to the rate of getting the rewards, of
getting t
o the goal. It is a usual practice to have ε=ε(t), i.e. ε is given as a
function of the time, usually


a decreasing function of t, which means much
exploration at the beginning and less exploration later.

RL Project


Ran Tavori and Eran Harel



-

17

-


Function Approximation
3

We have so far assumed th
at our estimates of value functions are
represented as a table with one entry for each state or for each state
-
action
pair. This is a particularly clear and instructive case, but of course it is
limited to tasks with small numbers of states and actions. Th
e problem is
not just the memory needed for large tables, but the time and data needed
to accurately fill them. In other words, the key issue is that of generalization.
How can experience with a limited subset of the state space be usefully
generalized to
produce a good approximation over a much larger subset?

This is a severe problem. In many tasks to which we would like to apply
Reinforcement Learning, most states encountered will never have been
experienced exactly before. This will almost always be the

case when the
state or action spaces include continuous variables or large number of
sensors, such as a visual image. The only way to learn anything at all on
these tasks is to generalize from previously experienced states to ones that
have never been see
n.

To summarize this, function approximation (of both Q and V functions) is
used to confront 2 main issues:



Memory capacity


the usage of function approximation
usually consumes less memory.



Convergence speed and generalization


if 2 similar states
(or
more) are traveled, then the update of one would result in
the update of the other, due to the generalization achieved
by the function approximation.


There are many methods for function approximation. To name a few, these
would be neural networks, decisio
n and classification trees, nearest
neighbor, k
-
nearest neighbor, radial basis functions, tile coding, C
-
MAC etc.

In the project we have implemented a Linear Function Approximation called
RBF (Radial Basis Function). The following section describes this wo
rk.

The Feature Space

A notion that is common to all function approximations is the notion of the
feature space.

A feature is a measurement of some property of the state. One can think of
a feature as being an abstraction of the state. An example feature w
ould be
the Manhattan distance on a grid. Yet another example would be the
number of pieces in a Checkers game. A proper feature must be related to



3

The function approximation overview is according to [
1
], [
2
], [
4
], [
5
] and [
6
]

RL Project


Ran Tavori and Eran Harel



-

18

-


the goodness of the state. Moreover, if a feature is not related to the
goodness of the state it is redundan
t.

The feature space is a multi
-
dimensional space (one dimension per
feature). It might be discrete or continual.

The Feature space is usually much smaller that the state space of the
original problem.

It is obvious that all features are domain specific.
Moreover, one can not
select a feature that is meaningful without having a good understanding of
the domain is it dealing with. As an understatement we can say that
choosing the correct features is crucial to the success of the approximation.

So why is thi
s feature space so important? In order to understand this let us
recall a previous discussed subject


the value function. We will use value
functions to explain the importance of the feature space, but it is important
to note that Q functions also have si
milar properties.

When we discussed earlier about value functions, we said that a value
function maps a state, s, to a real value, v: V(s)=v. But, what is a state?
How is it represented? Of course, it is possible to represent a state in many
different ways
, such as an enumeration of values, or a set of enumerations
etc. When speaking about value functions it is not important how the state
is represented (as long as this representation is consistent), but when
speaking about function approximation it is very

important. Usually a
function approximation is a mathematical function that get as an input a
vector of real numbers. Thus, it can not get a
state

as an input


it must get
a vector representation of it. This representation is given in the feature
space,
using a transformation from the state space to the feature space.
Thus it is important to choose the correct transformation to the feature
space.

RBF

The idea of RBF is the approximation of a function (in our case this would
be the value function) by a set

of linear combination of gaussians.

The objective of the function approximation is to find the approximation,
f

of
V
, where f is the following linear sum:



i i
i
f w x




Equation
4
: a linear sum of gaussians is used

to approximate V

were



2
2
2
i
i
x c
i
x e


 


Equation
5
: the base functions are gaussians that are centered at c
i

RL Project


Ran Tavori and Eran Harel



-

19

-


The norm is the distance in the feature space. c
i

is the location of the base
function in the feature space.

The number of base functions, their positions, their standard deviation and
the weight of each function are all parameters that the algorithm has to
learn. The following sections describes the implementation for RBF that we
have made.

In our case

We used a

method in which the number of base functions is not determined
in advance, rather they are added dynamically according to a threshold and
according to the value of the gradient in the nearby surroundings. The
location of each function (its center) is set
once the function has been
added. The standard deviation is set and is constant for all base function.
What is fully controlled by the algorithm is the weight of the function.

The State Space

The usage of function approximation is sometimes the only way to

go
because of the great memory (space) complexity that’s involved with
regular, tabular solutions. In the test runs we made there were many cases
where the agent ended up with no memory left.

The Algorithm We Used

In the project we have implemented a TD(
λ) algorithm that uses RBF to
approximate its value function. The following figure describes the algorithm:




Initialize arbitrarily and
Repeat (for each episode):
initial state of the episode
Repeat (for each step of the episode
)
Policy
take action , observe reward an
d next
W
s
a s
a r








state `
`


`
until is terminal
W
s
r V s V s
e e V s
W W e
s s
s
 


  
 
 


Figure
4
: TD(λ) algorithm that uses RBF as function approximation

In the figure,
Policy

is an
ε
-
greedy policy, which means that with probability
of 1
-
ε it returns an action
a

such that taking the action
a

while being in state
RL Project


Ran Tavori and Eran Harel



-

20

-


s

would lead to state
s'

such that
V(s')

is maximal:






`|`model,
argmax
a
V s s s a


V(s)

is computed in the following way:






i i i
V s w d s



, hence it is a
RBF.

d
is the distance in the feature space between the location of the
i
th
gaussian and the transformation of
s

to the feature space.

model
is a model of the environment.
model(s,a)

is the next expected state
when b
eing at state
s

and taking action
a
. The agent must use a model of
the environment because it is using a value function (and not a Q function).

The Selected Features

It has been already mentioned that there is a great importance to selecting
the appropriat
e set of features. The features that we chose to use are the
following features:

For each of the predators we use:



The distance from the predator to the pray



The angle between the predator and the pray



The distance of the pray from the closest corner


It i
s not only important to chose the correct features, but it is also important
to give them the correct weights. We found that the most important feature
is the distance, and then the angle and the corner.

RL Project


Ran Tavori and Eran Harel



-

21

-


Design

The Three Main Interfaces

In order to satisfy

requirements
1

and
2

(see
Requirements
) an interface is
presented that consists of 3 main interface classes:



ISimulation interface



IAgent i
nterface



IEnvironment interface


The ISimulation interface is implemented as part of the platform. The IAgent
and the IEnvironment interfaces are not implemented as part of the
platform, rather they remain as interfaces and will be implemented by the
user
of the platform in order to satisfy requirement
3
.


Following is a text and diagrams that describe these 3 interfaces.

ISimulation interface

The purpose of the Simulation class is to manage the process of the agent
interacting with the environment.

A UML diagram of the interface is presented and disc
ussed:


Figure
5
: ISimulation interface

Discussion of the methods:

ISimulation.init
(IEnvironment environment, Object[] envParams, IAgent agent, Object[] agentParams)

Initializes the simulation instance, the agent, and the environm
ent.

The
environment
is an instance of the IEnvironment interface and the array
envParams is the parameters that will be passed to the init method of the
class.

The same goes with
agent

and
agentParams
.

RL Project


Ran Tavori and Eran Harel



-

2
2

-


ISimulation.start
()

Starts a new trial.

The function

calls Environment.start() and Agent.start(). This way the first
sensation

(the starting state) of the environment is forwarded to the agent
and the agent returns its first action and the trial begins.

ISimulation.steps
(int numOfSteps)

Runs the simulation
for
numOfSteps

steps, starting from whatever state the
environment is in. Note that the environment does not necessarily has to be
in the initial state


steps() starts a simulation from whatever state the
environment is in, and this state might be any sta
te. If the terminal state is
reached, the simulation is immediately prepared for a new trial by calling
Simulation.start()
. The switch from the terminal state to the new starting state
does not count as a step. Thus, this function allows the user to contro
l the
execution of the simulation by providing the total number of steps directly.

ISimulation.trials
(int numOfTrials, int maxStepsPerTrial)

Runs the simulation for
numOfTrials

trials, starting from whatever state the
environment is in (just like steps())
. Each trial can be no longer than
maxStepsPerTrial

steps. Each trial begins by calling Simulation.start() and
ends when the terminal state is reached or when
maxStepsPerTrial

steps is
reached, whichever comes first. Thus, this function allows the user to
control the execution of the simulation by providing the total number of trials
directly.

IAgent interface

The purpose of the agent is to solve an abstract problem. The problem is
defined as a reinforcements problem: an agent searching for a policy that
w
ill produce a maximal gain for it (maximal reinforcement) in some
environment, not necessarily defined in advance.

The agent might be a complex class that incorporates many subclasses in
order to achieve this goal. These classes might be classes such as a
Policy
class, a Function Approximation class etc. However, the interface of the
Agent class, as defined by IAgent is simple.

The IAgent interface is presented in the following UML diagram and the
following text:

RL Project


Ran Tavori and Eran Harel



-

23

-



Figure
6
: IAgent
interface

The interface as a whole is not implemented by the platform, but is
implemented by the user of the platform (henceforth


the user).


Following is a discussion of the interface’s methods.

IAgent.init
(Object[] params)

In implementing this method t
he user can do whatever initialization its agent
needs. This method is invoked by Simulation.init() method at the beginning
of each trial. Agent.init() should initialize the instance of the agent, making
any needed data
-
structures. If the agent learns or c
hanges in any way with
experience, then this function should reset it to its original, naive condition.

The params array is the same array that was passed to the init method of
the ISimulation class.

IAction
IAgent.start
(ISensation s)

This function is cal
led at the beginning of each new trial. Agent.start()
should perform any needed initialization of the agent to prepare it for
beginning a new trial. As opposed to Agent.init() , this method should not
delete the data structures (or the learned policy). It
should only prepare the
agent to the beginning of a new trial.

The input parameter,
s
, is the current state (or sensation) in which the
environment is in at the present time (normally it should be the start state).

The method should return an action, which

is the action that the agent
selects, given the sensation
s
.

IAction
IAgent.step
(ISensation prevS, IAction prevA, ISensation nextS, double reward)

This is the main function for the Agent class, where all the learning takes
place. It will be called once b
y the simulation instance on each step of the
simulation. This method informs the agent that, in response to the sensation
prevS
and its (previously chosen) action
prevA
, the environment returned
the payoff in
reward

and the sensation
nextS
. This function
returns an
action to be taken in response to the sensation
nextS
.

If the trial were terminated, the
nextS

would have its value set to null. In this
situation the value returned by the method is ignored.

RL Project


Ran Tavori and Eran Harel



-

24

-


The sensation and action
prevS

and
prevA

on one call

to Agent.step() are
always the same as the sensation
nextS

and the returned action on the
previous call. Thus, there is a sense then in which these arguments are
unnecessary, provided just as a convenience. They could simply be
remembered by the agent fro
m the previous call. This is permitted and often
necessary for efficient agent code (to prevent redundant processing of
sensations and actions). For this to work Agent.step() must never be called
directly by the user.

void
IAgent.save
(String toFileName)

T
his method is used to save the agent to the disk.

What is actually saved might be only the policy that the agent holds.

In order to make the saving possible, all the relevant interfaces extend the
Serializable interface.

void
IAgent.load
(String fromFileNa
me)

This method is used to load the agent from a disk file.

What is actually loaded might be only the policy that the agent holds.

In order to make the saving possible, all the relevant interfaces extend the
Serializable interface.

IEnvironment Interface

The purpose of the environment is to simulate a real world environment.

This interface is implemented completely by the user of the platform.

There could be many kinds of environments: a maze, a tetris game, a chess
game, a robot trying to get up etc. Howe
ver, the common interface of all the
environments is simple and it is presented in the IEnvironment class:


Figure
7
: IEnvironment interface

Following is a discussion of the IEnvironment methods.

IEnvironment.init
(Object[] params)

This method should initialize the instance of the environment, making any
needed data
-
structures. If the environment changes in any way with
experience, then this function should reset it to its original, naive condition.
For example: if the environment i
s a tetris game, than init shout reset the
state of the game to the initial state.

RL Project


Ran Tavori and Eran Harel



-

25

-


Normally the method is called once when the simulation is first assembled
and initialized.

The params array is the same array that was passed to the init method of
the ISimu
lation class.

ISensation
IEnvironment.start
()

The method is normally called at the beginning of each new trial. It should
perform any needed initialization of the environment to prepare it for
beginning a new trial. It should return a pointer to the first
sensation of the
trial.

double
IEnvironment.step
(IAction action)

This is the main function for the environment class. It will be called once by
the simulation instance on each step of the simulation. This method causes
the environment to undergo a transit
ion from its current state to a next state
dependent on the action
action
.

The function returns the payoff of the state transition as a return value.

Any data
-
structure and graphics actions should be done by the environment
in this method.

ISensation
IEnv
ironment.getSensation
()

Returns the current state/sensation of the environment. If the last transition
was into a terminal state, then the current sensation returned must have the
special value null.

Additional Interfaces

In order for the classes that inh
erit the 3 interfaces described above to
communicate with each other in an orderly fashion, additional interfaces
have to be defined.

These interfaces are:



ISensation interface



IAction interface


Following is a discussion of these interfaces.

ISensation In
terface

A sensation in the context of this application is a generalization of a state.
Meaning


the environment might be in a certain state; however, the agent
might think of it as being in the same exact state, but it might also be wrong,
thinking of it
as being in a different state. The reason is that the agent does
not always have a perfect information about the environment. If the agent
senses that the environment is in state s and if the agent’s sensors are
perfect, then the environment is really in s
tate s. But if the agent’s sensors
RL Project


Ran Tavori and Eran Harel



-

26

-


are not perfect (in the general case) then the environment might also be in
state s`. For this reason we use the term
sensation

rather than use the term
state

in order to represent the agent’s point of view of the environ
ment.

Previously we discussed about MDPs. In the discussion we used an implicit
assumption about the states: we assumed that the states are given to us
perfectly, i.e. we have full information about the current state. Of course, in
the general case this is

not always true and this is why we chose to use the
term
sensation
, rather then
state
. An MDP that is used in a non
-
perfect
information domain is referred to as
POMDP
(Partially Observed MDP). The
application implemented in the project is designed to deal

with partially
observed domain as well as fully observed domains.

A sensation is conceptually part of the environment.

The inner representation of a sensation might be a simple enumeration in
the case of a discrete environment, or it might be an X
-
locatio
n and a Y
-
location in case of a 2D environment (such as a 2D maze) and it might also
be presented as a several continuous variables (such as the case in some
real world problems).

But in order to make the interface clear and sharp, there are only few
meth
ods needed by a sensation interface:


Figure
8
: ISensation interface

Following is a discussion of the methods.

collection of IAction
ISensation.getActions
()

Given a sensation, there is a collection of actions that are applicable t
o it.

This is exactly what this method does: it is a simple mapping:
:2
Actions
getActions Sensations

. Meaning


given
this

sensation, the method
returns the actions that are applicable to it.

String
ISensation.getID
()

An agent must be able to identify a sensat
ion in order to manage a policy.
This method returns the ID of the sensation.

The way in which an ID of a sensation is built is fully dependent on the
environment. If, for example, the environment is a simple grid
-
maze, then
the ID might be the (X,Y) coord
inates of this sensation.

The two things that are important about the ID is that it must be unique and
consistent. Unique means that 2 different sensations must not map to the
same ID string. Consistent means that there can be no such case where a
sensati
on maps to a string at an earlier time and at a later time it will map to
RL Project


Ran Tavori and Eran Harel



-

27

-


a different ID string. Consistency can also be viewed as a deterministic
mapping from the domain of the sensations to the domain of strings.

A string was chosen to represent the ID b
ecause:

1.

it is simple

2.

it has a built in hashCode() method


One must keep in mind that if the agent wants to use a policy in which a
function approximation is being used, the agent must first “get to know” the
domain


it needs to know what kinds of states a
re there in the domain and
how they are represented
. That is


it must be aware of what kind of objects
this method returns.

IAction Interface

IAction describes the action that an agent takes and that might (normally)
have an influence on the next sensatio
n of the environment.

The interface is straight forward:



Figure
9
: IAction interface

Method discussion:

String
IAction.getID
()

An agent must be able to identify an action in order to manage a policy (just
like the case with ISen
sation). This method returns the ID of the action.

The way in which an ID of an action is built is fully dependent on the
environment. If, for example, the environment is a simple grid
-
maze, then
the ID might be {RIGHT, LEFT, UP, DOWN}.

The discussion abo
ut the method ISensation.getID() applies to this method
too, and one must read it carefully.


RL Project


Ran Tavori and Eran Harel



-

28

-


Package Overview

After describing the interfaces let us categorize them by dividing them into
packages and define the interactions between the different interfac
es.

Agent
Simulation
Environment
GUI

Figure
10
: The Packages

The platform is divided into 4 main packages:



Simulation package



Agent package



Environment package



GUI package


The simulation package uses (depends on) the two other packages. This is
possible because

they have a well defined interface.


Following is a detailed discussion of each package.

Simulation Package

The simulation package consists of one interface and one class only:



ISimulation interface



Simulation class


The ISimulation interface has already

been defined earlier in the document
(see
ISimulation interface
).

The Simulation class simply implements the ISimulation interface as it has
already been defined.

RL Project


Ran Tavori and Eran Harel



-

29

-


Agent Package

The Agent package consists of 3 in
terfaces and at least 3 classes (at least
one implementation for each interface):



IAgent interface



IPolicy interface



IFunctionApproximator interface


The IAget interface has already been defined (see
IAgent interface
)
.

There could be many implementations to this interface. For example:
QAgent, TDAgent etc…

It is the user of the platform’s responsibility to implement the agents and the
implementations will be discussed in another document.

The other 2 interfaces: IPolic
y and IFunctionApproximator are merely a
suggestion for the implementation of an agent.


IPolicy is an interface of a policy that the agent learns and uses. This is
actually the heart of the agent that uses reinforcement learning as a
learning algorithm.

T
he interface is presented below:


Figure
11
: IPolicy interface

Method discussion:

IAction
IPolicy.step
(ISensation s)

The essence of a policy is: given a sensation, decide which action to take,
based on the sensation. This is just
what this method is supposed to do


a
simple mapping


:
step Sensations Actions

.

Of course, how the mapping is done depends on the Policy and on how
much “knowledge” it has been able to gain so far.

IPolicy.learn
(ISensation prevS, IAction action, IS
ensation nextS, double reward)

In order for the step method to be able to generate a good action, the Policy
class must be able to learn.

In this method the class learns: given the previous sensation
prevS

and the
previously chosen action
action

the enviro
nment has moved to the next
sensation
nextS

and returned a reward
reward
.

RL Project


Ran Tavori and Eran Harel



-

30

-



IFunctionApproximator is a class whose essence is to be able to provide an
approximate mapping (as suggested from its name) from one domain to
another.

Of course, in an ideal world,

the mapping would not be an approximation
but rather a precise mapping. In this situation the class that implements the
interface would consist of a large table that maps objects.

The mapping would typically be a mapping of sensations to actions.

But in m
any cases holding a large table of mapping is not practical. In those
cases one must use a function approximation.

There are many ways to approximate functions: Neural Networks, Decision
Trees, Gradient Descent and more. The implementer of the interface mu
st
choose one of these methods.


Figure
12
: IFunctionApproximator interface

IAction

IFunctionApproximator.map
(ISensation s)

This method maps sensations to actions. If the class implements a tabular
mapping, then this mapping is th
e mapping of visited states to the
estimated values of those states, and if the class implements an
approximate function than this mapping maps all states according to the
approximation parameters learnt so far.

In order to understand the context of this c
lass refer to
Figure
13
.


RL Project


Ran Tavori and Eran Harel



-

31

-


Environment Package

The environment package encapsulates all the environment classes. Those
classes may be classes that manage logically the environment, classes that
display graphically the environment et
c.

The package shows three interface outside of it, to be used by the other
packages. Those are:



IEnvironment interface



ISensation interface



IAction interface


All three interfaces had been fully discussed in the preceding text (see
IEnvironment interface
,
ISensation interface
,
IAction interface
) and, as
mentioned before, the implementation of them is left to the user of the
platfo
rm.


To summarize this we present a diagram of all the main interfaces:

RL Project


Ran Tavori and Eran Harel



-

32

-



Figure
13
: Interaction between the interfaces

The simulation instance has one instance of an environment (which
implements IEnvorinment) and one instance of
an agent (which implements
IAgent).

The simulation uses those instances in order to create a simulation.

The agent has a policy instance and it uses it in order to decide what its
next step will be. The policy, in turn, uses a function approximation in ord
er
to get the job done.

Serialization

Serialization is java’s way of saving objects to disk files or sending them
through a network connection.

RL Project


Ran Tavori and Eran Harel



-

33

-


Java makes it very easy for the programmer to save his object. All you have
to do is declare that the class of y
our object implements the Serializable
interface, and all of the referenced objects within the object also have their
classes implementing the Serializable interface and that’s all. Now it is
possible to save a whole object to disk and load it from disk. W
hen an
object is saved to disk all of the other objects references by it are
automatically saved too.

For this reason, in this application, where there is a need to save objects to
disk (such as the policy of the agent, the environment and the
environment’
s state) many interfaces that were discussed in this document
extend the Serializable interface, which results in the classes themselves
implementing the Serializable interface.

Following is a list of all classes/interfaces that implement the Serializable
interface:

1.

agent.IAgent

2.

agent.IFunctionApproximator

3.

agent.IPolicy

4.

agent.QAgent.Q (an example of a class that the implementor of the
IAgent interfaces uses, that also has to be saved to disk)

5.

environment.IAction

6.

environment.ISensation

RL Project


Ran Tavori and Eran Harel



-

34

-


Implemented Environmen
ts

To demonstrate the usage of the platform, we have implemented 3
environments:



A Maze environment



A Predator & Pray environment


The following section describes the above environments in detail.

The Maze Environment

The maze environment is a 2D maze on a

rectangle grid where the actor
(the agent) must find the exit, (we will call it the goal state). On each
episode the agent can get the current state ID and query the environment
about the possible actions that can be taken. The possible actions can be
one

of {NORTH, EAST, SOUTH, WEST} but some states may contain less
options, depending on the walls locations. Each time the agent takes a
step, the maze’s state changes to reflect the agent’s new coordinate. If the
agent takes a step that leads to a wall, its

location won’t change.


The implementation of the maze environment is straightforward. A matrix
was used, where each cell in the matrix can contain a code, representing
the agent, the goal state, a wall, or an empty cell. This decision makes it
easy to ca
lculate the next state or the possible actions, and also makes it
easy to create a graphic display. For the simplicity of the code, the maze
environment also keeps a pointer to
MazeSensations
representing the goal
state, the agent’s location, and the initi
al state.

The GridSesation class

The
GridSensation

class was created to allow a unified view of a grid
location for environments that allow the above
-
mentioned four possible
actions and also another unique action: STAY, which means literally stay in
place.

The STAY option is only used in the
PredatorAndPrayEnvironment,
which

is discussed later.

A
GridSensation
is a representation of a coordinate on a grid. Each
GridSensation
has a unique ID, which is an enumeration of the cell it
represents. That is to say
, its ID is the index in the environment’s matrix
when it is being opened to a long array.

A
GridSensation

keeps a pointer to its enclosing environment to allow easy
implementation of the next two methods, which are used by the agents for
function approxim
ation, and environment modeling.

boolean

GridSensation.isGoal
()

Tests if this sensation is a goal state.

RL Project


Ran Tavori and Eran Harel



-

35

-


int

GridSensation.distance
(GridSensation otherSensation)

Returns the actual distance between this state and the other state.

The MazeSesation class

Th
e
MazeSesation

is a
GridSensation

that does not allow the STAY action.

The GridAction class

The
GridAction

is nothing more than an abstraction of the above idea of five
possible grid actions: NORTH, EAST, SOUTH, WEST, and STAY. One can
think of this class
as an enumeration of these actions.

The PredatorAndPrayEnvironment

The predator and pray environment is a
GridEnvironment

just like the maze
environment. On the grid there are an arbitrary number of predators
(monsters) that are trying to catch the pray
(the pacman). In other words,
this is a pacman game in which the agent plays the monsters role, and is
trying to catch the pacman to achieve its goal. On each step the agent picks
an action (a
PredatorAndPrayAction
, will be described in detail shortly), an
d
the environment moves the monsters accordingly. At that point, if the
pacman was not caught it makes its move according to its deterministic
getaway algorithm encapsulated in the
PacMan

class, which will be
described soon. Note that the environment’s ste
p actually encapsulates the
monsters step and the pacman’s step that are two separate steps. This
makes the game a round robin game.


As was previously stated, the agent’s goal is achieved when at least one of
the monsters is located on the same cell as t
he pacman. Each time the goal
is achieved, the monsters are placed at a random initial location, and the
chase starts once again. This allows a quicker improvement of the learned
policy.


Apart from the common interface inherited from IEnvironment the
Pred
atorAndPrayEnvironment class implements one important method
which allows the agent package to implement an IEnvironmentModel in an
easy manner:

Map

PredatorAndPrayEnvironment.getProbableNextSensations

(Iaction action)

The method calculates all probable ne
xt states, for the chosen action, and
then returns a mapping from each possible next state to its probability of
occurring.


To allow the environment to be able to calculate the real distances between
two grid cells, (remember that we are talking about a m
aze, and therefore
RL Project


Ran Tavori and Eran Harel



-

36

-


the distances aren’t exactly the Manhattan distances), a utility class was
added to abstract this operation:

The FloydWarshall class

The
FloydWarshall

class implements the Floyd
-
Warshal algorithm to find
the shortest distance between all

pairs of nodes (cells) in a
GridEnvironment
. On creation it is given a pointer to its enclosing
environment and it calculates the shortest distances between every two
cells. It also finds the maximal shortest distance on the grid for normalizing
purposes
(this feature is important for the part of function approximation).
The
FloydWarshall

class gives the
PredatorAndPrayEnvironment

a great
deal of improvement since it can then use the real distances in a maze and
not counting on the Manhattan distance, that

can be far from being
accurate.


The three main methods of the FloydWarshall class are:

Int

FloydWarshall.distance

(GridSensation from, GridSensation to)

Returns the distance between two states (or cells) on the grid environment
that created it.

Int

Fl
oydWarshall.distance

(int fromRow, int fromCol, int toRow, int toCol)

Returns the distance between cells on a grid environment. The cells are at
the coordinates given in the parameters.

Int

FloydWarshall.getMaxMinDist

()

Returns the maximal shortest distan
ce between two points on the grid. In
other words this is the maximal distance d such that there exist two points
on the grid p1, p2 (not walls) and distance (p1, p2)==d and there is no other
d that applies (d is maximal).

For example: on a maze without w
alls this would be the Manhattan
distance.

The PredatorAndPraySensation class

The
PredatorAndPraySensation

class can be thought of as a set of
GridSensations
. Each sensation is the coordinate of one of the actors on
the grid. By actors we mean the predator
s, and the pacman. A possible
action in a
PredatorAndPraySensation

is then a set of actions, each virtually
taken by one of the monsters (recall that there is actually only one agent,
and there for all of the monsters actions are taken in one combined step
).


The main methods in the
PredatorAndPraySensation
class, apart from the
methods described in the interface are:

GridSensation
PredatorAndPraySensation.getPacManSensation
()

RL Project


Ran Tavori and Eran Harel



-

37

-


Returns the
GridSensation
that represent the pacman location.

GridSensation[]
Pre
datorAndPraySensation. getPredatorsSensations

()

Returns an array of
GridSensations
that represent the monsters locations.

boolean
PredatorAndPraySensation.isGoalSensation
()

Tests if this is a goal state.

The PredatorAndPrayAction class

A
PredatorAndPrayAc
tion

is actualy a set of
GridActions

taken by all the
monsters in a single step. Each entry in the set is a
GridAction

that was
chosen by one of the monsters.

RL Project


Ran Tavori and Eran Harel



-

38

-


Experimental results

The following section describes the execution results of the implemented
age
nts with a few selected environments.

Maze Environment Results

The executions upon the maze environment were quite impressive. The first
maze we tested was an 11x11
-
sized maze:


The results are described in the graphs. Note: that the shortest path is 20
s
teps.

Q agent
0
10
20
30
40
50
60
70
80
1
20
39
58
77
96
115
134
153
172
191
210
229
248
267
286
305
324
343
Episode
Steps for episode

Figure
14
: #of steps for episode for a Q agent

RL Project


Ran Tavori and Eran Harel



-

39

-


SarsaLambda agent
0
50
100
150
200
250
300
350
400
1
29
57
85
113
141
169
197
225
253
281
309
337
365
393
421
449
477
505
Episode
Steps for episode

Figure
15
: #of steps for episode for a Sarsa lambda agent.

TDLambda agent
0
10
20
30
40
50
60
70
80
1
19
37
55
73
91
109
127
145
163
181
199
217
235
253
271
289
307
325
343
361
379
397
Episode
Steps for episode

Figure
16
: #of steps for episode for a TD lambda agent.


The next maze we tested was an 20x20
-
sized maze. The shortest path on
this maze was 45
-
steps.

RL Project


Ran Tavori and Eran Harel



-

40

-



And the
results were:

Q agent
40
140
240
340
440
540
640
740
840
940
1
170
339
508
677
846
1015
1184
1353
1522
1691
1860
2029
2198
2367
2536
2705
2874
Episode
Steps for episode

Figure
17
: #of steps for episode for a Q agent.

The results were quite similar for the Sarsa Lambda agent.

RL Project


Ran Tavori and Eran Harel



-

41

-


TDLambda agent
40
90
140
190
240
290
340
390
440
490
1
107
213
319
425
531
637
743
849
955
1061
1167
1273
1379
1485
1591
1697
1803
1909
2015
Episode
Steps for episode

Figure
18
: #of steps for episode for a TD Lambda agent.

Predator And Pray Environment Results

Since the
results were quite good, we moved on to the Predator & Pray
environment. First we tested a rather small 10x10 grid with 3 monsters.



Even for this small maze the agents that did not use any sort of function
approximation failed because of lack of memory,

when the policy table got
too big. We therefore display the results we gained using the RBF TD
Lambda agent.


RL Project


Ran Tavori and Eran Harel



-

42

-


RBFTDLambda agent
0
2000
4000
6000
8000
10000
12000
1
10
19
28
37
46
55
64
73
82
91
100
Episode
Steps for episode

Figure
19
: #of steps for episode for a RBF TD Lambda agent.


As you can see, the RBF TD La
mbda, can be somewhat unstable. We
believe this is due to the indeterminist behavior in the first trial, which
caused a large variance in different executions.


Next we tested much bigger mazes. The following maze is 29x29
-
sized.




The heuristic agent achieved an average of 36 steps for episode in the
good executions, while the RBF TD Lambda agent achieved an average of
49 steps after ~3600 steps. An interesting phenomenon is that the RBF TD
Lambda achieves far better results on
an empty grid than the heuristic
agent.

We relate this fact to the features we used. As you can recall we used a
feature, which was the distance from the corners. We believe this is the
RL Project


Ran Tavori and Eran Harel



-

43

-


cause for this kind of behavior. If only time would allow us we woul
d
probably research this phenomenon in greater detail.


RBFTDLambda agent
0
500
1000
1500
2000
2500
1
10
19
28
37
46
55
64
73
82
91
100
Episode
Steps for episode

Figure
20
: #of steps for episode for a RBF TD Lambda agent.

RL Project


Ran Tavori and Eran Harel



-

44

-


Operating manual

The following is a short description of the project’s main windows co
ntrols,
to allow the user to get familiar with the usage.

Environment editor.

We created the environment editor to be able to quickly create test cases
for the platform. The editor can create both maze initiation files and
predator & pray initiation files.

All you have to do is select the desired
environment in the first dialog.



To create a new environment, click File

new (Ctrl
-
N) and
select the desired size.



To save the environment, click File

save (Ctrl
-
S).



To open an existing environment file, click F
ile

open (Ctrl
-
O).



To draw walls click the draw button and drag or click the
mouse on the drawing panel. To erase, do the same with the
right mouse button.



To add monsters click the add monster button and click, on
the drawing panel in the desired places.

NOTE: in the maze
environment, only one monster will be selected as the initial
agent’s location.



To place the pacman in the predator and pray environment,
click the place pacman button and click on the drawing panel
where you want to place it.



To place t
he goal state in the maze environment, click the
place pacman button and click on the drawing panel where
you want to place it.


RL Project


Ran Tavori and Eran Harel



-

45

-


Draw mode
button

Add monster
button

Place pacman button

File menu

Drawing panel


The simulation type chooser

This dialog pops up when you run the platform main application. It prompts
the user to select an
environment and an agent.




The simulation main window

This frame displays the environment’s state, and some statistics:

Agent selector

Environment selector

RL Project


Ran Tavori and Eran Harel



-

46

-




The number of episodes (number of times the agent
achieved the goal state.



The number of steps so far.



The average number of steps p
er episode.



Time since startup.

The application’s controls are:



Learn button


freezes the graphic display to allow the agent
to quickly learn a better policy.



Display button


resumes the graphic display of the
environment state.



Delay slider


sets the d
elay time of the graphic display.




Display
panel

Learning mode
button

Display mode
button

# of simulation
steps

# of complete
episodes

Average
steps

Simulation
delay

Total
time

RL Project


Ran Tavori and Eran Harel



-

47

-


Improvements that we have not implemented

During the process of development we came up with many ideas of how to
make a better implementation of it that we did not have the time to invest in.
The list is long and quite

interesting. following are only a few main ideas
that came up.



Distributed implementation. Implements the predators as 2
(or more) separate entities, each having its own policy. In
this case the predators must learn to cooperate in order to
catch the prey
.



Add features to the function approximation.



Make the learning of the weights of the features be
automatic.



Use a data structure for holding the base functions that has
a better time complexity. In the current implementation the
process of getting the val
ue of the approximation and of
updating the weights is linear in the number of base
functions. It is possible to use a data structure, such as KD
-
tree, in which this complexity would be logarithmic in the
number of base functions.



Use a different mechanism

for function approximation. For
example, the usage of C
-
MAC has been successfully applied
to similar problems. Also the method of Neural Networks has
been successfully used in the past together with RL
algorithms (an example is TD
-
Gammon).


RL Project


Ran Tavori and Eran Harel



-

48

-


List of Figure
s

Figure 1 : TD algorithm for estimating V
π

................................
............................

10

Figure 2: Q
-
learning algorithm

................................
................................
............

13

Figure 3: sarsa algorithm

................................
................................
....................

15

Figure 4: TD(λ) algorithm that uses RBF as function approximation

...................

19

Figure 5: ISimulation interface

................................
................................
............

21

Figure 6: IAgent interface

................................
................................
....................

23

Figure 7: IEnvironment interface

................................
................................
.........

24

Figure 8: IS
ensation interface

................................
................................
.............

26

Figure 9: IAction interface

................................
................................
...................

27

Figure 10: The Packages

................................
................................
....................

28

Figure 11: IPolicy interface

................................
................................
.................

29

Figure 12: IFunctionApproximator interface

................................
........................

30

Figure 13: Interaction between the interfaces

................................
.....................

32

Figure 14: #of steps for episode for a Q agent

................................
....................

38

Figure 15: #of steps for episode for a Sarsa lambda agent.

...............................

39

Figure 16: #of steps for episode for a TD lambda agent.

................................
....

39

Figure 17: #of steps for episode for a Q agent.

................................
...................

40

Figure 18: #of steps for episode for a TD Lambda agent.

................................
...

41

Figure 19: #of steps for episode for a RBF TD Lambda agent.

...........................

42

Figure 20
: #of steps for episode for a RBF TD Lambda agent.

...........................

43

RL Project


Ran Tavori and Eran Harel



-

49

-


Bibliography

1.

Sutton R. S. and Barto A. G,
Reinforcement Learning: An
Introduction
, MIT Press, Cambridge, MA
,
1998

2.

Mitchell T. M,
Machine Learning,
McGraw Hill, 1997

3.

Kaelbli
ng L. P. and Littman M. L,
Reinforcement Learning: A
Survey
, Journal of Artificial Intelligence Research, pages 237
-
285,
1996

4.

Unemy T,
Learning not to Fail


Instance
-
Based Learning from
Negative Reinforcement,

1992

5.

Krose B. and Smagt P,
An Introduction to

Neural Networks
,
University of Amsterdam, 1996

6.

Gurney G,
Computers and Symbols versus Nets and Neurons
, UCL
Press, 2001