# COMP 424 - Artificial Intelligence

Τεχνίτη Νοημοσύνη και Ρομποτική

17 Ιουλ 2012 (πριν από 5 χρόνια και 10 μήνες)

339 εμφανίσεις

COMP 424 - Artificial Intelligence
Lecture 20: Markov Decision Processes
Instructors:
Joelle Pineau (jpineau@cs.mcgill.ca)
Sylvie Ong (song@cs.mcgill.ca)
Class web page: www.cs.mcgill.ca/~jpineau/comp424
Joelle Pineau
2
COMP-424: Artificial intelligence
Outline

Markov chains

Markov Decision Processes

Policies and value functions

Policy evaluation

Policy improvement

Computing optimal value functions for MDPs

Policy iteration algorithm

Value iteration algorithm

Function approximation
Joelle Pineau
3
COMP-424: Artificial intelligence
Sequential decision-making

Utility theory provides a foundation for one-shot decisions.

If more than one decision has to be taken, reasoning about all of
them in general is very expensive.

Agents need to be able to make decisions in a
repeated
interaction with the environment over time.

Markov Decision Processes (MDPs) provide a framework for
modeling sequential decision-making.
Joelle Pineau
4
COMP-424: Artificial intelligence

How the game works
:

Start at state 1.

Roll a die, then move a number of positions given by its value.

If you land on square 5, you are teleported to 8.

When you get to 12, you win!

Note that there is no skill or
decision
involved (yet).
Joelle Pineau
5
COMP-424: Artificial intelligence
Markov Chain Example

What are the states? What is the initial state?

What are the actions?

What is the goal?

If the agent is in state
s
t
at time
t
, the state at time
t+1
,
s
t+1
is
determined
only based on the dice roll at time t
.

Note that there is a discrete clock pacing the interaction of the agent with
the environment,
t = 0, 1, 2,

Joelle Pineau
6
COMP-424: Artificial intelligence
Key assumption

The probability of the next state,
s
t+1
,
does not depend
on how
the agent got to the current state
s
t
.

This is called the
Markov property
.

So our game is completely described by the
probability
distribution of the next state given the current state
.
Joelle Pineau
7
COMP-424: Artificial intelligence
Markov Chain Definition

Set of states
S

Transition probabilities:
T: S x S

[0, 1]
T(s, s

) = P(s
t+1
=s

| s
t
=s)

Initial state distribution:
P
0
: S

[0, 1]
P
0
(s) = P(s
0
=s)
Joelle Pineau
8
COMP-424: Artificial intelligence
Things that can be computed

What is the expected number of time steps (dice rolls) to the
finish line?

What is the expected number of time steps until we reach a
state for the first time?

What is the probability of being in a given state
s
at time
t
?

After
t
time steps, what is the probability that we have ever been
in a given state
s
?
Joelle Pineau
9
COMP-424: Artificial intelligence
Example: Decision Making

Suppose that we played the game with two dice.

You roll both dice and then have a choice:

Take the roll from the first die.

Take the roll from the second die.

Take the sum of the two rolls.

The goal is to finish the game as quickly as possible.

What are we missing?
Joelle Pineau
10
COMP-424: Artificial intelligence
Sequential Decision Problem

At each time step
t
, the agent is in some state
s
t
.

It chooses an action
a
t
, and as a result, it receives a numerical
reward
r
t+1
and it can observe the new state
s
t+1
.

Similar to a Markov chain, but there are also
actions
and
rewards
.
Joelle Pineau
11
COMP-424: Artificial intelligence
Markov Decision Processes (MDPs)

Set of states
S

Set of actions
A

Reward function

r: S x A

R(s,a)
is the short-term utility of the action.

Transition model (dynamics):
T: S x A x S

[0, 1]

T(s,a,s

) = P(s
t+1
=s

| s
t
=s, a
t
=a)
is the probability of going from
s
to
s

under
action
a
.

Discount factor,
γ
(between 0 and 1, usually close to 1).
Joelle Pineau
12
COMP-424: Artificial intelligence
The discount factor

Two interpretations:

At each time step, there is a
1-
γ
chance that the agent dies, and

Inflation rate: receiving an amount of money in a year, is worth less
than today.
Joelle Pineau
13
COMP-424: Artificial intelligence
Applications of MDPs

AI / Computer Science:

Robotic control

Air campaign planning

Elevator control

Computation scheduling

Control and automation

Spoken dialogue management

Cellular channel allocation

Football play selection
Joelle Pineau
14
COMP-424: Artificial intelligence
Applications of MDPs

Economics / Operations Research

Inventory management

Fleet maintenance

Packet retransmission

Nuclear plant management

Agriculture

Herd management

Fish stock management
Joelle Pineau
15
COMP-424: Artificial intelligence
Planning in MDPs

The goal of an agent in an MDP is to be rational.
Maximize its expected utility

(I.e. respect MEU principle).

Maximizing the
immediate

utility
(defined by the immediate
reward) is
not sufficient
.

E.g. the agent might pick an action that gives instant gratification,
even if it later makes it

die

.

The goal is to
maximize long-term utility
, also called
return
.

The return is defined as an additive function of all rewards received
by the agent.
Joelle Pineau
16
COMP-424: Artificial intelligence
Returns

The return
R
t
for a trajectory, starting from step
t
, can be defined
as:

(e.g. games, trips through a maze, etc.)
R
t
= r
t+1
+ r
t+2
+

+ r
T
where

T
is the time when a terminal state is reached.

(e.g. tasks which may go on forever)
R
t
= r
t+1
+
γ
r
t+2
+
γ
2
r
t+3

=

k=1:

γ
k-1
r
t+k
Discount factor
γ
<1
ensures that return is finite if rewards are bounded
Joelle Pineau
17
COMP-424: Artificial intelligence
Example: Mountain-Car

States
: position and velocity

Actions
: accelerate forward, accelerate backward, coast

Goal
: get the car to the top of the hill as quickly as possible.

Reward
: -1 for every time step, until car reaches the top (then 0)
(Alternately: reward = 1 at the top, 0 otherwise,
γ
<1)
Joelle Pineau
18
COMP-424: Artificial intelligence
Policies

The goal of the agent is to find a way of behaving, called a
policy
(similar to a universal plan, or a strategy) that maximizes the expected
value of
R
t
,

t
.

Two types of policies:

Deterministic policy
: in each state the agent chooses a unique action.
π
: S

A,
π
(s) = a

Stochastic policy
: in the same state, the agent can

roll a die

and choose
different actions.

π
: S x A

[0, 1],
π
(s,a) = P(a
t
=a | s
t
=s)

Once a policy is fixed, the MDP becomes a
Markov chain with rewards
.
Joelle Pineau
19
COMP-424: Artificial intelligence
Value Functions

Because we want to find a policy which maximizes the expected return,
it is a good idea to
estimate the expected return
. Why exactly?

Then we can search through the space of policies for one that is good.

Value functions represent the expected return, for every state, given a
certain policy.

Computing value functions is an intermediate step towards computing
good policies.
Joelle Pineau
20
COMP-424: Artificial intelligence
State Value Function

The
value function of a policy
π

is a function:
V
π
: S

The
value of state s under policy
π

is the expected return if the
agent starts from state
s
and picks actions according to policy
π
:
V
π
(s) = E
π

[ R
t
| s
t
= s ]

For a finite state space, we can represent this as an array, with one
entry for every state.

We will talk later about methods used for very large or continuous
state spaces.
Joelle Pineau
21
COMP-424: Artificial intelligence
Bellman Equations for Evaluating Policy

Recall our definition of the return:

R
t
= r
t+1
+
γ

r
t+2
+
γ
2

r
t+3
+

=
r
t+1
+
γ
(
r
t+2
+
γ

r
t+3
+

)
=
r
t+1
+
γ

R
t+1

Based on this observation,
V
π
(s)
becomes:
V
π
(s) = E
π

[ R
t
| s
t
= s ] = E
π
[ r
t+1
+
γ
R
t+1
| s
t
=s ]

By writing the expectation explicitly, we get:

Deterministic policy:
V
π
(s) =
(
r(s,
π
(s)) +
γ

s

S
T(s,
π
(s), s

)V
π
(s

)
)

Stochastic policy:
V
π
(s) =

a

A
π
(s,a)

(
r(s,a) +
γ

s

S
T(s,a,s

)V
π
(s

)
)
This is a system of linear equations (one per state) with unique solution
V
π
.
Joelle Pineau
22
COMP-424: Artificial intelligence
Policy evaluation in matrix form

Bellman

s equation in matrix form:
V
π
= r
π
+
γ
T
π
V
π

What are
V
π
,
r
π

and

T
π

?

V
π

is a vector containing the value of each state under policy
π
.

r
π

is a vector containing the immediate reward at each state:
r(s,
π
(s))
.

T
π

is a vector containing the transition probabilities at each state:
T(s,
π
(s), s

)
.

In some cases, we can solve this exactly:
V
π
= ( I -
γ
T
π
)
-1
r
π

Can we do this iteratively?
Joelle Pineau
23
COMP-424: Artificial intelligence
Iterative Policy Evaluation
Main idea: turn Bellman equations into update rules.
1.
V
0
.
2.
During every iteration
k
, update the value function for all states:
V
k+1
(s)

(
r(s,
π
(s)) +
γ

s

S
T(s,
π
(s), s

)V
k
(s

)
)
3.
Stop when the maximum change between two iterations is smaller than
a desired threshold (i.e. the values stop changing.)
This is a bootstrapping idea: the value of one state is updated based on
the current estimates of the values of successor states.
This is a dynamic programming algorithm
. It

s guaranteed to converge!
Joelle Pineau
24
COMP-424: Artificial intelligence
Convergence of Iterative Policy Evaluation

Consider the absolute error in our estimate
V
k+1
(s)
:
(*** Note in these equations,
R(s,a)
denotes the
reward
, not the
return
.)
Joelle Pineau
25
COMP-424: Artificial intelligence
Convergence of Iterative Policy Evaluation

Let
ε
k
be the worst error at iteration
k-1
:
|

From previous calculation, we have:

Because
γ
<1
, this means that

We say that the error
contracts
and the contraction factor is
γ
.
Joelle Pineau
26
COMP-424: Artificial intelligence
Searching for a Good Policy

We say that
π

π

if
V
π
(s)

V
π

(s),

s

S
.

This gives a partial ordering of policies.

If one policy is better at one state but worse at another state, the
two policies are not comparable.

Since we know how to compute values for policies, we can
search through the space of policies
.

Local search seems like a good fit.
Joelle Pineau
27
COMP-424: Artificial intelligence
Policy Improvement

Recall Bellman

s eqn:
V
π
(s)

a

A
π
(s,a)

(
r(s,a) +
γ

s

S
T(s,a,s

)V
π
(s

)
)

Suppose that there is some action
a*
, such that:
(
r(s,a*) +
γ

s

S
T(s,a*,s

)V
π
(s

)
) >
V
π
(s)

Then if we set
π
(s,a*)

1
, the value of state
s
will increase.

Because we replaced each element in the sum that defines
V
π
(s)
with a bigger value.

The values of states that can transition to
s

increase as well.

The values of all other states stay the same.

So the new policy using
a*
is better than the initial policy

π
.
Joelle Pineau
28
COMP-424: Artificial intelligence
Policy Iteration

More generally, we can change the policy
π
to a new policy

π

which is
greedy
with respect to the computed values
V
π
π

(s) = argmax
a

A
(
r(s,a) +
γ

s

S
T(s,a,s

)V
π
(s

)
)

This gives us a local search through the space of policies.

We stop when the values of two successive policies are identical.

Because we only look for deterministic policies, and there is a finite number of
them, the search is
guaranteed to terminate
.
Joelle Pineau
29
COMP-424: Artificial intelligence
Policy Iteration Algorithm

π
0

(e.g. random)

Repeat:

Compute
V
π
, using policy evaluation.

Compute a new policy
π

that is
greedy
with respect to
V
π

Terminate when
V
π

= V
π

Joelle Pineau
30
COMP-424: Artificial intelligence
A 4x3 gridworld example

Problem description:

11 discrete states, 4 motion actions in each state.

Transitions are stochastic, as shown on left figure. If direction is blocked,
agent stays in same location.

Reward is +1 in top right state, -10 in state directly below, -0 elsewhere.

Episode terminates when the agent reaches +1 or -10 state.

Discount factor
γ
= 0.99
.
S
+1
-10
0.1
0.1
0.1
0.7
Intended
direction
Joelle Pineau
31
COMP-424: Artificial intelligence
Policy Iteration (1)

+1

-10

-0.21
-0.22
-0.24
+1
-0.22
-1.54
-10
-0.35
-1.27
-2.30
-8.93
Joelle Pineau
32
COMP-424: Artificial intelligence
Policy Iteration (2)

+1

-10

0.37
0.41
0.76
+1
0.36
-0.51
-10
0.35
0.31
0.05
-1.20
Joelle Pineau
33
COMP-424: Artificial intelligence
Policy Iteration (3)

+1

-10

0.78
0.80
0.81
+1
0.77
-0.41
-10
0.75
0.70
0.39
-0.90
Joelle Pineau
34
COMP-424: Artificial intelligence
Change Rewards

+1

-500

-2.60
-3.06
-6.37
+1
-2.61
-61.9
-500
-2.85
-4.82
-19.1
-78.2
Joelle Pineau
35
COMP-424: Artificial intelligence
Optimal policies and optimal value functions

The
optimal value function

V*
is defined as the best value that
can be achieved at any state:
V*(s) = max
π
V
π
(s)

In a finite MDP, there exists a unique optimal value function
(showed by Bellman, 1957).

Any policy that achieves the optimal value function is called
optimal policy
.
Joelle Pineau
36
COMP-424: Artificial intelligence
Optimal policies in the gridworld example

Optimal state values give information about the shortest path to
the goal.

One of the
deterministic
optimal policies is shown below.

There can be an infinite number of optimal policies (think
stochastic
policies).

+1

-10

0.78
0.80
0.81
+1
0.77
-0.41
-10
0.75
0.70
0.39
-0.90
Joelle Pineau
37
COMP-424: Artificial intelligence
Complexity of policy iteration
Repeat two basic steps: Compute
V
π

+ Compute a new policy
π

1.
Compute
V
π
, using policy evaluation.
Per iteration:
O(S
3
)
2.
Compute a new policy
π

that is greedy with respect to
V
π

.
Per iteration:
O(S
2
A)
Repeat for how many iterations?
At most
|A|
|S|
Can get very expensive when there are many states!
Joelle Pineau
38
COMP-424: Artificial intelligence
Bellman Optimal Equation for V*

The value of a state under the optimal policy must be equal to the
expected return for the best action in the state:
V*(s)
= max
a

A
E [ r
t+1
+
γ
V*(s
t+1
) | s
t
=s, a
t
=a ]
= max
a

A
( r(s,a) +
γ

s

S
T(s,a,s

)V*(s

) )
V*
is the
unique solution
of this system of non-linear equations.

If we know
V*
(and
r, T,
γ
), then we can compute
π
*
easily:
π
*(s)
= argmax
a

A
( r(s,a) +
γ

s

S
T(s,a,s

)V*(s

) )

One way to compute
V*
is through policy iteration.
Can we compute

V*
directly (without computing

π

at every iteration)?
Joelle Pineau
39
COMP-424: Artificial intelligence
Value Iteration Algorithm
Main idea
: Turn the Bellman optimality equation into an update rule (same
as done in policy evaluation):
1.
0
.
2.
On each iteration, update the value function estimate:
V
k
(s)
= max
a

A
( r(s,a) +
γ

s

S
T(s,a,s

)V
k-1
(s

) )
3.
Stop when the maximum value change between iterations is below a
threshold.
The algorithm converges (in the limit) to the true V*.
Joelle Pineau
40
COMP-424: Artificial intelligence
Value Iteration (1)
0
0
0
+1
0
0
-10
0
0
0
0
Joelle Pineau
41
COMP-424: Artificial intelligence
Value Iteration (2)
0
0
0.69
+1
0
-0.99
-10
0
0
0
-0.99
Bellman residual:
|V
2
(s) - V
1
(s)| = 0.99
Joelle Pineau
42
COMP-424: Artificial intelligence
Value Iteration (5)
0.48
0.70
0.76
+1
0.23
-0.55
-10
0
-0.20
-0.23
-1.40
Bellman residual:
|V
5
(s) - V
4
(s)| = 0.23
Joelle Pineau
43
COMP-424: Artificial intelligence
Value Iteration (20)
0.78
0.80
0.81
+1
0.77
-0.44
-10
0.75
0.69
0.37
-0.92
Bellman residual:
|V
5
(s) - V
4
(s)| = 0.008
Joelle Pineau
44
COMP-424: Artificial intelligence
Compare VI and PI
0.78
0.80
0.81
+1
0.77
-0.44
-10
0.75
0.69
0.37
-0.92
0.78
0.80
0.81
+1
0.77
-0.41
-10
0.75
0.69
0.39
-0.90
Value Iteration
20 iterations
Policy Iteration
3 iterations
Joelle Pineau
45
COMP-424: Artificial intelligence
Another example: Four Rooms

Four actions, fail 30% of the time.

No rewards until the goal is reached,
γ
= 0.9
.

Values propagate backwards from the goal.
Joelle Pineau
46
COMP-424: Artificial intelligence
Complexity analysis

Policy iteration:

Per iteration:
O(S
3
+ S
2
A)

N
0
iteration: At most |A|
|S|

Fewer iterations

Value iteration:

Per iteration: O(S
2
A)

N
o
iterations: Polynomial in 1 / ( 1 -
γ
)

Faster per iteration
Joelle Pineau
47
COMP-424: Artificial intelligence
A more efficient VI algorithm

Instead of updating all states on every iteration, focus on
important states
.

Here, we can define important as
visited often
.
E.g., board positions that occur on every game, rather than just
once in 100 games.

Asynchronous dynamic programming algorithm
:

Generate trajectories through the MDP.

Update states whenever they appear on such a trajectory.

This focuses the updates on states that are actually possible.
Joelle Pineau
48
COMP-424: Artificial intelligence
Limitations of MDPs
1.
Finding an optimal policy is polynomial in the number of states.

Number of states is often astronomical.

Dynamic programming can solve problems up to 10
7
states.
2.
State is sometimes not observable.

Some states may

look the same

.

Sensors data may be noisy.
3.
Value iteration and policy iteration assume the model
(transitions and rewards) is known in advance.
Joelle Pineau
49
COMP-424: Artificial intelligence
What you should know

Definition of MDP framework.

Differences/similarities between MDPs and other AI approaches
(e.g. general search, game playing, STRIPS planning).

Basic MDP algorithms (policy evaluation, policy iteration, value
iteration) and their properties.

Function approximation (why, what, how).