COMP 424  Artificial Intelligence
Lecture 20: Markov Decision Processes
Instructors:
Joelle Pineau (jpineau@cs.mcgill.ca)
Sylvie Ong (song@cs.mcgill.ca)
Class web page: www.cs.mcgill.ca/~jpineau/comp424
Joelle Pineau
2
COMP424: Artificial intelligence
Outline
•
Markov chains
•
Markov Decision Processes
•
Policies and value functions
–
Policy evaluation
–
Policy improvement
•
Computing optimal value functions for MDPs
–
Policy iteration algorithm
–
Value iteration algorithm
•
Function approximation
Joelle Pineau
3
COMP424: Artificial intelligence
Sequential decisionmaking
•
Utility theory provides a foundation for oneshot decisions.
–
If more than one decision has to be taken, reasoning about all of
them in general is very expensive.
•
Agents need to be able to make decisions in a
repeated
interaction with the environment over time.
•
Markov Decision Processes (MDPs) provide a framework for
modeling sequential decisionmaking.
Joelle Pineau
4
COMP424: Artificial intelligence
Example: Simplified chutes and ladders
•
How the game works
:
–
Start at state 1.
–
Roll a die, then move a number of positions given by its value.
–
If you land on square 5, you are teleported to 8.
–
When you get to 12, you win!
•
Note that there is no skill or
decision
involved (yet).
Joelle Pineau
5
COMP424: Artificial intelligence
Markov Chain Example
•
What are the states? What is the initial state?
•
What are the actions?
•
What is the goal?
•
If the agent is in state
s
t
at time
t
, the state at time
t+1
,
s
t+1
is
determined
only based on the dice roll at time t
.
–
Note that there is a discrete clock pacing the interaction of the agent with
the environment,
t = 0, 1, 2,
…
Joelle Pineau
6
COMP424: Artificial intelligence
Key assumption
•
The probability of the next state,
s
t+1
,
does not depend
on how
the agent got to the current state
s
t
.
•
This is called the
Markov property
.
•
So our game is completely described by the
probability
distribution of the next state given the current state
.
Joelle Pineau
7
COMP424: Artificial intelligence
Markov Chain Definition
•
Set of states
S
•
Transition probabilities:
T: S x S
→
[0, 1]
T(s, s
’
) = P(s
t+1
=s
’
 s
t
=s)
•
Initial state distribution:
P
0
: S
→
[0, 1]
P
0
(s) = P(s
0
=s)
Joelle Pineau
8
COMP424: Artificial intelligence
Things that can be computed
•
What is the expected number of time steps (dice rolls) to the
finish line?
•
What is the expected number of time steps until we reach a
state for the first time?
•
What is the probability of being in a given state
s
at time
t
?
•
After
t
time steps, what is the probability that we have ever been
in a given state
s
?
Joelle Pineau
9
COMP424: Artificial intelligence
Example: Decision Making
•
Suppose that we played the game with two dice.
•
You roll both dice and then have a choice:
–
Take the roll from the first die.
–
Take the roll from the second die.
–
Take the sum of the two rolls.
•
The goal is to finish the game as quickly as possible.
•
What are we missing?
Joelle Pineau
10
COMP424: Artificial intelligence
Sequential Decision Problem
•
At each time step
t
, the agent is in some state
s
t
.
•
It chooses an action
a
t
, and as a result, it receives a numerical
reward
r
t+1
and it can observe the new state
s
t+1
.
•
Similar to a Markov chain, but there are also
actions
and
rewards
.
Joelle Pineau
11
COMP424: Artificial intelligence
Markov Decision Processes (MDPs)
•
Set of states
S
•
Set of actions
A
•
Reward function
r: S x A
→
ℜ
–
R(s,a)
is the shortterm utility of the action.
•
Transition model (dynamics):
T: S x A x S
→
[0, 1]
–
T(s,a,s
’
) = P(s
t+1
=s
’
 s
t
=s, a
t
=a)
is the probability of going from
s
to
s
’
under
action
a
.
•
Discount factor,
γ
(between 0 and 1, usually close to 1).
Joelle Pineau
12
COMP424: Artificial intelligence
The discount factor
•
Two interpretations:
–
At each time step, there is a
1
γ
chance that the agent dies, and
does not receive rewards afterwards.
–
Inflation rate: receiving an amount of money in a year, is worth less
than today.
Joelle Pineau
13
COMP424: Artificial intelligence
Applications of MDPs
•
AI / Computer Science:
–
Robotic control
–
Air campaign planning
–
Elevator control
–
Computation scheduling
–
Control and automation
–
Spoken dialogue management
–
Cellular channel allocation
–
Football play selection
Joelle Pineau
14
COMP424: Artificial intelligence
Applications of MDPs
•
Economics / Operations Research
–
Inventory management
–
Fleet maintenance
–
Road maintenance
–
Packet retransmission
–
Nuclear plant management
•
Agriculture
–
Herd management
–
Fish stock management
Joelle Pineau
15
COMP424: Artificial intelligence
Planning in MDPs
•
The goal of an agent in an MDP is to be rational.
Maximize its expected utility
(I.e. respect MEU principle).
•
Maximizing the
immediate
utility
(defined by the immediate
reward) is
not sufficient
.
–
E.g. the agent might pick an action that gives instant gratification,
even if it later makes it
“
die
”
.
•
The goal is to
maximize longterm utility
, also called
return
.
–
The return is defined as an additive function of all rewards received
by the agent.
Joelle Pineau
16
COMP424: Artificial intelligence
Returns
•
The return
R
t
for a trajectory, starting from step
t
, can be defined
as:
•
Episodic tasks
(e.g. games, trips through a maze, etc.)
R
t
= r
t+1
+ r
t+2
+
…
+ r
T
where
T
is the time when a terminal state is reached.
•
Continuing tasks
(e.g. tasks which may go on forever)
R
t
= r
t+1
+
γ
r
t+2
+
γ
2
r
t+3
…
=
k=1:
γ
k1
r
t+k
Discount factor
γ
<1
ensures that return is finite if rewards are bounded
Joelle Pineau
17
COMP424: Artificial intelligence
Example: MountainCar
•
States
: position and velocity
•
Actions
: accelerate forward, accelerate backward, coast
•
Goal
: get the car to the top of the hill as quickly as possible.
•
Reward
: 1 for every time step, until car reaches the top (then 0)
(Alternately: reward = 1 at the top, 0 otherwise,
γ
<1)
Joelle Pineau
18
COMP424: Artificial intelligence
Policies
•
The goal of the agent is to find a way of behaving, called a
policy
(similar to a universal plan, or a strategy) that maximizes the expected
value of
R
t
,
∀
t
.
•
Two types of policies:
–
Deterministic policy
: in each state the agent chooses a unique action.
π
: S
→
A,
π
(s) = a
–
Stochastic policy
: in the same state, the agent can
“
roll a die
”
and choose
different actions.
π
: S x A
→
[0, 1],
π
(s,a) = P(a
t
=a  s
t
=s)
•
Once a policy is fixed, the MDP becomes a
Markov chain with rewards
.
Joelle Pineau
19
COMP424: Artificial intelligence
Value Functions
•
Because we want to find a policy which maximizes the expected return,
it is a good idea to
estimate the expected return
. Why exactly?
•
Then we can search through the space of policies for one that is good.
•
Value functions represent the expected return, for every state, given a
certain policy.
•
Computing value functions is an intermediate step towards computing
good policies.
Joelle Pineau
20
COMP424: Artificial intelligence
State Value Function
•
The
value function of a policy
π
is a function:
V
π
: S
→
ℜ
•
The
value of state s under policy
π
is the expected return if the
agent starts from state
s
and picks actions according to policy
π
:
V
π
(s) = E
π
[ R
t
 s
t
= s ]
–
For a finite state space, we can represent this as an array, with one
entry for every state.
–
We will talk later about methods used for very large or continuous
state spaces.
Joelle Pineau
21
COMP424: Artificial intelligence
Bellman Equations for Evaluating Policy
•
Recall our definition of the return:
R
t
= r
t+1
+
γ
r
t+2
+
γ
2
r
t+3
+
…
=
r
t+1
+
γ
(
r
t+2
+
γ
r
t+3
+
…
)
=
r
t+1
+
γ
R
t+1
•
Based on this observation,
V
π
(s)
becomes:
V
π
(s) = E
π
[ R
t
 s
t
= s ] = E
π
[ r
t+1
+
γ
R
t+1
 s
t
=s ]
•
By writing the expectation explicitly, we get:
–
Deterministic policy:
V
π
(s) =
(
r(s,
π
(s)) +
γ
s
’
∈
S
T(s,
π
(s), s
’
)V
π
(s
’
)
)
–
Stochastic policy:
V
π
(s) =
a
∈
A
π
(s,a)
(
r(s,a) +
γ
s
’
∈
S
T(s,a,s
’
)V
π
(s
’
)
)
This is a system of linear equations (one per state) with unique solution
V
π
.
Joelle Pineau
22
COMP424: Artificial intelligence
Policy evaluation in matrix form
•
Bellman
’
s equation in matrix form:
V
π
= r
π
+
γ
T
π
V
π
•
What are
V
π
,
r
π
and
T
π
?
–
V
π
is a vector containing the value of each state under policy
π
.
–
r
π
is a vector containing the immediate reward at each state:
r(s,
π
(s))
.
–
T
π
is a vector containing the transition probabilities at each state:
T(s,
π
(s), s
’
)
.
•
In some cases, we can solve this exactly:
V
π
= ( I 
γ
T
π
)
1
r
π
•
Can we do this iteratively?
Joelle Pineau
23
COMP424: Artificial intelligence
Iterative Policy Evaluation
Main idea: turn Bellman equations into update rules.
1.
Start with some initial guess
V
0
.
2.
During every iteration
k
, update the value function for all states:
V
k+1
(s)
←
(
r(s,
π
(s)) +
γ
s
’
∈
S
T(s,
π
(s), s
’
)V
k
(s
’
)
)
3.
Stop when the maximum change between two iterations is smaller than
a desired threshold (i.e. the values stop changing.)
This is a bootstrapping idea: the value of one state is updated based on
the current estimates of the values of successor states.
This is a dynamic programming algorithm
. It
’
s guaranteed to converge!
Joelle Pineau
24
COMP424: Artificial intelligence
Convergence of Iterative Policy Evaluation
•
Consider the absolute error in our estimate
V
k+1
(s)
:
(*** Note in these equations,
R(s,a)
denotes the
reward
, not the
return
.)
Joelle Pineau
25
COMP424: Artificial intelligence
Convergence of Iterative Policy Evaluation
•
Let
ε
k
be the worst error at iteration
k1
:

•
From previous calculation, we have:
•
Because
γ
<1
, this means that
•
We say that the error
contracts
and the contraction factor is
γ
.
Joelle Pineau
26
COMP424: Artificial intelligence
Searching for a Good Policy
•
We say that
π
≥
π
’
if
V
π
(s)
≥
V
π
’
(s),
∀
s
∈
S
.
•
This gives a partial ordering of policies.
–
If one policy is better at one state but worse at another state, the
two policies are not comparable.
•
Since we know how to compute values for policies, we can
search through the space of policies
.
•
Local search seems like a good fit.
Joelle Pineau
27
COMP424: Artificial intelligence
Policy Improvement
•
Recall Bellman
’
s eqn:
V
π
(s)
←
a
∈
A
π
(s,a)
(
r(s,a) +
γ
s
∈
S
T(s,a,s
’
)V
π
(s
’
)
)
•
Suppose that there is some action
a*
, such that:
(
r(s,a*) +
γ
s
∈
S
T(s,a*,s
’
)V
π
(s
’
)
) >
V
π
(s)
•
Then if we set
π
(s,a*)
←
1
, the value of state
s
will increase.
–
Because we replaced each element in the sum that defines
V
π
(s)
with a bigger value.
•
The values of states that can transition to
s
increase as well.
–
The values of all other states stay the same.
•
So the new policy using
a*
is better than the initial policy
π
.
Joelle Pineau
28
COMP424: Artificial intelligence
Policy Iteration
•
More generally, we can change the policy
π
to a new policy
π
’
which is
greedy
with respect to the computed values
V
π
π
’
(s) = argmax
a
∈
A
(
r(s,a) +
γ
s
∈
S
T(s,a,s
’
)V
π
(s
’
)
)
•
This gives us a local search through the space of policies.
•
We stop when the values of two successive policies are identical.
•
Because we only look for deterministic policies, and there is a finite number of
them, the search is
guaranteed to terminate
.
Joelle Pineau
29
COMP424: Artificial intelligence
Policy Iteration Algorithm
•
Start with an initial policy
π
0
(e.g. random)
•
Repeat:
–
Compute
V
π
, using policy evaluation.
–
Compute a new policy
π
’
that is
greedy
with respect to
V
π
•
Terminate when
V
π
= V
π
’
Joelle Pineau
30
COMP424: Artificial intelligence
A 4x3 gridworld example
•
Problem description:
–
11 discrete states, 4 motion actions in each state.
–
Transitions are stochastic, as shown on left figure. If direction is blocked,
agent stays in same location.
–
Reward is +1 in top right state, 10 in state directly below, 0 elsewhere.
–
Episode terminates when the agent reaches +1 or 10 state.
–
Discount factor
γ
= 0.99
.
S
+1
10
0.1
0.1
0.1
0.7
Intended
direction
Joelle Pineau
31
COMP424: Artificial intelligence
Policy Iteration (1)
↑
↑
↑
+1
↑
↑
10
↑
↑
↑
↑
0.21
0.22
0.24
+1
0.22
1.54
10
0.35
1.27
2.30
8.93
Joelle Pineau
32
COMP424: Artificial intelligence
Policy Iteration (2)
↑
←
→
+1
↑
↑
10
↑
←
←
←
0.37
0.41
0.76
+1
0.36
0.51
10
0.35
0.31
0.05
1.20
Joelle Pineau
33
COMP424: Artificial intelligence
Policy Iteration (3)
→
→
→
+1
↑
↑
10
↑
←
←
←
0.78
0.80
0.81
+1
0.77
0.41
10
0.75
0.70
0.39
0.90
Joelle Pineau
34
COMP424: Artificial intelligence
Change Rewards
→
←
→
+1
→
↑
500
↑
←
←
←
2.60
3.06
6.37
+1
2.61
61.9
500
2.85
4.82
19.1
78.2
Joelle Pineau
35
COMP424: Artificial intelligence
Optimal policies and optimal value functions
•
The
optimal value function
V*
is defined as the best value that
can be achieved at any state:
V*(s) = max
π
V
π
(s)
•
In a finite MDP, there exists a unique optimal value function
(showed by Bellman, 1957).
•
Any policy that achieves the optimal value function is called
optimal policy
.
Joelle Pineau
36
COMP424: Artificial intelligence
Optimal policies in the gridworld example
•
Optimal state values give information about the shortest path to
the goal.
•
One of the
deterministic
optimal policies is shown below.
•
There can be an infinite number of optimal policies (think
stochastic
policies).
→
→
→
+1
↑
↑
10
↑
←
←
←
0.78
0.80
0.81
+1
0.77
0.41
10
0.75
0.70
0.39
0.90
Joelle Pineau
37
COMP424: Artificial intelligence
Complexity of policy iteration
Repeat two basic steps: Compute
V
π
+ Compute a new policy
π
’
1.
Compute
V
π
, using policy evaluation.
Per iteration:
O(S
3
)
2.
Compute a new policy
π
’
that is greedy with respect to
V
π
.
Per iteration:
O(S
2
A)
Repeat for how many iterations?
At most
A
S
Can get very expensive when there are many states!
Joelle Pineau
38
COMP424: Artificial intelligence
Bellman Optimal Equation for V*
•
The value of a state under the optimal policy must be equal to the
expected return for the best action in the state:
V*(s)
= max
a
∈
A
E [ r
t+1
+
γ
V*(s
t+1
)  s
t
=s, a
t
=a ]
= max
a
∈
A
( r(s,a) +
γ
s
∈
S
T(s,a,s
’
)V*(s
’
) )
V*
is the
unique solution
of this system of nonlinear equations.
•
If we know
V*
(and
r, T,
γ
), then we can compute
π
*
easily:
π
*(s)
= argmax
a
∈
A
( r(s,a) +
γ
s
∈
S
T(s,a,s
’
)V*(s
’
) )
•
One way to compute
V*
is through policy iteration.
Can we compute
V*
directly (without computing
π
at every iteration)?
Joelle Pineau
39
COMP424: Artificial intelligence
Value Iteration Algorithm
Main idea
: Turn the Bellman optimality equation into an update rule (same
as done in policy evaluation):
1.
Start with an arbitrary initial approximation V
0
.
2.
On each iteration, update the value function estimate:
V
k
(s)
= max
a
∈
A
( r(s,a) +
γ
s
∈
S
T(s,a,s
’
)V
k1
(s
’
) )
3.
Stop when the maximum value change between iterations is below a
threshold.
The algorithm converges (in the limit) to the true V*.
Joelle Pineau
40
COMP424: Artificial intelligence
Value Iteration (1)
0
0
0
+1
0
0
10
0
0
0
0
Joelle Pineau
41
COMP424: Artificial intelligence
Value Iteration (2)
0
0
0.69
+1
0
0.99
10
0
0
0
0.99
Bellman residual:
V
2
(s)  V
1
(s) = 0.99
Joelle Pineau
42
COMP424: Artificial intelligence
Value Iteration (5)
0.48
0.70
0.76
+1
0.23
0.55
10
0
0.20
0.23
1.40
Bellman residual:
V
5
(s)  V
4
(s) = 0.23
Joelle Pineau
43
COMP424: Artificial intelligence
Value Iteration (20)
0.78
0.80
0.81
+1
0.77
0.44
10
0.75
0.69
0.37
0.92
Bellman residual:
V
5
(s)  V
4
(s) = 0.008
Joelle Pineau
44
COMP424: Artificial intelligence
Compare VI and PI
0.78
0.80
0.81
+1
0.77
0.44
10
0.75
0.69
0.37
0.92
0.78
0.80
0.81
+1
0.77
0.41
10
0.75
0.69
0.39
0.90
Value Iteration
20 iterations
Policy Iteration
3 iterations
Joelle Pineau
45
COMP424: Artificial intelligence
Another example: Four Rooms
•
Four actions, fail 30% of the time.
•
No rewards until the goal is reached,
γ
= 0.9
.
•
Values propagate backwards from the goal.
Joelle Pineau
46
COMP424: Artificial intelligence
Complexity analysis
•
Policy iteration:
–
Per iteration:
O(S
3
+ S
2
A)
–
N
0
iteration: At most A
S
Fewer iterations
•
Value iteration:
–
Per iteration: O(S
2
A)
–
N
o
iterations: Polynomial in 1 / ( 1 
γ
)
Faster per iteration
Joelle Pineau
47
COMP424: Artificial intelligence
A more efficient VI algorithm
•
Instead of updating all states on every iteration, focus on
important states
.
•
Here, we can define important as
visited often
.
E.g., board positions that occur on every game, rather than just
once in 100 games.
•
Asynchronous dynamic programming algorithm
:
–
Generate trajectories through the MDP.
–
Update states whenever they appear on such a trajectory.
•
This focuses the updates on states that are actually possible.
Joelle Pineau
48
COMP424: Artificial intelligence
Limitations of MDPs
1.
Finding an optimal policy is polynomial in the number of states.
–
Number of states is often astronomical.
–
Dynamic programming can solve problems up to 10
7
states.
2.
State is sometimes not observable.
–
Some states may
“
look the same
”
.
–
Sensors data may be noisy.
3.
Value iteration and policy iteration assume the model
(transitions and rewards) is known in advance.
Joelle Pineau
49
COMP424: Artificial intelligence
What you should know
•
Definition of MDP framework.
•
Differences/similarities between MDPs and other AI approaches
(e.g. general search, game playing, STRIPS planning).
•
Basic MDP algorithms (policy evaluation, policy iteration, value
iteration) and their properties.
•
Function approximation (why, what, how).
Comments 0
Log in to post a comment