1
Symbolic Dynamic Programming
Alan Fern *
* Based in part on slides by Craig Boutilier
2
Planning in Large State Space MDPs
You have learned algorithms for computing optimal policies
Value Iteration
Policy Iteration
These algorithms explicitly enumerate the state space
Often this is impractical
Simulation

based planning and RL allowed for approximate
planning in large MDPs
Did not utilize an explicit model of the MDP. Only used a strong or
weak simulator.
How can we get exact solutions to enormous MDPs?
3
Structured Representations
Policy iteration and value iteration treat states as atomic
entities with no internal structure.
In most cases, states actually do have internal structure
E.g. described by a set of state variables, or objects with properties
and relationships
Humans exploit this structure to plan effectively
What if we had a compact, structured representation for a
large MDP and could efficiently plan with it?
Would allow for exact solutions to very large MDPs
4
A Planning Problem
5
Logical or Feature

based Problems
For most AI problems, states are not viewed as
atomic entities.
They contain structure. For example, they are
described by a set of
boolean
propositions/variables
S
exponential in number of propositions
Basic policy and value iteration do nothing to
exploit the structure of the MDP when it is
available
n
X
X
X
S
2
1
6
Solution?
Require structured representations in terms
of propositions
compactly represent transition function
compactly represent reward function
compactly represent value functions and policies
Require structured computation
perform steps of PI or VI directly on structured
representations
can avoid the need to enumerate state space
We start by representing the transition
structure as dynamic Bayesian networks
7
Propositional Representations
States decomposable into
state variables (we
will assume
boolean
variables)
Structured
representations the norm in AI
Decision diagrams, Bayesian networks, etc.
Describe
how actions affect/depend on features
Natural, concise, can be exploited computationally
Same ideas can be used for MDPs
n
X
X
X
S
2
1
8
Robot Domain as Propositional
MDP
Propositional variables for single user version
Loc (robot’s locat’n): Office, Entrance
T (lab is tidy): boolean
CR (coffee request outstanding): boolean
RHC (robot holding coffee): boolean
RHM (robot holding mail): boolean
M (mail waiting for pickup): boolean
Actions/Events
move to an adjacent location, pickup mail, get coffee, deliver
mail, deliver coffee, tidy lab
mail arrival, coffee request issued, lab gets messy
Rewards
rewarded for tidy lab, satisfying a coffee request, delivering mail
(or penalized for their negation)
9
State Space
State of MDP: assignment to these six
variables
64 states
grows exponentially with number of variables
Transition matrices
4032 parameters required per matrix
one matrix per action (6 or 7 or more actions)
Reward function
64 reward values needed
Factored state and action descriptions will
break this exponential dependence (generally)
10
Dynamic Bayesian Networks (DBNs)
Bayesian networks (BNs) a common
representation for probability distributions
A graph (DAG) represents conditional
independence
Conditional probability tables (CPTs) quantify local
probability distributions
Dynamic Bayes net action representation
one Bayes net for each action a, representing the
set of conditional distributions Pr(S
t+1
A
t
,S
t
)
each state variable occurs at time t and t+1
dependence of t+1 variables on t variables
depicted by directed arcs
11
DBN Representation: deliver coffee
T
t
L
t
CR
t
RHC
t
T
t+1
L
t+1
CR
t+1
RHC
t+1
Pr(CR
t+1
 L
t
,CR
t
,RHC
t
)
Pr
(T
t+1

T
t
)
L CR RHC CR
(t+1)
CR
(t+1)
O T T 0.2 0.8
E T T 1.0 0.0
O F T 0.1 0.9
E F T 0.1 0.9
O T F 1.0 0.0
E T F 1.0 0.0
O F F 0.1 0.9
E F F 0.1 0.9
T T
(t+1)
T
(t+1)
T 0.91 0.09
F 0.0 1.0
RHM
t
RHM
t+1
M
t
M
t+1
Pr(RHM
t+1
RHM
t
)
RHM R
(t+1)
R
(t+1)
T 1.0 0.0
F 0.0 1.0
12
Benefits of DBN Representation
Pr
(S
t+1
 S
t
) =
Pr
(RHM
t+1
,M
t+1
,T
t+1,
L
t+1
,C
t+1,
RHC
t+1
 RHM
t
,M
t
,T
t,
L
t
,C
t,
RHC
t
)
= Pr
(RHM
t+1
RHM
t
) *
Pr
(M
t+1

M
t
) *
Pr
(T
t+1

T
t
)
*
Pr
(L
t+1

L
t
) *
Pr
(CR
t+1

CR
t,
RHC
t,
L
t
) *
Pr
(RHC
t+1

RHC
t,
L
t
)

Only 20 parameters vs.
4032 for matrix

Removes global exponential
dependence
s
1
s
2 ...
s
64
s
1
0.9 0.05
...
0.0
s
2
0.0 0.20
...
0.1
S
64
0.1 0.0
...
0.0
.
.
.
T
t
L
t
CR
t
RHC
t
T
t+1
L
t+1
CR
t+1
RHC
t+1
RHM
t
RHM
t+1
M
t
M
t+1
Full Matrix
13
Structure in CPTs
So far we have represented each CPT as a table
of size exponential in the number of parents
Notice that there’s regularity in CPTs
e.g.,
Pr
(CR
t+1

L
t,
CR
t,
RHC
t
)
has many similar entries
Compact function representations for CPTs can
be used to great effect
decision trees
algebraic decision diagrams (ADDs/BDDs)
Here we show examples of decision trees (DTs)
14
Action Representation
–
DBN/DT
CR(t)
0.1
1.0
RHC(t)
L(t)
0.2
Decision
Tree (DT)
T
t
L
t
CR
t
RHC
t
T
t+1
L
t+1
CR
t+1
RHC
t+1
RHM
t
RHM
t+1
M
t
M
t+1
f
t
t
o
e
Leaves of DT give
Pr
(CR
t+1
=true  L
t,
CR
t,
RHC
t
)
DTs can often represent conditional probabilities much more
compactly than a full conditional probability table
e.g. If CR(t) = true & RHC(t) = false then
CR(t+1)=TRUE with prob. 1
1.0
f
15
Reward Representation
Rewards represented with DTs in a similar fashion
Would require vector of size 2
n
for explicit representation
CR
M
T

1
1

100
f
t
f
t

10
f
t
Small reward for satisfying all of these conditions
High cost for unsatisfied coffee request
High, but lower, cost for undelivered mail
Cost for lab being untidy
16
Structured Computation
Given our compact decision tree (DBN)
representation, can we solve MDP without
explicit state space enumeration?
Can we avoid
O(S)

computations by exploiting
regularities made explicit by representation?
We will study a general approach for doing this
called structured dynamic programming
17
Structured Dynamic Programming
We now consider how to perform dynamic programming
techniques such as VI and PI using the problem structure
VI and PI are based on a few basic operations.
Here we will show how to perform these operations directly on tree
representations of value functions, policies, and transitions functions
The approach is very general and can be applied to other
representations (e.g. algebraic decision diagrams, situation
calculus) and other problems after the main idea is
understood
We will focus on VI here, but the paper also describes a
version of modified policy iteration
18
Recall Tree

Based Representations
X
Y
Z
X
Y
Z
X
Y
0.9
0.0
X
1.0
0.0
1.0
Z
Y
1.0
0.0
0.9
Z
10
0
DBN for Action A
Reward Function R
Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.
e.g. If X(t)=false & Y(t) = true then
Y(t+1)=true w/ prob 1
e.g. If X(t)=true THEN
Y(t+1)=true w/ prob 0.9
t
f
t
f
t
f
t
f
t
f
t
f
Recall that each action of the MDP has its own DBN.
19
Structured Dynamic Programming
Value functions and policies can also have tree
representations
Often much more compact representations than tables
Our Goal:
compute the tree representations of
policy and value function given the tree
representations of the transitions and rewards
20
Recall Value Iteration
)
](
[
)
(
0
)
(
1
0
s
V
B
s
V
s
V
k
k
)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
T
s
R
V
a
s
Q
a
Bellman Backup
:
Suppose that V is compactly represented as a tree.
1.
Show how to compute compact trees for
Q(s,a
1
,V),…,Q(
s,a
n
,V
)
2.
Use a max operation on the Q

trees (returns a single tree)
21
The MAX Trees Operation
X
Y
0.9
0.0
X
1.0
0.0
1.0
Tree partitions the state space, assigning values to each region
1.0
0.0
0.9
0.0
1.0
1.0
0.0
1.0
The state space max for the above trees is:
In general, how can we compute the tree
representing the max?
22
The MAX Tree Operation
X
Y
X
1.0,
0.0
0.0,
0.0
1.0
0.0
0.9
0.0
1.0
Can simply append one tree to leaves of other. Makes all the
distinctions that either tree makes. Max operation is taken at
leaves of result.
X
1.0,
1.0
0.0,
1.0
X
1.0,
0.9
0.0,
0.9
X
Y
0.9
0.0
X
1.0
0.0
1.0
MAX
X
Y
X
1.0
0.0
X
1.0
1.0
X
1.0
0.9
23
The MAX Tree Operation
1.0
0.0
0.9
0.0
1.0
The resulting tree may have unreachable leaves. We can
simplify the tree by removing such paths.
X
Y
0.9
0.0
X
1.0
0.0
1.0
Simplify
X
Y
X
1.0
0.0
X
1.0
1.0
X
1.0
0.9
X
Y
0.0
1.0
1.0
unreachable
24
BINARY OPERATIONS
(other binary operations similar to max)
25
MARGINALIZATION
∑
A
Compute diagram representing
G
B
=
𝐹
(
,
)
𝐴
There are libraries for doing this.
26
Structured Bellman Backup
)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
T
s
R
V
a
s
Q
a
Structured Bellman Backup
:
•
So if we can compute the trees for Q(s,a
1
,V),…,Q(s,a
n
,V)
we can compute the tree representing B[V] using max
•
This will allows us to perform structured value iteration!
•
The hope is that if V is small then so will the tree for B[V].
How can we compute a tree for Q(s,a,V) for a given a?
27
Computing Q(s,a,V) Trees
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
s
s
V
s
a
s
T
s
R
V
a
s
Q
•
Given a tree for V and trees for T(s,a,s’) via the DBN,
compute a tree FVTree(s,a,V) representing the expectation
•
Given trees for R(s) and FVTree(s,a,V) we can perform an
addition operation on trees to get the tree for Q(s,a,V)
•
Addition on trees is almost identical to operation of MAX
(i.e. instead of taking MAX at leaves simply add the numbers)
So all we need to do is compute FVTree(s,a,V).
How?
tree provided
by problem definition
FVTree(s,a,V)
(future value tree)
Generic Computation of
FVTree
(
s,a,V
)
FVTree
𝑠
,
𝑎
,
𝑉
=
𝑇
𝑠
,
𝑎
,
𝑠
′
⋅
𝑉
𝑠
′
𝑠
′
=
Pr
′
𝑠
,
𝑎
⋅
Pr
′
𝑠
,
𝑎
⋅
Pr
′
𝑠
,
𝑎
⋅
𝑉
(
𝑠
′
)
′
′
′
=
Pr
(
′

𝑠
,
𝑎
)
Pr
(
′

𝑠
,
𝑎
)
Pr
′
𝑠
,
𝑎
⋅
𝑉
(
𝑠
′
)
′
′
′
s
=(
x,y,z
) s’ = (
x’,y’,z
’)
Generic Computation of
FVTree
(
s,a,V
)
FVTree
𝑠
,
𝑎
,
𝑉
=
𝑇
𝑠
,
𝑎
,
𝑠
′
⋅
𝑉
𝑠
′
𝑠
′
=
Pr
′
𝑠
,
𝑎
⋅
Pr
′
𝑠
,
𝑎
⋅
Pr
′
𝑠
,
𝑎
⋅
𝑉
(
𝑠
′
)
′
′
′
=
Pr
(
′

𝑠
,
𝑎
)
Pr
(
′

𝑠
,
𝑎
)
Pr
′
𝑠
,
𝑎
⋅
𝑉
(
𝑠
′
)
′
′
′
s
=(
x,y,z
) s’ = (
x’,y’,z
’)
30
SDP

S
pecific Computation
FVTree Partition
V(s’) partition
v
2
v
3
v
1
FV1
p2
p1
p3
)
'
(
'
)
'
,
,
(
)
,
,
(
s
s
V
s
a
s
T
V
a
s
FVTree
•
If two states have the same probability of transitioning to
each partition of V (under action a), then they will have the
same future value
•
So we want those states to be in same partition of FVTree
labeled by that future value
FV1 = p1*V1 + p2*V2 + p3*V3
So how do we compute
this partition and values?
31
Computing FVTree(s,a,V)
FVTree(s) Partition
V(s’) partition
v
2
v
3
v
1
FV1
p2
p1
p3
Procedure:
1.
Let T1 be the tree involving variables VARS representing V(s’)
2.
Construct a new tree T2 such that states belonging to the same
leaf of T2 assign the same distribution over leaves of T1 at t+1
•
I.e. each state at time t assigns a distribution over VARS at t+1
and this distribution implies a distribution over leaves of T1 at t+1
•
Can do this by “composing” trees for VARS
3.
Assign future value to each leaf of T2
32
A Simple Example
X
Y
Z
X’
Y’
Z’
Y
Y’:0.9
Y’:0.0
X
X’: 1.0
X’ : 0.0
Y’:1.0
Z
Y
Z’:1.0
Z’:0.0
Z’:0.9
Z
10
0
DBN for Action A
RewardFunction R
t
f
t
f
t
f
t
f
t
f
t
f
X
Notation: Z’ = Z(t+1), Z = Z(t), and “Z’ : p” means that Z(t+1) = true with probability p.
(This corresponds to S’ = (X’, Y’, Z’) and S = (X, Y, Z) in the definition of Q

function)
33
Example: FVTree(s,A,V) when V(s’)=R
Z’
0
10
V(s’)=R(s’)
Z
Y
Z’: 1
Z’: 0.0
Z’: 0.9
Z
Y
10
0.0
9
10
0
1
z’
~z’
Z’ is only relevant variable
for V (so VARS={z’})
probability tree for z’
(only y and z at t effect z at t+1)
z
~z, y
~z, ~y
0.1
Partition for FVTree
(same as tree for z’)
FVTree
Compute Expected FV
10
9
z
~z, y
~z, ~y
0
0.9
1
Partition for V(s’)
34
Computing Q(s,a,V) Trees
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
s
s
V
s
a
s
T
s
R
V
a
s
Q
Resulting Q(s,a,V) tree for V=R (using tree addition):
Z
Y
10
0.0
9
FVTree
Z
0
10
0.9
Z
Y
19
0.0
8.1
35
Example: More Complex V
Y’
Z’
8.1
0.0
19.0
Original V(s’)
depends on y’ and z’
(so VARS={y’,z’})
y’
~y’, z’
~y’,~z’
8.1
19.0
0.0
X
Y
Y’: 0.9
Y’: 0.0
Y’: 1.0
probability tree for y’
(only y and x at t effect y’)
x
~x, y
~x, ~y
0.1
0.9
1
1
•
If ~x and y then we always
go to the 8.1 leaf (so this
can be a leaf of FVTree)
•
Otherwise probability of
leaf we go to depends on z’
Distribution over leaves of T1
depends on z at these leaves
T1
36
Example: More Complex V
Y’
Z’
8.1
0.0
19.0
Original V(s’)
Y
X
Z
Y
Y’: 0.9
Z’: 1.0
Y’: 0.9
Z’: 0.0
Y’:0.9
Z’: 0.9
X
Y
Y’: 0.9
Y’: 0.0
Y’: 1.0
depends on y’ and z’
probability tree for y’
(only x(t) and y(t) effect y’)
Z
Y
Y’: 0.0
Z’: 1.0
Y’: 0.0
Z’: 0.0
Y’:0.0
Z’: 0.9
Y: 1.0
Can simplify this tree . . .
probability tree for z’
(only y and z effect z’)
T1
T2
37
Example: More Complex V
Y’
Z’
8.1
0.0
19.0
Original V(s’)
Y
X
Z
Y
Y’: 0.9
Z’: 1.0
Y’: 0.0
Z’: 0.0
Y’:0.9
Z’: 0.9
X
Y
Y’: 0.9
Y’: 0.0
Y’: 1.0
depends on y’ and z’
Z
Y’: 0.0
Z’: 1.0
Y’: 0.0
Z’: 0.0
Y’: 1.0
•
Each leaf of resulting tree gives a distribution over leaves
of tree for V(s’) (i.e. gives distribution over Y’ and Z’, which
gives distribution over future values)
•
This serves as the structure for the FVTree
Simplified tree gives
structure of FVTree:
38
Example: More Complex V
Y’
Z’
8.1
0.0
19.0
Original V
Y
X
Z
Y
Y’: 0.9
Z’: 1.0
Y’: 0.9
Z’: 0.0
Y’:0.9
Z’: 0.9
X
Y
Y’: 0.9
Y’: 0.0
Y’: 1.0
depends on y and z
Z
Y’: 0.0
Z’: 1.0
Y’: 0.0
Z’: 0.0
Y’: 1.0
•
Form FVTree by storing expected future value (FV) at each leaf
FVTree:
FV = 8.1*0.9 + 19.0*(0.1*1) + 0*0
FV = 8.1*0.9 + 19.0*(0.1*0.9) + 0*(0.1*0.1)
FV = 8.1*0.9 + 19.0*(0.1*0) + 0*(0.1*1)
FV = 8.1*1 + 19.0*(0) + 0*(0)
FV = 8.1*0 + 19.0*(1*1) + 0*(1*0)
FV = 8.1*0 + 19.0*(1*0) + 0*(1*1)
39
Recap: Value Iteration
)
](
[
)
(
0
)
(
1
0
s
V
B
s
V
s
V
k
k
)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
Pr(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
s
R
V
a
s
Q
a
Bellman Backup
:
•
So we can perform all the steps of value iteration by directly
manipulating trees
•
When sequence of value functions have small tree
representations then this gives a huge savings
40
SDP: Relative Merits
Adaptive, nonuniform, exact abstraction method
provides exact solution to MDP
much more efficient on certain problems (time/space)
400 million state problems in a couple hrs
Can formulate a similar procedure for modified policy
iteration
Some drawbacks
produces piecewise constant VF
some problems admit no compact solution representation
so the sizes of trees blows up with enough iterations
approximation may be desirable or necessary
41
Approximate SDP
Easy to approximate solution using SDP
Simple
pruning
of value function
Simply “merge” leaves that have similar values
Can prune trees
[BouDearden96]
or ADDs
[StaubinHoeyBou00]
Gives regions of
approximately same value
42
A Pruned Value ADD
8.36
8.45
7.45
U
R
W
6.81
7.64
6.64
U
R
W
5.62
6.19
5.19
U
R
W
Loc
HCR
HCU
9.00
W
10.00
[7.45,
8.45]
Loc
HCR
HCU
[9.00, 10.00]
[6.64, 7.64]
[5.19, 6.19]
43
Approximate SDP: Relative Merits
Relative merits of ASDP fewer regions implies faster
computation
30

40 billion state problems in a couple hours
allows fine

grained control of time vs. solution quality with
dynamic
error bounds
technical challenges: variable ordering, convergence, fixed
vs. adaptive tolerance, etc.
Some drawbacks
(still) produces piecewise constant VF
doesn’t exploit additive structure of VF at all
Bottom

line
: When a problem matches the structural
assumptions of SDP then we can gain much. But
many problems do not match assumptions.
44
Ongoing Work
Factored action spaces
Sometimes the action space is large, but has structure.
For example, cooperative multi

agent systems
Recent work (at OSU) has studied SDP for
factored action spaces
Include action variables in the DBNs
Action
variables
State
variables
Comments 0
Log in to post a comment