# Symbolic Dynamic Programming

AI and Robotics

Nov 7, 2013 (4 years and 8 months ago)

107 views

1

Symbolic Dynamic Programming

Alan Fern *

* Based in part on slides by Craig Boutilier

2

Planning in Large State Space MDPs

You have learned algorithms for computing optimal policies

Value Iteration

Policy Iteration

These algorithms explicitly enumerate the state space

Often this is impractical

Simulation
-
based planning and RL allowed for approximate
planning in large MDPs

Did not utilize an explicit model of the MDP. Only used a strong or
weak simulator.

How can we get exact solutions to enormous MDPs?

3

Structured Representations

Policy iteration and value iteration treat states as atomic
entities with no internal structure.

In most cases, states actually do have internal structure

E.g. described by a set of state variables, or objects with properties
and relationships

Humans exploit this structure to plan effectively

What if we had a compact, structured representation for a
large MDP and could efficiently plan with it?

Would allow for exact solutions to very large MDPs

4

A Planning Problem

5

Logical or Feature
-
based Problems

For most AI problems, states are not viewed as
atomic entities.

They contain structure. For example, they are
described by a set of
boolean

propositions/variables

|S|

exponential in number of propositions

Basic policy and value iteration do nothing to
exploit the structure of the MDP when it is
available

n
X
X
X
S

2
1
6

Solution?

Require structured representations in terms
of propositions

compactly represent transition function

compactly represent reward function

compactly represent value functions and policies

Require structured computation

perform steps of PI or VI directly on structured
representations

can avoid the need to enumerate state space

We start by representing the transition
structure as dynamic Bayesian networks

7

Propositional Representations

States decomposable into
state variables (we
will assume
boolean

variables)

Structured

representations the norm in AI

Decision diagrams, Bayesian networks, etc.

Describe
how actions affect/depend on features

Natural, concise, can be exploited computationally

Same ideas can be used for MDPs

n
X
X
X
S

2
1
8

Robot Domain as Propositional
MDP

Propositional variables for single user version

Loc (robot’s locat’n): Office, Entrance

T (lab is tidy): boolean

CR (coffee request outstanding): boolean

RHC (robot holding coffee): boolean

RHM (robot holding mail): boolean

M (mail waiting for pickup): boolean

Actions/Events

move to an adjacent location, pickup mail, get coffee, deliver
mail, deliver coffee, tidy lab

mail arrival, coffee request issued, lab gets messy

Rewards

rewarded for tidy lab, satisfying a coffee request, delivering mail

(or penalized for their negation)

9

State Space

State of MDP: assignment to these six
variables

64 states

grows exponentially with number of variables

Transition matrices

4032 parameters required per matrix

one matrix per action (6 or 7 or more actions)

Reward function

64 reward values needed

Factored state and action descriptions will
break this exponential dependence (generally)

10

Dynamic Bayesian Networks (DBNs)

Bayesian networks (BNs) a common
representation for probability distributions

A graph (DAG) represents conditional
independence

Conditional probability tables (CPTs) quantify local
probability distributions

Dynamic Bayes net action representation

one Bayes net for each action a, representing the
set of conditional distributions Pr(S
t+1
|A
t
,S
t
)

each state variable occurs at time t and t+1

dependence of t+1 variables on t variables
depicted by directed arcs

11

DBN Representation: deliver coffee

T
t

L
t

CR
t

RHC
t

T
t+1

L
t+1

CR
t+1

RHC
t+1

Pr(CR
t+1

| L
t
,CR
t
,RHC
t
)

Pr
(T
t+1
|
T
t
)

L CR RHC CR
(t+1)

CR
(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.1 0.9

E F T 0.1 0.9

O T F 1.0 0.0

E T F 1.0 0.0

O F F 0.1 0.9

E F F 0.1 0.9

T T
(t+1)

T
(t+1)

T 0.91 0.09

F 0.0 1.0

RHM
t

RHM
t+1

M
t

M
t+1

Pr(RHM
t+1
|RHM
t
)

RHM R
(t+1)

R
(t+1)

T 1.0 0.0

F 0.0 1.0

12

Benefits of DBN Representation

Pr
(S
t+1
| S
t
) =
Pr
(RHM
t+1
,M
t+1
,T
t+1,
L
t+1
,C
t+1,
RHC
t+1

| RHM
t
,M
t
,T
t,
L
t
,C
t,
RHC
t
)

= Pr
(RHM
t+1
|RHM
t
) *
Pr
(M
t+1
|

M
t
) *
Pr
(T
t+1
|

T
t
)

*
Pr
(L
t+1
|

L
t
) *
Pr
(CR
t+1
|

CR
t,
RHC
t,
L
t
) *
Pr
(RHC
t+1
|

RHC
t,
L
t
)

-

Only 20 parameters vs.

4032 for matrix

-
Removes global exponential

dependence

s
1
s
2 ...
s
64

s
1

0.9 0.05

...
0.0

s
2

0.0 0.20

...
0.1

S
64
0.1 0.0

...
0.0

.

.

.

T
t

L
t

CR
t

RHC
t

T
t+1

L
t+1

CR
t+1

RHC
t+1

RHM
t

RHM
t+1

M
t

M
t+1

Full Matrix

13

Structure in CPTs

So far we have represented each CPT as a table
of size exponential in the number of parents

Notice that there’s regularity in CPTs

e.g.,
Pr
(CR
t+1
|
L
t,
CR
t,
RHC
t
)

has many similar entries

Compact function representations for CPTs can
be used to great effect

decision trees

Here we show examples of decision trees (DTs)

14

Action Representation

DBN/DT

CR(t)

0.1

1.0

RHC(t)

L(t)

0.2

Decision

Tree (DT)

T
t

L
t

CR
t

RHC
t

T
t+1

L
t+1

CR
t+1

RHC
t+1

RHM
t

RHM
t+1

M
t

M
t+1

f

t

t

o

e

Leaves of DT give

Pr
(CR
t+1
=true | L
t,
CR
t,
RHC
t
)

DTs can often represent conditional probabilities much more

compactly than a full conditional probability table

e.g. If CR(t) = true & RHC(t) = false then

CR(t+1)=TRUE with prob. 1

1.0

f

15

Reward Representation

Rewards represented with DTs in a similar fashion

Would require vector of size 2
n

for explicit representation

CR

M

T

-
1

1

-
100

f

t

f

t

-
10

f

t

Small reward for satisfying all of these conditions

High cost for unsatisfied coffee request

High, but lower, cost for undelivered mail

Cost for lab being untidy

16

Structured Computation

Given our compact decision tree (DBN)
representation, can we solve MDP without
explicit state space enumeration?

Can we avoid
O(|S|)
-
computations by exploiting

We will study a general approach for doing this
called structured dynamic programming

17

Structured Dynamic Programming

We now consider how to perform dynamic programming
techniques such as VI and PI using the problem structure

VI and PI are based on a few basic operations.

Here we will show how to perform these operations directly on tree
representations of value functions, policies, and transitions functions

The approach is very general and can be applied to other
representations (e.g. algebraic decision diagrams, situation
calculus) and other problems after the main idea is
understood

We will focus on VI here, but the paper also describes a
version of modified policy iteration

18

Recall Tree
-
Based Representations

X

Y

Z

X

Y

Z

X

Y

0.9

0.0

X

1.0

0.0

1.0

Z

Y

1.0

0.0

0.9

Z

10

0

DBN for Action A

Reward Function R

Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.

e.g. If X(t)=false & Y(t) = true then

Y(t+1)=true w/ prob 1

e.g. If X(t)=true THEN

Y(t+1)=true w/ prob 0.9

t

f

t

f

t

f

t

f

t

f

t

f

Recall that each action of the MDP has its own DBN.

19

Structured Dynamic Programming

Value functions and policies can also have tree
representations

Often much more compact representations than tables

Our Goal:

compute the tree representations of
policy and value function given the tree
representations of the transitions and rewards

20

Recall Value Iteration

)
](
[
)
(
0
)
(
1
0
s
V
B
s
V
s
V
k
k

)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
T
s
R
V
a
s
Q
a

Bellman Backup
:

Suppose that V is compactly represented as a tree.

1.
Show how to compute compact trees for
Q(s,a
1
,V),…,Q(
s,a
n
,V
)

2.
Use a max operation on the Q
-
trees (returns a single tree)

21

The MAX Trees Operation

X

Y

0.9

0.0

X

1.0

0.0

1.0

Tree partitions the state space, assigning values to each region

1.0

0.0

0.9

0.0

1.0

1.0

0.0

1.0

The state space max for the above trees is:

In general, how can we compute the tree

representing the max?

22

The MAX Tree Operation

X

Y

X

1.0,
0.0

0.0,
0.0

1.0

0.0

0.9

0.0

1.0

Can simply append one tree to leaves of other. Makes all the

distinctions that either tree makes. Max operation is taken at

leaves of result.

X

1.0,
1.0

0.0,

1.0

X

1.0,

0.9

0.0,
0.9

X

Y

0.9

0.0

X

1.0

0.0

1.0

MAX

X

Y

X

1.0

0.0

X

1.0

1.0

X

1.0

0.9

23

The MAX Tree Operation

1.0

0.0

0.9

0.0

1.0

The resulting tree may have unreachable leaves. We can

simplify the tree by removing such paths.

X

Y

0.9

0.0

X

1.0

0.0

1.0

Simplify

X

Y

X

1.0

0.0

X

1.0

1.0

X

1.0

0.9

X

Y

0.0

1.0

1.0

unreachable

24

BINARY OPERATIONS

(other binary operations similar to max)

25

MARGINALIZATION

A

Compute diagram representing
G
B
=

𝐹
(

,

)
𝐴

There are libraries for doing this.

26

Structured Bellman Backup

)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
T
s
R
V
a
s
Q
a

Structured Bellman Backup
:

So if we can compute the trees for Q(s,a
1
,V),…,Q(s,a
n
,V)

we can compute the tree representing B[V] using max

This will allows us to perform structured value iteration!

The hope is that if V is small then so will the tree for B[V].

How can we compute a tree for Q(s,a,V) for a given a?

27

Computing Q(s,a,V) Trees

)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
s
s
V
s
a
s
T
s
R
V
a
s
Q

Given a tree for V and trees for T(s,a,s’) via the DBN,
compute a tree FVTree(s,a,V) representing the expectation

Given trees for R(s) and FVTree(s,a,V) we can perform an

addition operation on trees to get the tree for Q(s,a,V)

Addition on trees is almost identical to operation of MAX

So all we need to do is compute FVTree(s,a,V).

How?

tree provided

by problem definition

FVTree(s,a,V)

(future value tree)

Generic Computation of
FVTree
(
s,a,V
)

FVTree
𝑠
,
𝑎
,
𝑉
=

𝑇
𝑠
,
𝑎
,
𝑠

𝑉
𝑠

𝑠

=

Pr

𝑠
,
𝑎

Pr


𝑠
,
𝑎

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)




=

Pr

(

|
𝑠
,
𝑎
)

Pr

(


|
𝑠
,
𝑎
)

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)




s
=(
x,y,z
) s’ = (
x’,y’,z
’)

Generic Computation of
FVTree
(
s,a,V
)

FVTree
𝑠
,
𝑎
,
𝑉
=

𝑇
𝑠
,
𝑎
,
𝑠

𝑉
𝑠

𝑠

=

Pr

𝑠
,
𝑎

Pr


𝑠
,
𝑎

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)




=

Pr

(

|
𝑠
,
𝑎
)

Pr

(


|
𝑠
,
𝑎
)

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)




s
=(
x,y,z
) s’ = (
x’,y’,z
’)

30

SDP
-
S
pecific Computation

FVTree Partition

V(s’) partition

v
2

v
3

v
1

FV1

p2

p1

p3

)
'
(
'
)
'
,
,
(
)
,
,
(
s
s
V
s
a
s
T
V
a
s
FVTree

If two states have the same probability of transitioning to

each partition of V (under action a), then they will have the

same future value

So we want those states to be in same partition of FVTree

labeled by that future value

FV1 = p1*V1 + p2*V2 + p3*V3

So how do we compute

this partition and values?

31

Computing FVTree(s,a,V)

FVTree(s) Partition

V(s’) partition

v
2

v
3

v
1

FV1

p2

p1

p3

Procedure:

1.
Let T1 be the tree involving variables VARS representing V(s’)

2.
Construct a new tree T2 such that states belonging to the same

leaf of T2 assign the same distribution over leaves of T1 at t+1

I.e. each state at time t assigns a distribution over VARS at t+1

and this distribution implies a distribution over leaves of T1 at t+1

Can do this by “composing” trees for VARS

3.
Assign future value to each leaf of T2

32

A Simple Example

X

Y

Z

X’

Y’

Z’

Y

Y’:0.9

Y’:0.0

X

X’: 1.0

X’ : 0.0

Y’:1.0

Z

Y

Z’:1.0

Z’:0.0

Z’:0.9

Z

10

0

DBN for Action A

RewardFunction R

t

f

t

f

t

f

t

f

t

f

t

f

X

Notation: Z’ = Z(t+1), Z = Z(t), and “Z’ : p” means that Z(t+1) = true with probability p.

(This corresponds to S’ = (X’, Y’, Z’) and S = (X, Y, Z) in the definition of Q
-
function)

33

Example: FVTree(s,A,V) when V(s’)=R

Z’

0

10

V(s’)=R(s’)

Z

Y

Z’: 1

Z’: 0.0

Z’: 0.9

Z

Y

10

0.0

9

10

0

1

z’

~z’

Z’ is only relevant variable

for V (so VARS={z’})

probability tree for z’

(only y and z at t effect z at t+1)

z

~z, y

~z, ~y

0.1

Partition for FVTree

(same as tree for z’)

FVTree

Compute Expected FV

10

9

z

~z, y

~z, ~y

0

0.9

1

Partition for V(s’)

34

Computing Q(s,a,V) Trees

)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
s
s
V
s
a
s
T
s
R
V
a
s
Q

Resulting Q(s,a,V) tree for V=R (using tree addition):

Z

Y

10

0.0

9

FVTree

Z

0

10

0.9

Z

Y

19

0.0

8.1

35

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V(s’)

depends on y’ and z’

(so VARS={y’,z’})

y’

~y’, z’

~y’,~z’

8.1

19.0

0.0

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

probability tree for y’

(only y and x at t effect y’)

x

~x, y

~x, ~y

0.1

0.9

1

1

If ~x and y then we always

go to the 8.1 leaf (so this

can be a leaf of FVTree)

Otherwise probability of

leaf we go to depends on z’

Distribution over leaves of T1

depends on z at these leaves

T1

36

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V(s’)

Y

X

Z

Y

Y’: 0.9

Z’: 1.0

Y’: 0.9

Z’: 0.0

Y’:0.9

Z’: 0.9

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

depends on y’ and z’

probability tree for y’

(only x(t) and y(t) effect y’)

Z

Y

Y’: 0.0

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’:0.0

Z’: 0.9

Y: 1.0

Can simplify this tree . . .

probability tree for z’

(only y and z effect z’)

T1

T2

37

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V(s’)

Y

X

Z

Y

Y’: 0.9

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’:0.9

Z’: 0.9

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

depends on y’ and z’

Z

Y’: 0.0

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’: 1.0

Each leaf of resulting tree gives a distribution over leaves

of tree for V(s’) (i.e. gives distribution over Y’ and Z’, which

gives distribution over future values)

This serves as the structure for the FVTree

Simplified tree gives

structure of FVTree:

38

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V

Y

X

Z

Y

Y’: 0.9

Z’: 1.0

Y’: 0.9

Z’: 0.0

Y’:0.9

Z’: 0.9

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

depends on y and z

Z

Y’: 0.0

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’: 1.0

Form FVTree by storing expected future value (FV) at each leaf

FVTree:

FV = 8.1*0.9 + 19.0*(0.1*1) + 0*0

FV = 8.1*0.9 + 19.0*(0.1*0.9) + 0*(0.1*0.1)

FV = 8.1*0.9 + 19.0*(0.1*0) + 0*(0.1*1)

FV = 8.1*1 + 19.0*(0) + 0*(0)

FV = 8.1*0 + 19.0*(1*1) + 0*(1*0)

FV = 8.1*0 + 19.0*(1*0) + 0*(1*1)

39

Recap: Value Iteration

)
](
[
)
(
0
)
(
1
0
s
V
B
s
V
s
V
k
k

)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
Pr(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
s
R
V
a
s
Q
a

Bellman Backup
:

So we can perform all the steps of value iteration by directly

manipulating trees

When sequence of value functions have small tree

representations then this gives a huge savings

40

SDP: Relative Merits

provides exact solution to MDP

much more efficient on certain problems (time/space)

400 million state problems in a couple hrs

Can formulate a similar procedure for modified policy
iteration

Some drawbacks

produces piecewise constant VF

some problems admit no compact solution representation

so the sizes of trees blows up with enough iterations

approximation may be desirable or necessary

41

Approximate SDP

Easy to approximate solution using SDP

Simple
pruning

of value function

Simply “merge” leaves that have similar values

Can prune trees
[BouDearden96]

[StaubinHoeyBou00]

Gives regions of
approximately same value

42

8.36

8.45

7.45

U

R

W

6.81

7.64

6.64

U

R

W

5.62

6.19

5.19

U

R

W

Loc

HCR

HCU

9.00

W

10.00

[7.45,

8.45]

Loc

HCR

HCU

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]

43

Approximate SDP: Relative Merits

Relative merits of ASDP fewer regions implies faster
computation

30
-
40 billion state problems in a couple hours

allows fine
-
grained control of time vs. solution quality with
dynamic

error bounds

technical challenges: variable ordering, convergence, fixed

Some drawbacks

(still) produces piecewise constant VF

doesn’t exploit additive structure of VF at all

Bottom
-
line
: When a problem matches the structural
assumptions of SDP then we can gain much. But
many problems do not match assumptions.

44

Ongoing Work

Factored action spaces

Sometimes the action space is large, but has structure.

For example, cooperative multi
-
agent systems

Recent work (at OSU) has studied SDP for
factored action spaces

Include action variables in the DBNs

Action

variables

State

variables