Symbolic Dynamic Programming

benhurspicyAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

86 views

1

Symbolic Dynamic Programming

Alan Fern *

* Based in part on slides by Craig Boutilier


2

Planning in Large State Space MDPs


You have learned algorithms for computing optimal policies


Value Iteration


Policy Iteration


These algorithms explicitly enumerate the state space


Often this is impractical



Simulation
-
based planning and RL allowed for approximate
planning in large MDPs


Did not utilize an explicit model of the MDP. Only used a strong or
weak simulator.



How can we get exact solutions to enormous MDPs?

3

Structured Representations


Policy iteration and value iteration treat states as atomic
entities with no internal structure.



In most cases, states actually do have internal structure


E.g. described by a set of state variables, or objects with properties
and relationships


Humans exploit this structure to plan effectively



What if we had a compact, structured representation for a
large MDP and could efficiently plan with it?


Would allow for exact solutions to very large MDPs

4

A Planning Problem

5

Logical or Feature
-
based Problems


For most AI problems, states are not viewed as
atomic entities.


They contain structure. For example, they are
described by a set of
boolean

propositions/variables




|S|

exponential in number of propositions



Basic policy and value iteration do nothing to
exploit the structure of the MDP when it is
available


n
X
X
X
S




2
1
6

Solution?


Require structured representations in terms
of propositions


compactly represent transition function


compactly represent reward function


compactly represent value functions and policies


Require structured computation


perform steps of PI or VI directly on structured
representations


can avoid the need to enumerate state space


We start by representing the transition
structure as dynamic Bayesian networks

7

Propositional Representations


States decomposable into
state variables (we
will assume
boolean

variables)





Structured

representations the norm in AI


Decision diagrams, Bayesian networks, etc.


Describe
how actions affect/depend on features


Natural, concise, can be exploited computationally



Same ideas can be used for MDPs

n
X
X
X
S




2
1
8

Robot Domain as Propositional
MDP


Propositional variables for single user version


Loc (robot’s locat’n): Office, Entrance


T (lab is tidy): boolean


CR (coffee request outstanding): boolean


RHC (robot holding coffee): boolean


RHM (robot holding mail): boolean


M (mail waiting for pickup): boolean


Actions/Events


move to an adjacent location, pickup mail, get coffee, deliver
mail, deliver coffee, tidy lab


mail arrival, coffee request issued, lab gets messy


Rewards


rewarded for tidy lab, satisfying a coffee request, delivering mail


(or penalized for their negation)


9

State Space


State of MDP: assignment to these six
variables


64 states


grows exponentially with number of variables


Transition matrices


4032 parameters required per matrix


one matrix per action (6 or 7 or more actions)


Reward function


64 reward values needed


Factored state and action descriptions will
break this exponential dependence (generally)

10

Dynamic Bayesian Networks (DBNs)


Bayesian networks (BNs) a common
representation for probability distributions


A graph (DAG) represents conditional
independence


Conditional probability tables (CPTs) quantify local
probability distributions


Dynamic Bayes net action representation


one Bayes net for each action a, representing the
set of conditional distributions Pr(S
t+1
|A
t
,S
t
)


each state variable occurs at time t and t+1


dependence of t+1 variables on t variables
depicted by directed arcs


11

DBN Representation: deliver coffee

T
t

L
t

CR
t

RHC
t

T
t+1

L
t+1

CR
t+1

RHC
t+1

Pr(CR
t+1

| L
t
,CR
t
,RHC
t
)

Pr
(T
t+1
|
T
t
)

L CR RHC CR
(t+1)

CR
(t+1)

O T T 0.2 0.8

E T T 1.0 0.0

O F T 0.1 0.9

E F T 0.1 0.9

O T F 1.0 0.0

E T F 1.0 0.0

O F F 0.1 0.9

E F F 0.1 0.9

T T
(t+1)

T
(t+1)

T 0.91 0.09

F 0.0 1.0

RHM
t

RHM
t+1

M
t

M
t+1

Pr(RHM
t+1
|RHM
t
)

RHM R
(t+1)

R
(t+1)


T 1.0 0.0


F 0.0 1.0

12

Benefits of DBN Representation

Pr
(S
t+1
| S
t
) =
Pr
(RHM
t+1
,M
t+1
,T
t+1,
L
t+1
,C
t+1,
RHC
t+1

| RHM
t
,M
t
,T
t,
L
t
,C
t,
RHC
t
)

= Pr
(RHM
t+1
|RHM
t
) *
Pr
(M
t+1
|

M
t
) *
Pr
(T
t+1
|

T
t
)


*
Pr
(L
t+1
|

L
t
) *
Pr
(CR
t+1
|

CR
t,
RHC
t,
L
t
) *
Pr
(RHC
t+1
|

RHC
t,
L
t
)

-

Only 20 parameters vs.


4032 for matrix






-
Removes global exponential


dependence


s
1
s
2 ...
s
64

s
1

0.9 0.05

...
0.0

s
2

0.0 0.20

...
0.1

S
64
0.1 0.0

...
0.0

.

.

.

T
t

L
t

CR
t

RHC
t

T
t+1

L
t+1

CR
t+1

RHC
t+1

RHM
t

RHM
t+1

M
t

M
t+1

Full Matrix

13

Structure in CPTs


So far we have represented each CPT as a table
of size exponential in the number of parents


Notice that there’s regularity in CPTs


e.g.,
Pr
(CR
t+1
|
L
t,
CR
t,
RHC
t
)

has many similar entries


Compact function representations for CPTs can
be used to great effect


decision trees


algebraic decision diagrams (ADDs/BDDs)


Here we show examples of decision trees (DTs)

14

Action Representation


DBN/DT

CR(t)

0.1

1.0

RHC(t)

L(t)

0.2

Decision

Tree (DT)

T
t

L
t

CR
t

RHC
t

T
t+1

L
t+1

CR
t+1

RHC
t+1

RHM
t

RHM
t+1

M
t

M
t+1

f

t

t

o

e

Leaves of DT give

Pr
(CR
t+1
=true | L
t,
CR
t,
RHC
t
)

DTs can often represent conditional probabilities much more

compactly than a full conditional probability table


e.g. If CR(t) = true & RHC(t) = false then


CR(t+1)=TRUE with prob. 1

1.0

f

15

Reward Representation


Rewards represented with DTs in a similar fashion


Would require vector of size 2
n

for explicit representation

CR


M

T

-
1

1

-
100

f

t

f

t

-
10

f

t

Small reward for satisfying all of these conditions

High cost for unsatisfied coffee request

High, but lower, cost for undelivered mail

Cost for lab being untidy

16

Structured Computation


Given our compact decision tree (DBN)
representation, can we solve MDP without
explicit state space enumeration?


Can we avoid
O(|S|)
-
computations by exploiting
regularities made explicit by representation?


We will study a general approach for doing this
called structured dynamic programming

17

Structured Dynamic Programming


We now consider how to perform dynamic programming
techniques such as VI and PI using the problem structure


VI and PI are based on a few basic operations.


Here we will show how to perform these operations directly on tree
representations of value functions, policies, and transitions functions


The approach is very general and can be applied to other
representations (e.g. algebraic decision diagrams, situation
calculus) and other problems after the main idea is
understood


We will focus on VI here, but the paper also describes a
version of modified policy iteration

18

Recall Tree
-
Based Representations

X

Y

Z

X

Y

Z

X

Y

0.9

0.0

X

1.0

0.0

1.0

Z

Y

1.0

0.0

0.9

Z

10

0

DBN for Action A

Reward Function R

Note: we are leaving off time subscripts for readability and using X(t), Y(t), …, instead.

e.g. If X(t)=false & Y(t) = true then


Y(t+1)=true w/ prob 1

e.g. If X(t)=true THEN


Y(t+1)=true w/ prob 0.9

t

f

t

f

t

f

t

f

t

f

t

f

Recall that each action of the MDP has its own DBN.

19

Structured Dynamic Programming


Value functions and policies can also have tree
representations


Often much more compact representations than tables



Our Goal:

compute the tree representations of
policy and value function given the tree
representations of the transitions and rewards

20

Recall Value Iteration

)
](
[
)
(
0
)
(
1
0
s
V
B
s
V
s
V
k
k



)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
T
s
R
V
a
s
Q
a





Bellman Backup
:

Suppose that V is compactly represented as a tree.

1.
Show how to compute compact trees for
Q(s,a
1
,V),…,Q(
s,a
n
,V
)

2.
Use a max operation on the Q
-
trees (returns a single tree)

21

The MAX Trees Operation

X

Y

0.9

0.0

X

1.0

0.0

1.0

Tree partitions the state space, assigning values to each region

1.0

0.0

0.9

0.0

1.0

1.0

0.0

1.0

The state space max for the above trees is:

In general, how can we compute the tree

representing the max?

22

The MAX Tree Operation

X

Y

X

1.0,
0.0

0.0,
0.0

1.0

0.0

0.9

0.0

1.0

Can simply append one tree to leaves of other. Makes all the

distinctions that either tree makes. Max operation is taken at

leaves of result.

X

1.0,
1.0

0.0,

1.0

X

1.0,

0.9

0.0,
0.9

X

Y

0.9

0.0

X

1.0

0.0

1.0

MAX

X

Y

X

1.0

0.0

X

1.0

1.0

X

1.0

0.9

23

The MAX Tree Operation

1.0

0.0

0.9

0.0

1.0

The resulting tree may have unreachable leaves. We can

simplify the tree by removing such paths.

X

Y

0.9

0.0

X

1.0

0.0

1.0

Simplify

X

Y

X

1.0

0.0

X

1.0

1.0

X

1.0

0.9

X

Y

0.0

1.0

1.0

unreachable

24

BINARY OPERATIONS

(other binary operations similar to max)

25

MARGINALIZATION


A

Compute diagram representing
G
B
=


𝐹
(

,

)
𝐴

There are libraries for doing this.

26

Structured Bellman Backup

)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
T
s
R
V
a
s
Q
a





Structured Bellman Backup
:



So if we can compute the trees for Q(s,a
1
,V),…,Q(s,a
n
,V)


we can compute the tree representing B[V] using max



This will allows us to perform structured value iteration!



The hope is that if V is small then so will the tree for B[V].

How can we compute a tree for Q(s,a,V) for a given a?

27

Computing Q(s,a,V) Trees

)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
s
s
V
s
a
s
T
s
R
V
a
s
Q






Given a tree for V and trees for T(s,a,s’) via the DBN,
compute a tree FVTree(s,a,V) representing the expectation



Given trees for R(s) and FVTree(s,a,V) we can perform an

addition operation on trees to get the tree for Q(s,a,V)



Addition on trees is almost identical to operation of MAX

(i.e. instead of taking MAX at leaves simply add the numbers)

So all we need to do is compute FVTree(s,a,V).

How?

tree provided

by problem definition

FVTree(s,a,V)

(future value tree)

Generic Computation of
FVTree
(
s,a,V
)

FVTree
𝑠
,
𝑎
,
𝑉
=

𝑇
𝑠
,
𝑎
,
𝑠


𝑉
𝑠

𝑠



=



Pr


𝑠
,
𝑎

Pr


𝑠
,
𝑎

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)








=

Pr

(


|
𝑠
,
𝑎
)

Pr

(


|
𝑠
,
𝑎
)

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)

















s
=(
x,y,z
) s’ = (
x’,y’,z
’)

Generic Computation of
FVTree
(
s,a,V
)

FVTree
𝑠
,
𝑎
,
𝑉
=

𝑇
𝑠
,
𝑎
,
𝑠


𝑉
𝑠

𝑠



=



Pr


𝑠
,
𝑎

Pr


𝑠
,
𝑎

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)








=

Pr

(


|
𝑠
,
𝑎
)

Pr

(


|
𝑠
,
𝑎
)

Pr


𝑠
,
𝑎

𝑉
(
𝑠

)

















s
=(
x,y,z
) s’ = (
x’,y’,z
’)

30

SDP
-
S
pecific Computation

FVTree Partition

V(s’) partition

v
2

v
3

v
1

FV1

p2

p1

p3

)
'
(
'
)
'
,
,
(
)
,
,
(
s
s
V
s
a
s
T
V
a
s
FVTree





If two states have the same probability of transitioning to


each partition of V (under action a), then they will have the


same future value



So we want those states to be in same partition of FVTree


labeled by that future value

FV1 = p1*V1 + p2*V2 + p3*V3

So how do we compute

this partition and values?

31

Computing FVTree(s,a,V)

FVTree(s) Partition

V(s’) partition

v
2

v
3

v
1

FV1

p2

p1

p3

Procedure:

1.
Let T1 be the tree involving variables VARS representing V(s’)

2.
Construct a new tree T2 such that states belonging to the same

leaf of T2 assign the same distribution over leaves of T1 at t+1


I.e. each state at time t assigns a distribution over VARS at t+1

and this distribution implies a distribution over leaves of T1 at t+1


Can do this by “composing” trees for VARS

3.
Assign future value to each leaf of T2

32

A Simple Example

X

Y

Z

X’

Y’

Z’

Y

Y’:0.9

Y’:0.0

X

X’: 1.0

X’ : 0.0

Y’:1.0

Z

Y

Z’:1.0

Z’:0.0

Z’:0.9

Z

10

0

DBN for Action A

RewardFunction R

t

f

t

f

t

f

t

f

t

f

t

f

X

Notation: Z’ = Z(t+1), Z = Z(t), and “Z’ : p” means that Z(t+1) = true with probability p.

(This corresponds to S’ = (X’, Y’, Z’) and S = (X, Y, Z) in the definition of Q
-
function)


33

Example: FVTree(s,A,V) when V(s’)=R

Z’

0

10

V(s’)=R(s’)

Z

Y

Z’: 1

Z’: 0.0

Z’: 0.9

Z

Y

10

0.0

9

10

0

1

z’

~z’

Z’ is only relevant variable

for V (so VARS={z’})

probability tree for z’

(only y and z at t effect z at t+1)

z

~z, y

~z, ~y

0.1

Partition for FVTree

(same as tree for z’)


FVTree

Compute Expected FV


10

9

z

~z, y

~z, ~y

0

0.9

1

Partition for V(s’)

34

Computing Q(s,a,V) Trees

)
'
(
'
)
'
,
,
(
β
)
(
)
,
,
(
s
s
V
s
a
s
T
s
R
V
a
s
Q




Resulting Q(s,a,V) tree for V=R (using tree addition):

Z

Y

10

0.0

9

FVTree

Z

0

10

0.9

Z

Y

19

0.0

8.1

35

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V(s’)

depends on y’ and z’

(so VARS={y’,z’})

y’

~y’, z’

~y’,~z’

8.1

19.0

0.0

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

probability tree for y’

(only y and x at t effect y’)

x

~x, y

~x, ~y

0.1

0.9

1

1



If ~x and y then we always

go to the 8.1 leaf (so this

can be a leaf of FVTree)



Otherwise probability of

leaf we go to depends on z’

Distribution over leaves of T1

depends on z at these leaves

T1

36

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V(s’)

Y

X

Z

Y

Y’: 0.9

Z’: 1.0

Y’: 0.9

Z’: 0.0

Y’:0.9

Z’: 0.9

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

depends on y’ and z’

probability tree for y’

(only x(t) and y(t) effect y’)

Z

Y

Y’: 0.0

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’:0.0

Z’: 0.9

Y: 1.0

Can simplify this tree . . .

probability tree for z’

(only y and z effect z’)

T1

T2

37

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V(s’)

Y

X

Z

Y

Y’: 0.9

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’:0.9

Z’: 0.9

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

depends on y’ and z’

Z

Y’: 0.0

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’: 1.0



Each leaf of resulting tree gives a distribution over leaves


of tree for V(s’) (i.e. gives distribution over Y’ and Z’, which


gives distribution over future values)



This serves as the structure for the FVTree

Simplified tree gives

structure of FVTree:

38

Example: More Complex V

Y’

Z’

8.1

0.0

19.0

Original V

Y

X

Z

Y

Y’: 0.9

Z’: 1.0

Y’: 0.9

Z’: 0.0

Y’:0.9

Z’: 0.9

X

Y

Y’: 0.9

Y’: 0.0

Y’: 1.0

depends on y and z

Z

Y’: 0.0

Z’: 1.0

Y’: 0.0

Z’: 0.0

Y’: 1.0



Form FVTree by storing expected future value (FV) at each leaf

FVTree:

FV = 8.1*0.9 + 19.0*(0.1*1) + 0*0

FV = 8.1*0.9 + 19.0*(0.1*0.9) + 0*(0.1*0.1)

FV = 8.1*0.9 + 19.0*(0.1*0) + 0*(0.1*1)

FV = 8.1*1 + 19.0*(0) + 0*(0)

FV = 8.1*0 + 19.0*(1*1) + 0*(1*0)

FV = 8.1*0 + 19.0*(1*0) + 0*(1*1)

39

Recap: Value Iteration

)
](
[
)
(
0
)
(
1
0
s
V
B
s
V
s
V
k
k



)
,
,
(
max
)
](
[
)
'
(
'
)
'
,
,
Pr(
β
)
(
)
,
,
(
V
a
s
Q
s
V
B
s
s
V
s
a
s
s
R
V
a
s
Q
a





Bellman Backup
:



So we can perform all the steps of value iteration by directly


manipulating trees



When sequence of value functions have small tree


representations then this gives a huge savings

40

SDP: Relative Merits


Adaptive, nonuniform, exact abstraction method


provides exact solution to MDP


much more efficient on certain problems (time/space)


400 million state problems in a couple hrs


Can formulate a similar procedure for modified policy
iteration


Some drawbacks


produces piecewise constant VF


some problems admit no compact solution representation


so the sizes of trees blows up with enough iterations


approximation may be desirable or necessary

41

Approximate SDP


Easy to approximate solution using SDP


Simple
pruning

of value function


Simply “merge” leaves that have similar values


Can prune trees
[BouDearden96]

or ADDs
[StaubinHoeyBou00]


Gives regions of
approximately same value


42

A Pruned Value ADD

8.36

8.45

7.45

U

R

W

6.81

7.64

6.64

U

R

W

5.62

6.19

5.19

U

R

W

Loc

HCR

HCU

9.00

W

10.00

[7.45,

8.45]

Loc

HCR

HCU

[9.00, 10.00]

[6.64, 7.64]

[5.19, 6.19]

43

Approximate SDP: Relative Merits


Relative merits of ASDP fewer regions implies faster
computation


30
-
40 billion state problems in a couple hours


allows fine
-
grained control of time vs. solution quality with
dynamic

error bounds


technical challenges: variable ordering, convergence, fixed
vs. adaptive tolerance, etc.


Some drawbacks


(still) produces piecewise constant VF


doesn’t exploit additive structure of VF at all


Bottom
-
line
: When a problem matches the structural
assumptions of SDP then we can gain much. But
many problems do not match assumptions.


44

Ongoing Work


Factored action spaces


Sometimes the action space is large, but has structure.


For example, cooperative multi
-
agent systems


Recent work (at OSU) has studied SDP for
factored action spaces


Include action variables in the DBNs

Action

variables

State

variables