Predictive State Representation

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

56 εμφανίσεις

Predictive State Representation


Masoumeh Izadi

School of Computer Science

McGill University




UdeM
-
McGill Machine Learning Seminar


Outline



Predictive Representations


PSR model specifications


Learning PSR


Using PSR in Control Problem


Conclusion


Future Directions

Motivation

In a dynamical system:



Knowing the exact state of the system is
mostly an unrealistic assumption.


Real world tasks exhibit uncertainty


POMDPs maintain belief b=(p(s0)….p(sn)) over
hidden variables s
i

as the state.


Beliefs are not verifiable!


POMDPs are hard to learn and to solve.






Motivation

Potential alternatives:


K
-
Markov Model


not general!



Predictive Representations




Predictive

Representations


State representation is in terms of experience.


Status (state) is represented by predictions
made from it.


Predictions
represent cause and effect.



Predictions are testable, maintainable, and
learnable.



No explicit notion of topological relationships.


Predictive

State
Representation


Test
: a sequence of action
-
observation pairs



Prediction

for a test

given a history:





Sufficient statistics: predictions for a set of
core tests, Q


q = a
1
o
1
...a
k
o
k


p(q|h)=P(o
1
...o
k

|
h
,
a
1
...a
k
)


Core
Tests

A set of tests Q is a core tests set if its prediction
forms a sufficient statistic for the dynamical
system.

p(Q|h)=[p(q
1
|h) ...p(q
n

|h)]


For any test

t:

p(t|h)=f_t (p(Q|h))

Linear
PSR

Model

For any test
q
, there exists a projection vector
m
q

s.t:

p(q|h) = p(Q|h)
T
m
q

Given a new action
-
observation pair,
ao,

the
prediction vector for each
q
i
є Q

is updated by:

P(q
i

|hao) = p(aoq
i
|h) / p(ao|h) = p(Q|h)
T
m
aoqi
/ p(Q|h)
T
m
ao

PSR Model Parameters



The set of core tests: Q={q
1
….q
n
}




Projection vectors for one step tests :m
ao
(for all ao
pairs )



Projection vectors for one step extension of core
tests m
aoqi
(for all ao pairs )

Linear PSR vs. POMDP


A linear PSR representation can be more
compact than the POMDP representation.

A POMDP with
n

nominal states can
represent a dynamical system of
dimensions

n



POMDP

Model


The model is an n
-
tuple {
S, A,

, T, O, R
}:






Sufficient statistics: belief state

(probability
distribution over S)

S = set of states

A = set of actions



= set of observations

T

= transition probability distribution for each action

O=
observation probability distribution for each action
-
observation

R

= reward functions for each action

Belief State

Posterior probability distribution over states

b

b’

a


o

1

|S|=3

1

1

b’(s’) = O(s’,a,o)T(s,a,s’) b(s)/Pr(o | a,b)


0


b(s)


1 for all s

S and

s

S b(s) = 1

Construct PSR from POMDP

Outcome function

u
(
t
):

the predictions for test
t

from all POMDP states.





Definition
:
A test
t

is said to be
independent

of a set
of tests
T

if its outcome vector is linearly independent
of the predictions for tests in
T.


State
Prediction

Matrix



Rank of

the matrix

determines the size of
Q.



Core tests corresponds to linearly independent
columns.



Entries are computed using the POMDP
model.


u(t
j
)

t
1
t
2


all possible tests

t
j

s
2

s
1

s
i

s
n

Linearly Independent States

Definition:

A linearly dependent state of an MDP
is a state for which any action transition function is
a linear combination of the transition functions
from other states.



Having the same dynamical structure is a special
case of linear dependency.

Example

0.2

0.8

0.7

0.3

O1, O2

O3, O2

O1, O4

O3

O4

O2

Linear PSR needs only two
tests to represent the system

e.g.: ao1, ao4 can predict any
other tests

State Space Compression

Theorem

For any controlled dynamical system :


linearly dependent states in the underlying MDP




more compact PSR than the corresponding
POMDP.

Reverse direction is not always the case due
to possible structure in the observations

Exploiting Structure

PSR exploits linear independence structure in the
dynamics of a system.


PSR also exploits regularities in dynamics.


Lossless compression needs invariance of state
representation in terms of values as well as dynamics.


Including reward as part of observation makes linear PSR
similar to linear lossless compressions for POMDPs.


POMDP Example

States: 20 (directions , grid state)

Actions: 3(turn left, turn right, move);

Observations: 2 (wall, nothing);


Structure Captured by PSR

Alias states (by immediate observation)

Predictive classes (by PSR core tests)

Generalization



Good generalization results when similar situations
have similar representations.



A good generalization makes it possible to learn with
small amount of experience.




Predictive representation:


generalizes the state space well.


makes the problem simpler and yet precise.


assists reinforcement learning algorithms.
[Rafols et al 2005]


Learning the PSR Model



The set of core tests: Q={q
1
….q
|Q|
}


Projection vectors for one step tests :m
ao
(for all ao pairs )



Projection vectors for one step extension of core tests
m
aoqi
(for all ao pairs )

System Dynamics Vector


Prediction of all possible future events can be
generated having any precise model of the
system.

t
1

t
2

p(t
1
)

p(t
2
)

p(t
i
)

t
i

t
i
=a
1
o
1
…a
k
o
k

p(t
i
) = prob(o
1
…o
k
|a
1
…a
k
)

System Dynamics Matrix


Linear dimension of a dynamical system is determined
by the rank of the system dynamics matrix.

P(t
j
|h
i
)

t
1
t
2


t
j

h
1

=
ε

h
i

h
2

t
j
=a
1
o
1
…a
k
o
k

h
i
=a’
1
o’
1
…a’
n
o’
n

p(t
j
|h
i
) = prob ( o
n+1
= o
1
,…, o
n+k
= o
k
|a’
1
o’
1
…a’
n
o’
n
, a
1
…a
k
)

POMDP in System Dynamics Matrix


Any model must be able to generate System Dynamic Matrix.

Core beliefs B = {b
1

b
2

… q
N
} :




Span the reachable subspace of continuous belief space;



Can be beneficial in POMDP solution methods
[Izadi et al 2005]




Represent reduced state space dimensions in structured
domains

P(t
j
|b
i
)

t
1
t
2


t
j

b
1

b
i

b
2

Core Test Discovery

Z
ij
= P(t
j
|h
i
)



Extend tests and histories one
-
step and
estimate entries of Z (counting data samples).



Find the rank and keep
the linearly independent
tests and histories




Keep extending until the
rank doesn’t change

Tests
(T)

Histories (H)

System Dynamics Matrix

P(t
j
|h
i
)

t
1
t
2


t
j

h
1

=
ε

h
i

h
2

All possible extension of tests and histories needs
processing a huge matrix in large domains.

Core Test Discovery

t
1
t
2


h
1


h
2

One
-
step
histories/
tests

Repeat one
-
step extensions to
Q
i
till the rank doesn’t change



millions of samples
required for a few state
problem.

PSR Learning


Structure Learning:



which tests to choose for Q from data



Parameter Learning:



how to tune m
-
vectors given the structure


and experience data



Learning Parameters

PSR



Gradient algorithm
[Singh et al. 2003]



Principle
-
Component based algorithm for
TPSR (uncontrolled system)
[Rosencrantz et al. 20004]



Suffix
-
History Algorithm
[ James et al.2004]



POMDP


EM





Results on PSR Model Learning

Planning



States expressed in predictive form.




Planning and reasoning should be in terms of experience.




Rewards treated as part of observations.




Tests are of the form: t=a
1
(o
1
r
1
)….a
n
(o
n
r
n
).


General POMDP methods (e.g. dynamic programming)
can be used.

Predictive Space

1

1

1

|Q|=3

P(Q|h)

P(Q|hao)

o

P(q
i

|hao) = p(Q|h)
T

m
ao
q
i
/p(Q|h)
T
m
ao


0 ≤ P(q
i

) ≤ 1 for all i’s

Forward Search

a
2

o
1

o
1

o
2

a
1

a
2

o
2

o
1

o
1

o
2

o
1

o
2

o
2

a
1

Exponential Complexity

Compare alternative future experiences.

DP for Finite
-
Horizon POMDPs


The
value function for a set
of trees

is always

piecewise
linear and convex

(PWLC)

p
1

p
2
,

s
1
,

s
2
,

a
2

a
3

a
3

a
3

a
1

a
2

a
1

o
1

o
1

o
2

o
1

o
2

o
2

a
1

a
2

a
3

a
3

a
2

a
1

a
1

o
1

o
1

o
2

o
1

o
2

o
2

a
1

a
1

a
2

a
2

a
2

a
3

a
3

o
1

o
1

o
2

o
1

o
2

o
2

p
1

p
2

p
3

p
3
,

Value Iteration in POMDPs


Value iteration:


Initialize

value function

V(b) = max_a
Σ
_s R(s,a) b(s)


This produces 1 alpha
-
vector per action.


Compute the value function at the next iteration
using
Bellman’s equation
:

V(b)= max_a [
Σ
_s
R(s,a)b(s)+

Σ
_s’[T(s,a,s’)O(s’,a,z)
α
(s’)]]

DP for Finite
-
Horizon PSRs

Theorem
:

value function for a finite horizon is still
piecewise
-
linear and convex.

There’s a scaler reward for each test.


R(h
t
,a)= Σ_r prob (r |h
t

, a)


Value of a policy tree is a linear function of
prediction vector.


V
p
(p(Q|h)=P
T
(Q|h)( n_a +

Σ
_o M
ao

w
)

Value Iteration in PSRs


Value iteration just as in POMDPs




V(p(Q|h)) = max
_
α

[V
α
(p(Q|h))]




Represent any finite
-
horizon solution by a
finite set of
alpha
-
vectors (policy trees)
.



Results on PSR Control

James etal.2004

Results on PSR Control


Current PSR planning algorithms are not
advantageous to POMDP planning
([Izadi & Precup
2003], [James et al. 2004]).


Planning Requires precise definition of
predictive space.


It is important to analyze the impact of PSR
planning on structured domains.



Predictive
Representations


Linear PSR


EPSR
action sequence +last observation
[Rudary and Singh 2004]


mPSR
augmented with history
[James et al 2005]



TD Networks
temporal difference learning
with network of interrelated predictions
[Tanner
and Sutton 2004]



A good state representation should be:


compact


useful for planning


efficiently learnable



Predictive state representation provide a lossless
compression which reflects the underlying structure.



PSR generalizes the space and facilitate planning.


Summary

Limitations



Learning and Discovery in PSRs still lack efficient
algorithms.



Current algorithms need way too data samples.



Experiments on many ideas can only be done on
toy problems so far due to model learning limitation.




Future Work


Theory of PSR and possible extensions


Efficient algorithms for learning predictive models


More on combining temporal abstraction with PSR


More on planning algorithms for PSR and EPSR


Approximation methods are yet to be developed


PSR for continuous systems


Generalization across states in stochastic systems



Non linear PSRs and exponential compression(?)