15-02-2012 Machine Learning(Winter 2011) Anush Sankaran 1

achoohomelessAI and Robotics

Oct 14, 2013 (3 years and 10 months ago)

112 views

15-02-2012
Machine Learning(Winter 2011) Anush Sankaran
1
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
2
R

E

G

H

U

O

N

L

B

D

I

N

F

K

T

L

H

D

A

S

H

O

Z

N

E

Q

A

K

M

Q

R

U

E

A

W

K

L

N

W

C

E

M

R

E

L

O

Y

E

R

Y

K

N

R

O

P

T

R

T

U

K

I

T

P

U

E

T

Y

I

O

N

G

U

Reinforcement Learning
R

E

G

H

U

O

N

L

B

D

I

N

F

K

T

L

H

D

A

S

H

O

Z

N

E

Q

A

K

M

Q

R

U

E

A

W

K

L

N

W

C

E

M

R

E

L

O

Y

E

R

Y

K

N

R

O

P

T

R

T

U

K

I

T

P

U

E

T

Y

I

O

N

G

U

Machine Learning(Winter 2011) Anush Sankaran
15-02-2012 3
Reinforcement Learning
R

E

G

H

U

O

N

L

B

D

I

N

F

K

T

L

H

D

A

S

H

O

Z

N

E

Q

A

K

M

Q

R

U

E

A

W

K

L

N

W

C

E

M

R

E

L

O

Y

E

R

Y

K

N

R

O

P

T

R

T

U

K

I

T

P

U

E

T

Y

I

O

N

G

U

Machine Learning(Winter 2011) Anush Sankaran
15-02-2012 4
Bandit Problem

Exploration vs. Exploitation !
Machine Learning(Winter 2011) Anush Sankaran
15-02-2012 5
15-02-2012
Machine Learning(Winter 2011) Anush Sankaran
6
Formalize RL

Image from MIT Open Course Ware SMA 5504 Fall 2002
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
7
Example

15-02-2012
Machine Learning(Winter 2011) Anush Sankaran
8
RL Model


A set of
states
, S

A set of
actions
, A, for each state

Rules for
transition
between state

Rules for “
rewards
” for each state-action-
state
pair.

Rules for
observation
in a current state


All rules are “
stochastic
”.
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
9
s0
s1
s2
a0
a1 a2
r0
r1
r2

RL Goal

Goal: To learn to “choose actions” that
“maximize rewards” …
r0 + γ.r1 + γ
2
.r2 + … , where 0 ≤ γ ≤ 1
Max.
Note
: Target function, П: S → A
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
10
Difference from Supervised Learning

Supervised Learning:

Training is of the form <s, П(s)>

Reinforcement Learning:

Training is of the form <<s, П(s)>,r>

Sequential decision making problem !
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
11
Aspects of RL

1.Delayed reward
2.Exploration
3.Partial observable states
4.Life long learning
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
12
Mathematical Formulation

Discounted Cumulative Reward
V
П
(s
t
) = r
t
+ γ r
t+1
+ γ
2
r
t+2
+ …
= ∑

i=0
γ
i
r
t+i

Other reward functions also!
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
13
Learning Task

Optimal policy
П* = argmax
П
V
П
(s), (∀s)




V
П*
(s) => V*(s)
(maximum discounted cumulative reward)
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
14
Illustration

15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
15
Q Function

(Bellman’s Optimality Principle)
П*(s) = argmax
a
[r(s,a)+γV
*
(δ(s,a))]

Q(s,a) ≡ r(s,a) +

γV
*
(δ(s,a))

П*(s) = argmax
a
Q(s,a)

15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
16
Q Learning

In simple,
V*(s) = max
a’
Q(s,a’)

Q(s,a) = r(s,a) +

γ max
a’
Q(δ(s,a),a’)

15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
17
Q Learning Algorithm

1.∀(s,a), initialize table Q’(s,a) = 0
2.Observe current stats, s
3.Do
 Select action a and execute
 Receive immediate reward, r
 Observe new state s’
 Update table entry
Q’(s,a) ← r +

γ max
a’
Q’(s’,a’)

s ← s’


15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
18
Illustration

15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
19
Convergence

15-02-2012
Machine Learning(Winter 2011) Anush Sankaran
20
Applications

• Checker’s [Samuel 59]
• TD-Gammon [Tesauro 92]
• World’s best downpeak elevator dispatcher [Crites at al ~95]
• Inventory management [Bertsekas et al ~95]
– 10-15% better than industry standard
• Dynamic channel assignment [Singh & Bertsekas, Nie&Haykin ~95]
– Outperforms best heuristics in the literature
• Cart-pole [Michie&Chambers 68-] with bang-bang control
• Robotic manipulation [Grupen et al. 93-]
• Path planning
• Robot docking [Lin 93]
• Parking
• Football
• Tetris
• Multiagent RL
[Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots of work since]
• Combinatorial optimization: maintenance & repair
– Control of reasoning [Zhang & Dietterich IJCAI-95]
Thanks to Samarth, for compiling
15-02-2012
Machine Learning(Winter 2011) Anush
Sankaran
21
Implementations

RL- Glue: standard interface for multiple
langauges
 Maja Machine Learning Framework (MMLF):
Python
 PyBrain: Python
 Toolbox: for Matlab and Python
 TeachingBox: Java
 Open Source C++ Implementations
 Orange: orngReinforcment, data-mining tool