Reinforcement Learning

Arya MirAI and Robotics

Nov 26, 2011 (5 years and 6 months ago)

898 views

Machine learning is traditionally divided into: Supervised learning . . . takes a set of pre-labelled examples, i.e., input-output pairs. extracts a mapping allowing us to predict a label (classification) or a function value (regression) corresponding to an ‘unseen’ input. Examples: decision tree (ID3), perceptron algorithm, support vector machine. Unsupervised learning . . . As for supervised learning, but the example data are unlabelled, i.e., no outputs. Typically works by clustering

COMP2039ArtificialIntelligence:
ReinforcementLearning
BobDamper
May8,2006
COMP2039ArtificialIntelligence:ReinforcementLearning–p.1/20
SupervisedandUnsupervisedLearning
Machinelearningistraditionallydividedinto:
Supervisedlearning...
takesasetofpre-labelledexamples,i.e.,
input-outputpairs.
extractsamappingallowingustopredictalabel
(classification)orafunctionvalue(regression)
correspondingtoan‘unseen’input.
Examples:decisiontree(ID3),perceptron
algorithm,supportvectormachine.
Unsupervisedlearning...
Asforsupervisedlearning,buttheexampledata
areunlabelled,i.e.,nooutputs.
Typicallyworksbyclustering.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.2/20
On-LineandOff-LineLearning
AnimportantdistinctioninAI.
Off-linelearningrequirestheentiresetofexamplesto
beavailablebeforelearningcommences.
Supervisedandunsupervisedlearningare(almost
always)off-line.
ButinAI,wefrequentlywanttoproducesystems
(‘agents’)that“learnastheygo”,i.e.,on-linelearning.
Thatis,theagentlearnsbyinteractingwithits
environment.
Wecallthis“learningbydoing”,orreinforcement
learning(RL).
ForRLtobepossible,theagent/learnermusthave
someinternalgoals,or‘wants’,thatguidelearning.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.3/20
LearningWhatToDo
Learningwhattodoinvolvesmappingsituationsto
actions...
...soastomaximisetheagent’sreward(orminimise
punishment,anegativereward).
By“mappingsituationstoactions”,wemeandeciding
whattodonextbasedonsomemeasure(s)sensed
fromthecurrentstateoftheenvironment.
Therewardissomesignalfromtheenvironmentthat
changesdependentupontheagent’saction.
Theagenthastodecidewhichactionsfromaselection
ofpossibleactionsarelikelytoincreaseitsrewards,
whenthefunctionconnectingtheseisunknownand/or
uncertain.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.4/20
SchematicofRL
learning￿agent
renvironment
or￿‘critic’
Theagentisalearningsystem.
Reward/punishmentdenotedr.
Environmentcanbeconsideredasa‘critic’ofthe
actionsselectedbytheagent.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.5/20
CharacteristicsofRL
Twocornerstonesare:
1.Trialanderrorsearch...ofthespaceofactions
andassociatedrewards.
2.Delayedrewards,since(inmostcasesofinterest)
theactionaffectsnotonlytheimmediaterewardat
thecurrentstatebutalsothenextsituation.
Akeychallengeistheexploitation-exploration
dilemma.
Exploitationisprofitingfromwhathasalreadybeen
learned...conservativestrategy(‘policy’),may
misshigh-rewardactions.
Explorationissearchingforgoodnewactions
amonguntriedactions...radicalstrategy,may
inadvertentlytrydisastrousactions.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.6/20
ExampleI
ChessplayingisoneexampleofanAIproblemthatcanbe
attackedbyreinforcementlearning.
Here,the‘environment’isthechessboard.
Theagent’sgoalistowinthegame.
Theselectedactionistomoveoneoftheagent’s
chesspieces.
Thereisthenanindirectchangeintheenvironment,
indirectbecausetheadversarymakesamove
consequenttotheagent’smove.
Thereward/punishmentissomemeasureofthe
agent’schancesofwinninggiventhenew
environment,suchastheevaluationfunctionusedin
minimaxsearch.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.7/20
ExampleII
Anothergoodexamplewouldbearobotgripperlearningto
graspandmanipulateobjectswithoutdroppingorcrushing
them,spillingtheircontents,etc.
Here,theenvironmentistheworkspacecontaining
objectstobemanipulated.
Theagent’sgoalmightbe,e.g.,totransfercontentsof
ajugintoanotherreceptacle.
Theselectedaction(probablyhierarchical)might
changeenvironmentbymovingrobotarmtotarget
object,increasinggripofendeffector,placingobjecton
worktop,etc.
Reward/punishmentwillbeappropriatelyrelatedto
particulargoal(e.g.,objectheldwithoutslipfor
xseconds).
COMP2039ArtificialIntelligence:ReinforcementLearning–p.8/20
RLandRobotics
Asthepreviousexampleshows,RLisespecially
valuableinrobotics.
Iftheenvironmentis‘unstructured’,thereislittle
alternativetolearningfrominteractionwiththe
environment.
Therobot/agentcanbeadaptivetoenvironmental
changes.
Buttherobotitselfformspartoftheenvironment.
Soitcanlearnaboutitsowncapabilitiesand
limitations...wedonotneedamathematicalmodelof
the‘plant’asinclassicalrobotics.
Asthewholeprocessisadaptive,robotcanlearnto
compensateforwear,componentfailure,etc.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.9/20
TheFourMainAspectsofRL
Theseare:
1.apolicy
;
2.areward
function;
3.autility
(or‘value’)function;and
4.(optionally)amodel
oftheenvironment.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.10/20
Policy
Thepolicydefinestheagent’swayofbehavingin
responsetoanygivensetofenvironmentalconditions:
policy,π:S7→Ai.e.,amappingfromperceivedstates
oftheenvironmenttoactions.
Policiescanbepassiveoractive.
InpassiveRL,theagent’spolicyisfixed(e.g.,simple
look-uptable;asetofstimulus-responserules).
InactiveRL,theagentmustalsolearnthepolicy
(using,e.g.,aneuralnetwork;somekindofheuristic
search).
Policiesmaybestochastic(toimprovesearchforgood
policies).
COMP2039ArtificialIntelligence:ReinforcementLearning–p.11/20
RewardFunction
Thisdefinesthe‘benefit’tothelearningagentofthe
currentenvironmentalstate.
reward:S7→R,r∈R
Thatis,itisamappingfromperceivedstatesofthe
environmenttoascalarnumber–thereward(or
punishment).
Agent’sgoalistomaximisetherewardinthelong
term,i.e.,short-termreductionofrewardisacceptable
ifitleadstolong-termgain.
Rewardfunctionsaregenerallyfixed,sincetheydefine
theagent’s‘wants’.
Rewardfunctionsareusedtomodifythepolicy.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.12/20
UtilityFunction
Wehavesaidthatshort-termreductionofrewardis
acceptableifitleadstolong-termgain.
Unliketherewardfunction,whichisimmediate,the
utilityfunctionU
π
(s)attemptstospecifylong-term
‘benefit’.
Itisthetotalrewardthatanagentcanexpectto
accumulateinthefuture,startingfromstates:
U
π
(s)=E
￿

￿
t=0
γ
t
R(s
t
)|π,s
0
=s
￿
Utilitiesusedtomakeandevaluatedecisions.
Rewardsareeasytomeasure–givenbythe
environment–bututilitiesarehardtodetermine.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.13/20
ModelofEnvironment
NotanessentialpartofRL–onlystartingtobeusedin
recentdevelopments.
Modelsareusedforplanning.
Theymimicthebehaviouroftheenvironmentandso
canbeused,forexample,topredictnextstateand
nextrewardgivencurrentstateandselectedaction.
Modelcanbelearnedfrominteraction.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.14/20
VarietiesofRL
ThereareverymanyvarietiesofRL,including:
utility-based
Q-learning
reflex
dynamicprogrammingmethods
temporaldifference(TD)methods,including
actor-criticschemes
etc.
Wecannothopetocoverallofthese,butwillsaysome
moreaboutthefirstthree.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.15/20
ThreeFlavoursofRL
Autility-based
agentattemptstolearnautilityfunction
andthenusesittoselectactionsthatmaximisethe
expectedutility:U
π
(s).
AQ-learning
agentlearnsaso-calledaction-value
functionQ,givingtheexpectedutilityoftakingagiven
action:Q
π
(s,a).
Areflex
agentsimplylearnsadirectmappingfrom
statestoactions.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.16/20
Example:ASimpleGridWorld
+1 -1
1
a
b
c
2
3
4
Agentstartsinsomestates
0
=(x,n),x∈{a,b,c},
n∈{1,2,3,4}.
Thereisafixed(stochastic)policyasshown.
Actionsareselecteduntilreachingoneofthegoal
statesat(c,4)or(b,4)...
...withreward+1or−1respectively.
Smallnegative‘reward’innon-terminalstates.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.17/20
GridWorldRulesandPolicy
Arrowsindicateagentmovesinthatdirectionwith
relativelyhighfixedprobability,sayp=0.7...
...withsmallprobabilityof(1−p)/2=0.15ofmoving
atrightanglestothisdirection.
Ifright-anglemoveillegal,agent‘bounces’offwallsto
remaininoriginalstate.
InpassiveRL,agent’staskistolearntheutilityfunction
withoutknowingtherewardfunctionforeachstate.
Difficulttodoexactly!Wehavetoworkoutthe
expectationofinfinitesums!
InactiveRL,wehavetofindtheoptimalpolicytoo.
ThepolicyshownisoptimalforR
NT
∼−0.05.
Noteavoidanceoftheshort-cutat(a,3).
COMP2039ArtificialIntelligence:ReinforcementLearning–p.18/20
DirectUtilityEstimation
Inthepassivecase,agent’staskistolearntheutility
functionwithoutknowingtherewardfunctionforeach
state.
Supposewehaveasetoftrialsintheenvironment,
i.e.,pairsofstatesvisitedandrewards.Wecanuse
thesetoestimatetheutilityofeachstatedirectly.
ThisconvertstheRLproblemtostandardsupervised
learning(withthestateasinputandthereward-to-go
asoutput).
Butthisdirectutilityestimationignoresavery
importantaspectoftheproblem:
Theutilityofeachstateequalsitsownrewardplusthe
expectedutilityofitssuccessorstates
COMP2039ArtificialIntelligence:ReinforcementLearning–p.19/20
BellmanEquations
ForatransitionfunctionT:
U
π
(s)=R(s)+γ
￿
s

T(s,π(s),s

)U
π
(s

)
Exploitingknowledgeoftheconnectionsbetween
statesencodedinTallowsustoreducethestate
spaceofthesearch...perhapsdramatically.
ButwehavetolearnT.
Inprinciple,thiscanbedonebyobservingtrialsand
estimatingstatetransitionprobabilitiesbycounting.
Wecanthensolvetheabove(Bellman)equationsby
dynamicprogramming.
COMP2039ArtificialIntelligence:ReinforcementLearning–p.20/20