您好,欢迎访问三七文档
ReinforcementLearningInContinuousTimeandSpaceKenjiDoya*ATRHumanInformationProcessingResearchLaboratories2-2Hikaridai,Seika,Soraku,Kyoto619-0288,JapanJanuary20,1999toappearinNeuralComputationAbstractThispaperpresentsareinforcementlearningframeworkforcontinuous-timedynamicalsystemswithoutaprioridiscretizationoftime,state,andaction.BasedontheHamilton-Jacobi-Bellman(HJB)equationforin nite-horizon,discountedrewardproblems,wederivealgorithmsforestimatingvaluefunctionsandforimprovingpolicieswiththeuseoffunctionapprox-imators.Theprocessofvaluefunctionestimationisformulatedastheminimizationofacontinuous-timeformofthetemporaldi erence(TD)error.UpdatemethodsbasedonbackwardEulerapproximationandex-ponentialeligibilitytracesarederivedandtheircorrespondenceswiththeconventionalresidualgradient,TD(0),andTD( )algorithmsareshown.Forpolicyimprovement,twomethods,namely,acontinuousactor-criticmethodandavalue-gradientbasedgreedypolicy,areformulated.Asaspecialcaseofthelatter,anonlinearfeedbackcontrollawusingthevaluegradientandthemodeloftheinputgainisderived.The\advantageup-dating,amodel-freealgorithmderivedpreviously,isalsoformulatedintheHJBbasedframework.Theperformanceoftheproposedalgorithmsis rsttestedinanon-linearcontroltaskofswingingupapendulumwithlimitedtorque.Itisshowninthesimulationsthat1)thetaskisaccomplishedbythecontinuousactor-criticmethodinanumberoftrialsseveraltimesfewerthanbytheconventionaldiscreteactor-criticmethod;2)amongthecontinuouspolicyupdatemethods,thevalue-gradientbasedpolicywithaknownorlearned*Currentaddress:KawatoDynamicBrainProject,ERATO,JapanScienceandTechnologyCorporation.2-2Hikaridai,Seika,Soraku,Kyoto619-0288,Japan.Phone:+81-774-95-1210.Fax:+81-774-95-3001.E-mail:doya@erato.atr.co.jp1dynamicmodelperformsseveraltimesbetterthantheactor-criticmethod;and3)avaluefunctionupdateusingexponentialeligibilitytracesismoreef- cientandstablethanthatbasedonEulerapproximation.Thealgorithmsarethentestedinahigher-dimensionaltask,i.e.,cart-poleswing-up.Thistaskisaccomplishedinseveralhundredtrialsusingthevalue-gradientbasedpolicywithalearneddynamicmodel.1IntroductionThetemporaldi erence(TD)familyofreinforcementlearning(RL)algorithms(Bartoetal.,1983;Sutton,1988;SuttonandBarto,1998)providesane ectiveapproachtocontrolanddecisionproblemsforwhichoptimalsolutionsareanalyticallyunavailableordi culttoobtain.Anumberofsuccessfulapplicationstolarge-scaleproblems,suchasboardgames(Tesauro,1994),dispatchproblems(CritesandBarto,1996;ZhangandDietterich,1996;SinghandBertsekas,1997),androbotnavigation(Mataric,1994)havebeenreported(see,e.g.,Kaelblingetal.(1996)andSuttonandBarto(1998)forareview).TheprogressofRLresearchsofar,however,hasbeenmostlyconstrainedtothediscreteformulationoftheprobleminwhichdiscreteactionsaretakenindiscretetimestepsbasedontheobservationofthediscretestateofthesystem.Manyinterestingreal-worldcontroltasks,suchasdrivingacarorridingasnowboard,requiresmoothcontinuousactionstakeninresponsetohigh-dimensional,real-valuedsensoryinput.InapplicationsofRLtocontinuousproblems,themostcommonapproachhasbeen rsttodiscretizetime,state,andactionandthentoapplyanRLalgorithmforadiscretestochasticsystem.However,thisdiscretizationapproachhasthefollowingdrawbacks:1.Whenacoarsediscretizationisused,thecontroloutputisnotsmooth,resultinginapoorperformance.2.Whena nediscretizationisused,thenumberofstatesandthenumberofiterationstepsbecomehuge,whichnecessitatesnotonlylargememorystoragebutalsomanylearningtrials.3.Inordertokeepthenumberofstatesmanageable,anelaboratepartitioningofthevariableshastobefoundusingpriorknowledge.E ortshavebeenmadetoeliminatesomeofthesedi cultiesbyusingappropriatefunctionapproximators(Gordon,1996;Sutton,1996;TsitsiklisandVanRoy,1997),adaptivestatepartitioningandaggregationmethods(Moore,1994;Singhetal.,1995;Asadaetal.,1996;Pareigis,1998),andmultipletimescalemethods(Sutton,1995).2Inthispaper,weconsideranalternativeapproachinwhichlearningalgorithmsareformu-latedforcontinuous-timedynamicalsystemswithoutresortingtotheexplicitdiscretizationoftime,stateandaction.Thecontinuousframeworkhasthefollowingpossibleadvantages:1.Asmoothcontrolperformancecanbeachieved.2.Ane cientcontrolpolicycanbederivedusingthegradientofthevaluefunction(Werbos,1990).3.Thereisnoneedtoguesshowtopartitionthestate,action,andtime:itisthetaskofthefunctionapproximationandnumericalintegrationalgorithmsto ndtherightgranularity.TherehavebeenseveralattemptsatextendingRLalgorithmstocontinuouscases.Bradtke(1993)showedconvergenceresultsforQ-learningalgorithmsfordiscrete-time,continuous-statesystemswithlineardynamicsandquadraticcosts.BradtkeandDu (1995)derivedaTDalgorithmforcontinuous-time,discrete-statesystems(semi-Markovdecisionproblems).Baird(1993)proposedthe\advantageupdatingmethodbyextendingQ-learningtobeusedforcontinuous-time,continuous-stateproblems.Whenweconsideroptimizationproblemsincontinuous-timesystems,theHamilton-Jacobi-Bellman(HJB)equation,whichisacontinuous-timecounterpartoftheBellmanequationfordiscrete-timesystems,providesasoundtheoreticalbasis(see,e.g.,Bertse
本文标题:Reinforcement learning in continuous time and spac
链接地址:https://www.777doc.com/doc-6067958 .html