Reinforcement learning in continuous time and spac

ReinforcementLearningInContinuousTimeandSpaceKenjiDoya*ATRHumanInformationProcessingResearchLaboratories2-2Hikaridai,Seika,Soraku,Kyoto619-0288,JapanJanuary20,1999toappearinNeuralComputationAbstractThispaperpresentsareinforcementlearningframeworkforcontinuous-timedynamicalsystemswithoutaprioridiscretizationoftime,state,andaction.BasedontheHamilton-Jacobi-Bellman(HJB)equationforinnite-horizon,discountedrewardproblems,wederivealgorithmsforestimatingvaluefunctionsandforimprovingpolicieswiththeuseoffunctionapprox-imators.Theprocessofvaluefunctionestimationisformulatedastheminimizationofacontinuous-timeformofthetemporaldierence(TD)error.UpdatemethodsbasedonbackwardEulerapproximationandex-ponentialeligibilitytracesarederivedandtheircorrespondenceswiththeconventionalresidualgradient,TD(0),andTD()algorithmsareshown.Forpolicyimprovement,twomethods,namely,acontinuousactor-criticmethodandavalue-gradientbasedgreedypolicy,areformulated.Asaspecialcaseofthelatter,anonlinearfeedbackcontrollawusingthevaluegradientandthemodeloftheinputgainisderived.The\advantageup-dating,amodel-freealgorithmderivedpreviously,isalsoformulatedintheHJBbasedframework.Theperformanceoftheproposedalgorithmsisrsttestedinanon-linearcontroltaskofswingingupapendulumwithlimitedtorque.Itisshowninthesimulationsthat1)thetaskisaccomplishedbythecontinuousactor-criticmethodinanumberoftrialsseveraltimesfewerthanbytheconventionaldiscreteactor-criticmethod;2)amongthecontinuouspolicyupdatemethods,thevalue-gradientbasedpolicywithaknownorlearned*Currentaddress:KawatoDynamicBrainProject,ERATO,JapanScienceandTechnologyCorporation.2-2Hikaridai,Seika,Soraku,Kyoto619-0288,Japan.Phone:+81-774-95-1210.Fax:+81-774-95-3001.E-mail:doya@erato.atr.co.jp1dynamicmodelperformsseveraltimesbetterthantheactor-criticmethod;and3)avaluefunctionupdateusingexponentialeligibilitytracesismoreef-cientandstablethanthatbasedonEulerapproximation.Thealgorithmsarethentestedinahigher-dimensionaltask,i.e.,cart-poleswing-up.Thistaskisaccomplishedinseveralhundredtrialsusingthevalue-gradientbasedpolicywithalearneddynamicmodel.1IntroductionThetemporaldierence(TD)familyofreinforcementlearning(RL)algorithms(Bartoetal.,1983;Sutton,1988;SuttonandBarto,1998)providesaneectiveapproachtocontrolanddecisionproblemsforwhichoptimalsolutionsareanalyticallyunavailableordiculttoobtain.Anumberofsuccessfulapplicationstolarge-scaleproblems,suchasboardgames(Tesauro,1994),dispatchproblems(CritesandBarto,1996;ZhangandDietterich,1996;SinghandBertsekas,1997),androbotnavigation(Mataric,1994)havebeenreported(see,e.g.,Kaelblingetal.(1996)andSuttonandBarto(1998)forareview).TheprogressofRLresearchsofar,however,hasbeenmostlyconstrainedtothediscreteformulationoftheprobleminwhichdiscreteactionsaretakenindiscretetimestepsbasedontheobservationofthediscretestateofthesystem.Manyinterestingreal-worldcontroltasks,suchasdrivingacarorridingasnowboard,requiresmoothcontinuousactionstakeninresponsetohigh-dimensional,real-valuedsensoryinput.InapplicationsofRLtocontinuousproblems,themostcommonapproachhasbeenrsttodiscretizetime,state,andactionandthentoapplyanRLalgorithmforadiscretestochasticsystem.However,thisdiscretizationapproachhasthefollowingdrawbacks:1.Whenacoarsediscretizationisused,thecontroloutputisnotsmooth,resultinginapoorperformance.2.Whenanediscretizationisused,thenumberofstatesandthenumberofiterationstepsbecomehuge,whichnecessitatesnotonlylargememorystoragebutalsomanylearningtrials.3.Inordertokeepthenumberofstatesmanageable,anelaboratepartitioningofthevariableshastobefoundusingpriorknowledge.Eortshavebeenmadetoeliminatesomeofthesedicultiesbyusingappropriatefunctionapproximators(Gordon,1996;Sutton,1996;TsitsiklisandVanRoy,1997),adaptivestatepartitioningandaggregationmethods(Moore,1994;Singhetal.,1995;Asadaetal.,1996;Pareigis,1998),andmultipletimescalemethods(Sutton,1995).2Inthispaper,weconsideranalternativeapproachinwhichlearningalgorithmsareformu-latedforcontinuous-timedynamicalsystemswithoutresortingtotheexplicitdiscretizationoftime,stateandaction.Thecontinuousframeworkhasthefollowingpossibleadvantages:1.Asmoothcontrolperformancecanbeachieved.2.Anecientcontrolpolicycanbederivedusingthegradientofthevaluefunction(Werbos,1990).3.Thereisnoneedtoguesshowtopartitionthestate,action,andtime:itisthetaskofthefunctionapproximationandnumericalintegrationalgorithmstondtherightgranularity.TherehavebeenseveralattemptsatextendingRLalgorithmstocontinuouscases.Bradtke(1993)showedconvergenceresultsforQ-learningalgorithmsfordiscrete-time,continuous-statesystemswithlineardynamicsandquadraticcosts.BradtkeandDu(1995)derivedaTDalgorithmforcontinuous-time,discrete-statesystems(semi-Markovdecisionproblems).Baird(1993)proposedthe\advantageupdatingmethodbyextendingQ-learningtobeusedforcontinuous-time,continuous-stateproblems.Whenweconsideroptimizationproblemsincontinuous-timesystems,theHamilton-Jacobi-Bellman(HJB)equation,whichisacontinuous-timecounterpartoftheBellmanequationfordiscrete-timesystems,providesasoundtheoreticalbasis(see,e.g.,Bertse

Reinforcement learning in continuous time and spac

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

信息技术应用1

竹源民居楼盘推广方案(1)

工程安全文明施工方案

第13章施工安全保证体系

专用线有砟道床换整体道床施工方案

办公室管理办法

《旅游服务礼仪》课程说明

建国60年国内企业文化的变迁

急腹症鉴别诊断

物流改善

相关文档

相关搜索