您好,欢迎访问三七文档
Model-BasedReinforementLearninginDynamiEnvironmentsTehnialReportUU-CS-2002-029MaroA.Wieringmaros.uu.nlIntelligentSystemsGroupInstituteofInformationandComputingSienesUtrehtUniversityAbstratWestudyusingreinforementlearninginpartiulardynamienviron-ments.Ourenvironmentsanontainmanydynamiobjetswhihmakesoptimalplanninghard.Onewayofusinginformationaboutalldynamiobjetsistoexpandthestatedesription,butthisresultsinahighdi-mensionalpoliyspae.Ourapproahistoinstantiateinformationaboutdynamiobjetsinthemodeloftheenvironmentandtoreplanusingmodel-basedreinforementlearningwheneverthisinformationhanges.Furthermore,ourapproahanbeombinedwithana-priorimodelofthehangingpartsoftheenvironment,whihenablestheagenttooptimallyplanaourseofation.ResultsonanavigationtaskinaWumpus-likeenvironmentwithmultipledynamihostilespideragentsshowthatoursystemisabletolearngoodsolutionsminimizingtheriskofhittingspi-deragents.Furtherexperimentsshowthatthetimeomplexityofthealgorithmsaleswellwhenmoreinformationisinstantiatedinthemodel.Keywords:ReinforementLearning,DynamiEnvironments,Model-basedRL,InstantiatingInformation,Replanning,POMDPs,Wumpus1IntrodutionReinforementlearning.Reinforementlearning(SuttonandBarto,1998;Kaelblingetal.,1996)anbeusedtolearntoontrolanagentbylettingtheagentinteratwithitsenvironmentandlearnfromtheobtainedfeedbak(re-wardsignals).Usingatrial-and-errorproess,areinforement-learning(RL)agentisabletolearnapoliy(orplan)whihoptimizestheumulativerewardintakeoftheagentovertime.Reinforementlearninghasbeenappliedsuess-fullyinpartiularstationaryenvironmentssuhasinhekers(Samuel,1959),bakgammon(Tesauro,1992),andhess(Baxteretal.,1997).Reinforementlearninghasalsobeenappliedto ndgoodsolutionsfordiÆultmulti-agentproblemssuhaselevatorontrol(CritesandBarto,1996),networkrouting(LittmanandBoyan,1993),andtraÆlightontrol(Wiering,2000).RLhasonlybeenusedfewtimesinsingleagentnon-stationaryenvironments,however.1Path-planningproblemsinnon-stationaryenvironmentsareinfatpartiallyobservableMarkovdeisionproblems(POMDPs)(Lovejoy,1991),whihareknowntobehardtosolveexatly.DayanandSejnowski(1996)onentratethemselvesonthedualontrolorexplorationproblemwherethereistheneedofdetetinghangesinahangingenvironment,whiletheagentshouldattogainasmuhrewardaspossible.BoyanandLittman(2001)useatemporalmodeltotakehangesoftheenvironmentintoaountwhenomputingapol-iy.InthispaperweareinterestedinapplyingRLtolearntoontrolagentsindynamienvironments.Dynamienvironments.Learningindynamienvironmentsishard,sinetheagentneedstostayinformedaboutthestatusofalldynamiob-jetsintheenvironment.Thisanbedonebyaugmentingthestatespaewithadesriptionofthestatusofalldynamiobjets,butthismayquiklyauseastatespaeexplosion.Furthermore,theagentmaynotexatlyknowthestatusofanobjetandthereforehastodealwithunertaininformation.Usingun-ertaininformationaspartofthestatespaeishard,sineitmakesthestatespaeontinuousandhighdimensional.Instantiatinginformationinthemodel.Thereexistsanothermethodforusingknowledgeaboutdynamiobjets:instantiatetheinformationaboutthedynamiobjetsintheworldmodelandthenusetherevisedworldmodeltoomputeanewpoliy.E.g.ifadooranbeopenorlosed,andweknowwhetherthedoorislosed,weansetnewtransitionprobabilitiesbetweenstatesintheworldmodelsuhthatthisinformationanbeusedbytheagent.Onethemodelisupdatedusingtheurrentlyavailableinformation,dynamiprogramming-likealgorithms(Bellman,1957;MooreandAtkeson,1993)anbeusedtoomputeanewpoliy.Inthisway,wehaveanadaptiveagentwhihtakesurrentlyknowninformationintoaountforomputingations,andwhihreplansonethedynamiinformationhanges.Thisishardtodowithotherplanningmethods,espeiallyforlosedloopontrolinstohastidy-namienvironments.Furthermore,theagentouldalsoinstantiateinformationreeivedbyommuniationwhihanbeusefulformulti-agentreinforementlearning.Althoughsharingpoliies(Tan,1993)isonewayforooperativemulti-agentlearning,ommuniationwithinstantiatinginformationanalsobeusedfornon-ooperativeorsemiooperativeenvironments.Usingpriorknowledge.Oftenreinforementlearningisusedtolearnontrolknowledgefromsrath,i.e.withoutusinga-prioriknowledge.Weknow,however,thattheuseofsomekindofa-prioriknowledgeanbeverybene ial.Forexample,ifpartiularationsareheavilypunishedwedonotwanttoexplorethoseations,butratherreasonabouttheonsequenesoftheseationsusingana-prioridesignedmodel.A-prioriknowledgeanalsobeusedtomodeladynamienvironmentsothatthisknowledgeanbepresentedtotheRLagent.Thisenablestheagenttoreasonaboutthedynamisoftheenvironmentwhihmaybeneessarytosolveapartiularproblem,whereproblemsmayariseoneaftertheother.AsanexamplethinkaboutanagentwhihiswalkinginaityandusesRLtolearnamapoftheity.Aftersometime,theagentmayhavethedesiretodrinksomethinginabar.Onetheagententerssomebar,itoulduseana-priorimodelofbarstounderstandwhihdynamientities,2suhasabarkeeper,otherustomers,tablesandhairset.playaroleinthebar-setting.Soitanusethismodel, llintheatualsituationusingsensordata(e.g.,vision)andomputeapoliy(orplan)toattainitsurrentgoal.Iftheagentdisoversmoreinformationaboutpartiular(dynami)entities,itanagaininstantiatethisinthemodeloftheurrentbarsitu
本文标题:Model-Based Reinforcement Learning in Dynamic Envi
链接地址:https://www.777doc.com/doc-3973173 .html