A reinforcement learning scheme for a partially-ob

AReinforcementLearningSchemeforaPartially-ObservableMulti-AgentGameSHINISHIINaraInstituteofScienceandTechnology.8916-5Takayama,Ikoma,630-0192JAPAN,CREST,JapanScienceandTechnologyAgency.HAJIMEFUJITAandMASAOKIMITSUTAKENaraInstituteofScienceandTechnology.8916-5Takayama,Ikoma,630-0192JAPAN.TATSUYAYAMAZAKINationalInstituteofInformationandCommunicationsTechnology.3-5Hikaridai,Seika,Kyoto,619-0289JAPAN.JUNMATSUDAOsakaGakuinUniversity.2-36-1Kishibeminami,Suita,JAPAN.YOICHIROMATSUNORicohCo.Ltd.1-1-17Koishikawa,Tokyo,112-0002JAPAN.October26,2004Abstract.Weformulateanautomaticstrategyacquisitionproblemforthemulti-agentcardgame“Hearts”asareinforcementlearningproblem.TheproblemcanapproximatelybedealtwithintheframeworkofapartiallyobservableMarkovdecisionprocess(POMDP)forasingle-agentsystem.Heartsisanexampleofimperfectinformationgames,whicharemorediﬃculttodealwiththanperfectinformationgames.APOMDPisadecisionproblemthatincludesaprocessforestimatingunobservablestatevariables.Byregardingmissinginformationasunobservablestatevariables,animperfectinformationgamecanbeformulatedasaPOMDP.However,thegameofHeartsisarealisticproblemthathasahugenumberofpossiblestates,evenwhenitisapproximatedasasingle-agentsystem.Therefore,furtherapproximationisnecessarytomakethestrategyacquisitionproblemtractable.Thisarticlepresentsanapproximationmethodbasedonestimatingunobservablestatevariablesandpredictingtheactionsoftheotheragents.Simulationresultsshowthatourreinforcementlearningmethodisapplicabletosuchadiﬃcultmulti-agentproblem.Keywords:Reinforcementlearning,POMDP,multi-agentsystem,cardgame,model-based1.IntroductionManycardgamesareimperfectinformationgames;foreachgameplayer,thereareunobservablestatevariables,e.g.,cardsinanotherplayer’shandorundealtcards.Sincecardgamesarewell-deﬁnedasmulti-agentsystems,strategyacquisitionproblemsforthemhavebeenwidelystudied.However,theexistingalgorithmshavenotachievedthelevelofhumanexperts(Ginsberg,2001),althoughsomealgorithmsforperfectinformationgameslikethegame“Backgammon”canbeatc2004KluwerAcademicPublishers.PrintedintheNetherlands.2ShinIshiihumanchampions(Tesauro,1994).Inordertodealwithimperfectinformationgames,itisimportanttoestimatemissinginformation(Ginsberg,2001).AdecisionmakingproblemoranoptimalcontrolprobleminastochasticbutstationaryenvironmentisoftenformulatedasaMarkovdecisionprocess(MDP).Ontheotherhand,iftheinformationintheenvironmentispartiallyunobserv-able,theproblemcanbeformulatedasapartiallyobservableMarkovdecisionprocess(POMDP).Byregardingmissinginformationasunobservablepartoftheenvironment,animperfectinformationgameisformulatedasaPOMDP.Inmanycardgames,coordinationandcompetitionamongtheplayersoccur.Suchasituationisreferredtoasamulti-agentsystem.Adecisionmakingproblemoranoptimalcontrolprobleminamulti-agentsystemhasahighdegreeofdiﬃcultyduetointeractionsamongtheagents.Reinforcementlearning(RL)(Sutton&Barto,1998),whichisamachinelearningframeworkbasedontrialanderror,hasoftenbeenappliedtoproblemswithinmulti-agentsystems(Crites,1996;Crites&Barto,1996;Littman,1994;Hu&Wellman,1998;Nagayuki,Ishii,&Doya,2000;Salustowicz,Wiering,&Schmidhuber,1998;Sandholm&Crites,1995;Sen,Sekaran,&Hale,1994;Tan,1993),andhasobtainedsuccessfulresults.Thisarticleinparticulardealswiththecardgame“Hearts”,whichisann-player(n2)non-cooperativeﬁnite-statezero-sumimperfect-informationgame,andpresentsanautomaticstrategy-acquisitionschemeforthegame.Byapprox-imatelyassumingthatthereisasinglelearningagent,theenvironmentcanberegardedasstationaryfortheagent.ThestrategyacquisitionproblemcanthenbeformulatedasaPOMDP,andtheproblemissolvedbyanRLmethod.OurRLmethodcopeswiththepartialobservabilitybyestimatingthecarddistributionintheotheragents’handsandbypredictingtheactionsoftheotheragents.Afterthat,wetrytoapplyourPOMDP-RLmethodtoamulti-agentproblem,namely,anenvironmentthathasseveralagentsthatlearnconcurrently.InaPOMDP,thestatetransitionforanobservablepartoftheenvironment,i.e.,observablestatevariables,doesnotnecessarilyhaveaMarkovproperty.APOMDPcanbetransformedintoanMDPwhosestatespaceconsistsofbeliefstates.Abeliefstateistypicallytheprobabilitydistributionofpossiblestates.Aftereachstatetransitionfortheobservablestatevariablesoccurs,thebeliefstatemaintainstheprobabilityoftheunobservablepartoftheenvironment;namely,thebeliefstateisestimatedusingtheobservationsofactualstatetransitionevents.Ifthecorrectmodeloftheenvironmentaldynamicsisavailable,theoptimalcontrol(i.e.,“policy”)foraPOMDPisobtainedbasedonadynamicprogramming(DP)approach(Kaelbling,Littman,&Cassandra,1998).TheagentdoesnothaveaprioriknowledgeoftheenvironmentaldynamicsinusualRLproblems,hence,itisimportantforaPOMDP-RLmethodtobeabletoestimatetheenvironmentalmodel.InthegameHearts,theenvironmentalmodel(statetransition)dependsonthecardsheldbyopponentagentsandthestrategies(actions)oftheopponentagents.Therefore,agoodestimationforthestatetransitionprobabilityneedstoapproxi-Areinforcementlearningschemeforapartially-observablemulti-agentgame3matethecarddistributionandtheactionpredictionfortheopponentagents.T

A reinforcement learning scheme for a partially-ob

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

ShopEx-CRM功能及版本区别（PDF82页）

虚拟社区用户黏度模型研究

电子制作必备技能焊接其实并不难

模板安装施工的安全技术要素

工程力学第四版张秉荣第六章ppt

精细管理工程

飞天大酒店多功能厅租用单

教师教学质量评价办法xinde

(最新)上市公司并购重组一种纯实务视角的法律解读3634

策划+设计与价值创新-兼谈居住空间的构成和价值取向(10)(1)

相关文档

相关搜索