您好,欢迎访问三七文档
AReinforcementLearningSchemeforaPartially-ObservableMulti-AgentGameSHINISHIINaraInstituteofScienceandTechnology.8916-5Takayama,Ikoma,630-0192JAPAN,CREST,JapanScienceandTechnologyAgency.HAJIMEFUJITAandMASAOKIMITSUTAKENaraInstituteofScienceandTechnology.8916-5Takayama,Ikoma,630-0192JAPAN.TATSUYAYAMAZAKINationalInstituteofInformationandCommunicationsTechnology.3-5Hikaridai,Seika,Kyoto,619-0289JAPAN.JUNMATSUDAOsakaGakuinUniversity.2-36-1Kishibeminami,Suita,JAPAN.YOICHIROMATSUNORicohCo.Ltd.1-1-17Koishikawa,Tokyo,112-0002JAPAN.October26,2004Abstract.Weformulateanautomaticstrategyacquisitionproblemforthemulti-agentcardgame“Hearts”asareinforcementlearningproblem.TheproblemcanapproximatelybedealtwithintheframeworkofapartiallyobservableMarkovdecisionprocess(POMDP)forasingle-agentsystem.Heartsisanexampleofimperfectinformationgames,whicharemoredifficulttodealwiththanperfectinformationgames.APOMDPisadecisionproblemthatincludesaprocessforestimatingunobservablestatevariables.Byregardingmissinginformationasunobservablestatevariables,animperfectinformationgamecanbeformulatedasaPOMDP.However,thegameofHeartsisarealisticproblemthathasahugenumberofpossiblestates,evenwhenitisapproximatedasasingle-agentsystem.Therefore,furtherapproximationisnecessarytomakethestrategyacquisitionproblemtractable.Thisarticlepresentsanapproximationmethodbasedonestimatingunobservablestatevariablesandpredictingtheactionsoftheotheragents.Simulationresultsshowthatourreinforcementlearningmethodisapplicabletosuchadifficultmulti-agentproblem.Keywords:Reinforcementlearning,POMDP,multi-agentsystem,cardgame,model-based1.IntroductionManycardgamesareimperfectinformationgames;foreachgameplayer,thereareunobservablestatevariables,e.g.,cardsinanotherplayer’shandorundealtcards.Sincecardgamesarewell-definedasmulti-agentsystems,strategyacquisitionproblemsforthemhavebeenwidelystudied.However,theexistingalgorithmshavenotachievedthelevelofhumanexperts(Ginsberg,2001),althoughsomealgorithmsforperfectinformationgameslikethegame“Backgammon”canbeatc2004KluwerAcademicPublishers.PrintedintheNetherlands.2ShinIshiihumanchampions(Tesauro,1994).Inordertodealwithimperfectinformationgames,itisimportanttoestimatemissinginformation(Ginsberg,2001).AdecisionmakingproblemoranoptimalcontrolprobleminastochasticbutstationaryenvironmentisoftenformulatedasaMarkovdecisionprocess(MDP).Ontheotherhand,iftheinformationintheenvironmentispartiallyunobserv-able,theproblemcanbeformulatedasapartiallyobservableMarkovdecisionprocess(POMDP).Byregardingmissinginformationasunobservablepartoftheenvironment,animperfectinformationgameisformulatedasaPOMDP.Inmanycardgames,coordinationandcompetitionamongtheplayersoccur.Suchasituationisreferredtoasamulti-agentsystem.Adecisionmakingproblemoranoptimalcontrolprobleminamulti-agentsystemhasahighdegreeofdifficultyduetointeractionsamongtheagents.Reinforcementlearning(RL)(Sutton&Barto,1998),whichisamachinelearningframeworkbasedontrialanderror,hasoftenbeenappliedtoproblemswithinmulti-agentsystems(Crites,1996;Crites&Barto,1996;Littman,1994;Hu&Wellman,1998;Nagayuki,Ishii,&Doya,2000;Salustowicz,Wiering,&Schmidhuber,1998;Sandholm&Crites,1995;Sen,Sekaran,&Hale,1994;Tan,1993),andhasobtainedsuccessfulresults.Thisarticleinparticulardealswiththecardgame“Hearts”,whichisann-player(n2)non-cooperativefinite-statezero-sumimperfect-informationgame,andpresentsanautomaticstrategy-acquisitionschemeforthegame.Byapprox-imatelyassumingthatthereisasinglelearningagent,theenvironmentcanberegardedasstationaryfortheagent.ThestrategyacquisitionproblemcanthenbeformulatedasaPOMDP,andtheproblemissolvedbyanRLmethod.OurRLmethodcopeswiththepartialobservabilitybyestimatingthecarddistributionintheotheragents’handsandbypredictingtheactionsoftheotheragents.Afterthat,wetrytoapplyourPOMDP-RLmethodtoamulti-agentproblem,namely,anenvironmentthathasseveralagentsthatlearnconcurrently.InaPOMDP,thestatetransitionforanobservablepartoftheenvironment,i.e.,observablestatevariables,doesnotnecessarilyhaveaMarkovproperty.APOMDPcanbetransformedintoanMDPwhosestatespaceconsistsofbeliefstates.Abeliefstateistypicallytheprobabilitydistributionofpossiblestates.Aftereachstatetransitionfortheobservablestatevariablesoccurs,thebeliefstatemaintainstheprobabilityoftheunobservablepartoftheenvironment;namely,thebeliefstateisestimatedusingtheobservationsofactualstatetransitionevents.Ifthecorrectmodeloftheenvironmentaldynamicsisavailable,theoptimalcontrol(i.e.,“policy”)foraPOMDPisobtainedbasedonadynamicprogramming(DP)approach(Kaelbling,Littman,&Cassandra,1998).TheagentdoesnothaveaprioriknowledgeoftheenvironmentaldynamicsinusualRLproblems,hence,itisimportantforaPOMDP-RLmethodtobeabletoestimatetheenvironmentalmodel.InthegameHearts,theenvironmentalmodel(statetransition)dependsonthecardsheldbyopponentagentsandthestrategies(actions)oftheopponentagents.Therefore,agoodestimationforthestatetransitionprobabilityneedstoapproxi-Areinforcementlearningschemeforapartially-observablemulti-agentgame3matethecarddistributionandtheactionpredictionfortheopponentagents.T
本文标题:A reinforcement learning scheme for a partially-ob
链接地址:https://www.777doc.com/doc-3309693 .html