您好,欢迎访问三七文档
DeepReinforcementLearningfromSelf-PlayinImperfect-InformationGamesJohannesHeinrichJ.HEINRICH@CS.UCL.AC.UKDavidSilverD.SILVER@CS.UCL.AC.UKUniversityCollegeLondon,UKAbstractManyreal-worldapplicationscanbedescribedaslarge-scalegamesofimperfectinformation.Todealwiththesechallengingdomains,priorworkhasfocusedoncomputingNashequilib-riainahandcraftedabstractionofthedomain.Inthispaperweintroducethefirstscalableend-to-endapproachtolearningapproximateNashequilibriawithoutanypriorknowledge.Ourmethodcombinesfictitiousself-playwithdeepreinforcementlearning.WhenappliedtoLeducpoker,NeuralFictitiousSelf-Play(NFSP)ap-proachedaNashequilibrium,whereascommonreinforcementlearningmethodsdiverged.InLimitTexasHold’em,apokergameofreal-worldscale,NFSPlearntacompetitivestrategythatapproachedtheperformanceofhumanex-pertsandstate-of-the-artmethods.1.IntroductionGameshaveatraditionofencouragingadvancesinar-tificialintelligenceandmachinelearning(Samuel,1959;Tesauro,1995;Campbelletal.,2002;Riedmilleretal.,2009;Gellyetal.,2012;Bowlingetal.,2015).Gametheorydefinesagameasadomainofconflictorcooper-ationbetweenseveralentities(Myerson,1991).Onemo-tivationofstudyingthesimplerrecreationalgamesistodevelopalgorithmsthatwillscaletomorecomplex,real-worldgamessuchasairportandnetworksecurity,financialandenergytrading,trafficcontrolandrouting(LambertIIIetal.,2005;Nevmyvakaetal.,2006;Bazzan,2009;Tambe,2011;Urieli&Stone,2014;Durkotaetal.,2015).Mostofthesereal-worldgamesinvolvedecisionmakingwithim-perfectinformationandhigh-dimensionalinformationstatespaces.Unfortunately,manymachinelearningmethods,thathavebeenappliedtoclassicalgames,lackconvergenceguaranteesforlearninginimperfect-informationgames.Ontheotherhand,manygame-theoreticapproacheslacktheabilitytoextractrelevantpatternsandgeneralisefromdata.Thisresultsinlimitedscalabilitytolargegames,un-lessthedomainisabstractedtoamanageablesizeusinghumanexpertknowledge,heuristicsormodelling.How-ever,acquiringhumanexpertiseoftenrequiresexpensiveresourcesandtime.Inaddition,humanscanbeeas-ilyfooledintoirrationaldecisionsorassumptions(Selten,1990;Ariely&Jones,2008).Thismotivatesalgorithmsthatlearnusefulstrategiesend-to-end.InthispaperweintroduceNFSP,adeepreinforcementlearningmethodforlearningapproximateNashequilib-riaofimperfect-informationgames.NFSPagentslearnbyplayingagainstthemselveswithoutexplicitpriorknowl-edge.Technically,NFSPextendsandinstantiatesFictitiousSelf-Play(FSP)(Heinrichetal.,2015)withneuralnetworkfunctionapproximation.AnNFSPagentconsistsoftwoneuralnetworksandtwokindsofmemory.Memorizedex-perienceofplayagainstfellowagentsisusedbyreinforce-mentlearningtotrainanetworkthatpredictstheexpectedvaluesofactions.Experienceoftheagent’sownbehaviourisstoredinaseparatememory,whichisusedbysuper-visedlearningtotrainanetworkthatpredictstheagent’sownaveragebehaviour.AnNFSPagentactscautiouslybysamplingitsactionsfromamixtureofitsaverage,routinestrategyanditsgreedystrategythatmaximizesitspredictedexpectedvalue.NFSPapproximatesfictitiousplay,whichisapopulargame-theoreticmodeloflearningingamesthatconvergestoNashequilibriainsomeclassesofgames,e.g.two-playerzero-sumandmany-playerpotentialgames.Weempiricallyevaluateourmethodintwo-playerzero-sumcomputerpokergames.Inthisdomain,currentgame-theoreticapproachesuseheuristicsofcardstrengthtoab-stractthegametoatractablesize(Zinkevichetal.,2007;Gilpinetal.,2007;Johansonetal.,2013).WhileLimitTexasHold’em(LHE),apokergameofreal-worldscale,hasgotwithinreachofbeingsolvedwithcurrentcompu-tationalresources(Bowlingetal.,2015),mostotherpokerandreal-worldgamesremainfaroutofscopewithoutab-straction.Ourapproachdoesnotrelyonengineeringsuchabstractionsoranyotherpriorknowledge.NFSPagentsleveragedeepreinforcementlearningtolearndirectlyfromtheirexperienceofinteractinginthegame.Whenap-pliedtoLeducpoker,NFSPapproachedaNashequilib-rium,whereascommonreinforcementlearningmethodsdi-arXiv:1603.01121v1[cs.LG]3Mar2016DeepReinforcementLearningfromSelf-PlayinImperfect-InformationGamesverged.WealsoappliedNFSPtoLHE,learningdirectlyfromtherawinputs.NFSPlearntacompetitivestrategy,approachingtheperformanceofstate-of-the-artmethodsbasedonhandcraftedabstractions.2.BackgroundInthissectionweprovideabriefoverviewofreinforcementlearning,extensive-formgamesandfictitiousself-play.Foramoredetailedexpositionwereferthereaderto(Sutton&Barto,1998),(Myerson,1991),(Fudenberg,1998)and(Heinrichetal.,2015).2.1.ReinforcementLearningReinforcementlearning(Sutton&Barto,1998)agentstyp-icallylearntomaximizetheirexpectedfuturerewardsfrominteractionwithanenvironment.Theenvironmentisusu-allymodelledasaMarkovdecisionprocess(MDP).Anagentbehavesaccordingtoapolicythatspecifiesadistri-butionoveravailableactionsateachstateoftheMDP.Theagent’sgoalistoimproveitspolicyinordertomaximizeitsgain,Gt=PTi=tRi+1,whichisarandomvariableoftheagent’scumulativefuturerewardsstartingfromtimet.Manyreinforcementlearningalgorithmslearnfromse-quentialexperienceintheformoftransitiontuples,(st;at;rt+1;st+1),wherestisthestateattimet,atistheactionchoseninthatstate,rt+1therewardreceivedth
本文标题:2016---Deep-Reinforcement-Learning-from-Self-Play-
链接地址:https://www.777doc.com/doc-5912755 .html