2016---Deep-Reinforcement-Learning-from-Self-Play-

DeepReinforcementLearningfromSelf-PlayinImperfect-InformationGamesJohannesHeinrichJ.HEINRICH@CS.UCL.AC.UKDavidSilverD.SILVER@CS.UCL.AC.UKUniversityCollegeLondon,UKAbstractManyreal-worldapplicationscanbedescribedaslarge-scalegamesofimperfectinformation.Todealwiththesechallengingdomains,priorworkhasfocusedoncomputingNashequilib-riainahandcraftedabstractionofthedomain.Inthispaperweintroducetheﬁrstscalableend-to-endapproachtolearningapproximateNashequilibriawithoutanypriorknowledge.Ourmethodcombinesﬁctitiousself-playwithdeepreinforcementlearning.WhenappliedtoLeducpoker,NeuralFictitiousSelf-Play(NFSP)ap-proachedaNashequilibrium,whereascommonreinforcementlearningmethodsdiverged.InLimitTexasHold’em,apokergameofreal-worldscale,NFSPlearntacompetitivestrategythatapproachedtheperformanceofhumanex-pertsandstate-of-the-artmethods.1.IntroductionGameshaveatraditionofencouragingadvancesinar-tiﬁcialintelligenceandmachinelearning(Samuel,1959;Tesauro,1995;Campbelletal.,2002;Riedmilleretal.,2009;Gellyetal.,2012;Bowlingetal.,2015).Gametheorydeﬁnesagameasadomainofconﬂictorcooper-ationbetweenseveralentities(Myerson,1991).Onemo-tivationofstudyingthesimplerrecreationalgamesistodevelopalgorithmsthatwillscaletomorecomplex,real-worldgamessuchasairportandnetworksecurity,ﬁnancialandenergytrading,trafﬁccontrolandrouting(LambertIIIetal.,2005;Nevmyvakaetal.,2006;Bazzan,2009;Tambe,2011;Urieli&Stone,2014;Durkotaetal.,2015).Mostofthesereal-worldgamesinvolvedecisionmakingwithim-perfectinformationandhigh-dimensionalinformationstatespaces.Unfortunately,manymachinelearningmethods,thathavebeenappliedtoclassicalgames,lackconvergenceguaranteesforlearninginimperfect-informationgames.Ontheotherhand,manygame-theoreticapproacheslacktheabilitytoextractrelevantpatternsandgeneralisefromdata.Thisresultsinlimitedscalabilitytolargegames,un-lessthedomainisabstractedtoamanageablesizeusinghumanexpertknowledge,heuristicsormodelling.How-ever,acquiringhumanexpertiseoftenrequiresexpensiveresourcesandtime.Inaddition,humanscanbeeas-ilyfooledintoirrationaldecisionsorassumptions(Selten,1990;Ariely&Jones,2008).Thismotivatesalgorithmsthatlearnusefulstrategiesend-to-end.InthispaperweintroduceNFSP,adeepreinforcementlearningmethodforlearningapproximateNashequilib-riaofimperfect-informationgames.NFSPagentslearnbyplayingagainstthemselveswithoutexplicitpriorknowl-edge.Technically,NFSPextendsandinstantiatesFictitiousSelf-Play(FSP)(Heinrichetal.,2015)withneuralnetworkfunctionapproximation.AnNFSPagentconsistsoftwoneuralnetworksandtwokindsofmemory.Memorizedex-perienceofplayagainstfellowagentsisusedbyreinforce-mentlearningtotrainanetworkthatpredictstheexpectedvaluesofactions.Experienceoftheagent’sownbehaviourisstoredinaseparatememory,whichisusedbysuper-visedlearningtotrainanetworkthatpredictstheagent’sownaveragebehaviour.AnNFSPagentactscautiouslybysamplingitsactionsfromamixtureofitsaverage,routinestrategyanditsgreedystrategythatmaximizesitspredictedexpectedvalue.NFSPapproximatesﬁctitiousplay,whichisapopulargame-theoreticmodeloflearningingamesthatconvergestoNashequilibriainsomeclassesofgames,e.g.two-playerzero-sumandmany-playerpotentialgames.Weempiricallyevaluateourmethodintwo-playerzero-sumcomputerpokergames.Inthisdomain,currentgame-theoreticapproachesuseheuristicsofcardstrengthtoab-stractthegametoatractablesize(Zinkevichetal.,2007;Gilpinetal.,2007;Johansonetal.,2013).WhileLimitTexasHold’em(LHE),apokergameofreal-worldscale,hasgotwithinreachofbeingsolvedwithcurrentcompu-tationalresources(Bowlingetal.,2015),mostotherpokerandreal-worldgamesremainfaroutofscopewithoutab-straction.Ourapproachdoesnotrelyonengineeringsuchabstractionsoranyotherpriorknowledge.NFSPagentsleveragedeepreinforcementlearningtolearndirectlyfromtheirexperienceofinteractinginthegame.Whenap-pliedtoLeducpoker,NFSPapproachedaNashequilib-rium,whereascommonreinforcementlearningmethodsdi-arXiv:1603.01121v1[cs.LG]3Mar2016DeepReinforcementLearningfromSelf-PlayinImperfect-InformationGamesverged.WealsoappliedNFSPtoLHE,learningdirectlyfromtherawinputs.NFSPlearntacompetitivestrategy,approachingtheperformanceofstate-of-the-artmethodsbasedonhandcraftedabstractions.2.BackgroundInthissectionweprovideabriefoverviewofreinforcementlearning,extensive-formgamesandﬁctitiousself-play.Foramoredetailedexpositionwereferthereaderto(Sutton&Barto,1998),(Myerson,1991),(Fudenberg,1998)and(Heinrichetal.,2015).2.1.ReinforcementLearningReinforcementlearning(Sutton&Barto,1998)agentstyp-icallylearntomaximizetheirexpectedfuturerewardsfrominteractionwithanenvironment.Theenvironmentisusu-allymodelledasaMarkovdecisionprocess(MDP).Anagentbehavesaccordingtoapolicythatspeciﬁesadistri-butionoveravailableactionsateachstateoftheMDP.Theagent’sgoalistoimproveitspolicyinordertomaximizeitsgain,Gt=PTi=tRi+1,whichisarandomvariableoftheagent’scumulativefuturerewardsstartingfromtimet.Manyreinforcementlearningalgorithmslearnfromse-quentialexperienceintheformoftransitiontuples,(st;at;rt+1;st+1),wherestisthestateattimet,atistheactionchoseninthatstate,rt+1therewardreceivedth

2016---Deep-Reinforcement-Learning-from-Self-Play-

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

新安装oracle10g

现代风水学在住宅建设中的应用

通信行业-工程土建设计标准流程

商业银行的资本管理(1)

汽车租赁调度客服

白癜风新药（乌龙散）热销国内外

会计法规讲义

劳动合同到期通知函

ISO9000C(1)

上海电力业务流程重组培训提案

相关文档

相关搜索