您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > AI人工智能 > AlphaGo-Zero论文讲解
MasterthegameofGowithouthumanknowledgePresentation:邢翔瑞2017.11.13DeepMind,5NewStreetSquare,LondonEC4A3TW,UK.Nature2017.Oct19目录01Motivation02Methods03Experiments04ConclusionMotivation0101Along-standinggoalofartificialintelligenceisanalgorithmthatlearns,tabularasa,superhumanproficiencyinchallengingdomains.02Expertdatasetsareoftenexpensive,unreliableorsimplyunavailable.03SupervisedlearningImposeaceilingontheperformanceofsystemstrainedinthismanner.MotivationMethods0201Firstandforemost,itistrainedsolelybyselfplayreinforcementlearning,startingfromrandomplay,withoutanysupervisionoruseofhumandata02Second,itusesonlytheblackandwhitestonesfromtheboardasinputfeatures.03Third,itusesasingleneuralnetwork,ratherthanseparatepolicyandvaluenetworks.DifferencesformAlphaGo01DeepNeuralNetwork02Withparameter03Inputtherawboardrepresentationofthepositionanditshistory04OutputsbothmoveprobilitiesandavalueNetwork0506representstheprobabilityofselectingeachmovea(includingpass),Thevectorofmoveprobabilities06scalarevaluation,estimatingtheprobabilityofthecurrentplayerwinningfrompositionsTheneuralnetworkconsistsofmanyresidualblocksofconvolutionallayerswithbatchnormalizationandrectifiernonlinearitiesPowerfulpolicyimprovementoperatorThesesearchprobabilitiesusuallyselectmuchstrongermovesthantherawmoveprobabilitiespoftheneuralnetworkGamewinnerzasasampleofthevalueSelf-playtrainingpipelineMCTSTheMCTSsearchoutputsprobabilitiesπofplayingeachmove.PowerfulpolicyevaluationoperatorThemainideaofourreinforcementlearningalgorithmistousethesesearchoperatorsrepeatedlyinapolicyiterationprocedure.morecloselymatchtheimprovedsearchprobabilitiesandself-playwinnerSelf-playreinforcementlearningMCTSinAlphaGoZeroExperiments03Empiricalevaluation4TPUs48TPUs36HoursSeveralMonthsFinalperformenceofAlphaGoZeroandAlphaGoLeeAlphaGoZero4.9milliongamesofselfplayweregenerated,using1,600imulationsforeachMCTS,whichorrespondstoapproximately0.4sthinkingtimepermove.Parameterswereupdatedfrom700,000minibatchesof2,048positions.Theneuralnetworkcontained20residualblocksPerformanceofAlphaGoZeroConclusion04Conclusion01Apurereinforcementlearningapproachisfullyfeasible,eveninthemostchallengingofdomains02Itispossibletotraintosuperhumanlevel,withouthumanexamplesorguidance,givennoknowledgeofthedomainbeyondbasicrules.03HumankindhasaccumulatedGoknowledgefrommillionsofgamesplayedoverthousandsofyears,collectivelydistilledintopatterns,proverbsandbooks.Inthespaceofafewdays,startingtabularasa,AlphaGoZerowasabletorediscovermuchofthisGoknowledge,aswellasnovelstrategiesthatprovidenewinsightsintotheoldestofgames.
本文标题:AlphaGo-Zero论文讲解
链接地址:https://www.777doc.com/doc-4785983 .html