您好,欢迎访问三七文档
MachineLearning,8,257-277(1992)©1992KluwerAcademicPublishers,Boston.ManufacturedinTheNetherlands.PracticalIssuesinTemporalDifferenceLearningGERALDTESAUROIBMThomasJ.WatsonResearchCenter,P.O.Box704,YorktownHeights,NY10598USAAbstract.Thispaperexamineswhethertemporaldifferencemethodsfortrainingconnectionistnetworks,suchasSutton'sTD(X)algorithm,canbesuccessfullyappliedtocomplexreal-worldproblems.Anumberofimportantpracticalissuesareidentifiedanddiscussedfromageneraltheoreticalperspective.ThesepracticalissuesarethenexaminedinthecontextofacasestudyinwhichTD(X)isappliedtolearningthegameofbackgammonfromtheoutcomeofself-play.Thisisapparentlythefirstapplicationofthisalgorithmtoacomplexnon-trivialtask.Itisfoundthat,withzeroknowledgebuiltin,thenetworkisabletolearnfromscratchtoplaytheentiregameatafairlystrongintermediatelevelofperformance,whichisclearlybetterthanconventionalcommercialprograms,andwhichinfactsurpassescomparablenetworkstrainedonamassivehumanexpertdataset.ThisindicatesthatTDlearningmayworkbetterinpracticethanonewouldexpectbasedoncurrenttheory,anditsuggeststhatfurtheranalysisofTDmethods,aswellasapplicationsinothercomplexdomains,maybeworthinvestigating.Keywords.Temporaldifferencelearning,neuralnetworks,cormectionistmethods,backgammon,games,featurediscovery1.IntroductionOneofthemostfascinatingandchallengingparadigmsoftraditionalmachinelearningre-searchisthedelayedreinforcementlearningparadigm.Inthesimplestformofthisparadigm,thelearningsystempassivelyobservesatemporalsequenceofinputstatesthateventuallyleadstoafinalreinforcementorrewardsignal(usuallyascalar).Thelearningsystem'staskinthiscaseistopredictexpectedrewardgivenanobservationofaninputstateorsequenceofinputstates.Thesystemmayalsobesetupsothatitcangeneratecontrolsignalsthatinfluencethesequenceofstates.Inthiscasethelearningtaskisusuallytogeneratetheoptimalcontrolsignalsthatwillleadtomaximumreinforcement.Delayedreinforcementlearningisdifficultfortworeasons.First,thereisnoexplicitteachersignalthatindicatesthecorrectoutputateachtimestep.Second,thetemporaldelayoftherewardsignalimpliesthatthelearningsystemmustsolveatemporalcreditassignmentproblem,i.e.,mustapportioncreditandblametoeachofthestatesandactionsthatresultedinthefinaloutcomeofthesequence.Despitethesedifficulties,delayedreinforcementlearninghasattractedconsiderableinter-estformanyyearsinthemachinelearningcommunity.Thenotionofalearningsysteminteractingwithanenvironmentandlearningtoperformatasksolelyfromtheoutcomeofitsexperienceintheenvironmentisveryintellectuallyappealing.Itcouldalsohavenumerouspracticalapplicationsinareassuchasmanufacturingprocesscontrol,navigationandpathplanning,andtradinginfinancialmarkets.Onepossibleapproachtotemporalcreditassignmentistobasetheapportionmentofcreditonthedifferencebetweentemporallysuccessivepredictions.AlgorithmsusingthisapproachhavebeentermedtemporaldifferencemethodsinSutton(1988),andhavebeen33258G.TESAUROstudiedformanyyearsinavarietyofcontexts.ExamplesincludeSamuel'scheckerspro-gram(Samuel,1959)andHolland'sbucketbrigadealgorithm(Holland,1986).Anincremen-talreal-timealgorithmcalledTD(~)hasbeenproposedinSutton(1988)foradjustingtheweightsinaconnectionistnetwork.Ithasthefollowingform:tAwt=°t(Pt+l--Pt)~ht-kVwPk(1)k=lwherePtisthenetwork'soutputuponobservationofinputpatternxtattimet,wisthevectorofweightsthatparameterizesthenetwork,andVwP~,isthegradientofnetworkout-putwithrespecttoweights.Equation1basicallycouplesatemporaldifferencemethodfortemporalcreditassignmentwithagradient-descentmethodforstructuralcreditassign-ment.Manysupervisedlearningproceduresusegradient-descentmethodstooptimizenet-workstructures;forexample,theback-propagationlearningprocedure(Rumelhart,etal.,1986)usesgradient-descenttooptimizetheweightsinafeed-forwardmultilayerperceptron.Equation1providesawaytoadaptsuchsupervisedlearningprocedurestosolvetemporalcreditassignmentproblems.(Aninterestingopenquestioniswhethermorecomplexsuper-visedlearningprocedures,suchasthosethatdynamicallyaddnodesorconnectionsduringtraining,couldbeadaptedtodotemporalcreditassignment.)Itcanbeshownthatthecasek=1correspondstoanexplicitsupervisedpairingofeachinputpatternxtwiththefinalrewardsignalz.Similarly,thecasek=0correspondstoanexplicitpairingofxtwiththenextpredictionPt+ITheparameter),providesasmoothheuristicinterpolationbetweenthesetwolimits.SuttonprovidesanumberofintuitiveargumentswhyTD(X)shouldbeamoreefficientlearningprocedurethanexplicitsupervisedpairingofinputstateswithfinalreward.ArigorousproofisalsogiventhatTD(0)convergestotheoptimalpredictionsforalinearnetworkandalinearlyindependentsetofinputpatterns.Thisproofhasrecentlybeenex-tendedtoarbitraryvaluesof),inDayan(1992).However,notheoreticalorempiricalresultsareavailableformorecomplextasksrequiringmultilayernetworks,althougharelatedalgo-rithmcalledtheAdaptiveHeuristicCritic(Sutton,1984)hasbeensuccessfullyappliedtoarelativelysmall-scalecart-polebalancingproblem(Barto,Sutton&Anderson,1983;Anderson,1987).ThepresentpaperseekstodeterminewhethertemporaldifferencelearningproceduressuchasT
本文标题:Practical issues in temporal difference learning
链接地址:https://www.777doc.com/doc-5903973 .html