Practical issues in temporal difference learning

MachineLearning,8,257-277(1992)©1992KluwerAcademicPublishers,Boston.ManufacturedinTheNetherlands.PracticalIssuesinTemporalDifferenceLearningGERALDTESAUROIBMThomasJ.WatsonResearchCenter,P.O.Box704,YorktownHeights,NY10598USAAbstract.Thispaperexamineswhethertemporaldifferencemethodsfortrainingconnectionistnetworks,suchasSutton'sTD(X)algorithm,canbesuccessfullyappliedtocomplexreal-worldproblems.Anumberofimportantpracticalissuesareidentifiedanddiscussedfromageneraltheoreticalperspective.ThesepracticalissuesarethenexaminedinthecontextofacasestudyinwhichTD(X)isappliedtolearningthegameofbackgammonfromtheoutcomeofself-play.Thisisapparentlythefirstapplicationofthisalgorithmtoacomplexnon-trivialtask.Itisfoundthat,withzeroknowledgebuiltin,thenetworkisabletolearnfromscratchtoplaytheentiregameatafairlystrongintermediatelevelofperformance,whichisclearlybetterthanconventionalcommercialprograms,andwhichinfactsurpassescomparablenetworkstrainedonamassivehumanexpertdataset.ThisindicatesthatTDlearningmayworkbetterinpracticethanonewouldexpectbasedoncurrenttheory,anditsuggeststhatfurtheranalysisofTDmethods,aswellasapplicationsinothercomplexdomains,maybeworthinvestigating.Keywords.Temporaldifferencelearning,neuralnetworks,cormectionistmethods,backgammon,games,featurediscovery1.IntroductionOneofthemostfascinatingandchallengingparadigmsoftraditionalmachinelearningre-searchisthedelayedreinforcementlearningparadigm.Inthesimplestformofthisparadigm,thelearningsystempassivelyobservesatemporalsequenceofinputstatesthateventuallyleadstoafinalreinforcementorrewardsignal(usuallyascalar).Thelearningsystem'staskinthiscaseistopredictexpectedrewardgivenanobservationofaninputstateorsequenceofinputstates.Thesystemmayalsobesetupsothatitcangeneratecontrolsignalsthatinfluencethesequenceofstates.Inthiscasethelearningtaskisusuallytogeneratetheoptimalcontrolsignalsthatwillleadtomaximumreinforcement.Delayedreinforcementlearningisdifficultfortworeasons.First,thereisnoexplicitteachersignalthatindicatesthecorrectoutputateachtimestep.Second,thetemporaldelayoftherewardsignalimpliesthatthelearningsystemmustsolveatemporalcreditassignmentproblem,i.e.,mustapportioncreditandblametoeachofthestatesandactionsthatresultedinthefinaloutcomeofthesequence.Despitethesedifficulties,delayedreinforcementlearninghasattractedconsiderableinter-estformanyyearsinthemachinelearningcommunity.Thenotionofalearningsysteminteractingwithanenvironmentandlearningtoperformatasksolelyfromtheoutcomeofitsexperienceintheenvironmentisveryintellectuallyappealing.Itcouldalsohavenumerouspracticalapplicationsinareassuchasmanufacturingprocesscontrol,navigationandpathplanning,andtradinginfinancialmarkets.Onepossibleapproachtotemporalcreditassignmentistobasetheapportionmentofcreditonthedifferencebetweentemporallysuccessivepredictions.AlgorithmsusingthisapproachhavebeentermedtemporaldifferencemethodsinSutton(1988),andhavebeen33258G.TESAUROstudiedformanyyearsinavarietyofcontexts.ExamplesincludeSamuel'scheckerspro-gram(Samuel,1959)andHolland'sbucketbrigadealgorithm(Holland,1986).Anincremen-talreal-timealgorithmcalledTD(~)hasbeenproposedinSutton(1988)foradjustingtheweightsinaconnectionistnetwork.Ithasthefollowingform:tAwt=°t(Pt+l--Pt)~ht-kVwPk(1)k=lwherePtisthenetwork'soutputuponobservationofinputpatternxtattimet,wisthevectorofweightsthatparameterizesthenetwork,andVwP~,isthegradientofnetworkout-putwithrespecttoweights.Equation1basicallycouplesatemporaldifferencemethodfortemporalcreditassignmentwithagradient-descentmethodforstructuralcreditassign-ment.Manysupervisedlearningproceduresusegradient-descentmethodstooptimizenet-workstructures;forexample,theback-propagationlearningprocedure(Rumelhart,etal.,1986)usesgradient-descenttooptimizetheweightsinafeed-forwardmultilayerperceptron.Equation1providesawaytoadaptsuchsupervisedlearningprocedurestosolvetemporalcreditassignmentproblems.(Aninterestingopenquestioniswhethermorecomplexsuper-visedlearningprocedures,suchasthosethatdynamicallyaddnodesorconnectionsduringtraining,couldbeadaptedtodotemporalcreditassignment.)Itcanbeshownthatthecasek=1correspondstoanexplicitsupervisedpairingofeachinputpatternxtwiththefinalrewardsignalz.Similarly,thecasek=0correspondstoanexplicitpairingofxtwiththenextpredictionPt+ITheparameter),providesasmoothheuristicinterpolationbetweenthesetwolimits.SuttonprovidesanumberofintuitiveargumentswhyTD(X)shouldbeamoreefficientlearningprocedurethanexplicitsupervisedpairingofinputstateswithfinalreward.ArigorousproofisalsogiventhatTD(0)convergestotheoptimalpredictionsforalinearnetworkandalinearlyindependentsetofinputpatterns.Thisproofhasrecentlybeenex-tendedtoarbitraryvaluesof),inDayan(1992).However,notheoreticalorempiricalresultsareavailableformorecomplextasksrequiringmultilayernetworks,althougharelatedalgo-rithmcalledtheAdaptiveHeuristicCritic(Sutton,1984)hasbeensuccessfullyappliedtoarelativelysmall-scalecart-polebalancingproblem(Barto,Sutton&Anderson,1983;Anderson,1987).ThepresentpaperseekstodeterminewhethertemporaldifferencelearningproceduressuchasT

Practical issues in temporal difference learning

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

监狱3标段施工组织设计

新形势下信托保险的互动发展DOC5(1)

安徽汽车产业技术环境分析和技术战略选择

中国医药保健品股份有限公司(1)

减速机检修质量标准

质量管理手册(初稿1)（PPT49页)

120129_小天鹅空调基础知识产品线以及新品介绍

国内外超细纤维合成革的发展现状及展望

小议直销企业文化变革的重要性(1)

ERP系统软件投标书整体解决方案

相关文档

相关搜索