您好,欢迎访问三七文档
JournalofMachineLearningResearch8(2007)2125-2167Submitted11/06;Revised4/07;Published9/07TransferLearningviaInter-TaskMappingsforTemporalDifferenceLearningMatthewE.TaylorMTAYLOR@CS.UTEXAS.EDUPeterStonePSTONE@CS.UTEXAS.EDUYaxinLiuYXLIU@CS.UTEXAS.EDUDepartmentofComputerSciencesTheUniversityofTexasatAustinAustin,Texas78712-1188Editor:MichaelL.LittmanAbstractTemporaldifference(TD)learning(SuttonandBarto,1998)hasbecomeapopularreinforcementlearningtechniqueinrecentyears.TDmethods,relyingonfunctionapproximatorstogeneralizelearningtonovelsituations,havehadsomeexperimentalsuccessesandhavebeenshowntoexhibitsomedesirablepropertiesintheory,butthemostbasicalgorithmshaveoftenbeenfoundslowinpractice.Thisempiricalresulthasmotivatedthedevelopmentofmanymethodsthatspeedupre-inforcementlearningbymodifyingataskforthelearnerorhelpingthelearnerbettergeneralizetonovelsituations.Thisarticlefocusesongeneralizingacrosstasks,therebyspeedinguplearning,viaanovelformoftransferusinghandcodedtaskrelationships.Wecomparelearningonacom-plextaskwiththreefunctionapproximators,acerebellarmodelarithmeticcomputer(CMAC),anartificialneuralnetwork(ANN),andaradialbasisfunction(RBF),andempiricallydemonstratethatdirectlytransferringtheaction-valuefunctioncanleadtoadramaticspeedupinlearningwithallthree.Usingtransferviainter-taskmapping(TVITM),agentsareabletolearnonetaskandthenmarkedlyreducethetimeittakestolearnamorecomplextask.OuralgorithmsarefullyimplementedandtestedintheRoboCupsoccerKeepawaydomain.Thisarticlecontainsandextendsmaterialpublishedintwoconferencepapers(TaylorandStone,2005;Tayloretal.,2005).Keywords:transferlearning,reinforcementlearning,temporaldifferencemethods,valuefunctionapproximation,inter-taskmapping1.IntroductionMachinelearninghastraditionallybeenlimitedtotrainingandtestingonthesamedistributionofprobleminstances.However,humansareabletolearntoperformwellincomplextasksbyutilizingprincipleslearnedinprevioustasks.Fewcurrentmachinelearningmethodsareabletotransferknowledgebetweenpairsoftasks,andnoneareabletotransferbetweenabroadrangeoftaskstotheextentthathumansare.Thisarticlepresentsanewmethodfortransferlearninginthereinforcementlearning(RL)frameworkusingtemporaldifference(TD)learningmethods(SuttonandBarto,1998),wherebyanagentcanlearnfasterinatargettaskaftertrainingonadifferent,typicallylesscomplex,sourcetask.TDlearningmethodshaveshownsomesuccessinmanyreinforcementlearningtasksbecauseoftheirabilitytolearnwherethereislimitedpriorknowledgeandminimalenvironmentalfeedback.c2007MatthewE.Taylor,PeterStoneandYaxinLiu.TAYLOR,STONEANDLIUHowever,thebasicunenhancedTDalgorithms,suchasQ-Learning(Watkins,1989)andSarsa(RummeryandNiranjan,1994;SinghandSutton,1996),havebeenfoundslowtoproducenear-optimalbehaviorsinpractice.Manytechniquesexist(Selfridgeetal.,1985;ColombettiandDorigo,1993;Asadaetal.,1994)whichattempt,withmoreorlesssuccess,tospeedupthelearningprocess.Section9willdiscussindepthhowourtransferlearningmethoddiffersfromotherexistingmethodsandcanpotentiallybecombinedwiththemifdesired.Inthisarticleweintroducetransferviainter-taskmapping(TVITM),wherebyaTDlearnertrainedononetaskwithaction-valuefunctionRLcanlearnfasterwhentrainingonanothertaskwithrelated,butdifferent,stateandactionspaces.TVITMthusenablesfasterTDlearninginsituationswheretherearetwoormoresimilartasks.Thistransferformulationisanalogoustoahumanbeingtoldhowanoveltaskisrelatedtoaknowntask,andthenusingthisrelationtodecidehowtoperformthenoveltask.Thekeytechnicalchallengeismappinganaction-valuefunction—theexpectedreturnorvalueoftakingaparticularactioninaparticularstate—inonerepresentationtoameaningfulaction-valuefunctioninanother,typicallylarger,representation.ItisthistransferfunctionalwhichdefinestransferintheTVITMframework.Instochasticdomainswithcontinuousstatespaces,agentswillrarely(ifever)visitthesamestatetwice.Itisthereforenecessaryforlearningagentstousefunctionapproximationwhenesti-matingtheaction-valuefunction.Withoutsomeformofapproximation,anagentwouldonlybeabletopredictavalueforstatesthatithadpreviouslyvisited.Inthisworkweareprimarilyconcernedwithadifferentkindofgeneralization.Insteadoffindingsimilaritiesbetweendifferentstates,wefocusonexploitingsimilaritiesbetweendifferenttasks.Theprimarycontributionofthisarticleisanexistenceproofthattherearedomainsinwhichitispossibletoconstructamappingbetweentasksandtherebyspeeduplearningbytransferringanaction-valuefunction.Thisapproachmayseemcounterintuitiveinitially:theaction-valuefunctionisthelearnedinformationwhichisdirectlytiedtotheparticulartaskitwaslearnedin.Neverthe-less,wewilldemonstratetheefficacyofusingTVITMtospeeduplearninginagentsacrosstasks,irrespectiveoftherepresentationusedbythefunctionapproximator.Threedifferentfunctionap-proximators(asdefinedinSection4.3),aCMAC,anANN,andanRBF,areusedtolearnasinglereinforcementlearningproblem.WewillcomparetheireffectivenessanddemonstratewhyTVITMispromisingforfuturetransferstudies.Theremainderofthisarticleisorganizedasfollows.Section2formallydefinesTVITM.Sec-tion3givesanoverviewofthetasksoverwhichwequantitati
本文标题:Transfer learning via inter-task mappings for temp
链接地址:https://www.777doc.com/doc-5196909 .html