Transfer learning via inter-task mappings for temp

JournalofMachineLearningResearch8(2007)2125-2167Submitted11/06;Revised4/07;Published9/07TransferLearningviaInter-TaskMappingsforTemporalDifferenceLearningMatthewE.TaylorMTAYLOR@CS.UTEXAS.EDUPeterStonePSTONE@CS.UTEXAS.EDUYaxinLiuYXLIU@CS.UTEXAS.EDUDepartmentofComputerSciencesTheUniversityofTexasatAustinAustin,Texas78712-1188Editor:MichaelL.LittmanAbstractTemporaldifference(TD)learning(SuttonandBarto,1998)hasbecomeapopularreinforcementlearningtechniqueinrecentyears.TDmethods,relyingonfunctionapproximatorstogeneralizelearningtonovelsituations,havehadsomeexperimentalsuccessesandhavebeenshowntoexhibitsomedesirablepropertiesintheory,butthemostbasicalgorithmshaveoftenbeenfoundslowinpractice.Thisempiricalresulthasmotivatedthedevelopmentofmanymethodsthatspeedupre-inforcementlearningbymodifyingataskforthelearnerorhelpingthelearnerbettergeneralizetonovelsituations.Thisarticlefocusesongeneralizingacrosstasks,therebyspeedinguplearning,viaanovelformoftransferusinghandcodedtaskrelationships.Wecomparelearningonacom-plextaskwiththreefunctionapproximators,acerebellarmodelarithmeticcomputer(CMAC),anartiﬁcialneuralnetwork(ANN),andaradialbasisfunction(RBF),andempiricallydemonstratethatdirectlytransferringtheaction-valuefunctioncanleadtoadramaticspeedupinlearningwithallthree.Usingtransferviainter-taskmapping(TVITM),agentsareabletolearnonetaskandthenmarkedlyreducethetimeittakestolearnamorecomplextask.OuralgorithmsarefullyimplementedandtestedintheRoboCupsoccerKeepawaydomain.Thisarticlecontainsandextendsmaterialpublishedintwoconferencepapers(TaylorandStone,2005;Tayloretal.,2005).Keywords:transferlearning,reinforcementlearning,temporaldifferencemethods,valuefunctionapproximation,inter-taskmapping1.IntroductionMachinelearninghastraditionallybeenlimitedtotrainingandtestingonthesamedistributionofprobleminstances.However,humansareabletolearntoperformwellincomplextasksbyutilizingprincipleslearnedinprevioustasks.Fewcurrentmachinelearningmethodsareabletotransferknowledgebetweenpairsoftasks,andnoneareabletotransferbetweenabroadrangeoftaskstotheextentthathumansare.Thisarticlepresentsanewmethodfortransferlearninginthereinforcementlearning(RL)frameworkusingtemporaldifference(TD)learningmethods(SuttonandBarto,1998),wherebyanagentcanlearnfasterinatargettaskaftertrainingonadifferent,typicallylesscomplex,sourcetask.TDlearningmethodshaveshownsomesuccessinmanyreinforcementlearningtasksbecauseoftheirabilitytolearnwherethereislimitedpriorknowledgeandminimalenvironmentalfeedback.c2007MatthewE.Taylor,PeterStoneandYaxinLiu.TAYLOR,STONEANDLIUHowever,thebasicunenhancedTDalgorithms,suchasQ-Learning(Watkins,1989)andSarsa(RummeryandNiranjan,1994;SinghandSutton,1996),havebeenfoundslowtoproducenear-optimalbehaviorsinpractice.Manytechniquesexist(Selfridgeetal.,1985;ColombettiandDorigo,1993;Asadaetal.,1994)whichattempt,withmoreorlesssuccess,tospeedupthelearningprocess.Section9willdiscussindepthhowourtransferlearningmethoddiffersfromotherexistingmethodsandcanpotentiallybecombinedwiththemifdesired.Inthisarticleweintroducetransferviainter-taskmapping(TVITM),wherebyaTDlearnertrainedononetaskwithaction-valuefunctionRLcanlearnfasterwhentrainingonanothertaskwithrelated,butdifferent,stateandactionspaces.TVITMthusenablesfasterTDlearninginsituationswheretherearetwoormoresimilartasks.Thistransferformulationisanalogoustoahumanbeingtoldhowanoveltaskisrelatedtoaknowntask,andthenusingthisrelationtodecidehowtoperformthenoveltask.Thekeytechnicalchallengeismappinganaction-valuefunction—theexpectedreturnorvalueoftakingaparticularactioninaparticularstate—inonerepresentationtoameaningfulaction-valuefunctioninanother,typicallylarger,representation.ItisthistransferfunctionalwhichdeﬁnestransferintheTVITMframework.Instochasticdomainswithcontinuousstatespaces,agentswillrarely(ifever)visitthesamestatetwice.Itisthereforenecessaryforlearningagentstousefunctionapproximationwhenesti-matingtheaction-valuefunction.Withoutsomeformofapproximation,anagentwouldonlybeabletopredictavalueforstatesthatithadpreviouslyvisited.Inthisworkweareprimarilyconcernedwithadifferentkindofgeneralization.Insteadofﬁndingsimilaritiesbetweendifferentstates,wefocusonexploitingsimilaritiesbetweendifferenttasks.Theprimarycontributionofthisarticleisanexistenceproofthattherearedomainsinwhichitispossibletoconstructamappingbetweentasksandtherebyspeeduplearningbytransferringanaction-valuefunction.Thisapproachmayseemcounterintuitiveinitially:theaction-valuefunctionisthelearnedinformationwhichisdirectlytiedtotheparticulartaskitwaslearnedin.Neverthe-less,wewilldemonstratetheefﬁcacyofusingTVITMtospeeduplearninginagentsacrosstasks,irrespectiveoftherepresentationusedbythefunctionapproximator.Threedifferentfunctionap-proximators(asdeﬁnedinSection4.3),aCMAC,anANN,andanRBF,areusedtolearnasinglereinforcementlearningproblem.WewillcomparetheireffectivenessanddemonstratewhyTVITMispromisingforfuturetransferstudies.Theremainderofthisarticleisorganizedasfollows.Section2formallydeﬁnesTVITM.Sec-tion3givesanoverviewofthetasksoverwhichwequantitati

Transfer learning via inter-task mappings for temp

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

基于EAM的浙江省能源集团虚拟联合仓储平台构建研究

BRP正本清源（ERP）

【房地产】瀛海名居：上演圣诞派对

酒店餐饮业-砧板岗位职责(doc)

担保业务实务及法律风险防范

7&8过程图因果矩陈PFMEA-最新版本

产品退货报告单

活动策划方案PPT模板

16_随机型决策分析方法

和光物流经营管理咨询报告

相关文档

相关搜索

Transfer learning via inter-task mappings for temp

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

基于EAM的浙江省能源集团虚拟联合仓储平台构建研究

BRP正本清源（ERP）

【房地产】瀛海名居：上演圣诞派对

酒店餐饮业-砧板岗位职责(doc)

担保业务实务及法律风险防范

7&amp;8过程图因果矩陈PFMEA-最新版本

产品退货报告单

活动策划方案PPT模板

16_随机型决策分析方法

和光物流经营管理咨询报告

相关文档

相关搜索

7&8过程图因果矩陈PFMEA-最新版本