2018-ICLR-On the Convergence of Adam and Beyond

UnderreviewasaconferencepaperatICLR2018ONTHECONVERGENCEOFADAMANDBEYONDAnonymousauthorsPaperunderdouble-blindreviewABSTRACTSeveralrecentlyproposedstochasticoptimizationmethodsthathavebeensuc-cessfullyusedintrainingdeepnetworkssuchasRMSPROP,ADAM,ADADELTA,NADAM,etcarebasedonusinggradientupdatesscaledbysquarerootsofex-ponentialmovingaveragesofsquaredpastgradients.Ithasbeenempiricallyob-servedthatsometimesthesealgorithmsfailtoconvergetoanoptimalsolution(oracriticalpointinnonconvexsettings).Weshowthatonecauseforsuchfailuresistheexponentialmovingaverageusedinthealgorithms.WeprovideanexplicitexampleofasimpleconvexoptimizationsettingwhereADAMdoesnotconvergetotheoptimalsolution,anddescribethepreciseproblemswiththepreviousanal-ysisofADAMalgorithm.Ouranalysissuggeststhattheconvergenceissuesmaybeﬁxedbyendowingsuchalgorithmswith“long-termmemory”ofpastgradi-ents,andproposenewvariantsoftheADAMalgorithmwhichnotonlyﬁxtheconvergenceissuesbutoftenalsoleadtoimprovedempiricalperformance.1INTRODUCTIONStochasticgradientdescent(SGD)isthedominantmethodtotraindeepnetworkstoday.Thismethoditerativelyupdatestheparametersofamodelbymovingtheminthedirectionofthenegativegra-dientofthelossevaluatedonaminibatch.Inparticular,variantsofSGDthatscalecoordinatesofthegradientbysquarerootsofsomeformofaveragingofthesquaredcoordinatesinthepastgradientshavebeenparticularlysuccessful,becausetheyautomaticallyadjustthelearningrateonaper-featurebasis.TheﬁrstpopularalgorithminthislineofresearchisADAGRAD(Duchietal.,2011;McMahan&Streeter,2010),whichcanachievesigniﬁcantlybetterperformancecomparedtovanillaSGDwhenthegradientsaresparse,oringeneralsmall.AlthoughADAGRADworkswellforsparsesettings,itsperformancehasbeenobservedtodeteriorateinsettingswherethelossfunctionsarenonconvexandgradientsaredenseduetorapiddecayofthelearningrateinthesesettingssinceitusesallthepastgradientsintheupdate.Thisproblemisespeciallyexacerbatedinhighdimensionalproblemsarisingindeeplearning.Totacklethisissue,severalvariantsofADAGRAD,suchasRMSPROP(Tieleman&Hinton,2012),ADAM(Kingma&Ba,2015),ADADELTA(Zeiler,2012),NADAM(Dozat,2016),etc,havebeenproposedwhichmitigatetherapiddecayofthelearningrateusingtheexponentialmovingaveragesofsquaredpastgradients,essentiallylimitingtherelianceoftheupdatetoonlythepastfewgradients.Whilethesealgorithmshavebeensuccessfullyemployedinseveralpracticalapplications,theyhavealsobeenobservedtonotconvergeinsomeothersettings.Ithasbeentypicallyobservedthatinthesesettingssomeminibatchesprovidelargegradientsbutonlyquiterarely,andwhiletheselargegradientsarequiteinformative,theirinﬂuencediesoutratherquicklyduetotheexponentialaveraging,thusleadingtopoorconvergence.Inthispaper,weanalyzethissituationindetail.Werigorouslyprovethattheintuitionconveyedintheaboveparagraphisindeedcorrect;thatlimitingtherelianceoftheupdateonessentiallyonlythepastfewgradientscanindeedcausesigniﬁcantconvergenceissues.Inparticular,wemakethefollowingkeycontributions:WeelucidatehowtheexponentialmovingaverageintheRMSPROPandADAMalgorithmscancausenon-convergencebyprovidinganexampleofsimpleconvexoptimizationprob-lemwhereRMSPROPandADAMprovablydonotconvergetoanoptimalsolution.OuranalysiseasilyextendstootheralgorithmsusingexponentialmovingaveragessuchasADADELTAandNADAMaswell,butweomitthisforthesakeofclarity.Infact,the1UnderreviewasaconferencepaperatICLR2018analysisisﬂexibleenoughtoextendtootheralgorithmsthatemployaveragingsquaredgradientsoveressentiallyaﬁxedsizewindow(forexponentialmovingaverages,theinﬂu-encesofgradientsbeyondaﬁxedwindowsizebecomesnegligiblysmall)intheimmediatepast.Weomitthegeneralanalysisinthispaperforthesakeofclarity.Theaboveresultindicatesthatinordertohaveguaranteedconvergencetheoptimizationalgorithmmusthave“long-termmemory”ofpastgradients.Speciﬁcally,wepointoutaproblemwiththeproofofconvergenceoftheADAMalgorithmgivenbyKingma&Ba(2015).Toresolvethisissue,weproposenewvariantsofADAMwhichrelyonlong-termmemoryofpastgradients,butcanbeimplementedinthesametimeandspacerequirementsastheoriginalADAMalgorithm.Weprovideaconvergenceanalysisforthenewvariantsintheconvexsetting,basedontheanalysisofKingma&Ba(2015),andshowadata-dependentregretboundsimilartotheoneinADAGRAD.Weprovideapreliminaryempiricalstudyofoneofthevariantsweproposedandshowthatiteitherperformssimilarly,orbetter,onsomecommonlyusedproblemsinmachinelearning.2PRELIMINARIESNotation.WeuseS+dtodenotethesetofallpositivedeﬁniteddmatrices.Withslightabuseofnotation,foravectora2RdandapositivedeﬁnitematrixM2RdRd,weusea=MtodenoteM1a,kMik2todenote`2-normofithrowofMandpMtorepresentM1=2.Furthermore,foranyvectorsa;b2Rd,weusepaforelement-wisesquareroot,a2forelement-wisesquare,a=btodenoteelement-wisedivisionandmax(a;b)todenoteelement-wisemaximum.Foranyvectori2Rd,i;jdenotesitsjthcoordinatewherej2[d].TheprojectionoperationF;A(y)forA2Sd+isdeﬁnedasargminx2FkA1=2(xy)kfory2Rd.Finally,wesayFhasboundeddiameterD1ifkxyk1D1forallx;y2F.Optimizationsetup.Aﬂexibleframeworktoanalyzeiterativeoptimizationmethodsistheonlineoptimizationprobleminthefullinformationfeedbackse

2018-ICLR-On the Convergence of Adam and Beyond

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

巴塞尔协议Ⅲ与银行风险管理

科学发展和谐发展跨越发展秦光荣1

固定资产管理流程控制点讲解

扫雪企划

XXXX年中国IT分销市场评估报告

福建莆田天龙商业项目操作方案

一个80后的职场沧桑

GB 09025

缝洞碳酸盐岩酸压工艺技术

NPS生产系统

相关文档

相关搜索