您好,欢迎访问三七文档
ASimulation-BasedAlgorithmforErgodicControlofMarkovChainsConditionedonRareEventsS.Bhatnagary,V.S.BorkarzandA.MadhukarxFebruary2006AbstractWestudytheproblemoflong-runaveragecostcontrolofMarkovchainsconditionedonarareevent.Inarelatedrecentwork,asimulationbasedalgorithmforestimatingperformancemeasuresassociatedwithaMarkovchainconditionedonarareeventhasbeendeveloped.Weextendideasfromthisworkanddevelopanadaptivealgorithmforobtaining,online,optimalcontrolpoliciesconditionedonarareevent.Ouralgorithmusesthreetimescalesorstep-sizeschedules.Ontheslowesttimescale,agradientsearchalgorithmforpolicyupdatesthatisbasedonone-simulationsimultaneousperturbationstochasticapproximation(SPSA)typeestimatesisused.DeterministicperturbationsequencesobtainedfromappropriatenormalizedHadamardmatricesareusedhere.Thefasttimescalerecursionscomputetheconditionaltransitionproba-bilitiesofanassociatedchainbyobtainingsolutionstothemultiplicativePoissonequation(foragivenpolicyestimate).Further,theriskparameterassociatedwiththevaluefunctionforagivenpolicyestimateisupdatedonatimescalethatliesinbetweenthetwoscalesabove.Webrieysketchtheconvergenceanalysisofouralgorithmandpresentanumericalapplicationinthesettingofroutingmultipleowsincommunicationnetworks.KeyWords:Markovdecisionprocesses,optimalcontrolconditionedonarareevent,simulationbasedalgorithms,SPSAwithdeterministicperturbations,reinforcementlearning.1IntroductionMarkovdecisionprocesses(MDPs)[5],[35]formageneralframeworkforstudyingproblemsofcontrolofstochasticdynamicsystems(SDS).Manytimes,oneencounterssituationsinvolvingcontrolofSDSconditionedonarareeventofasymptoticallyzeroprobability.Thiscouldbe,e.g.,aproblemofdamagecontrolwhenfacedwithacatastrophicevent.Forinstance,inthesettingofalargecommunicationnetworksuchastheinternet,onemaybeinterestedinobtainingoptimalowCorrespondingauthoryDepartmentofComputerScienceandAutomation,IndianInstituteofScience,Bangalore560012,India.E-Mail:shalabh@csa.iisc.ernet.inzSchoolofTechnologyandComputerScience,TataInstituteofFundamentalResearch,HomiBhabhaRoad,Mumbai400005,India.E-Mail:borkar@tifr.res.inxDepartmentofElectricalEngineering,IndianInstituteofScience,Bangalore560012,India.E-Mail:madhukar@ee.iisc.ernet.in1andcongestioncontrolorroutingstrategiesinasubnetworkgiventhatanextremaleventsuchasalinkfailurehasoccurredinanotherremotesubnetwork.OurobjectiveinthispaperistoconsideraproblemofthisnaturewhereinarareeventisspecicallydenedtobethetimeaverageofafunctionoftheMDPanditsassociatedcontrol-valuedprocessexceedingathresholdthatislargerthanitsmean.Weconsidertheinnitehorizonlong-runaveragecostcriterionforourproblemanddeviseanalgorithmbasedonpolicyiterationforthesame.ResearchondevelopingsimulationbasedmethodsforcontrolofSDShasgatheredmomentuminrecenttimes.Theselargelygounderthenamesofneuro-dynamicprogramming(NDP)[7]orreinforcementlearning(RL)[39]andareapplicableinthecaseofsystemsforwhichmodelinformationisnotknownorcomputationallyforbiddinglyexpensive,butoutputdataobtainedeitherthrougharealsystemorasimulatedoneisavailable.Ourproblemdoesnotsharethislastfeature,butwedoborrowcertainalgorithmicparadigmsfromthisliterature.Beforeweproceedfurther,werstreviewsomerepresentativerecentworkalongtheselines.In[3],analgorithmforlong-runaveragecostMDPsispresented.TheaveragecostgradientisapproximatedusingthatassociatedwithacorrespondinginnitehorizondiscountedcostMDPproblem.Thevarianceoftheestimateshoweverincreasesrapidlyasthediscountfactorisbroughtclosertoone.In[4],certainvariantsbasedonthealgorithmin[3]arepresentedandapplicationsonsomeexperimentalsettingsshown.In[25],aperturbationanalysis(PA)typeapproachisusedtoobtaintheperformancegradientbasedonsamplepathanalysis.In[24],aPA-basedmethodisproposedforsolvinglong-runaveragecostMDPs.Thisrequireskeepingtrackoftheregenerationepochsoftheunderlyingprocessforanypolicyandaggregatingdataoverthese.Theaboveepochscanhoweverbeveryinfrequentinmostreallifesystems.In[32],theaveragecostgradientiscomputedbyassumingthatsamplepathgradientsofperformanceandtransitionprobabilitiesareknowninfunctionalform.AmongstotherRL-basedapproaches,thetemporaldierence(TD)[39]andQ-learning[42]havebeenpopularinrecenttimes.Thesearebasedonvaluefunctionapproximations.Aparalleldevelopmentisthatofactor-criticalgorithmsbasedontheclassicalpolicyiterationalgorithmindynamicprogramming.Notethattheclassicalpolicyiterationalgorithmproceedsviatwonestedloops{anouterloopinwhichthepolicyimprovementstepisperformedandaninnerloopinwhichthepolicyevaluationstepforthepolicyprescribedbytheouterloopisconducted.Therespectiveoperationsinthetwoloopsareperformedone-after-the-otherinacyclicmanner.Theinnerloopcaninprincipletakealongtimetoconverge,makingtheoverallprocedureslowinpractice.In[29],certainsimulation-basedalgorithmsthatusemulti-timescalestochasticapproximationareproposed.Theideaistousecoupledstochasticrecursionsdrivenbydierentstep-sizeschedulesortimescales.Therecursioncorrespondingtopolicyevaluationisrunonthefastertimescalewhile2thatcorrespondingtopolicyimprovementisrunontheslowerone.Thus
本文标题:A Simulation-Based Algorithm for Ergodic Control o
链接地址:https://www.777doc.com/doc-3303108 .html