A Simulation-Based Algorithm for Ergodic Control o

ASimulation-BasedAlgorithmforErgodicControlofMarkovChainsConditionedonRareEventsS.Bhatnagary,V.S.BorkarzandA.MadhukarxFebruary2006AbstractWestudytheproblemoflong-runaveragecostcontrolofMarkovchainsconditionedonarareevent.Inarelatedrecentwork,asimulationbasedalgorithmforestimatingperformancemeasuresassociatedwithaMarkovchainconditionedonarareeventhasbeendeveloped.Weextendideasfromthisworkanddevelopanadaptivealgorithmforobtaining,online,optimalcontrolpoliciesconditionedonarareevent.Ouralgorithmusesthreetimescalesorstep-sizeschedules.Ontheslowesttimescale,agradientsearchalgorithmforpolicyupdatesthatisbasedonone-simulationsimultaneousperturbationstochasticapproximation(SPSA)typeestimatesisused.DeterministicperturbationsequencesobtainedfromappropriatenormalizedHadamardmatricesareusedhere.Thefasttimescalerecursionscomputetheconditionaltransitionproba-bilitiesofanassociatedchainbyobtainingsolutionstothemultiplicativePoissonequation(foragivenpolicyestimate).Further,theriskparameterassociatedwiththevaluefunctionforagivenpolicyestimateisupdatedonatimescalethatliesinbetweenthetwoscalesabove.Webrieysketchtheconvergenceanalysisofouralgorithmandpresentanumericalapplicationinthesettingofroutingmultipleowsincommunicationnetworks.KeyWords:Markovdecisionprocesses,optimalcontrolconditionedonarareevent,simulationbasedalgorithms,SPSAwithdeterministicperturbations,reinforcementlearning.1IntroductionMarkovdecisionprocesses(MDPs)[5],[35]formageneralframeworkforstudyingproblemsofcontrolofstochasticdynamicsystems(SDS).Manytimes,oneencounterssituationsinvolvingcontrolofSDSconditionedonarareeventofasymptoticallyzeroprobability.Thiscouldbe,e.g.,aproblemofdamagecontrolwhenfacedwithacatastrophicevent.Forinstance,inthesettingofalargecommunicationnetworksuchastheinternet,onemaybeinterestedinobtainingoptimalowCorrespondingauthoryDepartmentofComputerScienceandAutomation,IndianInstituteofScience,Bangalore560012,India.E-Mail:shalabh@csa.iisc.ernet.inzSchoolofTechnologyandComputerScience,TataInstituteofFundamentalResearch,HomiBhabhaRoad,Mumbai400005,India.E-Mail:borkar@tifr.res.inxDepartmentofElectricalEngineering,IndianInstituteofScience,Bangalore560012,India.E-Mail:madhukar@ee.iisc.ernet.in1andcongestioncontrolorroutingstrategiesinasubnetworkgiventhatanextremaleventsuchasalinkfailurehasoccurredinanotherremotesubnetwork.OurobjectiveinthispaperistoconsideraproblemofthisnaturewhereinarareeventisspecicallydenedtobethetimeaverageofafunctionoftheMDPanditsassociatedcontrol-valuedprocessexceedingathresholdthatislargerthanitsmean.Weconsidertheinnitehorizonlong-runaveragecostcriterionforourproblemanddeviseanalgorithmbasedonpolicyiterationforthesame.ResearchondevelopingsimulationbasedmethodsforcontrolofSDShasgatheredmomentuminrecenttimes.Theselargelygounderthenamesofneuro-dynamicprogramming(NDP)[7]orreinforcementlearning(RL)[39]andareapplicableinthecaseofsystemsforwhichmodelinformationisnotknownorcomputationallyforbiddinglyexpensive,butoutputdataobtainedeitherthrougharealsystemorasimulatedoneisavailable.Ourproblemdoesnotsharethislastfeature,butwedoborrowcertainalgorithmicparadigmsfromthisliterature.Beforeweproceedfurther,werstreviewsomerepresentativerecentworkalongtheselines.In[3],analgorithmforlong-runaveragecostMDPsispresented.TheaveragecostgradientisapproximatedusingthatassociatedwithacorrespondinginnitehorizondiscountedcostMDPproblem.Thevarianceoftheestimateshoweverincreasesrapidlyasthediscountfactorisbroughtclosertoone.In[4],certainvariantsbasedonthealgorithmin[3]arepresentedandapplicationsonsomeexperimentalsettingsshown.In[25],aperturbationanalysis(PA)typeapproachisusedtoobtaintheperformancegradientbasedonsamplepathanalysis.In[24],aPA-basedmethodisproposedforsolvinglong-runaveragecostMDPs.Thisrequireskeepingtrackoftheregenerationepochsoftheunderlyingprocessforanypolicyandaggregatingdataoverthese.Theaboveepochscanhoweverbeveryinfrequentinmostreallifesystems.In[32],theaveragecostgradientiscomputedbyassumingthatsamplepathgradientsofperformanceandtransitionprobabilitiesareknowninfunctionalform.AmongstotherRL-basedapproaches,thetemporaldierence(TD)[39]andQ-learning[42]havebeenpopularinrecenttimes.Thesearebasedonvaluefunctionapproximations.Aparalleldevelopmentisthatofactor-criticalgorithmsbasedontheclassicalpolicyiterationalgorithmindynamicprogramming.Notethattheclassicalpolicyiterationalgorithmproceedsviatwonestedloops{anouterloopinwhichthepolicyimprovementstepisperformedandaninnerloopinwhichthepolicyevaluationstepforthepolicyprescribedbytheouterloopisconducted.Therespectiveoperationsinthetwoloopsareperformedone-after-the-otherinacyclicmanner.Theinnerloopcaninprincipletakealongtimetoconverge,makingtheoverallprocedureslowinpractice.In[29],certainsimulation-basedalgorithmsthatusemulti-timescalestochasticapproximationareproposed.Theideaistousecoupledstochasticrecursionsdrivenbydierentstep-sizeschedulesortimescales.Therecursioncorrespondingtopolicyevaluationisrunonthefastertimescalewhile2thatcorrespondingtopolicyimprovementisrunontheslowerone.Thus

A Simulation-Based Algorithm for Ergodic Control o

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

国际货物运输保险（三）

医药卫生体制改革近期重点实施方案

东南陶瓷城品牌重组推广拟案(PPT 44)

防灾研究所中期目标防灾研究所中期目标防灾研究所中期...

企业战略-新经济条件下的企业资源管理

战略管理矩阵使用方法

行业变化轨迹下企业战略调整：国际化视角

爱护环境_从我做起110班（PPT38页)

空压机基础培训及应用教材

如何制定工资方案

相关文档

相关搜索