您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 公司方案 > 基于延迟调度的Hadoop分布式文件系统复制方案(IJITCS-V7-N4-8)
I.J.InformationTechnologyandComputerScience,2015,04,73-78PublishedOnlineMarch2015inMECS()DOI:10.5815/ijitcs.2015.04.08Copyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemS.SureshDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:sureshtvmalai85@gmail.comN.P.GopalanDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:npgopalan@nitt.eduAbstract—Thedatageneratedandprocessedbymoderncomputingsystemsburgeonrapidly.MapReduceisanimportantprogrammingmodelforlargescaledataintensiveapplications.HadoopisapopularopensourceimplementationofMapReduceandGoogleFileSystem(GFS).Thescalabilityandfault-tolerancefeatureofHadoopmakesitasastandardforBigDataprocessing.HadoopusesHadoopDistributedFileSystem(HDFS)forstoringdata.Datareliabilityandfault-toleranceisachievedthroughreplicationinHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicate(dereplicate)thepopular(unpopular)files/blocksinHDFSbasedontheinformationcollectedfromthescheduler.Experimentalresultsshowthat,theproposedmethodachieves13%and7%improvementsinresponsetimeandlocalityoverexistingalgorithmsrespectively.IndexTerms—DynamicReplication,HDFS,DelayScheduling,HadoopMapreduceI.INTRODUCTIONAsdatagrowsrapidly,thecomplexityofprocessingbecomesachallenge.Applicationsareneedtoprocessverylargeamountofdataofdifferenttypeinshorttimetoachievebetteruserexperience.Toprovideabstracteddataservicestotheapplicationprograms,severalsolutionsareproposedrangingfromtraditionaldatabasestocurrentBigDatamanagementssystems.Theperformanceoftheapplicationismainlybasedonthesebackenddatamanagementsystems.Toenabledistributedprocessingwithhighavailability,fault-toleranceandloadbalancing,replicationmechanismistheevergreensolution.Ontheotherhand,maintainingconsistencyamongthereplicasindistributedenvironmentsisatimeconsumingprocesswhichinternaffectstheavailabilityandperformance.MostofthedatageneratedandprocessedbythecurrentBigDataapplicationsfollowthe‘writeonceandreadmany’patternswhicheliminatesthecomplexityofmaintainingconsistencyamongreplicas.RecentemergingdistributedfilesystemssuchasGoogleFileSystem(GFS)[1],HadoopDistributedFileSystems(HDFS)[2]usereplicationmechanismstoenablefaulttolerant,highperformanceparallelprocessing.Blindlyreplicatingallfiles/blocksatmanyplaceincreasestheavailabilityandfault-tolerance.Butwillincreasememoryrequirementproportionally.Findinghotspotandreplicatingthemmayyieldbetterperformancewithlessdemandonmemory.Determiningoptimalnumberofreplicaisachallengingandanactiveresearchproblemforalongtimeasitaddressesapplicationload,datasizeandqualityofservice,etc.Currentdistributedcomputingenvironmentssuchasgridcomputing,cloudcomputingaredesignedtoprocesspetabytesofdatainamassivelyparallelstyle.Asprocessingspeedincreasesrapidlywithadventofmulticoreprocessors,theunderlyingfilesystemsdeterminetheperformanceofcomputingenvironments.Tosupportstreamlikedataaccess,modernfilesystems(Bigtable[3],Cassandra[4])useverysimpledatamodelsupportinglimitednumberofoperations.SomeofthepopulardistributedfilesystemsandHadoop[5]isanemergingopensourceplatformforparalleldataprocessingforlargescaledataintensiveapplicationssupportedbyHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicatethepopularfiles/blocks(hotspots)inHDFSusingtheinformationcollectedfromDelaySchedulingtechnique.Theperformanceofproposedalgorithmisevaluatedbyexhaustiveexperiments.Itisobservedthat,itexcelsintermsofresponsetime,localityandfairness.Thepaperisorganizedasfollows:Section2givesbackgroundonHadoopandHDFS.Section3isdedicatedtorelatedworks.Section4elaboratestheproposedreplicationalgorithm.Sections5describethesimulationenvironmentanddiscussthesimulationresults.Section6concludesthepaperandhighlightsthefutureresearchdirections.II.HADOOPANDHDFSBACKGROUNDHadoopisapopularparallelprocessingframeworkforcloudenvironments.ItisanopensourceimplementationofMapReduce[6]andGFS[1].Duetosimplicityandscalabilityitbecomesade-factostandardfordata-intensiveapplications.HadoopprovidesanabstracteddistributedfaulttolerantenvironmentforBigDataprocessing.Thejobssubmittedtothesystemaredivided74DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemCopyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78intosmalltasksandexecutedparallellyonaclusterofcommodityhardwaremachines.Hadoopadoptsthemasterslavearchitecture.Usersneedtowriteonlytwofunctions:mapandreducefortheirapplications.Allotheroperationssuchassynchronization,parallelizationandhandlingfailuresarehandledbytheframework.Hadoopcontainstwomajorcomponents:(i)MapReduceisaruntimeenvironmentforparallelprocessingand(ii)HDFSisadistributedfilesystemforstoringinputandoutputfiles.MapReducehastwomajorcomponents:JobtrackerandTasktraker.Jobtrackeristhemastercomponenttokeeptrackofall
本文标题:基于延迟调度的Hadoop分布式文件系统复制方案(IJITCS-V7-N4-8)
链接地址:https://www.777doc.com/doc-7709658 .html