基于延迟调度的Hadoop分布式文件系统复制方案(IJITCS-V7-N4-8)

I.J.InformationTechnologyandComputerScience,2015,04,73-78PublishedOnlineMarch2015inMECS()DOI:10.5815/ijitcs.2015.04.08Copyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemS.SureshDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:sureshtvmalai85@gmail.comN.P.GopalanDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:npgopalan@nitt.eduAbstract—Thedatageneratedandprocessedbymoderncomputingsystemsburgeonrapidly.MapReduceisanimportantprogrammingmodelforlargescaledataintensiveapplications.HadoopisapopularopensourceimplementationofMapReduceandGoogleFileSystem(GFS).Thescalabilityandfault-tolerancefeatureofHadoopmakesitasastandardforBigDataprocessing.HadoopusesHadoopDistributedFileSystem(HDFS)forstoringdata.Datareliabilityandfault-toleranceisachievedthroughreplicationinHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicate(dereplicate)thepopular(unpopular)files/blocksinHDFSbasedontheinformationcollectedfromthescheduler.Experimentalresultsshowthat,theproposedmethodachieves13%and7%improvementsinresponsetimeandlocalityoverexistingalgorithmsrespectively.IndexTerms—DynamicReplication,HDFS,DelayScheduling,HadoopMapreduceI.INTRODUCTIONAsdatagrowsrapidly,thecomplexityofprocessingbecomesachallenge.Applicationsareneedtoprocessverylargeamountofdataofdifferenttypeinshorttimetoachievebetteruserexperience.Toprovideabstracteddataservicestotheapplicationprograms,severalsolutionsareproposedrangingfromtraditionaldatabasestocurrentBigDatamanagementssystems.Theperformanceoftheapplicationismainlybasedonthesebackenddatamanagementsystems.Toenabledistributedprocessingwithhighavailability,fault-toleranceandloadbalancing,replicationmechanismistheevergreensolution.Ontheotherhand,maintainingconsistencyamongthereplicasindistributedenvironmentsisatimeconsumingprocesswhichinternaffectstheavailabilityandperformance.MostofthedatageneratedandprocessedbythecurrentBigDataapplicationsfollowthe‘writeonceandreadmany’patternswhicheliminatesthecomplexityofmaintainingconsistencyamongreplicas.RecentemergingdistributedfilesystemssuchasGoogleFileSystem(GFS)[1],HadoopDistributedFileSystems(HDFS)[2]usereplicationmechanismstoenablefaulttolerant,highperformanceparallelprocessing.Blindlyreplicatingallfiles/blocksatmanyplaceincreasestheavailabilityandfault-tolerance.Butwillincreasememoryrequirementproportionally.Findinghotspotandreplicatingthemmayyieldbetterperformancewithlessdemandonmemory.Determiningoptimalnumberofreplicaisachallengingandanactiveresearchproblemforalongtimeasitaddressesapplicationload,datasizeandqualityofservice,etc.Currentdistributedcomputingenvironmentssuchasgridcomputing,cloudcomputingaredesignedtoprocesspetabytesofdatainamassivelyparallelstyle.Asprocessingspeedincreasesrapidlywithadventofmulticoreprocessors,theunderlyingfilesystemsdeterminetheperformanceofcomputingenvironments.Tosupportstreamlikedataaccess,modernfilesystems(Bigtable[3],Cassandra[4])useverysimpledatamodelsupportinglimitednumberofoperations.SomeofthepopulardistributedfilesystemsandHadoop[5]isanemergingopensourceplatformforparalleldataprocessingforlargescaledataintensiveapplicationssupportedbyHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicatethepopularfiles/blocks(hotspots)inHDFSusingtheinformationcollectedfromDelaySchedulingtechnique.Theperformanceofproposedalgorithmisevaluatedbyexhaustiveexperiments.Itisobservedthat,itexcelsintermsofresponsetime,localityandfairness.Thepaperisorganizedasfollows:Section2givesbackgroundonHadoopandHDFS.Section3isdedicatedtorelatedworks.Section4elaboratestheproposedreplicationalgorithm.Sections5describethesimulationenvironmentanddiscussthesimulationresults.Section6concludesthepaperandhighlightsthefutureresearchdirections.II.HADOOPANDHDFSBACKGROUNDHadoopisapopularparallelprocessingframeworkforcloudenvironments.ItisanopensourceimplementationofMapReduce[6]andGFS[1].Duetosimplicityandscalabilityitbecomesade-factostandardfordata-intensiveapplications.HadoopprovidesanabstracteddistributedfaulttolerantenvironmentforBigDataprocessing.Thejobssubmittedtothesystemaredivided74DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemCopyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78intosmalltasksandexecutedparallellyonaclusterofcommodityhardwaremachines.Hadoopadoptsthemasterslavearchitecture.Usersneedtowriteonlytwofunctions:mapandreducefortheirapplications.Allotheroperationssuchassynchronization,parallelizationandhandlingfailuresarehandledbytheframework.Hadoopcontainstwomajorcomponents:(i)MapReduceisaruntimeenvironmentforparallelprocessingand(ii)HDFSisadistributedfilesystemforstoringinputandoutputfiles.MapReducehastwomajorcomponents:JobtrackerandTasktraker.Jobtrackeristhemastercomponenttokeeptrackofall

基于延迟调度的Hadoop分布式文件系统复制方案(IJITCS-V7-N4-8)

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

新一代企业信息化基础架构

6工程项目采购合同履约情况检查作业指引

成本核算与控制(1)

中成药及医药流通行业深度行业报告PDF30发掘政策导向中的机会(1)

大城市环境下的建筑施工企业人力资源管理研究

天顺风能：内幕信息知情人登记管理制度(XXXX年3月) XXXX-03-09

资产损失鉴证流程及工作底稿

XXXX-2020年中国奶糖行业全景调研与产业竞争格局报告

组织变革中员工的变革认知、变革抵制倾向与组织承诺关系研究

0130周末分享：拜访礼仪

相关文档

相关搜索