您好,欢迎访问三七文档
©SpinnakerLabs,Inc.GoogleClusterComputingFacultyTrainingWorkshopModuleV:HadoopTechnicalReview©SpinnakerLabs,Inc.Overview•HadoopTechnicalWalkthrough•HDFS•Databases•UsingHadoopinanAcademicEnvironment•PerformancetipsandothertoolsYouSay,“tomato…”Googlecallsit:Hadoopequivalent:MapReduceHadoopGFSHDFSBigtableHBaseChubby(nothingyet…butplanned)©SpinnakerLabs,Inc.SomeMapReduceTerminology•Job–A“fullprogram”-anexecutionofaMapperandReduceracrossadataset•Task–AnexecutionofaMapperoraReduceronasliceofdata–a.k.a.Task-In-Progress(TIP)•TaskAttempt–Aparticularinstanceofanattempttoexecuteataskonamachine©SpinnakerLabs,Inc.TerminologyExample•Running“WordCount”across20filesisonejob•20filestobemappedimply20maptasks+somenumberofreducetasks•Atleast20maptaskattemptswillbeperformed…moreifamachinecrashes,etc.©SpinnakerLabs,Inc.TaskAttempts•Aparticulartaskwillbeattemptedatleastonce,possiblymoretimesifitcrashes–Ifthesameinputcausescrashesoverandover,thatinputwilleventuallybeabandoned•Multipleattemptsatonetaskmayoccurinparallelwithspeculativeexecutionturnedon–TaskIDfromTaskInProgressisnotauniqueidentifier;don’tuseitthatway©SpinnakerLabs,Inc.MapReduce:HighLevelJobTrackerMapReducejobsubmittedbyclientcomputerMasternodeTaskTrackerSlavenodeTaskinstanceTaskTrackerSlavenodeTaskinstanceTaskTrackerSlavenodeTaskinstance©SpinnakerLabs,Inc.Node-to-NodeCommunication•HadoopusesitsownRPCprotocol•Allcommunicationbeginsinslavenodes–Preventscircular-waitdeadlock–Slavesperiodicallypollfor“status”message•Classesmustprovideexplicitserialization©SpinnakerLabs,Inc.Nodes,Trackers,Tasks•MasternoderunsJobTrackerinstance,whichacceptsJobrequestsfromclients•TaskTrackerinstancesrunonslavenodes•TaskTrackerforksseparateJavaprocessfortaskinstances©SpinnakerLabs,Inc.JobDistribution•MapReduceprogramsarecontainedinaJava“jar”file+anXMLfilecontainingserializedprogramconfigurationoptions•RunningaMapReducejobplacesthesefilesintotheHDFSandnotifiesTaskTrackerswheretoretrievetherelevantprogramcode•…Where’sthedatadistribution?©SpinnakerLabs,Inc.DataDistribution•ImplicitindesignofMapReduce!–Allmappersareequivalent;somapwhateverdataislocaltoaparticularnodeinHDFS•Iflotsofdatadoeshappentopileuponthesamenode,nearbynodeswillmapinstead–DatatransferishandledimplicitlybyHDFS©SpinnakerLabs,Inc.ConfiguringWithJobConf•MRProgramshavemanyconfigurableoptions•JobConfobjectshold(key,value)componentsmappingString’a–e.g.,“mapred.map.tasks”20–JobConfisserializedanddistributedbeforerunningthejob•ObjectsimplementingJobConfigurablecanretrieveelementsfromaJobConf©SpinnakerLabs,Inc.WhatHappensInMapReduce?DepthFirst©SpinnakerLabs,Inc.JobLaunchProcess:Client•ClientprogramcreatesaJobConf–IdentifyclassesimplementingMapperandReducerinterfaces•JobConf.setMapperClass(),setReducerClass()–Specifyinputs,outputs•JobConf.setInputPath(),setOutputPath()–Optionally,otheroptionstoo:•JobConf.setNumReduceTasks(),JobConf.setOutputFormat()…©SpinnakerLabs,Inc.JobLaunchProcess:JobClient•PassJobConftoJobClient.runJob()orsubmitJob()–runJob()blocks,submitJob()doesnot•JobClient:–DeterminesproperdivisionofinputintoInputSplits–SendsjobdatatomasterJobTrackerserver©SpinnakerLabs,Inc.JobLaunchProcess:JobTracker•JobTracker:–InsertsjarandJobConf(serializedtoXML)insharedlocation–PostsaJobInProgresstoitsrunqueue©SpinnakerLabs,Inc.JobLaunchProcess:TaskTracker•TaskTrackersrunningonslavenodesperiodicallyqueryJobTrackerforwork•Retrievejob-specificjarandconfig•LaunchtaskinseparateinstanceofJava–main()isprovidedbyHadoop©SpinnakerLabs,Inc.JobLaunchProcess:Task•TaskTracker.Child.main():–SetsupthechildTaskInProgressattempt–ReadsXMLconfiguration–ConnectsbacktonecessaryMapReducecomponentsviaRPC–UsesTaskRunnertolaunchuserprocess©SpinnakerLabs,Inc.JobLaunchProcess:TaskRunner•TaskRunner,MapTaskRunner,MapRunnerworkinadaisy-chaintolaunchyourMapper–TaskknowsaheadoftimewhichInputSplitsitshouldbemapping–CallsMapperonceforeachrecordretrievedfromtheInputSplit•RunningtheReducerismuchthesame©SpinnakerLabs,Inc.CreatingtheMapper•YouprovidetheinstanceofMapper–ShouldextendMapReduceBase•OneinstanceofyourMapperisinitializedbytheMapTaskRunnerforaTaskInProgress–ExistsinseparateprocessfromallotherinstancesofMapper–nodatasharing!©SpinnakerLabs,Inc.Mapper•voidmap(WritableComparablekey,Writablevalue,OutputCollectoroutput,Reporterreporter)©SpinnakerLabs,Inc.WhatisWritable?•Hadoopdefinesitsown“box”classesforstrings(Text),integers(IntWritable),etc.•AllvaluesareinstancesofWritable•AllkeysareinstancesofWritableComparable©SpinnakerLabs,Inc.WritingForCacheCoherencywhile(moreinputexists){myIntermediate=newintermediate(input);myIntermediate.process();exportoutputs;}©SpinnakerLabs,Inc.WritingForCacheCoherencymyIntermediate=newintermediate(junk);while(moreinputexists){myIntermediate.setupState(input);myIntermediate.process();exportoutputs;}©SpinnakerLabs,Inc.WritingForCacheCoherency•RunningtheGCtakestime•Reusinglocationsallowsbettercacheusage•Speedupcanbeasmuchastwo-fold•AllserializabletypesmustbeWritableanyway,somakeuseoftheinterfaceGettingDataToTheMapperInputfileInputSplitInputSplit
本文标题:Google云计算课程Module 5 - Hadoop Technical Review
链接地址:https://www.777doc.com/doc-3878560 .html