您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 信息化管理 > Hadoop Design and k-Means Clustering
HadoopDesignandk-MeansClusteringKennethHeafieldGoogleIncJanuary15,2008ExamplecodefromHadoop0.13.1usedundertheApacheLicenseVersion2.0andmodifiedforpresentation.Exceptasotherwisenoted,thecontentofthispresentationislicensedundertheCreativeCommonsAttribution2.5License.KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20081/31OutlineHadoopDesign1FaultTolerance2DataFlowInputOutput3MapTaskMapPartition4ReduceTaskFetchandSortReduceLaterinthistalk:Performanceandk-MeansClusteringKennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20082/31FaultToleranceManagingTasksJobTrackerTaskTrackerTaskTrackerReduceTaskMapTaskMapTaskMapTaskDesignTaskTrackerreportsstatusorrequestsworkevery10secondsMapTaskandReduceTaskreportprogressevery10secondsIssues+Detectsfailuresandslowworkersquickly-JobTrackerisasinglepointoffailureKennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20083/31FaultToleranceCopingWithFailureFailedTasksRerunmapandreduceasnecessary.SlowTasksStartasecondbackupinstanceofthesametask.ConsistencyAnyMapTaskorReduceTaskmightberunmultipletimesMapandReduceshouldbefunctionalKennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20084/31FaultToleranceUseofRandomNumbersPurposeSupportrandomizedalgorithmswhileremainingconsistentSamplingMapperprivateRandomrand;voidconfigure(JobConfconf){rand.setSeed((long)conf.getInt(mapred.task.partition));}voidmap(WritableComparablekey,Writablevalue,OutputCollectoroutput,Reporterreporter){if(rand.nextFloat()0.1){output.collect(key,value);}}KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20085/31DataFlowDataFlowHDFSInputInputFormatsplitsandreadsfilesMapperLocalOutputSequenceFileOutputFormatwritesserializedvaluesHTTPInputMapoutputsareretrievedoverHTTPandmergedReduceHDFSOutputOutputFormatwritesaSequenceFileortextKennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20086/31DataFlowInputInputSplitPurposeLocateasinglemaptask’sinput.ImportantFunctionsPathFileSplit.getPath();ImplementationsMultiFileSplitisalistofsmallfilestobeconcatenated.FileSplitisafilepath,offset,andlength.TableSplitisatablename,startrow,andendrow.KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20087/31DataFlowInputRecordReaderPurposeParseinputspecifiedbyInputSplitintokeysandvalues.Handlerecordsonsplitboundaries.ImportantFunctionsbooleannext(Writablekey,Writablevalue);ImplementationsLineRecordReaderreadslines.Keyisanoffset,valueisthetext.KeyValueLineRecordReaderreadsdelimitedkey-valuepairs.SequenceFileRecordReaderreadsaSequenceFile,Hadoop’sbinaryrepresentationofkey-valuepairs.KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20088/31DataFlowInputInputFormatPurposeSpecifiesinputfileformatbyconstructingInputSplitandRecordReader.ImportantFunctionsRecordReadergetRecordReader(InputSplitsplit,JobConfjob,Reporterreporter);InputSplit[]getSplits(JobConfjob,intnumSplits);ImplementationsTextInputFormatreadstextfiles.TableInputFormatreadsfromatable.KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,20089/31DataFlowOutputOutputFormatPurposeMachineorhumanreadableoutput.MakesRecordWriter,whichisanalogoustoRecordReaderImportantFunctionsRecordWritergetRecordWriter(FileSystemfs,JobConfjob,Stringname,Progressableprogress);FormatsSequenceFileOutputFormatwritesabinarySequenceFileTextOutputFormatwritestextfilesKennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,200810/31MapTaskMapTaskDefaultSetupInputFormatMapRunnableMapperOutputCollectorPartitionerReducerReducerSplitfilesandreadrecordsMapallrecordsinthetaskMaparecordConsultPartitionerandsavefilesAssignkey-valuepairstoreducersReducersretrievefilesoverHTTPKennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,200811/31MapTaskMapMapRunnablePurposeSequenceofmapoperationsDefaultImplementationpublicvoidrun(RecordReaderinput,OutputCollectoroutput,Reporterreporter)throwsIOException{try{WritableComparablekey=input.createKey();Writablevalue=input.createValue();while(input.next(key,value)){mapper.map(key,value,output,reporter);}}finally{mapper.close();}}KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,200812/31MapTaskMapMapperPurposeSinglemapoperationImportantFunctionsvoidmap(WritableComparablekey,Writablevalue,OutputCollectoroutput,Reporterreporter);Pre-definedMappersIdentityMapperInverseMapperflipskeyandvalue.RegexMappermatchesregularexpressionssetinjob.TokenCountMapperimplementswordcountmap.KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,200813/31MapTaskPartitionPartitionerPurposeDecidewhichreducerhandlesmapoutput.ImportantFunctionsintgetPartition(WritableComparablekey,Writablevalue,intnumReduceTasks);ImplementationsHashPartitioneruseskey.hashCode()%numReduceTasks.KeyFieldBasedPartitionerhashesonlypartofkey.KennethHeafield(GoogleInc)HadoopDesignandk-MeansClusteringJanuary15,200814/31ReduceTaskFetchandSortFetchandSortFetchTaskTrackertellsReducerwheremappersareReducerrequestsinputfilesfrommappersviaHTTPMergeSortRecursivelymerge
本文标题:Hadoop Design and k-Means Clustering
链接地址:https://www.777doc.com/doc-4853898 .html