您好,欢迎访问三七文档
当前位置:首页 > 中学教育 > 初中教育 > Hadoop In 45 Minutes or Less
HadoopIn45MinutesorLessLarge-ScaleDataProcessingforEveryoneTomWheelerTomWheelerHadoopIn45MinutesorLessWhoAmI?IPrincipalSoftwareEngineeratObjectComputing,Inc.IIworkedonlarge-scaledataprocessinginapreviousjob–IfonlyI’dhadHadoopbackthen...TomWheelerHadoopIn45MinutesorLessWhatI’mGoingtoCoverII’llexplainwhatHadoopisII’lltellyouwhatproblemsitcan(andcan’t)solveII’lldescribehowitworksII’llshowexamplessoyoucanseeitinactionTomWheelerHadoopIn45MinutesorLessWhatisHadoop?It’saframeworkforlarge-scaledataprocessing:IInspiredbyGoogle’sarchitecture:MapReduceandGFSIAtop-levelApacheproject–HadoopisopensourceIWritteninJava,plusafewshellscriptsTomWheelerHadoopIn45MinutesorLessHowDidHadoopOriginate?TomWheelerHadoopIn45MinutesorLessWhyShouldICareAboutHadoop?IFault-toleranthardwareisexpensiveIHadoopisdesignedtorunoncheapcommodityhardwareIItautomaticallyhandlesdatareplicationandnodefailureIItdoesthehardwork–youcanfocusonprocessingdataTomWheelerHadoopIn45MinutesorLessWho’sUsingHadoop?TomWheelerHadoopIn45MinutesorLessWhatFeaturesDoesHadoopOffer?IAPI+implementationforworkingwithMapReduceIMoreimportantly,itprovidesinfrastructureHadoopInfrastructureIJobconfigurationandefficientschedulingIBrowser-basedmonitoringofimportantclusterstatsIHandlingfailuresinbothcomputationanddatanodesIAdistributedFSoptimizedforHUGEamountsofdataTomWheelerHadoopIn45MinutesorLessWhenisHadoopaGoodChoice?IWhenyoumustprocesslotsofunstructureddataIWhenyourprocessingcaneasilybemadeparallelIWhenrunningbatchjobsisacceptableIWhenyouhaveaccesstolotsofcheaphardwareTomWheelerHadoopIn45MinutesorLessWhenisHadoopNotAGoodChoice?IForintensecalculationswithlittleornodataIWhenyourprocessingcannotbeeasilymadeparallelIWhenyourdataisnotself-containedIWhenyouneedinteractiveresultsIIfyouownstockinCray!TomWheelerHadoopIn45MinutesorLessHadoopExamples/Anti-ExamplesHadoopwouldbeagoodchoicefor...IIndexinglogfilesISortingvastamountsofdataIImageanalysisHadoopwouldbeapoorchoicefor...IFiguringPito1,000,000digitsICalculatingFibonaccisequencesIAgeneralRDBMSreplacementTomWheelerHadoopIn45MinutesorLessHDFSOverviewHDFSisperhapsHadoop’smostinterestingfeature.IHDFS=HadoopDistributedFilesystem(userspace)IInspiredbyGoogleFileSystemIHighaggregatethroughputforstreaminglargefilesIReplicationandlocalityTomWheelerHadoopIn45MinutesorLessHowHDFSWorks:SplitsIDatacopiedintoHDFSissplitintoblocksITypicalblocksize:UNIX=4KBvs.HDFS=128MBTomWheelerHadoopIn45MinutesorLessHowHDFSWorks:ReplicationIEachdatablocksisreplicatedtomultiplemachinesIAllowsfornodefailurewithoutdatalossTomWheelerHadoopIn45MinutesorLessHadoopArchitectureOverviewTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:NameNodeIThereisonlyone(active)namenodeperclusterIItmanagesthefilesystemnamespaceandmetadataISPOF:theoneplacetospend$$$forgoodhardwareTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:DataNodeITherearetypicallylotsofdatanodesIItmanagesdatablocks+servesthemtoclientsIDataisreplicated–failureisnobigdealTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:JobTrackerIThereisexactlyonejobtrackerperclusterIReceivesjobrequestssubmittedbyclientISchedulesandmonitorsMRjobsontasktrackersTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:TaskTrackerITherearetypicallylotsoftasktrackersIResponsibleforexecutingMRoperationsIReadsblocksfromdatanodesTomWheelerHadoopIn45MinutesorLessHadoopModesofOperationHadoopsupportsthreemodesofoperation:IStandaloneIPseudo-distributedIFully-distributedTomWheelerHadoopIn45MinutesorLessInstallingHadoopTheinstallationprocess,fordistributedmodes:IRequirements:Linux,Java1.6,sshd,rsyncIConfigureSSHforpassword-freeauthenticationIUnpackHadoopdistributionIEditafewconfigurationfilesIFormattheDFSonthenamenodeIStartallthedaemonprocessesTomWheelerHadoopIn45MinutesorLessRunningHadoopThebasicstepsforrunningaHadoopjobare:ICompileyourjobintoaJARfileICopyinputdataintoHDFSIExecutebin/hadoopjarwithrelevantargsIMonitortasksviaWebinterface(optional)IExamineoutputwhenjobiscompleteTomWheelerHadoopIn45MinutesorLessASimpleHadoopJobI’lldemonstrateHadoopwithasimpleMapReduceexampleIInputishistoricaldatafor30stocks,1987-2009IRecordsareCSV:symbol,lowprice,highprice,etc.IGoalistofindlargestintra-daypricefluctuationIOutputisonerecordperstock,showingmaxdeltaTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustratedIHadoopisallaboutdataprocessingISeeingthedatawillhelptoexplainthejobTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustrated:MapperInputTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustrated:MapperOutputTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustrated:ReducerOutputTomWheelerHadoopIn45MinutesorLessShowMetheCodeINowthatyouunderstandthedata...let’sseethecode!TomWheelerHadoopIn45MinutesorLessMapperforStockAnalyzer(part1)1publicclassStockAnalyzerMapperextendsMapReduceBase2implementsMapperLongWritable,Text,Text,FloatWritable{34@Override5publicvoidmap(LongWritablekey,Textvalue,6OutputCollectorText,FloatWritableoutput,Reporterreporter)7throwsIOException{89Stringrecord=value.toString();1011if(record.startsWith(Symbol)){12//ig
本文标题:Hadoop In 45 Minutes or Less
链接地址:https://www.777doc.com/doc-5234761 .html