Hadoop In 45 Minutes or Less

HadoopIn45MinutesorLessLarge-ScaleDataProcessingforEveryoneTomWheelerTomWheelerHadoopIn45MinutesorLessWhoAmI?IPrincipalSoftwareEngineeratObjectComputing,Inc.IIworkedonlarge-scaledataprocessinginapreviousjob–IfonlyI’dhadHadoopbackthen...TomWheelerHadoopIn45MinutesorLessWhatI’mGoingtoCoverII’llexplainwhatHadoopisII’lltellyouwhatproblemsitcan(andcan’t)solveII’lldescribehowitworksII’llshowexamplessoyoucanseeitinactionTomWheelerHadoopIn45MinutesorLessWhatisHadoop?It’saframeworkforlarge-scaledataprocessing:IInspiredbyGoogle’sarchitecture:MapReduceandGFSIAtop-levelApacheproject–HadoopisopensourceIWritteninJava,plusafewshellscriptsTomWheelerHadoopIn45MinutesorLessHowDidHadoopOriginate?TomWheelerHadoopIn45MinutesorLessWhyShouldICareAboutHadoop?IFault-toleranthardwareisexpensiveIHadoopisdesignedtorunoncheapcommodityhardwareIItautomaticallyhandlesdatareplicationandnodefailureIItdoesthehardwork–youcanfocusonprocessingdataTomWheelerHadoopIn45MinutesorLessWho’sUsingHadoop?TomWheelerHadoopIn45MinutesorLessWhatFeaturesDoesHadoopOffer?IAPI+implementationforworkingwithMapReduceIMoreimportantly,itprovidesinfrastructureHadoopInfrastructureIJobconﬁgurationandefﬁcientschedulingIBrowser-basedmonitoringofimportantclusterstatsIHandlingfailuresinbothcomputationanddatanodesIAdistributedFSoptimizedforHUGEamountsofdataTomWheelerHadoopIn45MinutesorLessWhenisHadoopaGoodChoice?IWhenyoumustprocesslotsofunstructureddataIWhenyourprocessingcaneasilybemadeparallelIWhenrunningbatchjobsisacceptableIWhenyouhaveaccesstolotsofcheaphardwareTomWheelerHadoopIn45MinutesorLessWhenisHadoopNotAGoodChoice?IForintensecalculationswithlittleornodataIWhenyourprocessingcannotbeeasilymadeparallelIWhenyourdataisnotself-containedIWhenyouneedinteractiveresultsIIfyouownstockinCray!TomWheelerHadoopIn45MinutesorLessHadoopExamples/Anti-ExamplesHadoopwouldbeagoodchoicefor...IIndexinglogﬁlesISortingvastamountsofdataIImageanalysisHadoopwouldbeapoorchoicefor...IFiguringPito1,000,000digitsICalculatingFibonaccisequencesIAgeneralRDBMSreplacementTomWheelerHadoopIn45MinutesorLessHDFSOverviewHDFSisperhapsHadoop’smostinterestingfeature.IHDFS=HadoopDistributedFilesystem(userspace)IInspiredbyGoogleFileSystemIHighaggregatethroughputforstreaminglargeﬁlesIReplicationandlocalityTomWheelerHadoopIn45MinutesorLessHowHDFSWorks:SplitsIDatacopiedintoHDFSissplitintoblocksITypicalblocksize:UNIX=4KBvs.HDFS=128MBTomWheelerHadoopIn45MinutesorLessHowHDFSWorks:ReplicationIEachdatablocksisreplicatedtomultiplemachinesIAllowsfornodefailurewithoutdatalossTomWheelerHadoopIn45MinutesorLessHadoopArchitectureOverviewTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:NameNodeIThereisonlyone(active)namenodeperclusterIItmanagestheﬁlesystemnamespaceandmetadataISPOF:theoneplacetospend$$$forgoodhardwareTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:DataNodeITherearetypicallylotsofdatanodesIItmanagesdatablocks+servesthemtoclientsIDataisreplicated–failureisnobigdealTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:JobTrackerIThereisexactlyonejobtrackerperclusterIReceivesjobrequestssubmittedbyclientISchedulesandmonitorsMRjobsontasktrackersTomWheelerHadoopIn45MinutesorLessTheHadoopCastofCharacters:TaskTrackerITherearetypicallylotsoftasktrackersIResponsibleforexecutingMRoperationsIReadsblocksfromdatanodesTomWheelerHadoopIn45MinutesorLessHadoopModesofOperationHadoopsupportsthreemodesofoperation:IStandaloneIPseudo-distributedIFully-distributedTomWheelerHadoopIn45MinutesorLessInstallingHadoopTheinstallationprocess,fordistributedmodes:IRequirements:Linux,Java1.6,sshd,rsyncIConﬁgureSSHforpassword-freeauthenticationIUnpackHadoopdistributionIEditafewconﬁgurationﬁlesIFormattheDFSonthenamenodeIStartallthedaemonprocessesTomWheelerHadoopIn45MinutesorLessRunningHadoopThebasicstepsforrunningaHadoopjobare:ICompileyourjobintoaJARﬁleICopyinputdataintoHDFSIExecutebin/hadoopjarwithrelevantargsIMonitortasksviaWebinterface(optional)IExamineoutputwhenjobiscompleteTomWheelerHadoopIn45MinutesorLessASimpleHadoopJobI’lldemonstrateHadoopwithasimpleMapReduceexampleIInputishistoricaldatafor30stocks,1987-2009IRecordsareCSV:symbol,lowprice,highprice,etc.IGoalistoﬁndlargestintra-daypriceﬂuctuationIOutputisonerecordperstock,showingmaxdeltaTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustratedIHadoopisallaboutdataprocessingISeeingthedatawillhelptoexplainthejobTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustrated:MapperInputTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustrated:MapperOutputTomWheelerHadoopIn45MinutesorLessOurHadoopJobIllustrated:ReducerOutputTomWheelerHadoopIn45MinutesorLessShowMetheCodeINowthatyouunderstandthedata...let’sseethecode!TomWheelerHadoopIn45MinutesorLessMapperforStockAnalyzer(part1)1publicclassStockAnalyzerMapperextendsMapReduceBase2implementsMapperLongWritable,Text,Text,FloatWritable{34@Override5publicvoidmap(LongWritablekey,Textvalue,6OutputCollectorText,FloatWritableoutput,Reporterreporter)7throwsIOException{89Stringrecord=value.toString();1011if(record.startsWith(Symbol)){12//ig

Hadoop In 45 Minutes or Less

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

dreamweavermx入门教程

中国地级市电子网站报告

电子支付板块

机械工程学院工程制图48学时教案

关于申报XXXX年度重庆市建筑业先进企业、优秀经理、优

王旭东重载交通长寿命半刚性基层路面设计与施工

中西药

法律知识思考关于中国发展货币市场基金的法律

案例——顺德国际商业城的推广

黑弧奥美XXXX年度广州保利·西海岸推广方案

相关文档

相关搜索