您好,欢迎访问三七文档
当前位置:首页 > 幼儿/小学教育 > 小学教育 > 北风网 Hadoop in Action
MEAPEditionManningEarlyAccessProgramCopyright2010ManningPublicationsFormoreinformationonthisandotherManningtitlesgoto–ADistributedProgrammingFrameworkPart1ofthisbookintroducesthebasicsforunderstandingandusingHadoop.WedescribethehardwarecomponentsthatmakeupaHadoopcluster,aswellastheinstallationandconfigurationtocreateaworkingsystem.WecovertheMapReduceframeworkatahighlevelandgetyourfirstMapReduceprogramupandrunning.13IntroducingHadoopThischaptercoversThebasicsofwritingascalable,■distributeddata-intensiveprogramUnderstandingHadoopandMapReduce■WritingandrunningabasicMapReduceprogram■Today,we’resurroundedbydata.Peopleuploadvideos,takepicturesontheircellphones,textfriends,updatetheirFacebookstatus,leavecommentsaroundtheweb,clickonads,andsoforth.Machines,too,aregeneratingandkeepingmoreandmoredata.Youmayevenbereadingthisbookasdigitaldataonyourcomputerscreen,andcertainlyyourpurchaseofthisbookisrecordedasdatawithsomeretailer.1Theexponentialgrowthofdatafirstpresentedchallengestocutting-edgebusinessessuchasGoogle,Yahoo,Amazon,andMicrosoft.Theyneededtogothroughterabytesandpetabytesofdatatofigureoutwhichwebsiteswerepopular,whatbookswereindemand,andwhatkindsofadsappealedtopeople.Existingtoolswerebecominginadequatetoprocesssuchlargedatasets.GooglewasthefirsttopublicizeMapReduce—asystemtheyhadusedtoscaletheirdataprocessingneeds.1Ofcourse,you’rereadingalegitimatecopyofthis,right?4CHAPTER1IntroducingHadoopThissystemarousedalotofinterestbecausemanyotherbusinesseswerefacingsimilarscalingchallenges,anditwasn’tfeasibleforeveryonetoreinventtheirownproprietarytool.DougCuttingsawanopportunityandledthechargetodevelopanopensourceversionofthisMapReducesystemcalledHadoop.Soonafter,Yahooandothersralliedaroundtosupportthiseffort.Today,Hadoopisacorepartofthecomputinginfrastructureformanywebcompanies,suchasYahoo,Facebook,LinkedIn,andTwitter.Manymoretraditionalbusinesses,suchasmediaandtelecom,arebeginningtoadoptthissystemtoo.Ourcasestudiesinchapter12willdescribehowcompaniesincludingNewYorkTimes,ChinaMobile,andIBMareusingHadoop.Hadoop,andlarge-scaledistributeddataprocessingingeneral,israpidlybecominganimportantskillsetformanyprogrammers.Aneffectiveprogrammer,today,musthaveknowledgeofrelationaldatabases,networking,andsecurity,allofwhichwereconsideredoptionalskillsacoupledecadesago.Similarly,basicunderstandingofdistributeddataprocessingwillsoonbecomeanessentialpartofeveryprogrammer’stoolbox.Leadinguniversities,suchasStanfordandCMU,havealreadystartedintroducingHadoopintotheircomputersciencecurriculum.Thisbookwillhelpyou,thepracticingprogrammer,getuptospeedonHadoopquicklyandstartusingittoprocessyourdatasets.ThischapterintroducesHadoopmoreformally,positioningitintermsofdistributedsystemsanddataprocessingsystems.ItgivesanoverviewoftheMapReduceprogrammingmodel.Asimplewordcountingexamplewithexistingtoolshighlightsthechallengesaroundprocessingdataatlargescale.You’llimplementthatexampleusingHadooptogainadeeperappreciationofHadoop’ssimplicity.We’llalsodiscussthehistoryofHadoopandsomeperspectivesontheMapReduceparadigm.ButletmefirstbrieflyexplainwhyIwrotethisbookandwhyit’susefultoyou.1.1Why“HadoopinAction”?Speakingfromexperience,IfirstfoundHadooptobetantalizinginitspossibilities,yetfrustratingtoprogressbeyondcodingthebasicexamples.ThedocumentationattheofficialHadoopsiteisfairlycomprehensive,butitisn’talwayseasytofindstraightfor-wardanswerstostraightforwardquestions.Thepurposeofwritingthebookistoaddressthisproblem.Iwon’tfocusonthenitty-grittydetails.InsteadIwillprovidetheinformationthatwillallowyoutoquicklycreateusefulcode,alongwithmoreadvancedtopicsmostoftenencounteredinpractice.1.2WhatisHadoop?Formallyspeaking,Hadoopisanopensourceframeworkforwritingandrunningdis-tributedapplicationsthatprocesslargeamountsofdata.Distributedcomputingisawideandvariedfield,butthekeydistinctionsofHadooparethatitisAccessible■—HadooprunsonlargeclustersofcommoditymachinesoroncloudcomputingservicessuchasAmazon’sElasticComputeCloud(EC2).WhatisHadoop?5HadoopclusterClientClientClientFigure1.1AHadoopclusterhasmanyparallelmachinesthatstoreandprocesslargedatasets.Clientcomputerssendjobsintothiscomputercloudandobtainresults.■Robust—Becauseitisintendedtorunoncommodityhardware,Hadoopisarchi-tectedwiththeassumptionoffrequenthardwaremalfunctions.Itcangracefullyhandlemostsuchfailures.Scalable■—Hadoopscaleslinearlytohandlelargerdatabyaddingmorenodestothecluster.Simple■—Hadoopallowsuserstoquicklywriteefficientparallelcode.Hadoop’saccessibilityandsimplicitygiveitanedgeoverwritingandrunninglargedistributedprograms.Evencollegestudentscanquicklyandcheaplycreate
本文标题:北风网 Hadoop in Action
链接地址:https://www.777doc.com/doc-4312165 .html