您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 创业/孵化 > 学习hadoop第二步 MapReduce任务的基础知识
2MapReduce任务的基础知识2.1HadoopMapReduce作业的基本构成要素.............................................................................12.1.1输入分割块..................................................................................................................52.1.2一个简单的Map任务:IdentityMapper..................................................................62.2配置作业................................................................................................................................102.2.1指定输入格式.............................................................................................................232.2.2设置输出参数.............................................................................................................252.2.3配置Reduce阶段......................................................................................................312.3执行作业................................................................................................................................332.4创建客户化的Mapper和Reducer.......................................................................................36这一章,我们将整体的介绍MapReduce作业。读完本章,你能编写和执行单机模式的MapReduce作业程序。本章中的样例程序假设你已经完成了第一章的设置。你可以在一个专用的本地模式配置下,使用一台单机执行这些样例程序,你不需要启动Hadoop核心框架。对于调试和单元测试,单机模式配置是最理想的。你能够从Apress网站()上这本书所在的页面下载这些样例代码。这些可下载的代码也包含一个用来执行样例程序的JAR文件。下面我们就开始查看MapReduce作业的必要组成要素。2.1HadoopMapReduce作业的基本构成要素用户可以配置和向框架提交MapReduce任务(简言之,作业)。一个MapReduce作业包括Map任务,混淆过程,排序过程和一套Reduce任务。然后框架会管理作业的分配和执行,收集输出和向用户传递作业结果。用户负责处理作业初始化,指定输入位置,指定输入和确保输入格式和位置是正确无误的。框架负责在集群中TaskTracker节点上派遣作业,执行map过程,混淆过程,排序过程和Reduce过程,把输出写入输出目录,最后通知用户作业完成状态。本章的所有样例程序都基于文件MapReduceIntro.java,如列表2-1所示。文件MapReduceIntro.java的代码所创建的作业逐行的读取输入,然后,根据每一行第一个Tab字符前面的部分排序这些行,如果某一行没有Tab字符,框架会根据整个行进行排序。MapReduceIntro.java文件是一个简单的实现了配置和执行MapReduce作业的样例程序。列表2-1MapReduceIntro.javapackagecom.apress.hadoopbook.examples.ch2;importjava.io.IOException;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapred.FileInputFormat;importorg.apache.hadoop.mapred.FileOutputFormat;importorg.apache.hadoop.mapred.JobClient;importorg.apache.hadoop.mapred.JobConf;importorg.apache.hadoop.mapred.KeyValueTextInputFormat;importorg.apache.hadoop.mapred.RunningJob;importorg.apache.hadoop.mapred.lib.IdentityMapper;importorg.apache.hadoop.mapred.lib.IdentityReducer;importorg.apache.log4j.Logger;/***AverysimpleMapReduceexamplethatreadstextualinputwhereeachrecordis*asingleline,andsortsalloftheinputlinesintoasingleoutputfile.**TherecordsareparsedintoKeyandValueusingthefirstTABcharacterasa*separator.IfthereisnoTABcharactertheentirelineistheKey.***@authorJasonVenner**/publicclassMapReduceIntro{protectedstaticLoggerlogger=Logger.getLogger(MapReduceIntro.class);/***ConfigureandruntheMapReduceIntrojob.**@paramargs*Notused.*/publicstaticvoidmain(finalString[]args){try{/***Constructthejobconfobjectthatwillbeusedtosubmitthis*jobtotheHadoopframework.ensurethatthejarordirectory*thatcontainsMapReduceIntroConfig.classismadeavailabletoall*oftheTasktrackernodesthatwillrunmapsorreducesforthis*job.*/finalJobConfconf=newJobConf(MapReduceIntro.class);/***Takecareofsomehousekeepingtoensurethatthissimpleexample*jobwillrun*/MapReduceIntroConfig.exampleHouseKeeping(conf,MapReduceIntroConfig.getInputDirectory(),MapReduceIntroConfig.getOutputDirectory());/***Thissectionistheactualjobconfigurationportion/***ConfiguretheinputDirectoryandthetypeofinput.Inthiscase*wearestatingthattheinputistext,andeachrecordisa*singleline,andthefirstTABistheseparatorbetweenthekey*andthevalueoftherecord.*/conf.setInputFormat(KeyValueTextInputFormat.class);FileInputFormat.setInputPaths(conf,MapReduceIntroConfig.getInputDirectory());/***Informtheframeworkthatthemapperclasswillbethe*{@linkIdentityMapper}.ThisclasssimplypassestheinputKey*Valuepairsdirectlytoitsoutput,whichinourcasewillbethe*shuffle.*/conf.setMapperClass(IdentityMapper.class);/***Configuretheoutputofthejobtogototheoutputdirectory.*InformtheframeworkthattheOutputKeyandValueclasseswill*be{@linkText}andtheoutputfileformatwill*{@linkTextOutputFormat}.TheTextOutputformatclassjoins*producesarecordofoutputforeachKey,Valuepair,withthe*followingformat.Formatter.format(%s\t%s%n,key.toString(),*value.toString());.**Inadditionindicatetotheframeworkthattherewillbe1*reduce.Thisresultsinallinputkeysbeingplacedintothe*same,single,partition,andthefinaloutputbeingasingle*sortedfile.*/FileOutputFormat.setOutputPath(conf,MapReduceIntroConfig.getOutputDirectory());conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);conf.setNumReduceTasks(1);/***Informtheframeworkthatthereducerclasswillbethe*{@linkIdentityReducer}.Thisclasssimplywritesanoutput*recordkey,valuerecordforeachvalueinthekey,valuesetit*receivesasinput.Thevalueorderingisarbitrary.*/conf.setReducerClass(IdentityReducer.class);logger.info(Launchingthejob.);/***Sendthejobconfigurationtotheframeworkandrequestthatthe*jobberun.*/finalRunningJobjob=JobClient.runJob(conf);logger.info(Thejobhascompleted.);if(!job.isSuccessful()){logger.error(Thejobfailed.);System.exit
本文标题:学习hadoop第二步 MapReduce任务的基础知识
链接地址:https://www.777doc.com/doc-5833022 .html