您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 管理学资料 > T6-S2-P1-韩小勇
DeepDive–AmazonElasticMapReduce韩小勇WhyAmazonEMR?EasytoUseLaunchaclusterinminutesLowCostPayanhourlyrateElasticEasilyaddorremovecapacityReliableSpendlesstimemonitoringSecureManagedFirewallsFlexibleYoucontroltheclusterEasytodeployAWSConsoleCommandLineorusetheEMRAPIwithyourfavoriteSDKEasytomonitoranddebugMonitorDebugIntegratedwithAmazonCloudWatchMonitorCluster,NodeandIOTrydifferentconfigurationstofindyouroptimalarchitectureCPUc3familycc1.4xlargecc2.8xlargeMemorym2familyr3familyDisk/IOd2familyi2familyGeneralm1familym3familyChooseyourinstancetypesBatchMachineSparkandLargeprocesslearninginteractiveHDFSEasytoaddandremovecomputecapacityonyourcluster.Matchcomputedemandswithclustersizing.ResizableclustersSpotfortasknodesUpto90%offEC2on-demandpricingOn-demandforcorenodesStandardEC2pricingforon-demandcapacityEasytouseSpotInstancesMeetSLAatPredictablecostExceedSLAatlowercostReadDataDirectlyintoHive,Pig,StreamingandCascadingfromKinesisStreamsNoIntermediateDataPersistenceRequiredSimplewaytointroducerealtimesourcesintoBatchOrientedSystemsMulti-ApplicationSupport&AutomaticCheckpointingAmazonEMRIntegrationwithAmazonKinesisTheHadoopecosystemcanruninAmazonEMRUsebootstrapactionstoinstallapplications…•AmazonS3–Designedfor99.999999999%durability–Separatecomputeandstorage•ResizeandshutdownAmazonEMRclusterswithnodataloss•PointmultipleAmazonEMRclustersatsamedatainAmazonS3EMRFSmakesiteasiertoleverageAmazonS3•Betterperformanceanderrorhandlingoptions•Transparenttoapplications–justread/writeto“s3://”•Consistentview–Forconsistentlistandread-after-writefornewputs•SupportforAmazonS3server-sideandclient-sideencryption•FasterlistingusingEMRFSmetadataEMRFSclient-sideencryptionAmazonS3AmazonS3encryptionclientsEMRFSenabledforAmazonS3client-sideencryptionKeyvendor(AWSKMSoryourcustomkeyvendor)(client-sideencryptedobjects)AmazonS3EMRFSmetadatainAmazonDynamoDB•Listandread-after-writeconsistency•FasterlistoperationsNumberofobjectsWithoutConsistentViewsWithConsistentViews1,000,000147.7229.70100,00012.703.69FastlistingofAmazonS3objectsusingEMRFSmetadata*Testedusingasinglenodeclusterwitham3.xlargeinstance.HDFSisstillthereifyouneedit•Iterativeworkloads–Ifyou’reprocessingthesamedatasetmorethanonce–ConsiderusingSpark&RDDsforthistoo•DiskI/Ointensiveworkloads•PersistdataonAmazonS3anduseS3DistCptocopyto/fromHDFSforprocessingAmazonEMR–DesignPatternsEMRexample#1:BatchProcessingGBoflogspushedtoS3hourlyDailyEMRclusterusingHivetoprocessdataInputandoutputstoredinS3250AmazonEMRjobsperday,processing30TBofdata:Long-runningClusterDatapushedtoS3DailyEMRclusterETLdataintodatabase24/7EMRclusterrunningHBaseholdslast2yearsofdataFront-endserviceusesHBaseclustertopowerdashboardwithhighconcurrencyTBsoflogssentdailyLogsstoredinAmazonS3HiveMetastoreonAmazonEMREMRexample#3:InteractivequeryInteractivequeryusingPrestoonMulti-petabytewarehouse:StreamingdataprocessingTBsoflogssentdailyLogsstoredinAmazonKinesisAmazonKinesisClientLibraryAWSLambdaAmazonEMRAmazonEC2OptimizationsforstorageFileformats•RowOriented–TextFiles–SequenceFiles•Writableobject–AvroDataFiles•Describedbyschema•ColumnarFormat–ObjectRecordColumnar(ORC)–ParquetLogicalTableRoworientedColumnorientedChoosingtherightfileformat•Processingandquerytools–Hive,ImpalaandPresto•Evolutionofschema–AvroforSchemaandPrestoforStorage•Fileformat“splittability”–AvoidJSON/XMLFiles.Usethemasrecords•Compression-BlockorFileFilesizes•Avoidsmallfiles–Avoidanythingsmallerthan100MB•EachmapperprocessesasingleFile•Fewerfiles,matchingcloselytoblocksize–FewercallstoAmazonS3–Fewernetwork/HDFSrequestsDealingwithSmallFiles•ReduceHDFSBlockSize,e.g.1MB(defaultis128MB)–--bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop--args“-m,dfs.block.size=1048576”•Better:useS3DistCPtocombinesmallerfilestogether–S3DistCPtakesapatternandtargetpathtocombinesmallerinputfilestolargerones–SupplyatargetsizeandcompressioncodecCompression•AlwaysCompressDataFilesOnAmazonS3–ReducesnetworktrafficbetweenAmazonS3andAmazonEMR–SpeedsUpYourJob•CompressMappersandReducerOutputAmazonEMRcompressesinter-nodetrafficwithLZOwithHadoop1,andSnappywithHadoop2ChoosingtherightCompression•Timesensitive,fastercompressionsareabetterchoice•Largeamountofdata,usespaceefficientcompressions•CombinedWorkload,usegzipAlgorithmSplittable?CompressionratioCompress+DecompressspeedGzip(DEFLATE)NoHighMediumbzip2YesVeryhighSlowLZOYesLowFastSnappyNoLowVeryfastCostsavingtipsforAmazonEMR•UseS3asyourpersistentdatastore;queryitusingPresto,Hive,Spark,etc.•Onlypayforcomputewhenyouneedit•UseAmazonEC2Spotinstancestosave80%•UseAmazonEC2Reservedinstancesforsteadyworkloads•UseCloudWatchalertstonotifyyouifaclusterisunderutilized,thenshutitdown.E.g.0mappersrunningforNhoursDEMO:ReadingTwitterStreamandshowTop10#topicseveryminute.UsingEMRsparkandscala.Showthefeatureof“eas
本文标题:T6-S2-P1-韩小勇
链接地址:https://www.777doc.com/doc-6432902 .html