T6-S2-P1-韩小勇

DeepDive–AmazonElasticMapReduce韩小勇WhyAmazonEMR?EasytoUseLaunchaclusterinminutesLowCostPayanhourlyrateElasticEasilyaddorremovecapacityReliableSpendlesstimemonitoringSecureManagedFirewallsFlexibleYoucontroltheclusterEasytodeployAWSConsoleCommandLineorusetheEMRAPIwithyourfavoriteSDKEasytomonitoranddebugMonitorDebugIntegratedwithAmazonCloudWatchMonitorCluster,NodeandIOTrydifferentconfigurationstofindyouroptimalarchitectureCPUc3familycc1.4xlargecc2.8xlargeMemorym2familyr3familyDisk/IOd2familyi2familyGeneralm1familym3familyChooseyourinstancetypesBatchMachineSparkandLargeprocesslearninginteractiveHDFSEasytoaddandremovecomputecapacityonyourcluster.Matchcomputedemandswithclustersizing.ResizableclustersSpotfortasknodesUpto90%offEC2on-demandpricingOn-demandforcorenodesStandardEC2pricingforon-demandcapacityEasytouseSpotInstancesMeetSLAatPredictablecostExceedSLAatlowercostReadDataDirectlyintoHive,Pig,StreamingandCascadingfromKinesisStreamsNoIntermediateDataPersistenceRequiredSimplewaytointroducerealtimesourcesintoBatchOrientedSystemsMulti-ApplicationSupport&AutomaticCheckpointingAmazonEMRIntegrationwithAmazonKinesisTheHadoopecosystemcanruninAmazonEMRUsebootstrapactionstoinstallapplications…•AmazonS3–Designedfor99.999999999%durability–Separatecomputeandstorage•ResizeandshutdownAmazonEMRclusterswithnodataloss•PointmultipleAmazonEMRclustersatsamedatainAmazonS3EMRFSmakesiteasiertoleverageAmazonS3•Betterperformanceanderrorhandlingoptions•Transparenttoapplications–justread/writeto“s3://”•Consistentview–Forconsistentlistandread-after-writefornewputs•SupportforAmazonS3server-sideandclient-sideencryption•FasterlistingusingEMRFSmetadataEMRFSclient-sideencryptionAmazonS3AmazonS3encryptionclientsEMRFSenabledforAmazonS3client-sideencryptionKeyvendor(AWSKMSoryourcustomkeyvendor)(client-sideencryptedobjects)AmazonS3EMRFSmetadatainAmazonDynamoDB•Listandread-after-writeconsistency•FasterlistoperationsNumberofobjectsWithoutConsistentViewsWithConsistentViews1,000,000147.7229.70100,00012.703.69FastlistingofAmazonS3objectsusingEMRFSmetadata*Testedusingasinglenodeclusterwitham3.xlargeinstance.HDFSisstillthereifyouneedit•Iterativeworkloads–Ifyou’reprocessingthesamedatasetmorethanonce–ConsiderusingSpark&RDDsforthistoo•DiskI/Ointensiveworkloads•PersistdataonAmazonS3anduseS3DistCptocopyto/fromHDFSforprocessingAmazonEMR–DesignPatternsEMRexample#1:BatchProcessingGBoflogspushedtoS3hourlyDailyEMRclusterusingHivetoprocessdataInputandoutputstoredinS3250AmazonEMRjobsperday,processing30TBofdata:Long-runningClusterDatapushedtoS3DailyEMRclusterETLdataintodatabase24/7EMRclusterrunningHBaseholdslast2yearsofdataFront-endserviceusesHBaseclustertopowerdashboardwithhighconcurrencyTBsoflogssentdailyLogsstoredinAmazonS3HiveMetastoreonAmazonEMREMRexample#3:InteractivequeryInteractivequeryusingPrestoonMulti-petabytewarehouse:StreamingdataprocessingTBsoflogssentdailyLogsstoredinAmazonKinesisAmazonKinesisClientLibraryAWSLambdaAmazonEMRAmazonEC2OptimizationsforstorageFileformats•RowOriented–TextFiles–SequenceFiles•Writableobject–AvroDataFiles•Describedbyschema•ColumnarFormat–ObjectRecordColumnar(ORC)–ParquetLogicalTableRoworientedColumnorientedChoosingtherightfileformat•Processingandquerytools–Hive,ImpalaandPresto•Evolutionofschema–AvroforSchemaandPrestoforStorage•Fileformat“splittability”–AvoidJSON/XMLFiles.Usethemasrecords•Compression-BlockorFileFilesizes•Avoidsmallfiles–Avoidanythingsmallerthan100MB•EachmapperprocessesasingleFile•Fewerfiles,matchingcloselytoblocksize–FewercallstoAmazonS3–Fewernetwork/HDFSrequestsDealingwithSmallFiles•ReduceHDFSBlockSize,e.g.1MB(defaultis128MB)–--bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop--args“-m,dfs.block.size=1048576”•Better:useS3DistCPtocombinesmallerfilestogether–S3DistCPtakesapatternandtargetpathtocombinesmallerinputfilestolargerones–SupplyatargetsizeandcompressioncodecCompression•AlwaysCompressDataFilesOnAmazonS3–ReducesnetworktrafficbetweenAmazonS3andAmazonEMR–SpeedsUpYourJob•CompressMappersandReducerOutputAmazonEMRcompressesinter-nodetrafficwithLZOwithHadoop1,andSnappywithHadoop2ChoosingtherightCompression•Timesensitive,fastercompressionsareabetterchoice•Largeamountofdata,usespaceefficientcompressions•CombinedWorkload,usegzipAlgorithmSplittable?CompressionratioCompress+DecompressspeedGzip(DEFLATE)NoHighMediumbzip2YesVeryhighSlowLZOYesLowFastSnappyNoLowVeryfastCostsavingtipsforAmazonEMR•UseS3asyourpersistentdatastore;queryitusingPresto,Hive,Spark,etc.•Onlypayforcomputewhenyouneedit•UseAmazonEC2Spotinstancestosave80%•UseAmazonEC2Reservedinstancesforsteadyworkloads•UseCloudWatchalertstonotifyyouifaclusterisunderutilized,thenshutitdown.E.g.0mappersrunningforNhoursDEMO:ReadingTwitterStreamandshowTop10#topicseveryminute.UsingEMRsparkandscala.Showthefeatureof“eas

T6-S2-P1-韩小勇

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

SAP实施过程纪要

SCM环境下的物物料计划与控制(PPT 130页)(1)

施工组织设计(低庄)

塑料模具设计[1]

数控铣教材

7个人到餐饮店

QP-PSZ-424-ZN01质量记录管理程序

季度经营分析模板

第七章宴会流程设计

18战略大决战

相关文档

相关搜索