您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > 3--基于Apache Spark软件栈的实时大数据分析-戴金权
©2014Amazon.com,Inc.anditsaffiliates.Allrightsreserved.Maynotbecopied,modified,ordistributedinwholeorinpartwithouttheexpressconsentofAmazon.com,Inc.基于ApacheSpark软件栈的实时大数据分析戴金权(JasonDai)英特尔大数据首席架构师2014-12-12下一代大数据分析•Volume–海量数据&指数级增长•Variety–多结构化,来自不同来源&不一致的数据模式(schema)•Value–简单(SQL):描述性分析(descriptiveanalytics)–复杂(non-SQL):预测性分析(predictiveanalytics)•Velocity–交互式分析(thespeedofthought)–流式分析(drinkingfromthefirehose)ApacheSpark软件栈2项目概况•由UCBerkeley的AMPLab发起的研究和开源项目•Intel和AMPLab(以及开源社区)在Spark项目的开源开发上进行紧密合作–合作起始于2012(当时Spark还是一个研究项目)–Intel目前Spark的代码贡献量排名世界前三•从Spark项目起始至今有多名committer来自Intel•Intel和多家合作伙伴(如大型网站)进行紧密合作–使用ApacheSpark软件栈构建下一代大数据分析–特别是实时的、基于内存的、复杂数据分析3BerkeleyDataAnalyticsStack(BDAS)SparkMesosSparkSQLMapReduce,MPI,…HDFS/HadoopStorageTachyonSparkStreamingGraphXgraph-parallelMLlibmachineleaningYARNStandalone实时大数据分析处理•下一代实时大数据分析架构–Datacaptured&processedina(semi)real-time/streamingfashion–DataminedusingSQLqueriesaswellascomplexmachinelearning&graphanalysis–Iterativeand/orinteractiveanalysisleveragingdistributedin-memorydatastore4Messaging/QueueStreamProcessingIn-MemoryStoreAd-hoc,interactivequery&OLAP/BIEvents/LogsLowLatencyProcessingEnginePersistentStorageNoSQLDataWarehouseIterativemachinelearninganddataminingLowLatencyQueryEngine基于ApacheSpark软件栈的实时大数据分析KafkaSparkStreamingRDDCacheAd-hoc,interactivequery&OLAP/BIEvents/LogsSpark,(MLLib,GraphX,etc.)HDFSNoSQLDataWarehouseIterativemachinelearninganddataminingSpark-SQL基于ApacheSpark软件栈的实时大数据分析KafkaSparkStreamingRDDCacheAd-hoc,interactivequery&OLAP/BIEvents/LogsSpark,(MLLib,GraphX,etc.)HDFSNoSQLDataWarehouseIterativemachinelearninganddataminingSpark-SQLSparkStream-SQL:流式处理+SQL查询•支持使用SQL查询,对输入数据流(包括结合历史数据、参考数据)进行处理分析•构建于SparkStreaming和SparkSQL框架之上•DiscreteStream(DStream)概念–Runstreamingcomputationasaseriesofverysmall,deterministic(mini-batch)Sparkjobs•Asfrequentas~1/2second–Betterfaulttolerance,stragglerhandling&stateconsistencySparkStreaming概述Spark(mini-batch)jobtime=1-2:inputtime=0-1:inputInputStream:immutabledistributeddataset(replicatedinmemory)inputstreamstate/outputstream………OutputResult:immutabledistributeddataset,storedinmemorystate/outputSparkSQL概述•在Spark框架上支持SQL查询–StructureddataanalysisusingSQLqueriesonSpark•Hivetables,Parquetfiles,etc.–Integrationwithanalyticspipelines•Hive兼容性–DirectlyreadingdatastoredinHive–WritingqueriesinHiveQL在EMR上运行Spark/Spark-SQL(Source:)SparkStreaming+Kinesis(Source:)SparkStream-SQL:流式SQL分析框架•用户使用Stream-SQL查询,对输入数据流进行处理分析•框架自动将Stream-SQL查询编译成DiscretizedStream•生成的DiscretizedStream在每一个“batch”运行一个Spark作业–Conceptually,eachjobrunsthesameSpark-SQLqueryastheStream-SQLquery(withtheinput“Stream”replacedbyaninputtable)–Theinputtablewillcontainthedatareceivedoverthatstreamduringthis“batch”(ordatareceivedinthe“current”window)Stream-SQL查询CREATESTREAMIFNOTEXISTSpeople_stream1(nameSTRING,ageINT)STOREDASLOCATION‘kafka://…’;CREATESTREAMIFNOTEXISTSpeople_stream2(nameSTRING,zipcodeINT)STOREDASLOCATION‘kafka://…’;SELECTcount(*)FROMpeople_stream1WHEREage=10&&age=19;SELECTzipcode,AVG(age)FROMpeople_stream1JOINpeople_stream2ONpeople_stream1.name=people_stream2.nameGROUPBYzipcode;SparkStream-SQL和Hive的兼容性•Hive:Hadoop平台上的数据仓库系统•Stream-SQL将Hive扩展为一个构建在Spark上的数据流管理系统–SupportwritingqueriesinHiveQLforStream–Streamcreated®isteredinHiveMetaStore(justasnormalHivetables)–Querybothinputdatastreamand(history/reference)datatablestoredinHiveStream-SQL查询CREATETABLEIFNOTEXISTScity_table(zipcodeINT,city_nameSTRING);CREATESTREAMIFNOTEXISTSpeople_stream(nameSTRING,zipcodeINT)STOREDASLOCATION‘kafka://…’;...SELECTcityname,count(*)FROMpeople_streamJOINcity_tableONpeople_stream.zipcode=city_table.zipcodeGROUPBYcity_table.zipcode;SparkStream-SQL开发状态•基于Apache2.0协议开源––Developerpreview(basedonSpark1.0)available•目前正处于积极开发中–AnupdatebasedonlatestSparkversionwillbeavailablesoon–Manymorefeatures&optimizationsarebeingadded–PlantocontributebacktothemainSparkprojectWelcomeCollaboration!Tachyon概述•可靠的、分布式内存文件系统,支持多种不同的底层存储系统•在不同的集群计算框架和作业之间,提供可靠的、内存级读写速度的数据共享支持多种框架的分布式内存文件系统SparkMapReduceSparkSQLH2OGraphXImpalaHDFSS3GlusterFSOrangeFSNFSCeph…………(Source:)应用性能改进Performancecomparisonforrealisticworkflow.Theworkflowran4xfasteronTachyonthanonMemHDFS.Incaseofnodefailure,applicationsinTachyonstillfinishes3.8xfaster.19(Source:)Tachyon分级存储管理•当前Tachyon中的2级存储架构–Memoryacrossdifferentserversintheclusterareorganizedasacachepooltoprovidememory-speeddatasharing–Alldataarereliablypersistedintheunderlyingfilesystem•Tachyon中新的分级存储管理–Thedatacachepoolmanagesmultiplestoragetiers(fordifferenttypesofstorage)toprovidememory-speeddatasharing–Provideefficientsupportfornewstoragemedia(e.g.,flash)and/orcomputingenvironments(e.g.,cloud,HPC)RamdiskLocalSSDLocalHDDServerRamdiskLocalSSDLocalHDDServerTachyon分级存储管理(闪存SSD案例)RamdiskLocal
本文标题:3--基于Apache Spark软件栈的实时大数据分析-戴金权
链接地址:https://www.777doc.com/doc-5497461 .html