夏俊鸾：Spark――基于内存的下一代大数据分析框架

Spark:High-SpeedBigDataAnalysisFrameworkIntelAndrewXiaWeibo:Andrew-XiaAgenda•IntelcontributionstoSpark•Collaboration•Realworldcases•SummarySparkOverview•OpensourceprojectsinitiatedbyAMPLabinUCBerkeley•ApacheincubationsinceJune2013•IntelcloselycollaboratingwithAMPLab&thecommunityonopensourcedevelopmentUCBERKELEYContributionsbyIntel•NettybasedshuffleforSpark•FairSchedulerforSpark•Sparkjoblogfiles•MetricssystemforSpark•SparkshellonYARN•Spark(standalonemode)integrationwithsecurityHadoop•BytecodegenerationforShark•Co-partitionedjoininShark...IntelChina•3committers•7contributors•50+patchesAgenda•IntelcontributionstoSpark•Collaboration•Realworldcases•Summary•Intelpartneringwithseveralbigwebsites–Buildingnext-genbigdataanalyticsusingtheSparkstack–E.g.,Alibaba,BaiduiQiyi,Youku,etc.CollaborationPartners•Advertising–Operationanalysis–Effectanalysis–Directionaloptimization•Analysis–Websitereport–Platformreport–Monitorsystem•Recommendation–Rankinglist–Personalrecommendation–Hot-clickanalysisBigDatainPartnersFAQ#1:PoorPerformance–machinelearningandgraphcomputation–OLAPforTabulardata,interactivequeryFAQsHadoopDataSharingiter.1iter.2...Inputquery1query2query3result1result2result3HDFSreadSlowduetoreplication,serialization,anddiskIOSparkDataSharingInputquery1query2query3...one-timeprocessingiter.1iter.2...Input10-100×fasterthannetworkanddiskFAQ#2:ToomanybigdatasystemsFAQsBigDataSystemsTodayMapReduce…Specializedsystems(iterative,interactiveandstreamingapps)GeneralbatchprocessingVisionofSparkEcosystemOnestacktorulethemall!SparkEcosystemHDFS/HadoopStorageMesosYARNTachyonSparkSharkSQLSparkStreamingGraphxGraph-parallelMLBaseMachinelearningMPI……MapReduceFAQ#3:StudycostFAQsCodeSize020000400006000080000100000120000140000HadoopMapReduceStorm(Streaming)Impala(SQL)Giraph(Graph)Sparknon-test,non-examplesourcelinesGraphXShark**alsocallsintoHiveStreamingFAQ#4:IsSparkStable?FAQs•Spark0.8hasbeenreleased•Spark0.9willbereleaseinJan2013SparkStatusFAQ#5:NotenoughmemorytocacheFAQs•Gracefuldegradation•Schedulertakescareofthis•Otheroptions–MEMORY_ONLY–MEMORY_ONLY_SER–MEMORY_AND_DISK–DISK_ONLYNotEnoughMemoryFAQ#6:Howtorecoverwhenfailure?FAQsHowtoFailover?Inputquery1query2query3...one-timeprocessingiter.1iter.2...Input•Lineage:trackthegraphoftransformationsthatbuiltRDD•Checkpoint:lineagegraphsgetlargeHowtoFailover?FAQ#7:IsSparkcompatiblewithHadoopecosystem?FAQsFAQ#8:NeedporttoSpark?FAQsFAQ#9:AnyconsaboutSpark?FAQsAgenda•IntelcontributionstoSpark•Collaboration•Realworldcases•Summary•Logscontinuouslycollected&streamedin–Throughqueuing/messagingsystems•Incominglogsprocessedina(semi)streamingfashion–Aggregationsfordifferenttimeperiods,demographics,etc.–Joinlogsandhistorytableswhennecessary•Aggregationresultsthenconsumedina(semi)streamingfashion–Monitoring,alerting,etc.Case1#:Real-TimeLogAggregation•Implications–Betterstreamingframeworksupport•Complex(e.g.,statful)analysis,fault-tolerance,etc.–Kafka&Sparknotcollocated•DStreamretrieveslogsinbackground(overnetwork)andcachesblocksinmemory–MemorytuningtoreduceGCiscritical•spark.cleaner.ttl(throughput*spark.cleaner.ttlsparkmemfreesize)•Storagelevel(MEMORY_ONLY_SER2)–Lowerlatency(severalseconds)•Nostartupoverhead(reusingSparkContext)Real-TimeLogAggregation:SparkStreamingKafkaClusterLogCollectorsSparkClusterRDBMS•Algorithm:complexmatchoperations–Mostlymatrixbased•Multiplication,factorization,etc.–Sometimegraph-based•E.g.,sparsematrix•Iterativecomputations–Matrix(graph)cachedinmemoryacrossiterationsCase#2:MachineLearning&GraphAnalysis•N-degreeassociationinthegraph–Computingassociationsbetweentwoverticesthataren-hopaway–E.g.,friendsoffriend•Graph-parallelimplementation–Bagel(PregelonSpark)andGraphX•Memoryoptimizationsforefficientgraphcachingcritical–Speedupfrom20+minutesto2minutesGraphAnalysis:N-DegreeAssociationGraphAnalysis:N-DegreeAssociationvwuState[w]=listofWeight(x,w)(forcurrenttopKweightstovertexw)State[v]=listofWeight(x,v)(forcurrenttopKweightstovertexv)vwuMessages={D(x,u)=Weight(x,w)*edge(w,u)}(forweight(x,w)inState[w])Messages={D(x,u)=Weight(x,v)*edge(w,u)}(forweight(x,v)inState[v])vwuState[u]=listofWeight(x,u)(forcurrenttopKweightstovertexu)Agenda•IntelcontributionstoSpark•Collaboration•Realworldcases•Summary•MemoryisKing!•Onestacktorulethemall!•Contributetocommunity!Summary

夏俊鸾：Spark――基于内存的下一代大数据分析框架

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

必备房地产基础知识

GB52964服装标识要求PPT

机械设备管理管理程序

园林植物栽植技术规程

金融美国次贷危机

【食面埋伏：谨防日常生活的饮食陷阱】张迅捷

家庭与法律教学经验

采购管理理论认知一采购概述

十中三年发展规划(-XXXX)

159华为公司的人力资源管理实践

相关文档

相关搜索