您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 公司方案 > 夏俊鸾:Spark――基于内存的下一代大数据分析框架
Spark:High-SpeedBigDataAnalysisFrameworkIntelAndrewXiaWeibo:Andrew-XiaAgenda•IntelcontributionstoSpark•Collaboration•Realworldcases•SummarySparkOverview•OpensourceprojectsinitiatedbyAMPLabinUCBerkeley•ApacheincubationsinceJune2013•IntelcloselycollaboratingwithAMPLab&thecommunityonopensourcedevelopmentUCBERKELEYContributionsbyIntel•NettybasedshuffleforSpark•FairSchedulerforSpark•Sparkjoblogfiles•MetricssystemforSpark•SparkshellonYARN•Spark(standalonemode)integrationwithsecurityHadoop•BytecodegenerationforShark•Co-partitionedjoininShark...IntelChina•3committers•7contributors•50+patchesAgenda•IntelcontributionstoSpark•Collaboration•Realworldcases•Summary•Intelpartneringwithseveralbigwebsites–Buildingnext-genbigdataanalyticsusingtheSparkstack–E.g.,Alibaba,BaiduiQiyi,Youku,etc.CollaborationPartners•Advertising–Operationanalysis–Effectanalysis–Directionaloptimization•Analysis–Websitereport–Platformreport–Monitorsystem•Recommendation–Rankinglist–Personalrecommendation–Hot-clickanalysisBigDatainPartnersFAQ#1:PoorPerformance–machinelearningandgraphcomputation–OLAPforTabulardata,interactivequeryFAQsHadoopDataSharingiter.1iter.2...Inputquery1query2query3result1result2result3HDFSreadSlowduetoreplication,serialization,anddiskIOSparkDataSharingInputquery1query2query3...one-timeprocessingiter.1iter.2...Input10-100×fasterthannetworkanddiskFAQ#2:ToomanybigdatasystemsFAQsBigDataSystemsTodayMapReduce…Specializedsystems(iterative,interactiveandstreamingapps)GeneralbatchprocessingVisionofSparkEcosystemOnestacktorulethemall!SparkEcosystemHDFS/HadoopStorageMesosYARNTachyonSparkSharkSQLSparkStreamingGraphxGraph-parallelMLBaseMachinelearningMPI……MapReduceFAQ#3:StudycostFAQsCodeSize020000400006000080000100000120000140000HadoopMapReduceStorm(Streaming)Impala(SQL)Giraph(Graph)Sparknon-test,non-examplesourcelinesGraphXShark**alsocallsintoHiveStreamingFAQ#4:IsSparkStable?FAQs•Spark0.8hasbeenreleased•Spark0.9willbereleaseinJan2013SparkStatusFAQ#5:NotenoughmemorytocacheFAQs•Gracefuldegradation•Schedulertakescareofthis•Otheroptions–MEMORY_ONLY–MEMORY_ONLY_SER–MEMORY_AND_DISK–DISK_ONLYNotEnoughMemoryFAQ#6:Howtorecoverwhenfailure?FAQsHowtoFailover?Inputquery1query2query3...one-timeprocessingiter.1iter.2...Input•Lineage:trackthegraphoftransformationsthatbuiltRDD•Checkpoint:lineagegraphsgetlargeHowtoFailover?FAQ#7:IsSparkcompatiblewithHadoopecosystem?FAQsFAQ#8:NeedporttoSpark?FAQsFAQ#9:AnyconsaboutSpark?FAQsAgenda•IntelcontributionstoSpark•Collaboration•Realworldcases•Summary•Logscontinuouslycollected&streamedin–Throughqueuing/messagingsystems•Incominglogsprocessedina(semi)streamingfashion–Aggregationsfordifferenttimeperiods,demographics,etc.–Joinlogsandhistorytableswhennecessary•Aggregationresultsthenconsumedina(semi)streamingfashion–Monitoring,alerting,etc.Case1#:Real-TimeLogAggregation•Implications–Betterstreamingframeworksupport•Complex(e.g.,statful)analysis,fault-tolerance,etc.–Kafka&Sparknotcollocated•DStreamretrieveslogsinbackground(overnetwork)andcachesblocksinmemory–MemorytuningtoreduceGCiscritical•spark.cleaner.ttl(throughput*spark.cleaner.ttlsparkmemfreesize)•Storagelevel(MEMORY_ONLY_SER2)–Lowerlatency(severalseconds)•Nostartupoverhead(reusingSparkContext)Real-TimeLogAggregation:SparkStreamingKafkaClusterLogCollectorsSparkClusterRDBMS•Algorithm:complexmatchoperations–Mostlymatrixbased•Multiplication,factorization,etc.–Sometimegraph-based•E.g.,sparsematrix•Iterativecomputations–Matrix(graph)cachedinmemoryacrossiterationsCase#2:MachineLearning&GraphAnalysis•N-degreeassociationinthegraph–Computingassociationsbetweentwoverticesthataren-hopaway–E.g.,friendsoffriend•Graph-parallelimplementation–Bagel(PregelonSpark)andGraphX•Memoryoptimizationsforefficientgraphcachingcritical–Speedupfrom20+minutesto2minutesGraphAnalysis:N-DegreeAssociationGraphAnalysis:N-DegreeAssociationvwuState[w]=listofWeight(x,w)(forcurrenttopKweightstovertexw)State[v]=listofWeight(x,v)(forcurrenttopKweightstovertexv)vwuMessages={D(x,u)=Weight(x,w)*edge(w,u)}(forweight(x,w)inState[w])Messages={D(x,u)=Weight(x,v)*edge(w,u)}(forweight(x,v)inState[v])vwuState[u]=listofWeight(x,u)(forcurrenttopKweightstovertexu)Agenda•IntelcontributionstoSpark•Collaboration•Realworldcases•Summary•MemoryisKing!•Onestacktorulethemall!•Contributetocommunity!Summary
本文标题:夏俊鸾:Spark――基于内存的下一代大数据分析框架
链接地址:https://www.777doc.com/doc-3558859 .html