您好,欢迎访问三七文档
当前位置:首页 > IT计算机/网络 > 数据挖掘与识别 > 基于Spark Graphx的大规模用户图计算和应用
基于Graphx的大规模用户图计算淘宝技术部——数据挖掘与计算明风目录Graphx简介和特性图计算场景整体模型,流程和算法调优与改进性能和技巧总结Graphx的发展0.5•2012-10-24•Bagel0.8•2013-09-25•Graphx-Branch0.9•2014-03-03•Graphx-Alpha特性•离线•批量•同步竞争对手•GraphLab•Giraph合作伴侣•Neo4j•TitanGraphx架构PregelGraphLabGraphConnectedComponentsSVDPlusPlusPageRankTriangleCountVertexRDDEdgeRDDRDD[EdgeTriplet]RoutingTableReplicatedVertexViewGraphImplStronglyConnectedComponentsGraphOpsMessageToPartitionPartitionStrategyEdgePartitionVertexPartitionEdgeEdgeTriplet算法模型实现关键点3个核心的RDDVerticesEdgesTriplets3个特性不变性——Immutable分布性——Distributed容错性——Fault-Tolerant边分割和点分割•VertexPartition&EdgePartition•RoutingTablePart.2Part.1VertexTable(RDD)BCADFEADPropertyGraphEdgeTable(RDD)ABACCDBCAEAFEFEDBCDEAFRoutingTable(RDD)BCDEAF121212122DVertexCutHeuristic点分割实现Graphx主要的接口基本信息接口•numEdges•numVertices•degrees(in/out)转换操作•mapVertices•mapEdges•mapTriplets结构操作•reverse•subgraph•mask•groupEdges联边聚合接口•mapReduceTriplets•collectNeighbors缓存操作•cache•unpersistVertices重要接口mapReduceTripletsattr11dstAttr1valsumWeight:VertexRDD[Double]=g.mapReduceTriplets()map(srcId1,weight11)(srcId1,weight12)(srcId1,weight13)(srcId2,weight21)(srcId2,weight22)(srcId2,weight23)reduceVertexRDD[Double](srcId1,sumW1)……(srcId2,sumW2)……attr13dstAttr3srcAttr1attr12dstAttr2MAPtriplet={Array((triplet.srcAttr.id,triplet.attr.weight)).iterator},REDUCE(a,b)=a+b用户图计算的场景基于度分布的中枢节点发现基于最大连通图的社区发现基于三角形计数的关系衡量基于随机游走的用户属性传播……用户能量模型SupervisedRandomWalks:PredictingandRecommendingLinksinSocialNetworkEdge{f1,f2……weight}Vertex{tScore,dScore,weights[],pScores[]}模型设计1.点能量传播2.训练集求AUC3.AUC求偏导4.更新边的权重正能量,负能量都会向周围的点传播,而能量从点i到点j传播的比例是:边(j-i)在点i的所有边的权重和的占比Aij=(1+e-w(xij-q))-1j(1+e-w×(xij-q))-1å使用新的权重调节因子,对图的所有边的权重,再做一次更新S(x;b)=11+e-bX标签集能量传播之后,在训练集上计算AUCAUC(w,q)=iÎPTjÎNTååS(P(i)-P(j))|PT||NT|对AUC求偏导,得到每个维度的权重和偏移量¶AUC(w,q)¶w=iÎPTjÎNT¶S(dij)dij¶P(i)¶w-¶P(j)¶wæèçöø÷ååPTNT模型训练initGraphtrcwbrfrcspwpsswssWeightAdjustorpartialDerivativeAUCPTNTPTTNTTderivative_AUC_cwderivative_AUC_csderivative_AUC_pwderivative_AUC_psderivative_AUC_swderivative_AUC_ssAUC训练次数:3+每次训练Pregel次数:2+6每次平均迭代:15360基于Pregel的随机游走gmessagenewMessagenewVertsnewGgiMaxIter&&activeMessage0mapReduceTripletsmessageinnerJoinouterJoinVerticesmapReduceTripletsdiffVertsdiffPerthreshold内存释放:unpersistVertices3个函数•点消息处理vprog{v.score=msg+(1-ALPHA)*v.weight}•发送消息sendMsg{(destId,ALPHA*edgeTriplet.srcAttr.score*edgeTriplet.attr.weight)}•消息合并mergeMsg{v1+v2}Graphx的性能参考0501002000w2亿20亿秒单次Pregel迭代性能Edges运行环境200Worker5Core/PerWorker30G/PerWorkerGraphx使用技巧循序渐进使用Sample,逐步增大数据量从边生成点,再去关联点属性valedgeRdd=loadEdges(edgefile).sample(false,0.01,1)valverticesRdd=extractVerticesFromEdge(edgeRdd)Graph(verticesRdd,edgeRdd)Graphx使用技巧释放内存保留旧图的引用,并尽快释放prevG=gg=g.outerJoinVertices(newVerts){……}.cache()(……dosomeactiontog……)prevG.unpersistVertices(blocking=false)Graphx使用技巧异常定位•加入带时间点的count,作为调试锚点g=g.outerJoinVertices(newVerts){……}.cache()if(debug_mode){logWithTime(g.vertices.count())}Graphx的配置建议spark.default.parallelismnum-workers*worker-coresmaster-memoryspark.storage.memoryFraction=0.5加入我们•我们需要•复杂网络算法师•Spark攻城师联系方式•微博:@明风Andy谢谢Q&A
本文标题:基于Spark Graphx的大规模用户图计算和应用
链接地址:https://www.777doc.com/doc-3347980 .html