1.GraphX提供了几种方式从RDD或者磁盘上的顶点和边集合构造图?
2.PageRank算法在图中发挥什么作用?
3.三角形计数算法的作用是什么?
Spark中文手册-编程指南
Spark之一个快速的例子Spark之基本概念
Spark之基本概念
Spark之基本概念(2)
Spark之基本概念(3)
Spark-sql由入门到精通
Spark-sql由入门到精通续
spark GraphX编程指南(1)
Pregel API
- class GraphOps[VD, ED] {
- def pregel[A]
- (initialMsg: A,
- maxIter: Int = Int.MaxValue,
- activeDir: EdgeDirection = EdgeDirection.Out)
- (vprog: (VertexId, VD, A) => VD,
- sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
- mergeMsg: (A, A) => A)
- : Graph[VD, ED] = {
- // Receive the initial message at each vertex
- var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
- // compute the messages
- var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
- var activeMessages = messages.count()
- // Loop until no messages remain or maxIterations is achieved
- var i = 0
- while (activeMessages > 0 && i < maxIterations) {
- // Receive the messages: -----------------------------------------------------------------------
- // Run the vertex program on all vertices that receive messages
- val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
- // Merge the new vertex values back into the graph
- g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
- // Send Messages: ------------------------------------------------------------------------------
- // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
- // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked
- // on edges in the activeDir of vertices in newVerts
- messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
- activeMessages = messages.count()
- i += 1
- }
- g
- }
- }
复制代码
- import org.apache.spark.graphx._
- // Import random graph generation library
- import org.apache.spark.graphx.util.GraphGenerators
- // A graph with edge attributes containing distances
- val graph: Graph[Int, Double] =
- GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
- val sourceId: VertexId = 42 // The ultimate source
- // Initialize the graph such that all vertices except the root have distance infinity.
- val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
- val sssp = initialGraph.pregel(Double.PositiveInfinity)(
- (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
- triplet => { // Send Message
- if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
- Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
- } else {
- Iterator.empty
- }
- },
- (a,b) => math.min(a,b) // Merge Message
- )
- println(sssp.vertices.collect.mkString("\n"))
复制代码
图构造者
- object GraphLoader {
- def edgeListFile(
- sc: SparkContext,
- path: String,
- canonicalOrientation: Boolean = false,
- minEdgePartitions: Int = 1)
- : Graph[Int, Int]
- }
复制代码
- # This is a comment
- 2 1
- 4 1
- 1 2
复制代码
- object Graph {
- def apply[VD, ED](
- vertices: RDD[(VertexId, VD)],
- edges: RDD[Edge[ED]],
- defaultVertexAttr: VD = null)
- : Graph[VD, ED]
- def fromEdges[VD, ED](
- edges: RDD[Edge[ED]],
- defaultValue: VD): Graph[VD, ED]
- def fromEdgeTuples[VD](
- rawEdges: RDD[(VertexId, VertexId)],
- defaultValue: VD,
- uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
- }
复制代码
顶点和边RDDs
VertexRDDs
- class VertexRDD[VD] extends RDD[(VertexID, VD)] {
- // Filter the vertex set but preserves the internal index
- def filter(pred: Tuple2[VertexId, VD] => Boolean): VertexRDD[VD]
- // Transform the values without changing the ids (preserves the internal index)
- def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
- def mapValues[VD2](map: (VertexId, VD) => VD2): VertexRDD[VD2]
- // Remove vertices from this set that appear in the other set
- def diff(other: VertexRDD[VD]): VertexRDD[VD]
- // Join operators that take advantage of the internal indexing to accelerate joins (substantially)
- def leftJoin[VD2, VD3](other: RDD[(VertexId, VD2)])(f: (VertexId, VD, Option[VD2]) => VD3): VertexRDD[VD3]
- def innerJoin[U, VD2](other: RDD[(VertexId, U)])(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
- // Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
- def aggregateUsingIndex[VD2](other: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
- }
复制代码
- val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
- val rddB: RDD[(VertexId, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
- // There should be 200 entries in rddB
- rddB.count
- val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
- // There should be 100 entries in setB
- setB.count
- // Joining A and B should now be fast!
- val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
复制代码
EdgeRDDs
- // Transform the edge attributes while preserving the structure
- def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
- // Revere the edges reusing both attributes and structure
- def reverse: EdgeRDD[ED]
- // Join two `EdgeRDD`s partitioned using the same partitioning strategy.
- def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexId, VertexId, ED, ED2) => ED3): EdgeRDD[ED3]
复制代码
图算法
PageRank算法
- // Load the edges as a graph
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Run PageRank
- val ranks = graph.pageRank(0.0001).vertices
- // Join the ranks with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ranksByUsername = users.join(ranks).map {
- case (id, (username, rank)) => (username, rank)
- }
- // Print the result
- println(ranksByUsername.collect().mkString("\n"))
复制代码
连通体算法
- / Load the graph as in the PageRank example
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Find the connected components
- val cc = graph.connectedComponents().vertices
- // Join the connected components with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val ccByUsername = users.join(cc).map {
- case (id, (username, cc)) => (username, cc)
- }
- // Print the result
- println(ccByUsername.collect().mkString("\n"))
复制代码
三角形计数算法
- // Load the edges in canonical order and partition the graph for triangle count
- val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
- // Find the triangle count for each vertex
- val triCounts = graph.triangleCount().vertices
- // Join the triangle counts with the usernames
- val users = sc.textFile("graphx/data/users.txt").map { line =>
- val fields = line.split(",")
- (fields(0).toLong, fields(1))
- }
- val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
- (username, tc)
- }
- // Print the result
- println(triCountByUsername.collect().mkString("\n"))
复制代码
例子
- // Connect to the Spark cluster
- val sc = new SparkContext("spark://master.amplab.org", "research")
- // Load my user data and parse into tuples of user id and attribute list
- val users = (sc.textFile("graphx/data/users.txt")
- .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
- // Parse the edge data which is already in userId -> userId format
- val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
- // Attach the user attributes
- val graph = followerGraph.outerJoinVertices(users) {
- case (uid, deg, Some(attrList)) => attrList
- // Some users may not have attributes so we set them as empty
- case (uid, deg, None) => Array.empty[String]
- }
- // Restrict the graph to users with usernames and names
- val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
- // Compute the PageRank
- val pagerankGraph = subgraph.pageRank(0.001)
- // Get the attributes of the top pagerank users
- val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
- case (uid, attrList, Some(pr)) => (pr, attrList.toList)
- case (uid, attrList, None) => (0.0, attrList.toList)
- }
- println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
复制代码
相关内容:
Spark中文手册1-编程指南
http://www.aboutyun.com/thread-11413-1-1.html
Spark中文手册2:Spark之一个快速的例子
http://www.aboutyun.com/thread-11484-1-1.html
Spark中文手册3:Spark之基本概念
http://www.aboutyun.com/thread-11502-1-1.html
Spark中文手册4:Spark之基本概念(2)
http://www.aboutyun.com/thread-11516-1-1.html
Spark中文手册5:Spark之基本概念(3)
http://www.aboutyun.com/thread-11535-1-1.html
Spark中文手册6:Spark-sql由入门到精通
http://www.aboutyun.com/thread-11562-1-1.html
Spark中文手册7:Spark-sql由入门到精通【续】
http://www.aboutyun.com/thread-11575-1-1.html
Spark中文手册8:spark GraphX编程指南(1)
http://www.aboutyun.com/thread-11589-1-1.html
Spark中文手册10:spark部署:提交应用程序及独立部署模式
http://www.aboutyun.com/thread-11615-1-1.html
Spark中文手册11:Spark 配置指南
http://www.aboutyun.com/thread-10652-1-1.html