GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel

Size: px

Start display at page:

Download "GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel"

Barnard Small
6 years ago
Views:

1 GraphX Graph nalytics in Spark nkur ave Graduate Student, U Berkeley MPLab Joint work with Joseph Gonzalez, Reynold Xin, aniel U BRKLY rankshaw, Michael Franklin, and Ion Stoica

2 Graphs

3 Social Networks

4 Web Graphs

5 User-Item Graphs

6 Graph lgorithms

7 PageRank

8 Triangle ounting

9 ollaborative Filtering Users Ratings Users x Products f(j) f(i) Products

10 ollaborative Filtering f(3) r 13 f(1) User Factors f(2) r 14 r 24 f(4) Product Factors r 25 f(5) X f[i] = arg min w2r d r ij w T f[j] 2 + w 2 2 j2nbrs(i)

11 The Graph-Parallel Pattern

12 The Graph-Parallel Pattern

13 The Graph-Parallel Pattern

14 Many Graph-Parallel lgorithms ollaborative Filtering» lternating Least Squares» Stochastic Gradient escent» Tensor Factorization ommunity etection» Triangle-ounting» K-core ecomposition» K-Truss Structured Prediction» Loopy Belief Propagation» Max-Product Linear Programs» Gibbs Sampling Graph nalytics» PageRank» Personalized PageRank» Shortest Path» Graph oloring Semi-supervised ML» Graph SSL lassification» Neural Networks» om

15 High-egree Vertices hallenges: 1. Storage: How to store a graph where one vertex s edges don t fit on a machine? 2. PI: How to expose parallelism within vertex neighborhoods?

16 omplex Pipelines Link Table Hyperlinks PageRank Top 20 Pages Title Link Title PR Raw Wikipedia < / > < / > < / > Top ommunities om. PR.. XML ditor ommunity User Table ditor Graph etection ommunity ditor Title User om.

17 omplex Pipelines Solution: mbed graph processing within a tableoriented system (Spark) hallenges: 1. Storage: How to store graphs as tables? 2. omputation: How to express graph ops as table ops (map, reduce, join, etc.)? 3. PI: How to present the two views to the user?

18 The GraphX PI

19 Property Graphs Vertex Property: User Profile urrent PageRank Value dge Property: Weights Relationships Timestamps

20 reating a Graph (Scala) type VertexId = Long Graph val vertices: R[(VertexId, String)] = sc.parallelize(list( (1L, lice ), (2L, Bob ), (3L, harlie ))) 1 coworker lice class dge[]( val srcid: VertexId, val dstid: VertexId, val attr: ) val edges: R[dge[String]] = sc.parallelize(list( dge(1l, 2L, coworker ), dge(2l, 3L, friend ))) 2 friend 3 Bob harlie val graph = Graph(vertices, edges)

21 Graph Operations (Scala) class Graph[V, ] { // Table Views def vertices: R[(VertexId, V)] def edges: R[dge[]] def triplets: R[dgeTriplet[V, ]] // Transformations def mapvertices[v2](f: (VertexId, V) => V2): Graph[V2, ] def mapdges[2](f: dge[] => 2): Graph[V2, ] def reverse: Graph[V, ] def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] // Joins def outerjoinvertices[u, V2] (tbl: R[(VertexId, U)]) (f: (VertexId, V, Option[U]) => V2): Graph[V2, ] // omputation def aggregatemessages[]( sendmsg: dgeontext[v,, ] => Unit, mergemsg: (, ) => ): R[(VertexId, )]

22 Built-in lgorithms (Scala) } // ontinued from previous slide def pagerank(tol: ouble): Graph[ouble, ouble] def triangleount(): Graph[Int, ] def connectedomponents(): Graph[VertexId, ] //...and more: org.apache.spark.graphx.lib PageRank Triangle ount onnected omponents

23 The triplets view class Graph[V, ] { } def triplets: R[dgeTriplet[V, ]] class dgetriplet[v, ]( val srcid: VertexId, val dstid: VertexId, val attr:, val srcttr: V, val dstttr: V) Graph 1 lice R coworker 2 friend Bob triplets srcttr dstttr attr lice coworker Bob Bob friend harlie 3 harlie

24 The subgraph transformation class Graph[V, ] { def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] } graph.subgraph(epred = (edge) => edge.attr!= relative ) Graph Graph lice coworker Bob lice coworker Bob relative friend subgraph friend harlie relative avid harlie avid

25 The subgraph transformation class Graph[V, ] { def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] } graph.subgraph(vpred = (id, name) => name!= Bob ) Graph Graph lice coworker Bob lice relative friend subgraph relative harlie relative avid harlie relative avid

26 omputation with aggregatemessages class Graph[V, ] { def aggregatemessages[]( sendmsg: dgeontext[v,, ] => Unit, mergemsg: (, ) => ): R[(VertexId, )] } class dgeontext[v,, ]( val srcid: VertexId, val dstid: VertexId, val attr:, val srcttr: V, val dstttr: V) { def sendtosrc(msg: ) def sendtost(msg: ) } graph.aggregatemessages( ctx => { ctx.sendtosrc(1) ctx.sendtost(1) }, _ + _)

27 omputation with aggregatemessages R Graph vertex id degree lice relative harlie coworker friend relative Bob avid aggregatemessages lice 2 Bob 2 harlie 3 avid 1

28 xample: Graph oarsening Web Pages Intra-omain Links Pages by omain subgraph onnected omponents Graph constructor omains omain Graph

29 How GraphX Works

30 Storing Graphs as Tables Property Graph Vertex Table dge Table (R) (R) B B F Machine 1 Machine 2 B B F F F

31 Simple Operations Reuse vertices or edges across multiple graphs Input Graph Transformed Graph Vertex dge Vertex dge Table Table Table Table Transform Vertex Properties

32 Implementing triplets Vertex Table Routing Table dge Table (R) (R) (R) B 1 2 Machine 1 Machine 2 B F B F B F F

33 Implementing triplets dge Table (R) B B F F Mirror ache B Mirror ache F Vertex Table (R) B F B F Routing Table (R) B F

34 Reduction in ommunication ue to ached Updates onnected omponents on Twitter Graph (1.4B edges) Network omm. (MB) Most vertices are within 8 hops of all vertices in their component Iteration

35 Implementing aggregatemessages dge Table (R) B B F F Mirror ache B Mirror ache F Scan Result Vertex Table B F B F

36 Speedup ue To ccess Method Selection 30 onnected omponents on Twitter (1.4B edges) Runtime (Seconds) Scan ll dges Scan Indexed 10 5 Index of ctive dges Iteration

7000 6000 5000 4000 3000 18x 1000 2000 500 1000 0 0 GraphX

37 PageRank Benchmark 2 luster of 16 x m2.4xlarge (8 cores) + 1Gig Twitter Graph (42M Vertices,1.5B dges) UK-Graph (106M Vertices, 3.7B dges) Runtime (Seconds) x x GraphX GraphLab Giraph Naïve GraphX GraphLab Giraph Naïve Spark Spark GraphX performs comparably to state-of-the-art graph processing systems.

38 Open Source lpha release with Spark in Feb 2014 Stable release with Spark in ec 2014

39 Future of GraphX 1. Language support a) Java PI: PR #3234 b) Python PI: collaborating with Intel, SPRK More algorithms a) L (topic modeling): PR #2388 b) orrelation clustering 3. Research a) Local graphs b) Streaming/time-varying graphs c) Graph database like queries

40 IndexedR Support efficient updates to immutable Rs using purely functional data structures

41 Thanks!

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics GraphX: Unifying ata-parallel and Graph-Parallel nalytics Presented by aniel rankshaw Joint work with Reynold Xin, Joseph Gonzalez, nkur ave, Michael Franklin, and Ion Stoica Spark Meetup 3/25/2014 U BERKELEY