Tegra Time-evolving Graph Processing on Commodity Clusters

Size: px

Start display at page:

Download "Tegra Time-evolving Graph Processing on Commodity Clusters"

Briana Kathleen Curtis
5 years ago
Views:

1 Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica

2 1 Graphs are everywhere Social Networks

3 2 Graphs are everywhere Gnutella network subgraph

4 Graphs are everywhere 3

5 4 Graphs are everywhere Metabolic network of a single cell organism Tuberculosis

6 5 Plenty of interest in processing them Graph MS 25% of all enterprises by end of Many open-source and research prototypes on distributed graph processing frameworks: Giraph, Pregel, GraphLab, GraphX, 1 Forrester Research

7 6 Real-world Graphs are ynamic Earthquake Occurrence ensity

8 Real-world Graphs are ynamic 7

9 Real-world Graphs are ynamic 8

10 9 Processing Time-evolving Graphs Many interesting business and research insights possible by processing such dynamic graphs little or no work in supporting such workloads in existing big-data graph-processing frameworks

11 10 hallenge #1: Storage Time E G 1 G 2 G 3 Redundant storage of graph entities over time

12 11 hallenge #2: omputation Time E G 1 G 2 G 3 E R 1 R 2 R 3 Wasted computation across snapshots

13 12 hallenge #3: ommunication Time E G 1 G 2 G 3 E uplicate messages sent over the network

14 13 How do we process time-evolving, Share Storage dynamically changing ommunication graphs omputation efficiently? Tegra

15 14 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra

16 15 Sharing Storage Time E G 1 G 2 G 3 δg 1 δg 2 Storing deltas result in the most optimal storage, but creating snapshot from deltas can be expensive! δg 3 E

17 16 etter Storage Solution Use a persistent datastructure Snapshot 1 Snapshot 2 t 1 t 2 Store snapshots in Persistent daptive Radix Trees (PRT)

18 17 Graph Snapshot Index Snapshot I Management Snapshot 1 Snapshot 2 Snapshot 1 Snapshot 2 t 1 t 1 t 2 t 2 Vertex Edge Partition Shares structure between snapshots, and enables efficient operations

19 18 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra

20 19 Graph Parallel bstraction - GS Gather: ccumulate information from neighborhood pply: pply the accumulated value Scatter: Update adjacent edges & vertices with updated value

21 20 Processing Multiple Snapshots Time E G 1 G 2 G 3 for (snapshot in snapshots) { for (stage in graph-parallel-computation) { } }

22 21 Reducing Redundant Messages for (step in graph-parallel-computation) { for (snapshot in snapshots) { } } Time E E G 1 G 2 G 3 an potentially avoid large number of redundant messages

23 22 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra

24 23 Updating Results If result from a previous snapshot is available, how can we reuse them? Three approaches in the past: Restart the algorithm Redundant computations Memoization (GraphInc 1 ) Too much state Operator-wise state (Naiad 2,3 ) Too much overhead Fault tolerance 1 Facilitating real- time graph mining, loud 12 2 Naiad: timely dataflow system, SOSP 13 3 ifferential dataflow, IR 13

25 24 Key Idea Leverage how GS model executes computation Each iteration in GS modifies the graph by a little an be seen as another time-evolving graph! Upon change to a graph: Mark parts of the graph that changed Expand the marked parts to involve regions for recomputation in every iteration orrow results from parts not changed

26 25 Incremental omputation Iterations G 0 1 G 1 1 G 2 1 Time G 0 2 G 1 2 G 2 2 Larger graphs and more iterations can yield significant improvements

27 26 Implementation & Evaluation Implemented on Spark 2.0 Extended dataframes with versioning information and iterate operator Extended GraphX PI to allow computation on multiple snapshots Preliminary evaluation on two real-world graphs Twitter: 41,652,230 vertices, 1,468,365,182 edges uk-2007: 105,896,555 vertices, 3,738,733,648 edges

28 27 enefits of Storage Sharing Storage Reduction atastructure overheads Significant improvements with more snapshots Number of Snapshots

29 28 enefits of sharing communication Time (s) Number of Snapshots GraphX Tegra

30 29 enefits of Incremental omputing 250 omputation Time (s) Only 5% of the graph modified in every snapshot 50x reduction by processing only the modified part Incremental Full omputation Snapshot I

31 30 Summary & Future Work Processing time-evolving graph efficiently can be useful Sharing storage, computation and communication key to efficient time-evolving graph analysis ode release Incremental pattern matching pproximate graph analytics Geo-distributed graph analytics

GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel

GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel GraphX Graph nalytics in Spark nkur ave Graduate Student, U Berkeley MPLab Joint work with Joseph Gonzalez, Reynold Xin, aniel U BRKLY rankshaw, Michael Franklin, and Ion Stoica Graphs Social Networks