Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica
1 Graphs are everywhere Social Networks
2 Graphs are everywhere Gnutella network subgraph
Graphs are everywhere 3
4 Graphs are everywhere Metabolic network of a single cell organism Tuberculosis
5 Plenty of interest in processing them Graph MS 25% of all enterprises by end of 2017 1 Many open-source and research prototypes on distributed graph processing frameworks: Giraph, Pregel, GraphLab, GraphX, 1 Forrester Research
6 Real-world Graphs are ynamic Earthquake Occurrence ensity
Real-world Graphs are ynamic 7
Real-world Graphs are ynamic 8
9 Processing Time-evolving Graphs Many interesting business and research insights possible by processing such dynamic graphs little or no work in supporting such workloads in existing big-data graph-processing frameworks
10 hallenge #1: Storage Time E G 1 G 2 G 3 Redundant storage of graph entities over time
11 hallenge #2: omputation Time E G 1 G 2 G 3 E R 1 R 2 R 3 Wasted computation across snapshots
12 hallenge #3: ommunication Time E G 1 G 2 G 3 E uplicate messages sent over the network
13 How do we process time-evolving, Share Storage dynamically changing ommunication graphs omputation efficiently? Tegra
14 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra
15 Sharing Storage Time E G 1 G 2 G 3 δg 1 δg 2 Storing deltas result in the most optimal storage, but creating snapshot from deltas can be expensive! δg 3 E
16 etter Storage Solution Use a persistent datastructure Snapshot 1 Snapshot 2 t 1 t 2 Store snapshots in Persistent daptive Radix Trees (PRT)
17 Graph Snapshot Index Snapshot I Management Snapshot 1 Snapshot 2 Snapshot 1 Snapshot 2 t 1 t 1 t 2 t 2 Vertex Edge Partition Shares structure between snapshots, and enables efficient operations
18 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra
19 Graph Parallel bstraction - GS Gather: ccumulate information from neighborhood pply: pply the accumulated value Scatter: Update adjacent edges & vertices with updated value
20 Processing Multiple Snapshots Time E G 1 G 2 G 3 for (snapshot in snapshots) { for (stage in graph-parallel-computation) { } }
21 Reducing Redundant Messages for (step in graph-parallel-computation) { for (snapshot in snapshots) { } } Time E E G 1 G 2 G 3 an potentially avoid large number of redundant messages
22 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra
23 Updating Results If result from a previous snapshot is available, how can we reuse them? Three approaches in the past: Restart the algorithm Redundant computations Memoization (GraphInc 1 ) Too much state Operator-wise state (Naiad 2,3 ) Too much overhead Fault tolerance 1 Facilitating real- time graph mining, loud 12 2 Naiad: timely dataflow system, SOSP 13 3 ifferential dataflow, IR 13
24 Key Idea Leverage how GS model executes computation Each iteration in GS modifies the graph by a little an be seen as another time-evolving graph! Upon change to a graph: Mark parts of the graph that changed Expand the marked parts to involve regions for recomputation in every iteration orrow results from parts not changed
25 Incremental omputation Iterations G 0 1 G 1 1 G 2 1 Time G 0 2 G 1 2 G 2 2 Larger graphs and more iterations can yield significant improvements
26 Implementation & Evaluation Implemented on Spark 2.0 Extended dataframes with versioning information and iterate operator Extended GraphX PI to allow computation on multiple snapshots Preliminary evaluation on two real-world graphs Twitter: 41,652,230 vertices, 1,468,365,182 edges uk-2007: 105,896,555 vertices, 3,738,733,648 edges
27 enefits of Storage Sharing Storage Reduction 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 atastructure overheads Significant improvements with more snapshots 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Snapshots
28 enefits of sharing communication Time (s) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Snapshots GraphX Tegra
29 enefits of Incremental omputing 250 omputation Time (s) 200 150 100 50 0 Only 5% of the graph modified in every snapshot 50x reduction by processing only the modified part Incremental Full omputation 0 5 10 15 20 Snapshot I
30 Summary & Future Work Processing time-evolving graph efficiently can be useful Sharing storage, computation and communication key to efficient time-evolving graph analysis ode release Incremental pattern matching pproximate graph analytics Geo-distributed graph analytics api@cs.berkeley.edu www.cs.berkeley.edu/~api