Tutorial: Large-Scale Graph Processing in Shared Memory

Size: px

Start display at page:

Download "Tutorial: Large-Scale Graph Processing in Shared Memory"

Eustace Norris
6 years ago
Views:

1 Tutorial: Large-Scale Graph Processing in Shared Memory Julian Shun (University of California, Berkeley) Laxman Dhulipala (Carnegie Mellon University) Guy Blelloch (Carnegie Mellon University)

2 PPoPP Tutorial Agenda 2:00-3:30pm Overview of graph processing and Ligra Walk through installation Do examples in Ligra 3:30-4:00pm Break 4:00-5:30pm Implementation details of Ligra Overview of other graph processing systems Exercises Tutorial website: Slides available at

3 Graphs are everywhere! PPoPP Can contain billions of vertices and edges! 6.6 billion edges 128 billion edges ~1 trillion edges [VLDB 2015]

PPoPP 2016 4 Graph processing challenges Many random memory accesses, poor

parallelism throughout execution Race conditions, load balancing Need to

4 PPoPP Graph processing challenges Many random memory accesses, poor locality Relatively high communication-to-computation ratio Varying parallelism throughout execution Race conditions, load balancing Need to efficiently analyze large graphs Running time efficiency Space efficiency Programming efficiency

5 PPoPP Ligra Graph Processing Framework EdgeMap VertexMap Breadth-first search Betweenness centrality Connected components K-core decomposition Belief propagation Maximal independent set Single-source shortest paths Eccentricity estimation (Personalized) PageRank Local graph clustering Biconnected components Collaborative filtering Simplicity, Performance, Scalability

6 PPoPP Graph Processing Systems Existing (at the time Ligra was developed): Pregel/ Giraph/GPS, GraphLab, Pegasus, Knowledge Discovery Toolbox, GraphChi, and many others Our system: Ligra - Lightweight graph processing system for shared memory Takes advantage of frontier-based nature of many algorithms (active set is dynamic and often small)

Breadth-first Search (BFS) PPoPP 2016 7 Compute a BFS tree rooted at source r containing all vertices reachable from r Applications Betweenness centrality

7 Breadth-first Search (BFS) PPoPP Compute a BFS tree rooted at source r containing all vertices reachable from r Applications Betweenness centrality Eccentricity estimation Maximum flow Web crawlers Network broadcasting Cycle detection Can process each level in parallel Race conditions, load balancing.

Steps for Graph Traversal PPoPP 2016 8 Operate on a subset of vertices Map computation over subset of edges in

Map computation over subset of vertices in parallel VertexMap } VertexSubset EdgeMap We built the Ligra

8 Steps for Graph Traversal PPoPP Operate on a subset of vertices Map computation over subset of edges in parallel Return new subset of vertices Many graph traversal Graph algorithms do this! Map computation over subset of vertices in parallel VertexMap } VertexSubset EdgeMap We built the Ligra abstraction for these kinds of computations Think with flat data-parallel operators Shared memory parallelism Optimizations: hybrid traversal graph compression

9 PPoPP Ligra Framework T F F T VertexSubset bool f(v){ data[v] = data[v] + 1; return (data[v] == 1); } VertexFilter VertexMap 6 3 VertexSubset 8 6

PPoPP 2016 10 Ligra Framework T 2 0 5 T VertexSubset 8 0 4 6 T T 7 4 1 T

10 PPoPP Ligra Framework T T VertexSubset T T T 8 bool update(u,v){ } EdgeMap bool cond(v){ } 6 3 F VertexSubset

11 PPoPP Breadth-first Search in Ligra parents = {-1,, -1}; //-1 indicates unexplored procedure UPDATE(s, d): return compare_and_swap(parents[d], -1, s); procedure COND(v): return parents[v] == -1; //checks if unexplored procedure BFS(G, r): parents[r] = r; frontier = {r}; //VertexSubset while (size(frontier) > 0): frontier frontier = EDGEMAP(G, frontier, UPDATE, COND); T T F T T

12 PPoPP Install and code examples in Ligra

13 Breadth-first Search (BFS) Compute a BFS tree rooted at source r containing all vertices reachable from r Frontier PPoPP r

Connected Components PPoPP 2016 14 Takes an unweighted, undirected graph G = (V, E) Returns a

14 Connected Components PPoPP Takes an unweighted, undirected graph G = (V, E) Returns a label array L such that L[v] = L[w] if and only if v is connected to w Image Source: docs.roguewave.com

15 PPoPP Parallel Label Propagation Algorithm Processing all vertices in each iteration seems wasteful Optimization: only place vertices who changed on frontier 45 5 Warning: this algorithm is only good for low-diameter graphs

16 PPoPP Single-Source Shortest Paths Takes a weighted graph G = (V, E, w(e)) and starting vertex r V Returns a shortest paths array SP where SP[v] stores the shortest path distance from r to v ( if v unreachable from r)

17 PPoPP Parallel Bellman-Ford Shortest Paths 3 D B 5 E 63 A C -2 4 F 2

18 PPoPP Parallel Bellman-Ford Performance Times for Bellman-Ford on rmat24 Running time (seconds) Our Bellman-Ford Naive Bellman-Ford Number of threads

19 K-core decomposition PPoPP A k-core of a graph is a maximal connected subgraph in which all vertices have degree at least k A vertex has core number k if it belongs to a k-core but not a (k+1)-core Algorithm: Takes an unweighted, undirected graph G and returns the core number of each vertex in G k = 1 while(g is not empty) { } while(there exists vertices with degree < k in G) { } k = k+1; assign a core number of k-1 to all vertices with degree < k; remove all vertices with degree < k from G;

20 PageRank PPoPP

$PPoPP 2016 21 PageRank in Ligra p_curr = {1/ V,, 1/ V }; p_next = {0,, 0}; diff = {}; procedure UPDATE(s, d): return atomic_increment(p_next[d], p_curr[s] / degree(s)); procedure COMPUTE(i):$

21 PPoPP PageRank in Ligra p_curr = {1/ V,, 1/ V }; p_next = {0,, 0}; diff = {}; procedure UPDATE(s, d): return atomic_increment(p_next[d], p_curr[s] / degree(s)); procedure COMPUTE(i): p_next[i] = α p_next[i] + (1- α) (1/ V ); diff[i] = abs(p_next[i] p_curr[i]); p_curr[i] = 0; return 1; procedure PageRank(G, α, ε): frontier = {0,, V -1}; error = while (error > ε): frontier = EDGEMAP(G, frontier, UPDATE, COND true ); VERTEXMAP(frontier, COMPUTE); error = sum of diff entries; swap(p_curr, p_next) return p_curr;

22 PPoPP PageRank Sparse version? PageRank-Delta: Only update vertices whose PageRank value has changed by more than some Δ-fraction (discussed in GraphLab and McSherry WWW 05)

23 PageRank-Delta in Ligra PPoPP PR[i] = {1/ V,, 1/ V }; nghsum = {0,, 0}; Change = {}; //store changes in PageRank values procedure UPDATE(s, d): //passed to EdgeMap return atomic_increment(nghsum[d], Change[s] / degree(s)); procedure COMPUTE(i): Change[i] = α nghsum[i]; PR[i] = PR[i] + Change[i]; return (abs(change[i]) > Δ); //now passed to VertexFilter //check if absolute value of change is big enough procedure PageRank(G, α, ε): frontier = VERTEXFILTER(frontier, COMPUTE);

24 Eccentricity estimation PPoPP Takes an unweighted, undirected graph G = (V, E) Returns an estimate of the eccentricity of each vertex where The eccentricity of a vertex v is the distance to furthest reachable vertex from v ecc(a) = 4 ecc(b) = 3 ecc(c) = 2 ecc(d) = 3 a b c d e f g ecc(e) = 4 ecc(f) = 3 ecc(g) = 4

25 PPoPP Multiple BFS s Run multiple BFS s from a sample of random vertices and use distance from furthest sample as eccentricity estimate s 1 v s 4 s 2 s 3 eĉc(v) = max(d(v, s 1 ), d(v, s 2 ), d(v, s 3 ), d(v, s 4 )) for all v In practice, need to run two sweeps to get good accuracy [KDD 2015]

Eccentricity estimation implementation Run all k BFS s simultaneously 1e+11 Take advantage of

Bitwise-OR of bit-vectors 1e+07 when visiting neighbor 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 2

26 Eccentricity estimation implementation Run all k BFS s simultaneously 1e+11 Take advantage of bit-level parallelism to store visited 1e+10 information v Num. edges traversed PPoPP com-youtube 1e+12 1e+09 k/64 words k-bfs Naive BFS 1e+08 Bitwise-OR of bit-vectors 1e+07 when visiting neighbor Unique bit set for each source k s 1 s

Eccentricity estimation implementation Initial frontier = {s 1, s 2,, s k } d = 0 While

BFS s by 1 level nextfrontier = {} d = d+1 EdgeMap For each vertex v in frontier: For each

store in ngh If any of ngh s words changed: eĉc(ngh) = max(eĉc(ngh), d) and place ngh on

27 Eccentricity estimation implementation Initial frontier = {s 1, s 2,, s k } d = 0 While frontier not empty: PPoPP atomic bitwise-or using compareand-swap //Advance all BFS s by 1 level nextfrontier = {} d = d+1 EdgeMap For each vertex v in frontier: For each neighbor ngh: //pass visited information Do bitwise-or of v s words with ngh s words and store in ngh If any of ngh s words changed: eĉc(ngh) = max(eĉc(ngh), d) and place ngh on nextfrontier if not there frontier = nextfrontier We will implement this example in Ligra for k=64

28 PPoPP Ligra Implementation Details

29 PPoPP VertexSubset, VertexMap, and VertexFilter VertexSubset has one of two representations: Sparse integer array, e.g. {1, 4, 7} Dense boolean array, e.g. {0,1,0,0,1,0,0,1} procedure VERTEXMAP(VertexSubset U, func F): parallel foreach v in U: F(v); //side-effects application data procedure VERTEXFILTER(VertexSubset U, bool func F): result = {} parallel foreach v in U: if(f(v) == 1) then: add v to result; return result;

30 Sparse or Dense EdgeMap? Frontier PPoPP Dense method better when frontier is large and many vertices have been visited Sparse (traditional) method better for small frontiers Switch between the two methods based on frontier size [Beamer et al. SC 12] 6 7 Limited to BFS? 8

PPoPP 2016 31 EdgeMap procedure EDGEMAP(G, frontier, Update, Cond): if (above threshold) then: return

outgoing edges of frontier vertices in parallel Loop through incoming edges of unexplored vertices (in

31 PPoPP EdgeMap procedure EDGEMAP(G, frontier, Update, Cond): if (above threshold) then: return EDGEMAP_DENSE(G, frontier, Update, Cond); else: return EDGEMAP_SPARSE(G, frontier, Update, Cond); Loop through outgoing edges of frontier vertices in parallel Loop through incoming edges of unexplored vertices (in parallel), breaking early if possible More general than just BFS! Generalized to many other problems Users need not worry about this

32 EdgeMap (sparse version) How to represent VertexSubset? Array of integers, e.g. U = {0, 5, 7} PPoPP procedure EDGEMAP_SPARSE(G, frontier, UPDATE, COND): nextfrontier = {}; parallel foreach v in frontier: parallel foreach w in out_neighbors(v): if(cond(w) == 1 and UPDATE(v, w) == 1) then: add w to nextfrontier; remove duplicates from nextfrontier; return nextfrontier; parents = {-1,, -1}; //-1 indicates unvisited procedure UPDATE(s, d): return compare_and_swap(parents[d], -1, s); procedure COND(i): return parents[i] == -1; //checks if unvisited procedure BFS(G, r): parents[r] = r; frontier = {r}; //vertexsubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);

33 EdgeMap (dense version) How to represent dense VertexSubset? Byte array, e.g. U = {1, 0, 0, 0, 0, 1, 0, 1} procedure EDGEMAP_DENSE(G, frontier, UPDATE, COND): nextfrontier = {0,, 0}; parallel foreach v in G: if (COND(v) == 1) then: foreach ngh in in_neighbors(v): if ngh in frontier and UPDATE(ngh, v) == 1 then: nextfrontier[v] = 1; if (COND(v) == 0) then: break; return nextfrontier; PPoPP parents = {-1,, -1}; //-1 indicates unvisited procedure UPDATE(s, d): return compare_and_swap(parents[d], -1, s); procedure COND(i): return parents[i] == -1; //checks if unvisited procedure BFS(G, r): parents[r] = r; frontier = {r}; //vertexsubset while (size(frontier) > 0): frontier = EDGEMAP(G, frontier, UPDATE, COND);

34 Frontier-based approach enables sparse/dense traversal Twitter graph (41M vertices, 1.5B edges) PPoPP core running time (seconds) BFS Betweenness Centrality Connected Components Eccentricity Estimation Dense Sparse Sparse/Dense (using default threshold of E /20)

Ligra BFS Performance PPoPP 2016 35 Running time (seconds) 0.3 0.25 0.2 0.15 0.1 0.05 0 Twitter graph (41M vertices, 1.

35 Ligra BFS Performance PPoPP Running time (seconds) Twitter graph (41M vertices, 1.5B edges) BFS Ligra (40-core machine) Hand-written Cilk/ OpenMP (40-core machine) Lines of code BFS Comparing against direction-optimizing code by Beamer et al.

PPoPP 2016 36 Ligra PageRank Performance Running time (seconds) 14 12 10 8 6 4 2 0 Twitter graph (41M vertices, 1.

36 PPoPP Ligra PageRank Performance Running time (seconds) Twitter graph (41M vertices, 1.5B edges) Page Rank (1 iteration) GraphLab (64 x 8- cores) GraphLab (40-core machine) Lines of code Ligra (40-core machine) Hand-written Cilk/ OpenMP (40-core 10 machine) 0 Page Rank Easy to implement sparse version of PageRank in Ligra

Running time (seconds) Ligra Connected Components Performance 14 12 10 8 6 4 2 0 Twitter graph (41M vertices, 1.

37 Running time (seconds) Ligra Connected Components Performance Twitter graph (41M vertices, 1.5B edges) Connected Components GraphLab (16 x 8- cores) GraphLab (40-core machine) Lines of code Ligra (40-core machine) Hand-written Cilk/ OpenMP (40-core 50 machine) 0 PPoPP Connected Components Performance close to hand-written code Faster than existing high-level frameworks at the time Shared-memory graph processing is very efficient Several shared-memory graph processing systems subsequently developed: Galois [SOSP 13], X-stream [SOSP 13], PRISM [SPAA 14], Polymer [PPoPP 15], Ringo [SIGMOD 15], GraphMat [VLDB 15]

PPoPP 2016 38 Large Graphs Available RAM 6.

38 PPoPP Large Graphs Available RAM 6.6 billion edges Running Time 128 billion edges Memory Required ~1 trillion edges [VLDB 2015] Most can fit on commodity shared memory machine What if you don t have that much memory?

39 PPoPP Ligra+: Adding Graph Compression to Ligra

40 Ligra+: Adding Graph Compression to Ligra Interface PPoPP Graph VertexSubset EdgeMap VertexMap Use compressed representation Same as before Decode edges on-the-fly Same as before Same interface as Ligra All changes hidden from the user!

Graph representation PPoPP 2016 41 Vertex IDs 0 1 2 3 Offsets 0 4 5 11... Edges 2 7 9 16 0 1 6 9 12 2-0 = 2 7-2 = 5 1-2 = -1 Compressed Edges 2 5 2 7-1 -1 5 3 3.

41 Graph representation PPoPP Vertex IDs Offsets Edges = = = -1 Compressed Edges Sort edges and encode differences Graph reordering to improve locality Goal: give neighbors IDs close to vertex ID BFS, DFS, METIS, our own separator-based algorithm

42 Variable-length codes k-bit codes PPoPP Encode value in chunks of k bits Use k-1 bits for data, and 1 bit as the continue bit Example: encode 401 using 8-bit (byte) code In binary: bits for data continue bit

43 Encoding optimization PPoPP Another idea: get rid of continue bits Number of bytes required to encode each integer Number of bytes per integer Header x 1 x 2 x 3 x 4 x 5 x 6 x 7 x Size of group (max 64) Use run-length encoding Integers in group encoded in byte chunks Increases space, but makes decoding cheaper (no branch misprediction from checking continue bit)

44 Ligra+: Adding Graph Compression to Ligra Interface PPoPP Graph VertexSubset EdgeMap VertexMap Use compressed representation Same as before Decode edges on-the-fly Same as before Same interface as Ligra All changes hidden from the user!

45 Modifying EdgeMap PPoPP Processes outgoing edges of a subset of vertices VertexSubset All vertices processed in parallel What about high-degree vertices?

Handling high-degree vertices High-degree vertex PPoPP 2016 46-1 2 4 3 16 2 1 5 8 19 4 1 23 14 12 1 9 10 3 5 Chunks of size T -1 2 4 3 16 2 27 5 8 19 4 1 87 14 12

46 Handling high-degree vertices High-degree vertex PPoPP Chunks of size T Encode first entry relative to source vertex All chunks can be decoded in parallel! We chose T=1000 Similar performance and space usage for a wide range of T

47 PPoPP Ligra+: Adding Graph Compression to Ligra Space relative to Ligra Ligra Ligra core time relative to Ligra Using 8-bit codes with run-length encoding Cost of decoding on-the-fly? Memory bottleneck a bigger issue as graph algorithms are memory-bound

48 PPoPP Demo on compressed graphs in Ligra

49 PPoPP Other Graph Processing Systems

50 Many existing frameworks PPoPP Pregel/Giraph/GPS, GraphLab/PowerGraph, PRISM, Pegasus, Knowledge Discovery Toolbox/CombBLAS, GraphChi, GraphX, Galois, X-Stream, Gunrock, GraphMat, Ringo, TurboGraph, FlashGraph, Grace, PathGraph, Polymer, GoFFish, Blogel, LightGraph, MapGraph, PowerLyra, PowerSwitch, XDGP, Signal/ Collect, PrefEdge, Parallel BGL, KLA, Grappa, Chronos, Green-Marl, GraphHP, P++, LLAMA, Venus, Cyclops, Medusa, NScale, Neo4J, Trinity, GBase, HyperGraphDB, Horton, GSPARQL, Titan, and many others Cannot list everything here. For more information, see: Systems for Big Graphs, Khan and Elnikety VLDB 2014 Tutorial Trade-offs in Large Graph Processing: Representations, Storage, Systems and Algorithms, Ajwani et al. WWW 2015 Tutorial A Survey of Parallel Graph Processing Frameworks, Doekemeijer and Varbanescu 2014 Thinking like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing, McCune et al. 2015

51 PPoPP Pregel Think like a vertex Distributed-memory, uses message passing Bulk synchronous model Vertices can be either active or inactive Pregel: A System for Large-Scale Graph Processing, Malewicz et al. SIGMOD 2010

52 GraphLab/PowerGraph PPoPP Think like a vertex Vertices define Gather, Apply, and Scatter functions Data on edges as well as vertices Scheduler processes active vertices Supports asynchronous execution Useful for some machine learning applications Different levels of consistency Different scheduling orders Current version for distributed memory (original version was for shared memory) Vertex Data: Edge Data: GraphLab: A New Framework for Parallel Machine Learning, Low et al. UAI 2010 PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al. OSDI 2012

PPoPP 2016 53 GraphChi Think like a vertex Define Update function for vertices Optimized for disk-based

Parallel sliding windows method for efficient execution One shard in memory, other shards read in streaming

53 PPoPP GraphChi Think like a vertex Define Update function for vertices Optimized for disk-based execution Divides edges into shards Each shard has a range of target IDs Each shard sorted by source ID Parallel sliding windows method for efficient execution One shard in memory, other shards read in streaming fashion About O((V+E)/B) I/O s per iteration GraphChi: Large-Scale Graph Computation on Just a PC, Kyrola et al. OSDI 2012

54 Linear algebra abstraction PPoPP PEGASUS: Framework using matrix-vector abstraction in MapReduce [Kang et al. ICDM 2009] Knowledge Discovery Toolbox/CombBLAS: Uses matrixvector and matrix-matrix routines for implementing graph algorithms; uses Python frontend, C++/MPI backend [Lugowski et al. SDM 2012]

55 PPoPP Some other shared memory systems X-Stream [SOSP 2013] Edge-centric abstraction Supports disk-based execution Galois [SOSP 2013] Parallel programming framework based on dynamic sets with various schedulers Implements graph abstractions of Ligra, X-Stream, and GraphLab PRISM [SPAA 2014] Deterministic version of GraphLab using graph coloring for scheduling; implemented in Cilk Polymer [PPoPP 2015] NUMA-aware implementation of Ligra s abstraction GraphMat [VLDB 2015] Uses optimized matrix-vector routines to implement graph algorithms Gunrock [PPoPP 2016] Abstraction for frontier-based computations on GPUs

56 Exercises PPoPP

57 PPoPP Exercises Implement a version of BFS that gives a deterministic BFS tree (i.e., the Parents array is deterministic) Implement a version of BFS that uses a bit-vector to check the visited status before updating the Parents array

PPoPP 2016 58 Thank you! Code: https://github.com/jshun/ligra/ References Ligra: A Lightweight Graph Processing Framework for Shared Memory, PPoPP 2013.

58 PPoPP Thank you! Code: References Ligra: A Lightweight Graph Processing Framework for Shared Memory, PPoPP Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+, DCC An Evaluation of Parallel Eccentricity Estimation Algorithms on Undirected Real-World Graphs, KDD 2015.

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge