Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

Size: px

Start display at page:

Download "Duane Merrill (NVIDIA) Michael Garland (NVIDIA)"

Jeffry Wilcox
6 years ago
Views:

1 Duane Merrill (NVIDIA) Michael Garland (NVIDIA)

2 Agenda Sparse graphs Breadth-first search (BFS) Maximal independent set (MIS) BFS MIS

4 Sparse graphs Graph G(V, E) where n = V and m = E Convey structure via relationships Wikipedia 07 3D lattice 4

5 Graph sparsity Adjacency matrix of graph G(n, m) v 0 v 0 v 1 v 1 v 1 v v v 0 v v 3 v 3 v 1 v 3 Spy plots 5

6 Compressed sparse row (CSR) representation v 0 v 1 v v 3 Adjacency matrix of graph G(n, m) v 0 v 0 v 1 v 1 v 1 v v v 0 v v 3 v 3 v 1 v 3 Column indices: O(m) [0] [1] [] [3] [4] [5] [6] [7] [8] Row offsets: O(n) [0] [1] [] [3] [4]

8 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source 0 8

9 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source

10 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source

11 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source

12 Breadth-first search (BFS) 1. Pick a source node. Rank every vertex by the length of shortest path from source (or by predecessor vertex) 1

13 BFS applications Algorithmic primitive Reachability (e.g., garbage collection) Belief propagation (statistical inference) Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 13

14 The sequential algorithm Cycles newly-discovered-but-unvisited vertices through a queue 1. Dequeue vertex v. For all neighbors n i of v: dist[n i ] = dist[v] + 1 Enqueue n i 14 a b c e f d u g h i j k l m n o p r s t v Q: {a} Q: {b, c, d, e, f} a b c e f d u g h i j k l m n o p r s t v

15 Common parallelization strategies Level-synchronous: parallel work done in epochs by level a) Quadratic work b) Linear work Randomized: parallel work done within independent sub-problems a) Ullman-Yannakakis

16 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 16

17 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 17

18 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 18

19 Quadratic-work approach 1. Inspect every vertex at every time step E.g., one thread per vertex. If updated, update its neighbors 3. Repeat until no further updates 19

20 Quadratic-work approach Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s Output: Array dist[0..n-1] with dist[v] holding the distance from s to v kernel kernel 1. parallel for (i in V) :. dist[i] := 3. dist[s] := 0 4. iteration := 0 5. do : 6. done := true 7. parallel for (i in V) : 8. if (dist[i] == iteration) 9. done := false 10. for (offset in R[i].. R[i+1]-1) : 11. j := C[offset] 1. dist[j] = iteration iteration while (!done) Good: trivial data-parallel GPU stencils Bad: worst-case quadratic O(n +m) work Harish et al., Deng et al. Hong et al 0

21 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Active edges per iteration Search Depth Search Depth Search Depth Wikipedia 3D cubic lattice R-MAT small-world Search Depth Search Depth Search Depth Europe road atlas PDE-constrained optimization Auto transmission mesh 1

22 Active vertex identifiers Linear-work approach 1. Expand neighbors in parallel Enqueue adjacency lists Vertex frontier Edge-frontier. Filter out previously-visited vertices in parallel Edge-frontier Vertex frontier Logical Edge Frontier Logical Vertex Frontier 3. Repeat 0 Search depth Good: Linear O(n+m) work Merrill et al. (PPoPP 1), Luo et al., etc. Challenging: Need shared queues for tracking frontiers

23 Referenced vertex identifiers (millions) Quadratic vs. linear Quadratic Linear 3D Poisson Lattice (300 3 = 7M vertices) BFS iteration 130x more work!! 3

24 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues). Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 4

25 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues). Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 5

26 Available bandwidth (GB/s) 1) Cooperative data movement (queuing) Problem: Need shared work queues GPU rate-limits (C050): 16 billion vertex-identifiers / sec (14 GB/s) Only 67M global-atomics / sec (38x slowdown) Only 600M smem-atomics / sec (7x slowdown) Solution: Compute enqueue offsets using prefix-sum 0.1 1:1 1:8 1:64 1:51 1:4096 1:3768 1:6144 Frequency of atomic ops per vertex dequeued 6

27 Prefix sum for allocation Allocation requirement Result of prefix sum Output A C 1 0 D 3 t 0 t 1 t t 3 A 0 C 3 D 3 t 0 t 1 t t Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well Proceed at copy-bandwidth Only ~8 thread-instructions per input 7

28 Prefix sum for edge expansion Adjacency list size A C 1 0 D 3 Result of prefix sum Expanded edge frontier t 0 t 1 t t 3 A 0 C 3 D 3 t 0 t 1 t t

29 Data-flow of efficient CTA scan (granularity coarsening) x16 barrier x4 x1 x4 barrier x16 (For good GPU scan details, see Baxter s 9

30 ) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 30

31 ) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 31

32 3) Poor locality within SIMD lanes Problem: The referenced adjacency lists are arbitrarily located a) Bad: Serial expansion & processing t 1 t 3 t 0 t Solution: Exacerbated by wide SIMD widths Can t afford to have SIMD threads streaming through unrelated data b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t t 3 t 4 t 5 t 31 t 3 t 33 t 34 t 35 t 36 t 37 t 63 Cooperative neighbor expansion t 64 t 65 t 66 t 67 t 68 t 69 t 95 c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t t 3 t 4 t 5 3

33 4) SIMD simultaneous discovery Iteration 1 Iteration Iteration 3 Iteration 4 Duplicates in the edge-frontier redundant work Exacerbated by wide SIMD Compounded every iteration 33

Vertices (millions) 4) SIMD simultaneous discovery Normally not a problem for CPU implementations Serial adjacency list inspection A big problem for cooperative SIMD expansion 1.6 1.4 1. 1.0 0.

34 Vertices (millions) 4) SIMD simultaneous discovery Normally not a problem for CPU implementations Serial adjacency list inspection A big problem for cooperative SIMD expansion Logical edgefrontier Logical vertexfrontier Actually expanded Actually contracted Spatially-descriptive datasets 0.6 Power-law datasets 0.4 Solution: Localized duplicate removal using hashes in local scratch Search Depth 34

35 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ).4 x x nlpkkt (4-core ) 3.0 x.5 10 x audikw x cage (4-core ).6 x. 18 x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3. x x kron_g500-logn x random.mv.18me (8-core ) 7.0 x x rmat.mv.18me (8-core ) 6.0 x 3.3 x.5 GHz Core i7 4-core (Leiserson et al.).7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 600K (Sandybridge)

36 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ).4 x x nlpkkt (4-core ) 3.0 x.5 10 x audikw x cage (4-core ).6 x. 18 x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3. x x kron_g500-logn x random.mv.18me (8-core ) 7.0 x x rmat.mv.18me (8-core ) 6.0 x 3.3 x.5 GHz Core i7 4-core (Leiserson et al.).7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 600K (Sandybridge)

38 Maximal independent set A vertex subset U V of an undirected graph G = (V, E) U is independent if no two vertices in U are adjacent U is maximal if no node can be added without violating independence a Both {a} and {b, c} are maximal independent sets b c 38

39 Maximal independent set A vertex subset U V of an undirected graph G = (V, E) U is independent if no two vertices in U are adjacent U is maximal if no node can be added without violating independence a Both {a} and {b, c} are maximal independent sets b c (Finding the maximum independent set is generally NP-Complete.) 39

40 MIS applications Graph coloring (vertex coloring) Co-scheduling sets of independent resources Useful for GPUs: avoid atomic updates by touching items in different kernels Matching (edge coloring) Pair-wise assignments Scheduling of jobs onto hosts Matching people with desks, dance partners, kidneys, etc. Run MIS on G where: There is a node in G for every edge in G Nodes are connected in G if edges in G are adjacent 40

41 Example: MIS() for mesh refinement in algebraic multi-grid Bell et al., Cohen et al., etc. 41

42 The sequential algorithm Repeatedly add a vertex to the MIS, removing it and its neighbors from the graph: 1. Select first identifier v from V t s g r h u e n o b a v f c p i d m j l k. Add v to MIS MIS: {} 3. Remove v and neighbors(v) from V v i 4. Repeat until V = {} g u j a s h n l t r o p m k MIS: {a} 4

43 The parallel algorithm (Luby85) Repeatedly add sets to the MIS, removing them and their neighbors from the graph: i h r k d j s g o c e l m f p 1. Randomly permute identifier ordering of V. Identify all m i such that identifier m i < all neighbors(m i ) a q t n MIS: {} b u 3. Add all m i to MIS 4. Remove m i and neighbors(m i ) from V 5. Repeat until V = {} q h r k d u s g o c t l m f p a i n e b j Generally converges after O(log n) steps MIS: {a, b, c, d, e, f, k, l, p}

44 Data-parallel GPU approach Input: Vertex set V, row-offsets array R, column-indices array C, source vertex s Output: Array MIS[0..n-1] with MIS[v] holding membership flag (1-1) in the maximal independent set constructed. kernel kernel 1. parallel for (i in V) :. MIS[i] := 0 3. do : 4. done := true 5. parallel for (i in V) : 6. if (MIS[i] == 0) 7. dependent i := false 8. done := false 9. local_min i := i 10. for (offset in R[i].. R[i+1]-1) : 11. j := C[offset] 1. if (MIS[j] == 1) 13. dependent i = true 14. else if ((MIS[j] == 0) && (permute(j) < permute(i)) 15. local_min i := j 16. if (dependent i ) 17. MIS[i] := else if (local_min i == i) 19. MIS[i] := 1 0. while (!done) Good: trivial data-parallel GPU stencils E.g., CUSP library, Hoberock, Cohen, Wang, etc. Bad: average case O(m log n) work because edges are never removed 44

45 Future work: actually remove vertices from G Don t visit all n vertices at every time step! Should significantly improve utilization in subsequent rounds Higher density of active vertices CSR isn t a particularly friendly format for vertex removal Easy: Use prefix sum to compact the row_offsets array Double the size to pair segment-begin offsets with explicit segment-end offsets Hard: No great way to find and remove vertex identifiers from adjacency lists 45

46 Summary Purely data-parallel approaches can be uncompetitive E.g., one thread per each time step Dynamic workload management: Prefix sum (vs. atomic-add) Utilization requires fine-grained expansion and contraction GPUs can be very amenable to dynamic, cooperative problems 46

48 GTC 013:The Premier Event in Accelerated Computing Registration is Open! Four days March 18-1, San Jose, CA Three keynotes 300+ sessions One day of pre-conference developer tutorials 100+ research posters Lots of networking events and opportunities

49 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt.5000 D Poisson stencil hugebubbles-0000 D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn0 Graph500 random graph random.mv.18me Uniform random graph rmat.mv.18me RMAT random graph

50 Comparison of expansion techniques 50

Coupling of expansion & contraction phases Alternatives: Expand-contract Wholly realize vertex-frontier in DRAM Suitable for all types of BFS iterations Contract-expand Logical Edge Frontier Logical

51 Coupling of expansion & contraction phases Alternatives: Expand-contract Wholly realize vertex-frontier in DRAM Suitable for all types of BFS iterations Contract-expand Logical Edge Frontier Logical Vertex Frontier Two-phase Wholly realize edge-frontier in DRAM Even better for small, fleeting BFS iterations Wholly realize both frontiers in DRAM BFS iteration Even better for large, saturating BFS iterations (surprising!!) 51

52 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand -Phase Hybrid

53 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 10 Simple Local duplicate culling Hash per warp (instantaneous coverage) 4.x Hash per CTA (recent history coverage).4x.6x Redundant work < 10% in all datasets No atomics needed 1 1.1x 1.7x 1.x 1.1x 1.4x 1.1x 1.3x 53

54 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C050 x1 C050 x C050 x3 C050 x

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex