GPU Sparse Graph Traversal. Duane Merrill

Size: px

Start display at page:

Download "GPU Sparse Graph Traversal. Duane Merrill"

Blake Austin
6 years ago
Views:

1 GPU Sparse Graph Traversal Duane Merrill

2 Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor in this traversal order

The punchline Conventional wisdom GPUs are poorly suited for dynamic problems like BFS Dynamic work distribution seems hard to implement Requires cooperation We show GPUs provide exceptional

3 The punchline Conventional wisdom GPUs are poorly suited for dynamic problems like BFS Dynamic work distribution seems hard to implement Requires cooperation We show GPUs provide exceptional performance Billions of traversed-edges / second / socket vs. hundreds-of-millions for multi-core (SC 10, SPAA 10) For diverse synthetic and real-world datasets Small-diameter (e.g., RMAT and social networks) Large-diameter (e.g., road-network & spatial lattices) 3

performance-analog for many applications Pointer-chasing Work queues Finding community

4 BFS applications A common algorithmic building block Tracking reachable heap during garbage collection Belief propagation in statistical inference Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure in networks Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 3D lattice Wikipedia 07 4

5 Level-synchronous BFS parallelization strategies

Repeat t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 Details Quadratic O(n 2 + m)

6 Quadratic-work approach 1. Inspect every vertex at every timestep 2. If updated, update its neighbors 3. Repeat t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 Details Quadratic O(n 2 + m) work Trivial data-parallel stencils One thread per vertex t 9 t 8 t 11 t 16 t 13 t 17 t 14 t 10 t 12 t 15 6

Filter neighbors (previously-visited & duplicates) 3.

7 Items explored Linear-work approach 1. Expand neighbors of unexplored nodes Expand: vertex frontier edge-frontier 2. Filter neighbors (previously-visited & duplicates) 3. Repeat Contract: edge-frontier vertex frontier Logical Edge Frontier Logical Vertex Frontier Details Linear O(n+m) work Need cooperative buffers for tracking frontiers Distance from source 7

8 Referenced nodes & edges (millions) The problem with quadratic approach Too much work!!! Quadratic Linear D Poisson Lattice (300 3 = 27M vertices) x work BFS iteration 8

Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: Demonstrate O(m+n) traversal on diverse datasets (Not a one-trick pony )

9 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: Demonstrate O(m+n) traversal on diverse datasets (Not a one-trick pony ) Edge Frontier Vertex Frontier Edge Frontier Vertex Frontier Edge Frontier Vertex Frontier Search Depth Search Depth Search Depth Wikipedia (social) 3D Poisson grid (cubic lattice) R-MAT (random, power-law, small-world) Edge Frontier 5 Edge Frontier 5 Edge Frontier Vertex Frontier 4 Vertex Frontier 4 Vertex Frontier Search Depth Search Depth Search Depth Europe road atlas (Euclidian space) PDE-constrained optimization (non-linear KKT) Auto transmission manifold (tetrahedral mesh) 9

10 Challenges

11 Linear-work challenges for parallelization General challenges 1. Load imbalance between processing elements Workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative placement within global queues 2. Poor load balancing within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (benign race condition) 11

12 (1) Cooperative placement within global queues Problem: Need shared work queues Threads must cooperatively determine enqueue offsets GPU rate-limits (C2050): A C D G Can copy 16B vertex-identifiers / s Only 67M global-atomics / s (238x slowdown) Only 600M smem-atomics / s (27x slowdown) Solution: Compute local offsets using CTA-wide prefix-sum Compute CTA offset using a single coarse-grained atomic-add Dynamic expansion 12

13 Prefix sum for allocation Input (e.g., adjacency list sizes) Result of prefix sum A C D G A0 2C 3 D3 G6 Thread Thread Thread Thread Thread Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well 1 Proceed at copy-bandwidth Only ~8 thread-instructions per input Frontier output Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS , University of Virginia

14 (2) Poor load balancing within SIMD lanes (a) Bad: Serial expansion & processing Problem: Large variance in adjacency list sizes t 0 t 1 t 2 t 3 t 4 Solution: E.g., power-law degree distributions Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel (b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 96 t 97 t 98 t 99 t 100 t 101 t 127 (c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 14

15 (3) Poor locality within SIMD lanes (a) Bad: Serial expansion & processing Problem: t 0 t 1 t 2 t 3 t 4 Solution: The referenced adjacency lists are arbitrarily located Exacerbated by wide SIMD widths and DRAM transaction sizes Can t afford to have SIMD threads streaming through unrelated data Cooperative neighbor expansion (b) Slightly better: Coarse warp-centric parallel expansion t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 96 t 97 t 98 t 99 t 100 t 101 t 127 (c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 15

16 (4) SIMD simultaneous discovery Duplicates in edge-frontier yield redundant work Exacerbated by wide SIMD Compounded every iteration Iteration 1 Iteration 2 Iteration 3 Iteration 4 16

17 Vertices (millions) Simultaneous discovery Normally not a problem for CPU implementations Logical edge-frontier Logical vertex-frontier Low hardware parallelism 1.2 Actually expanded Serial adjacency list inspection 1.0 Actually contracted A big problem for cooperative SIMD expansion Spatially-descriptive datasets Power-law datasets Solution: Collision hashes in local scratch Search Depth 1. Hash per warp (instantaneous coverage) 2. Hash per CTA (recent history coverage) 17

18 Absolute and relative performance 18

19 Single-socket performance comparison Sequential Parallel Graph Spy Plot Average Search Depth Sandybridge Nehalem Tesla C2050 Billion TE/s Billion TE/s Parallel speedup Billion TE/s Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 3.0 x x nlpkkt (4-core ) 1.8 x x audikw x cage (4-core ) 1.8 x x kkt_power (4-core ) 2.2 x x copapersciteseer x wikipedia (4-core ) 2.7 x x kron_g500-logn x random.2mv.128me (8-core ) 5.0 x x rmat.2mv.128me (8-core ) 4.6 x x Harmonic mean (across all / common datasets) 12x / 20x 3.4GHz Core i7 2600K 2.5 GHz Core i7 4-core, Leiserson et al. 2.7 GHz EX Xeon X core, Agarwal et al.

Summary BFS conclusions Quadratic-work approaches are uncompetitive Need linear-work algorithm Dynamic workload management: Overall conclusions Prefix sum instead of atomic read-modify-write ops

20 Summary BFS conclusions Quadratic-work approaches are uncompetitive Need linear-work algorithm Dynamic workload management: Overall conclusions Prefix sum instead of atomic read-modify-write ops Fine-grained expansion and contraction for high utilization GPUs can be very amenable to dynamic, cooperative problems BFS is actually well-suited to GPU strengths Exposes lots of concurrency Is memory-bound Merrill et al. High performance and Scalable GPU Graph Traversal. Technical Report CS , Department of Computer Science, University of Virginia, Aug

21 Questions? Merrill, D., Garland, M., and Grimshaw, A. High Performance and Scalable GPU Graph Traversal. Technical Report CS , Department of Computer Science, University of Virginia. Aug

22 Arbitrary locality 3D (7-point )Poisson Lattice Adjacency lists reference nearby vertices Wikipedia 07 Adjacency lists do not reference nearby vertices Sparsity pattern (adjacency matrix) Sparsity pattern (adjacency matrix) 22

23 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt D Poisson stencil hugebubbles D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn20 Graph500 random graph random.2mv.128me Uniform random graph rmat.2mv.128me RMAT random graph

24 Compressed sparse row (CSR) representation v 0 v 1 v 2 v 3 Logical adjacency matrix v 0 v 0 v 1 v 1 v 1 v 2 v 2 v 0 v 2 v 3 v 3 v 1 v 3 Column indices: O(m) [0] [1] [2] [3] [4] [5] [6] [7] [8] Row offsets: O(n) [0] [1] [2] [3] [4]

25 Coupling of expansion & contraction phases Alternatives: a) Expand-contract (one kernel) Maintain vertex-frontier (unexplored nodes) in DRAM between BFS iterations O(2n) bytes Produce and consume edge-frontier online, in-core b) Contract-expand (one kernel) Maintain edge-frontier (untraversed edges) in DRAM between BFS iterations O(2m) bytes Produce and consume vertex frontier online, in-core c) Two-phase (two kernels) Wholly produce both edge and vertex-frontiers in DRAM O(m+n) bytes 25

26 Coupling of expansion & contraction phases Alternatives: a) Expand-contract (vertex-frontier in DRAM) Suitable for all types of BFS iterations b) Contract-expand (edge-frontier in DRAM) Even better for small, fleeting BFS iterations c) Two-phase (both frontiers in DRAM) Even better for large, saturating BFS iterations (surprising!!) The two fused strategies (a & b above) suffer the interaction between: TLB misses during expansion (neighbor gathering) Uncoalesced reads during contraction (status lookup) 26

27 Comparison of expansion techniques 27

28 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C2050 x1 C2050 x2 C2050 x3 C2050 x

29 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand 2-Phase Hybrid

30 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 1. Hash per warp (instantaneous coverage) Redundant work reduced < 10% in all datasets No atomics needed 2. Hash per CTA (recent history coverage) 10 Simple Local duplicate culling 4.2x 1 1.1x 2.4x 2.6x 1.7x 1.2x 1.1x 1.4x 1.1x n/a 1.3x 30

t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t T/4-2 t

31 CTA local prefix sum t 1 t 0 t 2 t T/4-1 t 3T/4-1 t 3T/4+1 t T/4 t T/4 + 1 t T/4 + 2 t T/2-1 t T/2 t T/2 + 1 t T/2 + 2 t 3T/4 t 3T/4+2 t T - 1 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t 2 t 3 t 0 t 1 t T/4-2 t T/4-1 t T/4 t T/4 + 1 t T/2-2 t T/2-1 t T/2 t T/2 + 1 t 3T/4-2 t 3T/4-1 t 3T/4 t 3T/4+1 t T - 2 t T

32 Filter duplicates within edge-frontier Implement local collision hashes in smem scratch space during contraction 1. Hash per warp (instantaneous coverage) 2. Hash per CTA (recent history coverage) Redundant work reduced < 10% in all datasets No atomics needed 32

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex