GPU Sparse Graph Traversal

Size: px

Start display at page:

Download "GPU Sparse Graph Traversal"

Gwen Mills
5 years ago
Views:

1 GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA

2 Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source (or by predecessor vertex)

BFS applications A common algorithmic building block Tracking reachable heap during garbage collection Belief propagation (statistical inference) Simple

3 BFS applications A common algorithmic building block Tracking reachable heap during garbage collection Belief propagation (statistical inference) Simple performance-analog for many applications Pointer-chasing Work queues Finding community structure Path-finding Core benchmark kernel Graph 500 Rodinia Parboil 3D lattice Wikipedia 07 3

4 Conventional wisdom GPUs would be poorly suited for BFS Not pleasingly data-parallel (if you want to solve it efficiently) Insufficient atomic throughput for cooperation 4

5 Conventional wisdom GPUs would be poorly suited for BFS Not pleasingly data-parallel (if you want to solve it efficiently) Insufficient atomic throughput for cooperation Issues with scaling to 10,000s of processing elements Workload imbalance, SIMD underutilization Not enough parallelism in all graphs 5

6 Billions of travesed edges/sec/socket The punchline We show GPUs provide exceptional performance Billions of traversed-edges per second per socket For diverse synthetic and real-world datasets Linear-work algorithm Tesla C2050 Multi-core (per socket) Sequential using efficient parallel prefix sum For cooperative data movement Average traversal depth Agarwal et al. (SC 10) & Leiserson et al. (SPAA 10) 6

7 Level-synchronous parallelization strategies UNIVERSITY of VIRGINIA

8 (a) Quadratic-work approach Typical GPU BFS (e.g., Harish et al., Deng et al. Hong et al.) 1. Inspect every vertex at every time step t 1 t 0 2. If updated, update its neighbors 3. Repeat t 5 t 4 t 2 t 6 t 3 t 7 t 8 t 9 t 10 Quadratic O(n2+m) work t 11 t 13 t 12 Trivial data-parallel stencils t 14 t15 One thread per vertex t 16 t 17 8

Vertex identifiers (b) Linear-work approach 1.

2 Logical Edge Frontier Logical Vertex Frontier 0.

9 Vertex identifiers (b) Linear-work approach 1. Expand neighbors (vertex frontier edge-frontier) 2. Filter out previously visited & duplicates (edge-frontier vertex frontier) 3. Repeat Logical Edge Frontier Logical Vertex Frontier 0.1 Linear O(n+m) work Need cooperative buffers for tracking frontiers 0 Distance from source 9

! 40 35 30 25 20 Quadratic Linear 3D Poisson Lattice (300 3 = 27M

10 Referenced vertex identifiers (millions) Quadratic vs. linear Quadratic is too much work!! Quadratic Linear 3D Poisson Lattice (300 3 = 27M vertices) x work BFS iteration 2.5x 2,300x speedup vs. quadratic GPU methods 10

Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: GPU competence on diverse datasets (Not a one-trick pony ) 30 0.5 140 25 20 15 10 5 0.4 0.3 0.

11 Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Vertices (millions) Goal: GPU competence on diverse datasets (Not a one-trick pony ) Search Depth Search Depth Search Depth Wikipedia (social) 3D Poisson grid (cubic lattice) R-MAT (random, power-law, small-world) Search Depth Search Depth Europe road atlas (Euclidian space) PDE-constrained optimization (non-linear KKT) Auto transmission manifold (tetrahedral mesh) 11

12 Challenges UNIVERSITY of VIRGINIA

13 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues) 2. Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 13

14 Linear-work challenges for parallelization Generic challenges 1. Load imbalance between cores Coarse workstealing queues 2. Bandwidth inefficiency Use status bitmask when filtering already-visited nodes GPU-specific challenges 1. Cooperative allocation (global queues) 2. Load imbalance within SIMD lanes 3. Poor locality within SIMD lanes 4. Simultaneous discovery (i.e., the benign race condition) 14

15 GB / sec (1) Cooperative data movement (queuing) Problem: Need shared work queues 1000 Transfer Rate GPU rate-limits (C2050): 16 billion vertex-identifiers / sec (124 GB/s) Only 67M global-atomics / sec (238x slowdown) Only 600M smem-atomics / sec (27x slowdown) 10 1 Solution: Compute enqueue offsets using prefixsum Vertex stride between atomic ops (log-scale) 15

16 Prefix sum for allocation Allocation requirement Result of prefix sum Output A 2 C 1 0 D 3 t 0 t 1 t 2 t 3 A 0 C 2 3 D 3 t 0 t 1 t 2 t Each output is computed to be the sum of the previous inputs O(n) work Use results as a scatter offsets Fits the GPU machine model well Proceed at copy-bandwidth Only ~8 thread-instructions per input 16

17 Prefix sum for edge expansion Adjacency list size A 2 C 1 0 D 3 Result of prefix sum Expanded edge frontier t 0 t 1 t 2 t 3 A 0 2 C 3 D 3 t 0 t 1 t 2 t

18 Efficient local scan (granularity coarsening) x16 barrier x4 x1 x4 barrier x16 18

19 (2) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t 2 b) Slightly better: Coarse warp-centric parallel expansion c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 0 t 1 t 2 t 3 t 4 t 5 19

20 (2) Poor load balancing within SIMD lanes Problem: Solution: Large variance in adjacency list sizes Exacerbated by wide SIMD widths Cooperative neighbor expansion Enlist nearby threads to help process each adjacency list in parallel a) Bad: Serial expansion & processing t 1 t 3 t 0 t 2 b) Slightly better: Coarse warp-centric parallel expansion c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 t 0 t 1 t 2 t 3 t 4 t 5 20

21 (3) Poor locality within SIMD lanes Problem: The referenced adjacency lists are arbitrarily located Exacerbated by wide SIMD widths a) Bad: Serial expansion & processing t 1 t 3 t 0 t 2 b) Slightly better: Coarse warp-centric parallel expansion Solution: Can t afford to have SIMD threads streaming through unrelated data t 0 t 1 t 2 t 3 t 4 t 5 t 31 t 32 t 33 t 34 t 35 t 36 t 37 t 63 t 64 t 65 t 66 t 67 t 68 t 69 t 95 Cooperative neighbor expansion c) Best: Fine-grained parallel expansion (packed by prefix sum) t 0 t 1 t 2 t 3 t 4 t 5 21

22 (4) SIMD simultaneous discovery Iteration 1 Iteration 2 Iteration 3 Iteration 4 Duplicates in the edge-frontier redundant work Exacerbated by wide SIMD Compounded every iteration 22

23 Vertices (millions) Simultaneous discovery Normally not a problem for CPU implementations Serial adjacency list inspection A big problem for cooperative SIMD expansion Logical edgefrontier Logical vertexfrontier Actually expanded Actually contracted Spatially-descriptive datasets 0.6 Power-law datasets 0.4 Solution: Localized duplicate removal using hashes in local scratch Search Depth 23

24 Absolute and relative performance UNIVERSITY of VIRGINIA 24

25 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C2050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 2.4 x x nlpkkt (4-core ) 3.0 x x audikw x cage (4-core ) 2.6 x x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3.2 x x kron_g500-logn x random.2mv.128me (8-core ) 7.0 x x rmat.2mv.128me (8-core ) 6.0 x x 2.5 GHz Core i7 4-core (Leiserson et al.) 2.7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 2600K (Sandybridge)

26 Single-socket comparison Graph Spy Plot Average Search Depth Billion TE/s Nehalem Parallel speedup Billion TE/s Tesla C2050 Parallel speedup europe.osm x grid5pt x hugebubbles x grid7pt (4-core ) 2.4 x x nlpkkt (4-core ) 3.0 x x audikw x cage (4-core ) 2.6 x x kkt_power (4-core ) 3.0 x x copapersciteseer x wikipedia (4-core ) 3.2 x x kron_g500-logn x random.2mv.128me (8-core ) 7.0 x x rmat.2mv.128me (8-core ) 6.0 x x 2.5 GHz Core i7 4-core (Leiserson et al.) 2.7 GHz EX Xeon X core (Agarwal et al) vs 3.4GHz Core i7 2600K (Sandybridge)

27 Summary Quadratic-work approaches are uncompetitive Dynamic workload management: Prefix sum (vs. atomic-add) Utilization requires fine-grained expansion and contraction GPUs can be very amenable to dynamic, cooperative problems 27

28 Questions? UNIVERSITY of VIRGINIA

29 Experimental corpus Graph Spy Plot Description Average Search Depth Vertices (millions) Edges (millions) europe.osm Road network grid5pt D Poisson stencil hugebubbles D mesh grid7pt.300 3D Poisson stencil nlpkkt160 Constrained optimization problem audikw1 Finite element matrix cage15 Transition prob. matrix kkt_power Optimization (KKT) copapersciteseer Citation network wikipedia Wikipedia page links kron_g500-logn20 Graph500 random graph random.2mv.128me Uniform random graph rmat.2mv.128me RMAT random graph

30 Comparison of expansion techniques 30

31 Coupling of expansion & contraction phases Alternatives: Expand-contract Wholly realize vertex-frontier in DRAM Suitable for all types of BFS iterations Contract-expand Logical Edge Frontier Logical Vertex Frontier Two-phase Wholly realize edge-frontier in DRAM Even better for small, fleeting BFS iterations Wholly realize both frontiers in DRAM BFS iteration Even better for large, saturating BFS iterations (surprising!!) 31

32 10 9 edges / sec normalized Comparison of coupling approaches Expand-Contract Contract-Expand 2-Phase Hybrid

33 Redundant expansion factor (log) Local duplicate culling Contraction uses local collision hashes in smem scratch space: 10 Simple Local duplicate culling Hash per warp (instantaneous coverage) 4.2x Hash per CTA (recent history coverage) Redundant work < 10% in all datasets No atomics needed 1 1.1x 2.4x 2.6x 1.7x 1.2x 1.1x 1.4x 1.1x 1.3x 33

34 10 9 edges / sec normalized Multi-GPU traversal Expand neighbors sort by GPU (with filter) read from peer GPUs (with filter) repeat C2050 x1 C2050 x2 C2050 x3 C2050 x

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor