GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

ANSYS Fluent 2

Fluent control flow Accelerate this first Non-linear iterations Assemble Linear System of Equations Solve Linear System of Equations: Ax = b Runtime: ~ 33% ~ 67% No Converged? Yes Stop 3

Aggregation (Un-smoothed) AMG SETUP 1. Choose aggregates based on A f 2. Construct coarsening operator (R = P T ) 3. Construct coarse matrix (A c = R A f P) 4. Initialize smoother (if needed) SOLVE 1. Smooth 2. Compute residual (r f = b f A f x f ) 3. Restrict residual (R r f = r c ) 4. Recurse to solve coarse problem 5. Prolongate correction (x f = x f + Pe c ) 6. Smooth 7. If not converged, goto 1

Challenge of Multigrid on GPU CPU approach (domain decomposition) Chop problem into 1 subdomain per CPU core Run serial algorithm per subdomain (Fix up the boundaries) => Cores are fast, subdomains big enough that boundaries small GPU requirements are different Each thread is slow, but O(10,000) threads per GPU Therefore, subdomain size is O(1) Typical domain decomposition approaches break down => Methods must scale to limit of 1 grid cell per thread

Computational Pattern - SpMV SpMV (Sparse Matrix-Vector product) compute: y = Ax Parallel map Visit each edge for each row i in matrix A row_sum = 0 for each non-zero entry j row_sum += A i,j * x j y i = row_sum Reduction i j=1 j=3 =

SpMV-like pattern: SpMM-like pattern: for every vertex a i for every neighbor(a i ) b i compute F(a i,b i ) reduce {F(a i,b i ) } write result into position i for every vertex a i for every neighbor(a i ) b i for every neighbor(b i ) c k segmented_reduce {F(a i,b i,c k ) by (i,k)} for every (i,k) write result into position i,k

CUDA Implementation: SpMV-like Scalar kernel one thread per row / vertex For-loop over edges Serial reduction Vector kernel Multiple threads per row/vertex Process edges in parallel Parallel reduction

CUDA Implementation: SpMM-like For every vertex in 2-ring: Segmented reduction of F(a i,b i,c k ) by (i,k) Challenge: #{(a i,b i,c k ) triplets} > #{unique (i,k)} Unpredictable data expansion & contraction Difficult with 10,000 threads! Approaches: Shmem/RF caching (spill to global mem) Count, allocate, execute Everything in global memory with tuples (cusp/thrust approach)

Scaling Beyond Single GPU Workload breakdown approach SpMV-like => communicate edges and attached vertices which connect data between GPUs SpMM-like => communication edges and attached rows (e.g. 1-ring) which connect data between GPUs Plus lots of complex software engineering

Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

Size-2 Aggregation via Graph Matching Graph Matching: Set of edges such that no two edges share a vertex Maximum matching matching the includes the largest number of edges Equivalent: Independent set on dual of graph independent pairs of connected vertices FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking Each vertex extends a hand to its strongest neighbour FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking Each vertex checks if its strongest neighbor extended a hand back FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking Repeat with unmatched vertices FEMTEC 2013 - NVIDIA 2013

One-Phase Handshaking FEMTEC 2013 - NVIDIA 2013

Create P from Aggregate P i,j = 1 if vertex j in aggregate i 0 otherwise P = 1 1 1 1 1 1 1 1

Tricks for Computing A c For UA-AMG, P is either 0 or 1, so Galerkin product simplifies to A c I,J = sum over {fine points i in aggregate I and fine points j in aggregate J} of entries in fine matrix A ij f SpMM-like kernel

Computing A c is SpMM-like for every coarse vertex I for every contained fine vertex i for every non-zero entry j with coarse(j) == J segmented_sum {A f ij by (I,J)} for every (I,J) write result into position A c IJ I=0 J=1 i=0 i=1 j=2 A f 1,2 A f 0,2 A c 0,1= A f 0,2 + A f 1,2

Smoothers Preconditioned Richardson iteration M is called preconditioning matrix Can encode smoother via choice of M Solution vector x is updated iteratively

Jacobi Smoother Trivial Parallelism For Jacobi, M is block-diagonal (Could be nxn block)

Jacobi Smoother Jacobi Setup phase: compute inverses of D blocks

Graph Coloring Assignment of color (integer) to vertices, with no two adjacent vertices the same color Each color forms independent set (conflict-free) reveals parallelism inherent in graph topology

Reordering via Graph Coloring

DILU Smoother DILU preconditioner has the form E is such that Same as ILU(0) preconditioner for some banded matrices Only requires one extra diagonal of storage Cheap, strong, low-storage

DILU Smoother Setup is sequential Solve is also sequential (two triangular solves)

Multi-color DILU Smoother Use coloring to extract parallelism Setup: Forward solve: include neighbors whose color is less than yours in SpMV-like updates Backward solve: include neighbors whose colors is greater than yours in SpMV-like updates

NVAMG Library Implements AMG, Krylov methods, some utilities Supports standard matrix formats, easy-to-integrate C API MPI support, interoperates with MPI-enabled applications General beta release in July, 2013

ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent 14.5 with NVAMG (beta feature) 3000 2832 Dual Socket CPU Dual Socket CPU + Tesla C2075 Helix Model 2000 Lower is Better 1000 0 5.5x 2 x Xeon X5650, Only 1 Core Used 933 1.8x 517 517 2 x Xeon X5650, All 12 Cores Used Helix geometry 1.2M Hex cells Unsteady, laminar Coupled PBNS, DP AMG F-cycle on CPU AMG V-cycle on GPU NOTE: This is a performance preview GPU support is a beta feature All jobs solver time only 48

Fluent + NVAMG Preview Results 3500 ANSYS Fluent AMG on single CPU/GPU in ms Best solver settings on each platform 3000 2500 2000 1500 1000 500 0 Lower is Better Helix (hex 208K) Helix (tet 1173K) Airfoil (hex 784K) K20X(1) 3930K(6) FEMTEC 2013 - NVIDIA 2013

NVAMG Strong Scaling 3M unknowns FEMTEC 2013 - NVIDIA 2013

ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent GPU Acceleration of Truck 14M 75 69 Intel Xeon E5-2667, 2.90GHz Intel Xeon E5-2667, 2.90GHz + Tesla K20X Truck Body Model 50 Lower is Better 14 M Mixed cells 41 DES Turbulence 25 3.5x 28 Coupled PBNS, SP Times for 1 Iteration AMG F-cycle on CPU 12 3.3x 9 GPU: Preconditioned FGMRES with AMG 0 1 x Nodes, 2 CPUs (12 Cores Total) 2 x Nodes, 4 CPUs (24 Cores Total); 8 GPUs (4 ea Node) 4 x Nodes, 8 CPUs (48 Cores Total); 16 GPUs (4 ea Node) NOTE: All jobs solver time only 51

Tremendous Collaboration and Team Thanks to an awesome team (in alphabetical order): Marat Arsaev, Patrice Castonguay, Jonathan Cohen, Julien Demouth, Joe Eaton, Justin Luitjens, Nikolay Markovskiy, Maxim Naumov, Stan Posey, Nikolai Sakharnykh, Robert Strzodka, Zhenhai Zhu Star interns: Peter Zaspel, Simon Layton, Lu Wang, Istvan Reguly, Francesco Rossi, Christoph Franke, Felix Abecassis Our collaborators and AMG advisors: ANSYS: Sunil Sathe, Prasad Alavilli, Rongguang Jia PSU: James Brannick, Ludmil Zikatanov, Jinchao Xu, Xiaozhe Hu Developers at companies to be named later