GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

Size: px

Start display at page:

Download "GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA"

Abraham Stokes
6 years ago
Views:

1 GPU-Accelerated Algebraic Multigrid for Commercial Applications Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

2 ANSYS Fluent 2

3 Fluent control flow Accelerate this first Non-linear iterations Assemble Linear System of Equations Solve Linear System of Equations: Ax = b Runtime: ~ 33% ~ 67% No Converged? Yes Stop 3

4 Aggregation (Un-smoothed) AMG SETUP 1. Choose aggregates based on A f 2. Construct coarsening operator (R = P T ) 3. Construct coarse matrix (A c = R A f P) 4. Initialize smoother (if needed) SOLVE 1. Smooth 2. Compute residual (r f = b f A f x f ) 3. Restrict residual (R r f = r c ) 4. Recurse to solve coarse problem 5. Prolongate correction (x f = x f + Pe c ) 6. Smooth 7. If not converged, goto 1

5 Challenge of Multigrid on GPU CPU approach (domain decomposition) Chop problem into 1 subdomain per CPU core Run serial algorithm per subdomain (Fix up the boundaries) => Cores are fast, subdomains big enough that boundaries small GPU requirements are different Each thread is slow, but O(10,000) threads per GPU Therefore, subdomain size is O(1) Typical domain decomposition approaches break down => Methods must scale to limit of 1 grid cell per thread

each row i in matrix A row_sum = 0 for each non-zero entry

6 Computational Pattern - SpMV SpMV (Sparse Matrix-Vector product) compute: y = Ax Parallel map Visit each edge for each row i in matrix A row_sum = 0 for each non-zero entry j row_sum += A i,j * x j y i = row_sum Reduction i j=1 j=3 =

7 SpMV-like pattern: SpMM-like pattern: for every vertex a i for every neighbor(a i ) b i compute F(a i,b i ) reduce {F(a i,b i ) } write result into position i for every vertex a i for every neighbor(a i ) b i for every neighbor(b i ) c k segmented_reduce {F(a i,b i,c k ) by (i,k)} for every (i,k) write result into position i,k

8 CUDA Implementation: SpMV-like Scalar kernel one thread per row / vertex For-loop over edges Serial reduction Vector kernel Multiple threads per row/vertex Process edges in parallel Parallel reduction

9 CUDA Implementation: SpMM-like For every vertex in 2-ring: Segmented reduction of F(a i,b i,c k ) by (i,k) Challenge: #{(a i,b i,c k ) triplets} > #{unique (i,k)} Unpredictable data expansion & contraction Difficult with 10,000 threads! Approaches: Shmem/RF caching (spill to global mem) Count, allocate, execute Everything in global memory with tuples (cusp/thrust approach)

10 Scaling Beyond Single GPU Workload breakdown approach SpMV-like => communicate edges and attached vertices which connect data between GPUs SpMM-like => communication edges and attached rows (e.g. 1-ring) which connect data between GPUs Plus lots of complex software engineering

11 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

12 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

13 Size-2 Aggregation via Graph Matching Graph Matching: Set of edges such that no two edges share a vertex Maximum matching matching the includes the largest number of edges Equivalent: Independent set on dual of graph independent pairs of connected vertices FEMTEC NVIDIA 2013

14 One-Phase Handshaking FEMTEC NVIDIA 2013

15 One-Phase Handshaking Each vertex extends a hand to its strongest neighbour FEMTEC NVIDIA 2013

16 One-Phase Handshaking Each vertex checks if its strongest neighbor extended a hand back FEMTEC NVIDIA 2013

17 One-Phase Handshaking Repeat with unmatched vertices FEMTEC NVIDIA 2013

18 One-Phase Handshaking FEMTEC NVIDIA 2013

19 One-Phase Handshaking FEMTEC NVIDIA 2013

20 One-Phase Handshaking FEMTEC NVIDIA 2013

21 One-Phase Handshaking FEMTEC NVIDIA 2013

22 One-Phase Handshaking FEMTEC NVIDIA 2013

23 One-Phase Handshaking FEMTEC NVIDIA 2013

24 One-Phase Handshaking FEMTEC NVIDIA 2013

25 Create P from Aggregate P i,j = 1 if vertex j in aggregate i 0 otherwise P =

26 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

27 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

28 Tricks for Computing A c For UA-AMG, P is either 0 or 1, so Galerkin product simplifies to A c I,J = sum over {fine points i in aggregate I and fine points j in aggregate J} of entries in fine matrix A ij f SpMM-like kernel

31 Computing A c is SpMM-like for every coarse vertex I for every contained fine vertex i for every non-zero entry j with coarse(j) == J segmented_sum {A f ij by (I,J)} for every (I,J) write result into position A c IJ I=0 J=1 i=0 i=1 j=2 A f 1,2 A f 0,2 A c 0,1= A f 0,2 + A f 1,2

32 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

33 Smoothers Preconditioned Richardson iteration M is called preconditioning matrix Can encode smoother via choice of M Solution vector x is updated iteratively

34 Jacobi Smoother Trivial Parallelism For Jacobi, M is block-diagonal (Could be nxn block)

35 Jacobi Smoother Jacobi Setup phase: compute inverses of D blocks

36 Graph Coloring Assignment of color (integer) to vertices, with no two adjacent vertices the same color Each color forms independent set (conflict-free) reveals parallelism inherent in graph topology

37 Reordering via Graph Coloring

38 DILU Smoother DILU preconditioner has the form E is such that Same as ILU(0) preconditioner for some banded matrices Only requires one extra diagonal of storage Cheap, strong, low-storage

39 DILU Smoother Setup is sequential Solve is also sequential (two triangular solves)

40 Multi-color DILU Smoother Use coloring to extract parallelism Setup: Forward solve: include neighbors whose color is less than yours in SpMV-like updates Backward solve: include neighbors whose colors is greater than yours in SpMV-like updates

41 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

42 Workload breakdown SETUP 1. Choose aggregates based on A f SpMV 2. Construct coarsening operator (R = P T ) tranpose (sort) 3. Construct coarse matrix (A c = R A f P) SpMM 4. Initialize smoother (if needed) SpMV / SpMM (ILU1) SOLVE 1. Smooth SpMV 2. Compute residual (r f = b f A f x f ) SpMV 3. Restrict residual (R r f = r c ) SpMV 4. Recurse on coarse problem 5. Prolongate correction (x f = x f + Pe c ) SpMV 6. Smooth SpMV 7. If not converged, goto 1 reduction

43 NVAMG Library Implements AMG, Krylov methods, some utilities Supports standard matrix formats, easy-to-integrate C API MPI support, interoperates with MPI-enabled applications General beta release in July, 2013

ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent 14.

Tesla C2075 Helix Model 2000 Lower is Better 1000 0 5.

8x 517 517 2 x Xeon X5650, All 12 Cores Used Helix geometry 1.

44 ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent 14.5 with NVAMG (beta feature) Dual Socket CPU Dual Socket CPU + Tesla C2075 Helix Model 2000 Lower is Better x 2 x Xeon X5650, Only 1 Core Used x x Xeon X5650, All 12 Cores Used Helix geometry 1.2M Hex cells Unsteady, laminar Coupled PBNS, DP AMG F-cycle on CPU AMG V-cycle on GPU NOTE: This is a performance preview GPU support is a beta feature All jobs solver time only 48

45 Fluent + NVAMG Preview Results 3500 ANSYS Fluent AMG on single CPU/GPU in ms Best solver settings on each platform Lower is Better Helix (hex 208K) Helix (tet 1173K) Airfoil (hex 784K) K20X(1) 3930K(6) FEMTEC NVIDIA 2013

46 NVAMG Strong Scaling 3M unknowns FEMTEC NVIDIA 2013

ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent GPU Acceleration

12 3.3x 9 GPU: Preconditioned FGMRES with AMG 0 1 x Nodes, 2 CPUs

47 ANSYS Fluent AMG Solver Time (Sec) ANSYS Fluent GPU Acceleration of Truck 14M Intel Xeon E5-2667, 2.90GHz Intel Xeon E5-2667, 2.90GHz + Tesla K20X Truck Body Model 50 Lower is Better 14 M Mixed cells 41 DES Turbulence x 28 Coupled PBNS, SP Times for 1 Iteration AMG F-cycle on CPU x 9 GPU: Preconditioned FGMRES with AMG 0 1 x Nodes, 2 CPUs (12 Cores Total) 2 x Nodes, 4 CPUs (24 Cores Total); 8 GPUs (4 ea Node) 4 x Nodes, 8 CPUs (48 Cores Total); 16 GPUs (4 ea Node) NOTE: All jobs solver time only 51

48 Tremendous Collaboration and Team Thanks to an awesome team (in alphabetical order): Marat Arsaev, Patrice Castonguay, Jonathan Cohen, Julien Demouth, Joe Eaton, Justin Luitjens, Nikolay Markovskiy, Maxim Naumov, Stan Posey, Nikolai Sakharnykh, Robert Strzodka, Zhenhai Zhu Star interns: Peter Zaspel, Simon Layton, Lu Wang, Istvan Reguly, Francesco Rossi, Christoph Franke, Felix Abecassis Our collaborators and AMG advisors: ANSYS: Sunil Sathe, Prasad Alavilli, Rongguang Jia PSU: James Brannick, Ludmil Zikatanov, Jinchao Xu, Xiaozhe Hu Developers at companies to be named later

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU Robert Strzodka NVAMG Project Lead A Parallel Success Story in Five Steps 2 Step 1: Understand Application ANSYS Fluent Computational Fluid Dynamics