On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC 2013 May 20th, 2013

Hardware Parallel Hardware Constraints 2

GPUs: Disillusion Computing Architecture Schematic 2x12 GB/s 8x20 GB/s CPU GPU 100 GFLOPs SP 50 GFLOPs DP PCI Express 8 GB/s, ~1us Latency 1000 GFLOPs SP 250 GFLOPs DP Good for large FLOP-intensive tasks, high memory bandwidth PCI-Express can be a bottleneck 10-fold speedups (usually) not backed by hardware 3

Benchmarks 10-1 Vector Addition x = y + z 10-2 Execution Time (sec) 10-3 10-4 10-5 NVIDIA GTX 285, CUDA 10-6 NVIDIA GTX 285, OpenCL AMD Radeon HD 7970, OpenCL Intel Xeon Phi Beta, OpenCL 10-7 Intel Xeon Phi Beta, native Intel Xeon X5550, OpenMP Intel Xeon X5550, single-threaded 10-8 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Vector Size 4

Basic Idea Factor sparse matrix A LŨ L and Ũ sparse, triangular ILU0: Pattern of L, Ũ equal to A ILUT: Keep k elements per row Solver Cycle Phase Residual correction LŨx = z Forward solve Ly = z Backward solve Ũx = y Little parallelism in general 5 3 4 5 5 6 3 4 4 5

Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 16 17 18 19 20 11 12 13 14 15 6 7 8 9 10 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 20 19 18 17 16 11 12 13 14 15 10 9 8 7 6 1 2 3 4 5 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 10 14 17 19 20 6 9 13 16 18 3 5 8 12 15 1 2 4 7 11 7

Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d 17 9 18 10 20 6 16 7 17 8 13 4 14 5 15 1 11 2 12 3 7

Block-ILU Apply ILU to diagonal blocks Higher parallelism Usually more iterations required (problem-dependent) 8

Benchmark Benchmark - Setup 9

Benchmark Setup Hardware NVIDIA GTX 580 (default) AMD HD 7970 (only for final benchmark) Intel Core2Quad 9550 Numbering Lexicographic Red-Black Minimum Degree Remarks Setup purely on CPU, not included Data transfer costs not included OpenCL for both GPUs 10

Benchmark Case Study 1: 2D Poisson, Structured Grid 11

Benchmarks 10 2 10 1 2D FDM - Solver Time Execution Time (sec) 10 0 10-1 10-2 Lexicographic Red-Black 10-3 Min-Bandwidth CPU unpreconditioned 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 10 3 2D FDM Iterations 10 2 12 10 1 10 2 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10-1 2D FDM - Preconditioner Application Execution Time (sec) 10-2 10-3 10-4 Lexicographic Red-Black Min-Bandwidth 10-5 10 2 10 3 10 4 10 5 10 6 Unknowns 10 4 2D FDM 10 3 ILU Levels 10 2 13 10 1 Lexicographic Red-Black Min-Bandwidth 10 0 10 2 10 3 10 4 10 5 10 6 Unknowns

Benchmark Case Study 2: 3D Poisson, Structured Grid 14

Benchmarks 10 1 3D FDM - Solver Time Execution Time (sec) 10 0 10-1 10-2 Lexicographic 10-3 Red-Black Min-Bandwidth CPU unpreconditioned 10-4 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FDM 15 Iterations 10 2 10 1 10 0 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10-2 3D FDM - Preconditioner Application Execution Time (sec) 10-3 10-4 Lexicographic Red-Black Min-Bandwidth 10-5 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FDM ILU Levels 10 2 10 1 16 Lexicographic Red-Black Min-Bandwidth 10 0 10 3 10 4 10 5 10 6 Unknowns

Benchmark Case Study 3: 2D Poisson, Unstructured Grid 17

Benchmarks 10 3 10 2 2D FEM - Solver Time Execution Time (sec) 10 1 10 0 10-1 10-2 Lexicographic Red-Black 10-3 Min-Bandwidth CPU unpreconditioned 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 10 4 2D FEM 18 Iterations 10 3 10 2 10 1 10 2 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10 0 2D FEM - Preconditioner Application Execution Time (sec) 10-1 10-2 10-3 Lexicographic Red-Black Min-Bandwidth 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 10 4 2D FEM ILU Levels 10 3 10 2 19 Lexicographic Red-Black Min-Bandwidth 10 1 10 2 10 3 10 4 10 5 10 6 Unknowns

Benchmark Case Study 4: 3D Poisson, Unstructured Grid 20

Benchmarks 10 2 3D FEM - Solver Time Execution Time (sec) 10 1 10 0 10-1 Lexicographic 10-2 Red-Black Min-Bandwidth CPU unpreconditioned 10-3 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FEM Iterations 10 2 21 10 1 10 3 10 4 10 5 10 6 Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

Benchmarks 10-2 3D FEM - Preconditioner Application Execution Time (sec) 10-3 Lexicographic Red-Black Min-Bandwidth 10-4 10 3 10 4 10 5 10 6 Unknowns 10 3 3D FEM ILU Levels 10 2 22 Lexicographic Red-Black Min-Bandwidth 10 1 10 3 10 4 10 5 10 6 Unknowns

Coloring Color dependency graph Purely algebraic 1 23

Coloring Color dependency graph Purely algebraic 1 2 23

Benchmarks 10 2 10 1 2D FEM - Solver Time Execution Time (sec) 10 0 10-1 10-2 NVIDIA GTX 580, ILU NVIDIA GTX 580, no 10-3 AMD HD 7970, ILU AMD HD 7970, no CPU, no 10-4 10 2 10 3 10 4 10 5 10 6 Unknowns 24

Conclusion ILU Preconditioners Fine-grained parallelism exploitable (if done right) Higher-order discretizations less parallel Matrix Pattern CPU: banded for cache reuse GPU: colored for parallelism Availability ViennaCL: http://viennacl.sourceforge.net/ (PETSc: http://www.mcs.anl.gov/petsc/) ViennaCL + PETSc tutorial on Thursday afternoon! 25