On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Size: px

Start display at page:

Download "On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators"

Cecilia Barker
6 years ago
Views:

1 On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith Mathematics and Computer Science Division Argonne National Laboratory FEMTEC 2013 May 20th, 2013

2 Hardware Parallel Hardware Constraints 2

3 GPUs: Disillusion Computing Architecture Schematic 2x12 GB/s 8x20 GB/s CPU GPU 100 GFLOPs SP 50 GFLOPs DP PCI Express 8 GB/s, ~1us Latency 1000 GFLOPs SP 250 GFLOPs DP Good for large FLOP-intensive tasks, high memory bandwidth PCI-Express can be a bottleneck 10-fold speedups (usually) not backed by hardware 3

4 Benchmarks 10-1 Vector Addition x = y + z 10-2 Execution Time (sec) NVIDIA GTX 285, CUDA 10-6 NVIDIA GTX 285, OpenCL AMD Radeon HD 7970, OpenCL Intel Xeon Phi Beta, OpenCL 10-7 Intel Xeon Phi Beta, native Intel Xeon X5550, OpenMP Intel Xeon X5550, single-threaded Vector Size 4

5 Basic Idea Factor sparse matrix A LŨ L and Ũ sparse, triangular ILU0: Pattern of L, Ũ equal to A ILUT: Keep k elements per row Solver Cycle Phase Residual correction LŨx = z Forward solve Ly = z Backward solve Ũx = y Little parallelism in general

6 Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

7 Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

8 Level Scheduling Build dependency graph Substitute as many entries as possible simultaneously Trade-off: Each step vs. multiple steps in a single kernel 6

9 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

10 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

11 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

12 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

13 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

14 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

15 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

16 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

17 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

18 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

19 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

20 Interpretation on Structured Grids 2d finite-difference discretization Substitution whenever all neighbors with smaller index computed Works particularly well in 3d

21 Block-ILU Apply ILU to diagonal blocks Higher parallelism Usually more iterations required (problem-dependent) 8

22 Block-ILU Apply ILU to diagonal blocks Higher parallelism Usually more iterations required (problem-dependent) 8

23 Benchmark Benchmark - Setup 9

24 Benchmark Setup Hardware NVIDIA GTX 580 (default) AMD HD 7970 (only for final benchmark) Intel Core2Quad 9550 Numbering Lexicographic Red-Black Minimum Degree Remarks Setup purely on CPU, not included Data transfer costs not included OpenCL for both GPUs 10

25 Benchmark Case Study 1: 2D Poisson, Structured Grid 11

26 Benchmarks D FDM - Solver Time Execution Time (sec) Lexicographic Red-Black 10-3 Min-Bandwidth CPU unpreconditioned Unknowns D FDM Iterations Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

27 Benchmarks D FDM - Preconditioner Application Execution Time (sec) Lexicographic Red-Black Min-Bandwidth Unknowns D FDM 10 3 ILU Levels Lexicographic Red-Black Min-Bandwidth Unknowns

28 Benchmark Case Study 2: 3D Poisson, Structured Grid 14

29 Benchmarks D FDM - Solver Time Execution Time (sec) Lexicographic 10-3 Red-Black Min-Bandwidth CPU unpreconditioned Unknowns D FDM 15 Iterations Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

30 Benchmarks D FDM - Preconditioner Application Execution Time (sec) Lexicographic Red-Black Min-Bandwidth Unknowns D FDM ILU Levels Lexicographic Red-Black Min-Bandwidth Unknowns

31 Benchmark Case Study 3: 2D Poisson, Unstructured Grid 17

32 Benchmarks D FEM - Solver Time Execution Time (sec) Lexicographic Red-Black 10-3 Min-Bandwidth CPU unpreconditioned Unknowns D FEM 18 Iterations Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

33 Benchmarks D FEM - Preconditioner Application Execution Time (sec) Lexicographic Red-Black Min-Bandwidth Unknowns D FEM ILU Levels Lexicographic Red-Black Min-Bandwidth Unknowns

34 Benchmark Case Study 4: 3D Poisson, Unstructured Grid 20

35 Benchmarks D FEM - Solver Time Execution Time (sec) Lexicographic 10-2 Red-Black Min-Bandwidth CPU unpreconditioned Unknowns D FEM Iterations Unknowns Lexicographic Red-Black Min-Bandwidth unpreconditioned

36 Benchmarks D FEM - Preconditioner Application Execution Time (sec) 10-3 Lexicographic Red-Black Min-Bandwidth Unknowns D FEM ILU Levels Lexicographic Red-Black Min-Bandwidth Unknowns

37 Coloring Color dependency graph Purely algebraic 1 23

38 Coloring Color dependency graph Purely algebraic 1 23

39 Coloring Color dependency graph Purely algebraic

40 Benchmarks D FEM - Solver Time Execution Time (sec) NVIDIA GTX 580, ILU NVIDIA GTX 580, no 10-3 AMD HD 7970, ILU AMD HD 7970, no CPU, no Unknowns 24

41 Conclusion ILU Preconditioners Fine-grained parallelism exploitable (if done right) Higher-order discretizations less parallel Matrix Pattern CPU: banded for cache reuse GPU: colored for parallelism Availability ViennaCL: (PETSc: ViennaCL + PETSc tutorial on Thursday afternoon! 25

iennacl GPU-accelerated Linear Algebra at the Convenience of the C++ Boost Libraries Karl Rupp

iennacl GPU-accelerated Linear Algebra at the Convenience of the C++ Boost Libraries Karl Rupp GPU-accelerated Linear Algebra at the Convenience of the C++ Boost Libraries Karl Rupp Mathematics and Computer Science Division Argonne National Laboratory based on previous work at Technische Universität