Iterative Sparse Triangular Solves for Preconditioning

Size: px

Start display at page:

Download "Iterative Sparse Triangular Solves for Preconditioning"

Spencer Wesley Harris
6 years ago
Views:

1 Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra

2 Incomplete Factorization Preconditioning Incomplete LU factorizations (ILU) form a popular class of preconditioners for iteratively solving sparse linear systems of equations Ax = b. Basic idea: Factor sparse matrix A LU. L and U sparse, triangular. For symmetric case: A LL T. Fill-in restricted to sparsity pattern S. For ILU(0), S is sparsity pattern of A. Iterative top-level solver: Compute ILU in preconditioner setup. Residual correction LUx = z. Iterations solve preconditioned problem: Forward solve Ly = b, Backward solve Ux = y. 2

3 Sparse Triangular Solves Traditionally handled via forward/backward substitutions: Limited parallelism: level-scheduling, multi-color ordering. Error prone: Bit-Flip propagates all over. Idea: Approximate Sparse Triangular Solves: Replace forward/backward substitutions with iterative method. Low solution accuracy required as LU A typically only a rough approximation. Better scalability of iterative methods. 3

4 Approximate Sparse Triangular Solves Jacobi iteration: x k+1 = D 1 b (A D)x k x k+1 = D 1 b + Mx k M L = D 1 L (D L L) =I L M U = D 1 U (D U U) =I D 1 U U + Fine-grained parallelism on component-level. - Synchronizations between iterations. 4

5 Approximate Sparse Triangular Solves Jacobi iteration: x k+1 = D 1 b (A D)x k x k+1 = D 1 b + Mx k Block-asynchronous (BA) Jacobi: Use always latest available values in update. Jacobi-type updates on (GPU thread-) blocks. x k+1 = D 1 b + Mx + Fine-grained parallelism on component-level. + No synchronizations between iterations. 5

Experiment Setup Tesla K40 GPU (1,682 GFlop/s DP, 12 GB, 288 GB/s) CUDA v. 7.0, thread-block-size 128 SpMV and exact sparse triangular solves from cusparse v.

6 Experiment Setup Tesla K40 GPU (1,682 GFlop/s DP, 12 GB, 288 GB/s) CUDA v. 7.0, thread-block-size 128 SpMV and exact sparse triangular solves from cusparse v. 7.0 All results averaged over 50 runs Test matrices from UFMC 1 or FD-discretization of Laplace in 3D 1 6

7 cusparse Triangular Solves vs. Jacobi Runtime comparison [ms] between the exact triangular solve using the cusparse level-scheduling implementation and one Jacobi sweep. Relevant in production code: how many Jacobi sweeps necessary to provide same improvement to preconditioned top-level solver? 7

8 Jacobi vs. Block-asynchronous Jacobi (BA-Jacobi) uses newer values for some of the other blocks (Gauss-Seidel flavor). If updating blocks in dependency order, information gets propagated similar to exact forward- backward solves. Faster convergence expected for: top-down in Ly=b and bottom-up in Ux=y. = = NVIDIA s GPUs typically execute independent thread blocks in scheduling order. 8

9 Approximate Sparse Triangular Solves 10 3 Jacobi forward backward 10 3 Jacobi forward backward Residual norm Residual norm Laplace3D: Ly = b Runtime [ms] Faster convergence of for scheduling thread-blocks in forward order. Better TTS-performance of the cusparse-based Jacobi. = 9

10 Approximate Sparse Triangular Solves 10 3 Jacobi forward backward Jacobi forward backward 10 2 Residual norm 10 1 Residual norm Laplace3D: Ux = y Faster convergence of for scheduling thread-blocks in backward order. Better TTS-performance of the cusparse-based Jacobi = 12 Runtime [ms] 10

11 Approximate Sparse Triangular Solves 10 6 Jacobi forward 10 4 backward 10 6 Jacobi forward 10 4 backward Residual norm Residual norm DC: Ly = b Faster convergence of for scheduling thread-blocks in forward order. cusparse-based Jacobi suffers from unbalanced nonzero distribution = Runtime [ms] 11

12 Approximate Sparse Triangular Solves Jacobi forward backward Jacobi forward backward Residual norm Residual norm DC: Ux = y Faster convergence of for scheduling thread-blocks in backward order. All methods suffer from unbalanced nonzero distribution = Runtime [ms] 12

13 F-GMRES(50) CHP: DC: FGMRES(50) iterations FGMRES(50) iterations STO: VEN: FGMRES(50) iterations FGMRES(50) iterations

14 F-GMRES(50) CHP: DC: FGMRES(50) runtime [s] FGMRES(50) runtime [s] STO: VEN: FGMRES(50) runtime [s] FGMRES(50) runtime [s]

15 F-GMRES(50) LAP: FGMRES(50) runtime iterations [s] FGMRES(50) runtime [s] Better convergence properties for using dependency- aware scheduling. Can not compete with performance of Jacobi based on highly-tuned cusparse SpMV. 15

16 Summary Few sweeps are sufficient to provide improvement comparable to exact triangular solves. Iteration overhead easily compensated for by faster solver execution. Unbalanced nonzero distribution requires sophisticated implementations to load-balance approximate triangular solves. Future research: Add local iterations on Jacobi-blocks (local in GPU thread-block). Faster information propagation by using overlapping Jacobi blocks (RAS). (first results at SIAM LA 15 in Atlanta) This research is based on a cooperation with Edmond Chow from Georgia Institute of Technology, and supported by the U.S. Department of Energy and NVIDIA. 16

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt 1(B), Edmond Chow 2, and Jack Dongarra 1 1 University of Tennessee, Knoxville, TN, USA hanzt@icl.utk.edu, dongarra@eecs.utk.edu 2 Georgia