Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs

Size: px

Start display at page:

Download "Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs"

Daniel Daniel
5 years ago
Views:

1 Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran Flegar, Enrique Quintana-Orti

2 Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran Flegar, Enrique Quintana-Orti Pitch from the sparse community: Small size (<30), somewhat flexible Add inversion routine to Batched-Blas Go for Gauss-Jordan Elimination instead of LU

Iterative linear system solvers Goal: solve linear system Ax = b, A 2 R n n Use iterative Krylov method for generating solution approximation Convergence typically benefits from using a

3 Iterative linear system solvers Goal: solve linear system Ax = b, A 2 R n n Use iterative Krylov method for generating solution approximation Convergence typically benefits from using a preconditioner We focus on Jacobi preconditioners based on diagonal scaling P =(diag(a)) 1 Provide a high level of parallelism Attractive for many-core accelerators like GPUs Scalar Jacobi acts on the main diagonal block-jacobi acts on the block-diagonal 3

4 Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 4

5 Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 5

6 Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application Preconditioner setup (inversion of diagonal blocks) can become a challenge Batched inversion of many small systems Diagonal blocks are typically of small size 6

7 Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination insert solution as diagonal block into the preconditioner matrix 7

8 Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination insert solution as diagonal block into the preconditioner matrix 8

Batched Gauss-Jordan elimination (BGJE) on GPUs Flexible size up to 32 x 32 One warp per system Inversion in registers Communication via shfl() Implicit pivoting: accumulate

9 Batched Gauss-Jordan elimination (BGJE) on GPUs Flexible size up to 32 x 32 One warp per system Inversion in registers Communication via shfl() Implicit pivoting: accumulate row/col swaps realize them when writing results Advantages over LU-based approach: No operations on triangular systems Implicit pivoting Implementation can use static system size 9

10 NVIDIA Pascal architecture (P100) NVIDIA whitepaper: 56 SMX, 5.3 TF DP, 768 GB/s ( 6 : 1 ) 2n 3 operations for inversion Flexible-size inversion up to size 32x32, one warp per system. 600 BGJE 600 BGJE GFlop/s GFlop/s Batch size Matrix size Batched Gauss-Jordan Elimination (BGJE) achieves >10% of DP peak! Anzt et al. Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs. PMAM'17, 8th International Workshop on Programming Models and Applications for Multicores and Manycores, pp 1-10,

11 Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination insert solution as diagonal block into the preconditioner matrix 11

12 Challenge: diagonal block extraction from sparse data structures 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access 3. insert solution as diagonal block into the preconditioner matrix 12

13 Challenge: diagonal block extraction from sparse data structures memory requests cached extraction Matrix in CSR format 1 Diagonal block extraction costly Assign one thread per row 2 Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck 3 Non-coalescent memory access 4 5 row-ptr col-indices r 13

14 Challenge: diagonal block extraction from sparse data structures memory requests row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage 14

Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR

15 Challenge: diagonal block extraction from sparse data structures memory requests row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage 15

16 Challenge: diagonal block extraction from sparse data structures memory requests row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage Merge kernels vs. 3 separate kernels: [ extraction + inversion + insertion ] [ extraction ] + [ inversion ] + [ insertion ] 16

17 Diagonal block extraction from sparse data structures US: unmerged + shared extraction UC: unmerged + cached extraction MS: merged + shared extraction MC: merged + cached extraction Results show: Runtime block-jacobi vs. BGJE 56 test matrices (SuiteSparse) Data arranged w.r.t. performance of (avg.) best strategy BJ generation time / BGJE time BJ generation time / BGJE time US UC MS MC Block K40 size Test matrices US UC MS MC Block size Test matrices 17

18 Diagonal block extraction from sparse data structures US: unmerged + shared extraction UC: unmerged + cached extraction MS: merged + shared extraction MC: merged + cached extraction Tridiagonal-type matrix structure: - balanced nnz/row Arrow-type matrix structure: - few rows with many elements - corresponding threads are bottleneck for cached extraction BJ generation time / BGJE time BJ generation time / BGJE time US UC MS MC Matrix size US UC MS MC Tridiagonal-type Arrow-type Matrix size

19 Block-Jacobi preconditioner setup cost Block Jacobi generation time [s] Block size 4 Block size 8 Block size 16 Block size 24 Block size 32 Chebyshev2 dw2048 dw1024 LeGresley_2508 Chebyshev3 saylr4 sts4098 olm5000 nasa2910 s2rmq4m1 s2rmt3m1 s3rmt3m1 s3rmt3m3 s3rmq4m1 s1rmt3m1 dw8192 dw4096 Kuu bcsstk38 CurlCurl_0 linverse nemeth15 bcsstk18 piston bcsstk17 nd3k cbuckle Pres_Poisson msc10848 ABACUS_shell_ud nd6k Preconditioner setup needs up to 0.01 s on P100 GPU. Irrelevant for overall Iterative solver execution time. spmsrtls cz40948 gridgena ibm_matrix_2 sme3dc 3D_51448_3D 2D_54019_highK nd12k gas_sensor rail_79841 crankseg_1 matrix_9 dc3 matrix-new_3 nd24k ship_003 CurlCurl_1 F1 af_shell3 ecology2 Emilia_923 G3_circuit Hook_1498 ML_Geer Flan_1565 K40

20 Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 20

21 Factorization-based block-jacobi Preconditioner setup Preconditioner application 2. Generate factorization and pivoting. Apply trsv using pivoting information to turn the right-hand side sub-vector into the solution of the diagonal-block linear problem. 1. extract diagonal block from sparse data structure. 3. Insert factorization as block into the preconditioner matrix and store pivoting. *ipiv 21

22 Factorization-based block-jacobi Batched routines: Small-Size LU, Gauss-Huard, Gauss-Huard-T. Gauss-Huard-T writes the factors in access-friendly fashion. Flexible size up to limit of 32 x 32. Use one warp per system, factorization in registers. GFlop/s Small-Size LU Gauss-Huard Gauss-Huard-T cublas LU dgetrf GFlop/s Small-Size LU Gauss-Huard Gauss-Huard-T cublas LU dgetri Matrix size Matrix size NVIDIA P100 GPU 22

23 NVIDIA Pascal architecture (P100) Gauss-Jordan elimination (BGJE) MAGMA dgetrf-inv (BLU) Gauss-Huard (BGH) Gauss-Huard + transp (BGHT) Time/matrix [ms] Batch size 10 4 BGJE and BLU invert diagonal blocks vs. BGH / BGHT factorizing only. BGHT writes factors in non-coalescent fashion for faster preconditioner application. NVIDIA P100 GPU

24 Inversion-based block-jacobi vs. factorization-based block-jacobi Convergence of BiCGSTAB with block-jacobi preconditioner for 300 matrices Test cases block size 4 block size 8 block size 16 block size 24 block size BGH needs more iterations BGJE needs more iterations Explicit inversion better. Factorization better. NVIDIA P100 GPU 24

25 25 Block-Jacobi preconditioning: Performance goal Runtime (BGJE) [s] Runtime (BGH) [s] Test matrices 10-3 solve setup solve setup Test matrices Execution time of BiCGSTAB with Jacobi(24) preconditioner Runtime [s] BGJE BGH BGHT Test matrices Almost no difference in total execution time. BGH better if setup is significant to iterative solver time. BGJE better for long solver runs. NVIDIA P100 GPU

Pitch from the sparse community: Small size

routine to Batched-Blas Go for Gauss-Jordan

Dongarra (University of Tennessee), Goran

26 Pitch from the sparse community: Small size (<30) somewhat flexible Add inversion routine to Batched-Blas Go for Gauss-Jordan Elimination instead of LU This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy. 26

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning

Variable-Size Batched Gauss-Huard for Block-Jacobi Preconditioning Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 18C (217) 1783 1792 International Conference on Computational Science, ICCS 217, 12-14 June 217, Zurich, Switzerland Variable-Size