Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran Flegar, Enrique Quintana-Orti http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf
Workshop on Batched, Reproducible, and Reduced Precision BLAS Atlanta, GA 02/25/2017 Batched Factorization and Inversion Routines for Block-Jacobi Preconditioning on GPUs Hartwig Anzt Joint work with Goran Flegar, Enrique Quintana-Orti Pitch from the sparse community: Small size (<30), somewhat flexible Add inversion routine to Batched-Blas Go for Gauss-Jordan Elimination instead of LU http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf
Iterative linear system solvers Goal: solve linear system Ax = b, A 2 R n n Use iterative Krylov method for generating solution approximation Convergence typically benefits from using a preconditioner We focus on Jacobi preconditioners based on diagonal scaling P =(diag(a)) 1 Provide a high level of parallelism Attractive for many-core accelerators like GPUs Scalar Jacobi acts on the main diagonal block-jacobi acts on the block-diagonal 3
Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 4
Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 5
Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application Preconditioner setup (inversion of diagonal blocks) can become a challenge Batched inversion of many small systems Diagonal blocks are typically of small size 6
Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. 3. insert solution as diagonal block into the preconditioner matrix 7
Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. 3. insert solution as diagonal block into the preconditioner matrix 8
Batched Gauss-Jordan elimination (BGJE) on GPUs Flexible size up to 32 x 32 One warp per system Inversion in registers Communication via shfl() Implicit pivoting: accumulate row/col swaps realize them when writing results Advantages over LU-based approach: No operations on triangular systems Implicit pivoting Implementation can use static system size 9
NVIDIA Pascal architecture (P100) NVIDIA whitepaper: 56 SMX, 5.3 TF DP, 768 GB/s ( 6 : 1 ) 2n 3 operations for inversion Flexible-size inversion up to size 32x32, one warp per system. 600 BGJE 600 BGJE 500 500 GFlop/s 400 300 GFlop/s 400 300 200 200 100 100 0 0.5 1 1.5 2 2.5 3 3.5 4 Batch size 10 4 5 10 15 20 25 30 Matrix size Batched Gauss-Jordan Elimination (BGJE) achieves >10% of DP peak! Anzt et al. Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs. PMAM'17, 8th International Workshop on Programming Models and Applications for Multicores and Manycores, pp 1-10, 2017. 0 10
Block-Jacobi preconditioner generation 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. 3. insert solution as diagonal block into the preconditioner matrix 11
Challenge: diagonal block extraction from sparse data structures 1. extract diagonal block from sparse data structure invert diagonal block via Gauss-Jordan elimination 2. Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access 3. insert solution as diagonal block into the preconditioner matrix 12
Challenge: diagonal block extraction from sparse data structures memory requests cached extraction Matrix in CSR format 1 Diagonal block extraction costly Assign one thread per row 2 Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck 3 Non-coalescent memory access 4 5 row-ptr col-indices r 13
Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage 14
Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage 15
Challenge: diagonal block extraction from sparse data structures memory requests 1 2 3 4 5 row-ptr cached extraction col-indices row-ptr shared memory extraction col-indices shared_mem Matrix in CSR format Diagonal block extraction costly Assign one thread per row Unbalanced nonzero distribution results in load imbalance Dense rows become bottleneck Non-coalescent memory access Add intermediate step: Extract diag. blocks to shared mem. Multiple threads for each row Coalescent access to data in CSR Col-major storage Merge kernels vs. 3 separate kernels: [ extraction + inversion + insertion ] [ extraction ] + [ inversion ] + [ insertion ] 16
Diagonal block extraction from sparse data structures US: unmerged + shared extraction UC: unmerged + cached extraction MS: merged + shared extraction MC: merged + cached extraction Results show: Runtime block-jacobi vs. BGJE 56 test matrices (SuiteSparse) Data arranged w.r.t. performance of (avg.) best strategy BJ generation time / BGJE time 6 5 4 3 2 1 BJ generation time / BGJE time 12 10 8 6 4 2 US UC MS MC Block K40 size 8 0 10 20 30 40 50 60 Test matrices US UC MS MC Block size 32 0 0 10 20 30 40 50 60 Test matrices 17
Diagonal block extraction from sparse data structures US: unmerged + shared extraction UC: unmerged + cached extraction MS: merged + shared extraction MC: merged + cached extraction Tridiagonal-type matrix structure: - balanced nnz/row Arrow-type matrix structure: - few rows with many elements - corresponding threads are bottleneck for cached extraction BJ generation time / BGJE time BJ generation time / BGJE time 3 2.5 2 1.5 1 US UC MS MC 0.5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Matrix size 10 4 20 15 10 5 US UC MS MC Tridiagonal-type Arrow-type 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Matrix size 10 4 18
Block-Jacobi preconditioner setup cost Block Jacobi generation time [s] 10-1 10-2 10-3 10-4 10-5 19 Block size 4 Block size 8 Block size 16 Block size 24 Block size 32 Chebyshev2 dw2048 dw1024 LeGresley_2508 Chebyshev3 saylr4 sts4098 olm5000 nasa2910 s2rmq4m1 s2rmt3m1 s3rmt3m1 s3rmt3m3 s3rmq4m1 s1rmt3m1 dw8192 dw4096 Kuu bcsstk38 CurlCurl_0 linverse nemeth15 bcsstk18 piston bcsstk17 nd3k cbuckle Pres_Poisson msc10848 ABACUS_shell_ud nd6k Preconditioner setup needs up to 0.01 s on P100 GPU. Irrelevant for overall Iterative solver execution time. spmsrtls cz40948 gridgena ibm_matrix_2 sme3dc 3D_51448_3D 2D_54019_highK nd12k gas_sensor rail_79841 crankseg_1 matrix_9 dc3 matrix-new_3 nd24k ship_003 CurlCurl_1 F1 af_shell3 ecology2 Emilia_923 G3_circuit Hook_1498 ML_Geer Flan_1565 K40
Block-Jacobi preconditioner Can reflect the block-structure of sparse matrices (e.g. Higher-Order FEM) Often superior to scalar Jacobi preconditioner Light-weight preconditioner (compared to ILU) Parallel preconditioner application ( batched trsv / SpMV + vector scaling ) Inversion of diag. blocks in prec. setup + SpMV in prec. application Factorization of diag. blocks in prec. setup + trsv in prec. application 20
Factorization-based block-jacobi Preconditioner setup Preconditioner application 2. Generate factorization and pivoting. Apply trsv using pivoting information to turn the right-hand side sub-vector into the solution of the diagonal-block linear problem. 1. extract diagonal block from sparse data structure. 3. Insert factorization as block into the preconditioner matrix and store pivoting. *ipiv 21
Factorization-based block-jacobi Batched routines: Small-Size LU, Gauss-Huard, Gauss-Huard-T. Gauss-Huard-T writes the factors in access-friendly fashion. Flexible size up to limit of 32 x 32. Use one warp per system, factorization in registers. GFlop/s 350 300 250 200 150 100 Small-Size LU Gauss-Huard Gauss-Huard-T cublas LU dgetrf GFlop/s 30 25 20 15 10 Small-Size LU Gauss-Huard Gauss-Huard-T cublas LU dgetri 50 5 0 5 10 15 20 25 30 Matrix size 0 5 10 15 20 25 30 Matrix size NVIDIA P100 GPU 22
NVIDIA Pascal architecture (P100) 10-4 3.5 3 Gauss-Jordan elimination (BGJE) MAGMA dgetrf-inv (BLU) Gauss-Huard (BGH) Gauss-Huard + transp (BGHT) Time/matrix [ms] 2.5 2 1.5 1 23 0.5 0 1 2 3 4 5 Batch size 10 4 BGJE and BLU invert diagonal blocks vs. BGH / BGHT factorizing only. BGHT writes factors in non-coalescent fashion for faster preconditioner application. NVIDIA P100 GPU
Inversion-based block-jacobi vs. factorization-based block-jacobi Convergence of BiCGSTAB with block-jacobi preconditioner for 300 matrices Test cases 140 120 100 80 60 40 block size 4 block size 8 block size 16 block size 24 block size 32 20 0-100 -50 0 50 100 BGH needs more iterations BGJE needs more iterations Explicit inversion better. Factorization better. NVIDIA P100 GPU 24
25 Block-Jacobi preconditioning: Performance goal Runtime (BGJE) [s] Runtime (BGH) [s] 4 2 0 4 2 0 10-3 1 2 3 4 5 6 7 8 9 10 Test matrices 10-3 solve setup solve setup 1 2 3 4 5 6 7 8 9 10 Test matrices Execution time of BiCGSTAB with Jacobi(24) preconditioner Runtime [s] 10 2 10 1 10 0 10-1 10-2 10-3 BGJE BGH BGHT 50 100 150 200 250 300 Test matrices Almost no difference in total execution time. BGH better if setup is significant to iterative solver time. BGJE better for long solver runs. NVIDIA P100 GPU
Pitch from the sparse community: Small size (<30) somewhat flexible Add inversion routine to Batched-Blas Go for Gauss-Jordan Elimination instead of LU http://icl.cs.utk.edu/magma/ This research is based on a cooperation between Hartwig Anzt, Jack Dongarra (University of Tennessee), Goran Flegar and Enrique Quintana-Orti (University of Jaume I), and partly funded by the Department of Energy. http://www.icl.utk.edu/~hanzt/talks/batched_bjac.pdf 26