GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Size: px

Start display at page:

Download "GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU"

Chrystal Flowers
5 years ago
Views:

1 April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016

2 OBJECTIVE Direct sparse methods are among the most widely used in science and engineering GPU acceleration is challenging due to irregularity in operations and data access Investigate methods for GPU acceleration of sparse Cholesky factorization Implement within CHOLMOD 2

3 Sparse Cholesky Root algorithm AGENDA Subtree algorithm Custom batched BLAS/Lapack Multi-GPU Performance 3

4 DENSE CHOLESKY FACTORIZATION dense block Cholesky supernodes A 11 L 11 0 I 0 L t 11 L t 21 A 21 A 12 = A 22 L 21 I 0 A * 22 0 I L 11 L t 11 = A 11 POTRF dense Cholesky L 11 L t 21 = At 21 A * 22 = A 22 L 21 Lt 21 TRSM GEMM triangular solve matrix multiplication compressed column Schur complement 4

5 SPARSE CHOLESKY ISSUES Lots of small math Irregular operations/ access patterns PCIe communication fill fill 7 POTRF TRSM SYRK GEMM

6 ROOT ALGORITHM Send appropriate BLAS to GPU SuiteSparse (CHOLMOD) Assemble supernodes on GPU & CPU Hide PCIe communication Handles large matrices CPU CPU + GPU 2 x Xeon E v3 + K40 (max boost, ECC=off) 1.5x row/column threshold ndrow >= 256 ndcol >= 32 GFlops/s supernode score GPU CPU descendant supernodes Florida Sparse Matrix Collec4on 6

7 ROOT ALGORITHM Send appropriate BLAS to GPU SuiteSparse (CHOLMOD) Assemble supernodes on GPU & CPU Hide PCIe communication Handles large matrices CPU CPU + GPU 2 x Xeon E v3 + K40 (max boost, ECC=off) why not higher? supernode score GPU row/column threshold ndrow >= 256 ndcol >= 32 CPU GFlops/s why so low? descendant supernodes Florida Sparse Matrix Collec4on 7

8 SUBTREE ALGORITHM Send entire subtrees to GPU Factorization performed entirely on GPU Minimizes PCIe communication Requires batched BLAS/Lapack (with variable m, n, k) Previous method is used for root level 2 level 1 level 0 subtree 1 subtree 2 subtree 3 subtree 4 8

9 SUBTREE ALGORITHM Send entire subtrees to GPU ROOT alg. Factorization performed entirely on GPU Minimizes PCIe communication Requires batched BLAS/Lapack (with variable m, n, k) Previous method is used for root level 2 level 1 level 0 subtree 1 subtree 2 subtree 3 subtree 4 SUBTREE alg. 9

RESULTS - SUBTREE CHOLMOD 4.4.3 4.43 CPU CPU + GPU GPU Branches Subtrees 1.38x average speedup vs. previous CPU+GPU 2x average speedup vs.

10 RESULTS - SUBTREE CHOLMOD CPU CPU + GPU GPU Branches Subtrees 1.38x average speedup vs. previous CPU+GPU 2x average speedup vs. CPU Poorly performing matrices see the greatest speedup GFlop/s PCIe well avoided 0 22x Xeon E v3 + K40 (max boost, ECC=off) Florida Sparse Matrix Collection 10

11 CURRENT WORK 1. Releasable CUDA versions of batched BLAS/Lapack 2. Support multi-gpu for both SUBTREE and ROOT algs. 3. General implementation improvements 4. Release as merged with latest SuiteSparse library SuiteSparse BETA 11

12 BATCHED BLAS/LAPACK For each level For GEMM, SYRK, TRSM, POTRF, batch if: GEMM, SYRK: m<=128 & n<=128 & k<=128 POTRF, TRSM: m<=64 & n<=64 Stream remaining BLAS/LAPACK operations Require batches with variable sized elements: m, n, k Irregular operations don t give large uniform batches Cannot afford to copy/pad Previous work used modified cublas/cusolver code 12

13 CUSTOM BATCHED BLAS/LAPACK Written in CUDA accepts lists of m, n, k Every BLAS/Lapack operation gets assigned to a threadblock grid size = #batches automatic scheduling All threadblocks are 16x16 = 256 threads If result matrix < 16x16 idle threads If result matrix > 16x16 tiled 13

14 MULTI-GPU Subtree elimination tree 1 subtree per GPU Automatically scaled: subtree size <= GPU memory (as large as possible) root spin wait supernodes Static load-balancing based on flops synchronize Root At least one supernode in Root subtrees OMP parallel loop over supernodes: nthreads(#gpus) ordered subtree Spinwait on unfinished descendant supernodes 4x GPU : GPU 1, GPU 2, GPU 3, GPU 4 14

15 MULTI-GPU Serena.mtx GPU 0 GPU 1 15

16 MULTI-GPU Serena.mtx GPU 0 synchronize GPU 1 subtree root 16

17 CURRENT PERFORMANCE K40 1xK40 = 1.8x 2xK40 = 2.3x 4xK40 = 2.6x GF/s for numerical factoriza4on SuiteSparse x E GHz 1x - 4x K40 (full boost, ECC=ON) CPU 1x K40 2x K40 4x K Frlorida Sparse Matrix Collec4on 17

CURRENT PERFORMANCE K40 VS K80 numerical factoriza4on GF/s 2000.0 1800.0 1600.0 1400.0 1200.0 1000.0 800.0 600.0 400.0 200.0 0.0 2x E5-2698 v3 @2.

18 CURRENT PERFORMANCE K40 VS K80 numerical factoriza4on GF/s x E GHz 1x - 4x K40 (full boost, ECC=ON) 1x 4x K80 (full boost, pl=175, ECC=ON) CPU 1x K40 2x K40 4x K40 1x K80 2x K80 4x K80 Florida Sparse Matrix Collec4on 18

19 CURRENT PERFORMANCE Speedup (GPU/CPU) x E GHz + 1x K40 (full boost, ECC=ON) or 1x K80 (board, full boost, pl=175, ECC=ON) 103 SPD from Florida Sparse Matrix Collection K40 K80 1 1x factor flops / nnz(l) 19

20 FURTHER WORK further optimization of batched routines improved overlap/load-balancing for multi-gpu case LU pivoting accelerating all other aspects of matrix solution 20

21 CONCLUSIONS Sparse factorization can be well accelerated on GPUs Subtree algorithm / batched BLAS / careful implementation Plenty yet to be done SuiteSparse BETA 21

22 April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization