Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Size: px

Start display at page:

Download "Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments"

Marybeth Harvey
6 years ago
Views:

1 Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering 2017 Monday, February 27 th, 2017

2 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 2/30

3 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 3/30

4 What this talk is all about A design framework to perform batched computation using GPUs Part of the MAGMA library ( A foundation of batched linear algebra operations (BLAS and LAPACK) on small matrices of the same size or different sizes. A. Abdelfattah 4/30

5 Workshop on Batched, Reproducible, and Reduced Precision BLAS A community effort to establish standards for batch BLAS Two workshops so far: May 2016: ( Feb. 2017: ( We are always looking for feedback A. Abdelfattah 5/30

6 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 6/30

7 Why do we care? This is an application-driven need. Astrophysics Tensor contractions Large scale sparse direct solvers CA Krylov solvers Anomaly detection in hyperspectral images batched LU batched GEMMs batched LU, QR, and Cholesky batched GEMM, GEMV and others batched Cholesky Problems are generally of different sizes. A. Abdelfattah 7/30

8 Can we use existing GPU numerical software? Mostly No Relatively small caches Most existing software focuses on big matrices Using streams does not help We need to batch the execution in one computational kernel (batched kernel) A. Abdelfattah 8/30

9 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 9/30

10 The MAGMA Batched Library A subpackage of MAGMA Supports fixed and variable size problems A unified design structure Various BLAS and LAPACK functionality BLAS fixed size var. size Level-3 (all routines) Level-2 (GEMV,HEMV) LAPACK fixed size var. size One-sided factorization LU/QR/Chol. Chol. A. Abdelfattah 10/30

11 Design Assumptions Column major layout Same problem settings (e.g. transposition, upper/lower triangular matrix,... etc) Generally, matrices are not stored consecutively A sample dgemm_vbatched interface void magmablas_dgemm_vbatched( enum transa, enum transb, int* m, int* n, int* k, double alpha, double ** Aarray, int* lda, double ** Barray, int* ldb, double beta, double ** Carray, int* ldc, int batchcount, magma_queue_t queue ); A. Abdelfattah 11/30

12 Kernel Design Framework A kernel is a 1D array of 2D subgrids Each subgrid has a unique batchid Preprocessing layer for variable size batched kernel vbatched kernel subgrid subgrid thread block thread block non-batched device code Preprocessing A. Abdelfattah 12/30

13 Working with Different Sizes A preprocessing layer performs adaptive subgrid truncation (ASGT) ASGT: Kernels use self-adaptive subgrids Each subgrid is configured to accomodate any matrix in the input batch Each subgrid truncates itself to exactly fit the assigned problem Argument checking Separate GPU kernels Can be skipped through advanced APIs (similar to MKL direct calls) A. Abdelfattah 13/30

14 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 14/30

15 A GEMM-concentric Design Every BLAS kernel inherits/uses GEMM Except Hermitian matrix multiplication Any improvements in GEMM propagates to other routines TRSM (small size) TRSM (generic) HEMM/SYMM (generic) similar code base GEMM (generic) GEMM code base TRMM (small size) TRMM (generic) HERK/SYRK (generic) internal routine (generic) HER2K/SYR2K (generic) A. Abdelfattah 15/30

16 Design of the GEMM kernel Every thread block computes an output submatrix Square/rectangular recursive blocking Tuning is crucial Performance is sensitive to tuning parameters Tune for realistic cases found in LAPACK One parameterized code base, 6400 candidates Ended up with 34 winning kernels N /* some code */ case DGEMM_NT: // B is transposed { if(k < 128){ gemm_kernel_nt<double, version(nt,160)>(...); }else{ if(m < 256){ gemm_kernel_nt<double, version(nt,134)>(...); }else{ gemm_kernel_nt<double, version(nt,190)>(...); } } } break; /* some code */ M BLK k K BLK M A K BLK k BLK N B BLK N BLK M C A. Abdelfattah 16/30

17 DGEMM Performance 2k matrices, Pascal P100 GPU, NVIDIA CUDA 8.0 Fixed Size Variable Size A. Abdelfattah 17/30

18 DSYRK/DTRSM Performance 2k matrices, Pascal P100 GPU, NVIDIA CUDA 8.0 Fixed Size Variable Size A. Abdelfattah 18/30

19 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 19/30

20 Design of Batched LAPACK Batched LAPACK builds on top of batched BLAS Extract performance from level-3 batched BLAS kernels We started by batched one-sided factorization A new component is needed: batched panel factorizations Non-batched MAGMA is hybrid: uses the CPU for this task We cannot do the same on batched workloads Trailing matrix updates cannot hide the CPU-GPU communication Communication of non-consecutive data A full GPU solution is developed A. Abdelfattah 20/30

Intro magma-batched BLAS LAPACK Tiny-sizes Summary Figure 8: left-looking Cholesky factorization Cholesky Panel Factorization Algorithm 3 The fused potf2 kernel.

21 Intro magma-batched BLAS LAPACK Tiny-sizes Summary Figure 8: left-looking Cholesky factorization Cholesky Panel Factorization Algorithm 3 The fused potf2 kernel. 1: 2: 3: m-i 4: ib 5: 6: m-i A C ib 7: 8: 9: 10: 11: lb 12: 13: for i = 0, ib to m = nb do rak A(i:m,0:lb) ; rc 0 for k = 0, lb to m i do rakk rak sb rak(i:lb,k:k+lb) inplace transpose barrier() ra1 A(i:m,k+lb:k+2lb) prefetching rc rc + rakk sb multiplying barrier() end for sc ra1 - rc factorize sc end for Left looking variant with recursive We blocking first implemented this routine on GPUs and foun performance is about 3 Gflop/s for double precisio Kernel fusion: the panel (C) isits reused in shared memory both architectures. We then performed a detailed perform study basedin onshared the collection and analysis of machine A is read in registers, B is transposed memory ters. Counter readings were taken using performance (NVIDIA s CUPTI and PAPI CUDA component [20]. Prefetching helps hide the memory latency A. Abdelfattah While previously the unblocked potf2 algorithm wa plemented with outer loops going from 1 to nb runnin 21/30the CPU, calling the computational kernels on the GP

22 Performance of Batched Cholesky Factorization 2k matrices, Pascal P100 GPU, NVIDIA CUDA A. Abdelfattah 22/30

LU Panel Factorization LAPACK equivalent of (DGETF2) Swapping in LU factorization is very expensive One row at a time Parallel swapping allows concurrent row

23 LU Panel Factorization LAPACK equivalent of (DGETF2) Swapping in LU factorization is very expensive One row at a time Parallel swapping allows concurrent row interchanges DGETRF on 2000 matrices of size using a K40c GPU A. Abdelfattah 23/30 ure 2. Execution trace of the batched LU factorization using either classical swap (top)

24 Performance of Batched LU Factorization 2k matrices, Pascal P100 GPU, NVIDIA CUDA A. Abdelfattah 24/30

25 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 25/30

26 What if the matrices are "embarrassingly small"? For example less than a warp size (32) LAPACK-style blocking is not a wise choice We cannot afford any redundant memory traffic We must preserve minimal memory traffic Each problem is entirely solved by one thread block Data is kept in registers/shared memory Instruction mix is also important Minimize the amount of integer instructions by fully unrolling the code for every size. Add the size as a tuning parameter A. Abdelfattah 26/30

27 Performance for tiny matrices 100k matrices, Pascal P100 GPU, NVIDIA CUDA 8.0 DGEMM DPOTRF DGETRF M matrices, Pascal P100 GPU, NVIDIA CUDA A. Abdelfattah 27/30

28 Outline 1 At a Glance 2 Introduction 3 MAGMA Batched 4 Batched BLAS Design 5 Batched LAPACK Design 6 Batched Computation on Tiny Sizes 7 Conclusion A. Abdelfattah 28/30

29 Conclusion and Future Work To summarize: MAGMA provides many BLAS/LAPACK routines for batched workloads High performance through careful design and tuning Special approach for tiny sizes Future Directions: Performance improvements in one-sided factorization Autotuning framework for batched routines Other LAPACK functionalities A. Abdelfattah 29/30

30 Thank You! A. Abdelfattah 30/30

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community