A Standard for Batching BLAS Operations

Size: px

Start display at page:

Download "A Standard for Batching BLAS Operations"

Ethel Cole
5 years ago
Views:

1 A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1

2 API for Batching BLAS Operations We are proposing, as a community standard, an API for Batched Basic Linear Algebra Operations The focus is on multiple independent BLAS operations Think small matrices that are operated on in a single routine. Goal to be more efficient and portable for multi/manycore & GPU hardware systems. We can show 2x speedup and 3x better energy efficiency. 2 / 57

3 Definition Think of making one call to a function and having that operation perform in parallel cross the collection. So for matrix multiple C = C + A * B we have: 3 / 57

4 Definition The matrices in the collection can be of different sizes. Then internal to the Batched BLAS (BBLAS), the scheduling can go on to optimize use of the hardware. Nvidia is already doing a good job of this. They have a subset of the BLAS implemented. Intel has an implementation of Batched DGEMM only. 4

Dense Linear Algebra LINPACK (70 s) (Vector operations) LAPACK (80 s) (Blocking, cache friendly) Software/Algorithms follow hardware evolution in time ScaLAPACK (90 s) (Distributed Memory) PLASMA (00

PBLAS Mess Passing Rely on - a DAG/scheduler - block data layout - some extra kernels BLAS on tiles + DAG scheduling BLAS tasking + ( CPU / GPU / Xeon Phi ) hybrid scheduling Examples: Classic

5 Dense Linear Algebra LINPACK (70 s) (Vector operations) LAPACK (80 s) (Blocking, cache friendly) Software/Algorithms follow hardware evolution in time ScaLAPACK (90 s) (Distributed Memory) PLASMA (00 s) New Algorithms (many-core friendly) MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - Level-1 BLAS operations Level 1 BLAS Rely on - Level-3 BLAS operations Level 3 BLAS Rely on - PBLAS PBLAS Mess Passing Rely on - a DAG/scheduler - block data layout - some extra kernels BLAS on tiles + DAG scheduling BLAS tasking + ( CPU / GPU / Xeon Phi ) hybrid scheduling Examples: Classic fork-join BLAS model for(int i = 0; ) { } // factor a panel // sequential LAPACK DGETRF2( ); // backward swap // sequential LAPACK DLASWP( ); // forward swap // sequential LAPACK DLASWP( ); // triangular solve // parallel BLAS DTRSM( ); // matrix multiply // Parallel BLAS DGEMM( ); These models work well for large DLA problems BLAS tasking + hybrid scheduling

Examples Need of Batched routines for Numerical LA [ e.g.

; ] [ collaboration with Tim Davis at al.

library for Batched LA LU, QR, or Cholesky on small diagonal matrices TRSMs, QRs, or LUs TRSMs, TRMMs Updates (Schur complement)

6 Examples Need of Batched routines for Numerical LA [ e.g., sparse direct multifrontal methods, preconditioners for sparse iterative methods, tiled algorithms in dense linear algebra, etc.; ] [ collaboration with Tim Davis at al., Texas A&M University] Sparse / Dense Matrix System DAG-based factorization To capture main LA patterns needed in a numerical library for Batched LA LU, QR, or Cholesky on small diagonal matrices TRSMs, QRs, or LUs TRSMs, TRMMs Updates (Schur complement) GEMMs, SYRKs, TRMMs Example matrix from Quantum chromodynamics Reordered and ready for sparse direct multifrontal solver Diagonal blocks can be handled in parallel through batched LU, QR, or Cholesky factorizations nz = 6716

e.g., Convolutional Neural Networks (CNNs) used in computer vision Key computation is convolution of

Convolution operation: For every filter F n and every channel, the computation for every pixel value O

7 Machine Learning Need of Batched and/or Tensor contraction routines in machine learning e.g., Convolutional Neural Networks (CNNs) used in computer vision Key computation is convolution of Filter Fi (feature detector) and input image D (data): Data D Output O O n D k O n,k F n Filters F Convolution operation: For every filter F n and every channel, the computation for every pixel value O n,k is a tensor contraction: O = n,k D i k,i F Plenty of parallelism; small operations that must be batched With data reshape the computation can be transformed into a batched GEMM (and hence, efficiently implemented; among other approaches) n,i

8 : Calling Sequence: C =αa*b + βc SPECIFICATION Level 3 BLAS DGEMM Calling Sequence dgemm ( char * transa, char * transb, integer * m, integer * n, double * alpha, double * A, integer * lda, double * B, integer * ldb, double * beta, double * C, integer * ldc );

9 : Calling Sequence: C =αa*b + βc SPECIFICATION PROPOSED SPECIFICATION Level 3 BLAS DGEMM Calling Sequence dgemm ( char * transa, char * transb, integer * m, integer * n, double * alpha, double * A, integer * lda, double * B, integer * ldb, double * beta, double * C, integer * ldc ); Batched Level 3 BLAS DGEMM Calling Sequence dgemm_batch ( enum * transa, enum * transb, integer * m, integer * n, double * alpha, double ** arraya, integer * lda, double ** arrayb, integer * ldb, double * beta, double ** arrayc, integer * ldc, integer batch_count, enum batch_opts, integer * info ); Arrays batch_count number of matrices in the batch batch_opts enumerated value, specifying style of the batched computation. Valid values are BATCH_FIXED or BATCH_VARIABLE, specifying fixed or variable size matrices, respectively.

10 Batched Level 3 BLAS DGEMM Example DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s CPU nonbatched Loop around each DGEMM using the 16 cores Matrix size M = N, K = 32

11 Batched Level 3 BLAS DGEMM Example DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Each core does a DGEMM Gflop/s CPU batched CPU nonbatched Loop around each DGEMM using the 16 cores Matrix size M = N, K = 32

12 Batched Level 3 BLAS DGEMM Example 450 DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU nonbatched Matrix size M = N, K = 32

13 Batched Level 3 BLAS DGEMM Example 450 DGEMM (NN), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU batched GPU nonbatched Matrix size M = N, K = 32

14 Batched Level 2 BLAS DGEMV Example 12 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 CPU nonbatched Matrix size M = N

15 Batched Level 2 BLAS DGEMV Example 12 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 CPU batched CPU nonbatched Matrix size M = N

16 Batched Level 2 BLAS DGEMV Example 50 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU nonbatched Matrix size M = N

17 Batched Level 2 BLAS DGEMV Example 50 DGEMV (N), batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s GPU batched GPU nonbatched Matrix size M = N

18 Batched Level 1 BLAS DAXPY Example 5 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s CPU nonbatched Matrix size M = N

19 Batched Level 1 BLAS DAXPY Example 5 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU Gflop/s CPU batched CPU nonbatched Matrix size M = N

20 Batched Level 1 BLAS DAXPY Example 12 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 GPU nonbatched Matrix size M = N

21 Batched Level 1 BLAS DAXPY Example 12 DAXPY, batch_count = 500, 16-core Intel Xeon E5-2670, 1 Tesla K40c GPU 10 Gflop/s 8 6 GPU batched GPU nonbatched Matrix size M = N

22 Batched BLAS Performance Batched Level 3 BLAS: DGEMM Example Batched Level 2 BLAS: DGEMV Example Batched Level 2 BLAS: DAXPY Example Versions designed & autotuned for various sizes and shapes Reference: J. Dongarra, I. Duff, M. Gates, A. Haidar, S. Hammarling, N. Higham, J. Hogg, P. Lara, M. Zounon, S. Relton, and S. Tomov, A Proposed API for Batched Basic Linear Algebra Subprograms, UTK Computer Science Technical Report, (available at March / 51 / 57

23 Batched BLAS Performance Batched Level 3 BLAS: DGEMM Example Versions designed & autotuned for various sizes and shapes Reference: J. Dongarra, I. Duff, M. Gates, A. Haidar, S. Hammarling, N. Higham, J. Hogg, P. Lara, M. Zounon, S. Relton, and S. Tomov, A Proposed API for Batched Basic Linear Algebra Subprograms, UTK Computer Science Technical Report, (available at March / 51 / 57

24 A Proposed API for Batched LAPACK Batched LU example: Arrays batch_count number of matrices in the batch batch_opts enumerated value, specifying style of the batched computation. Valid values are BATCH_FIXED or BATCH_VARIABLE, specifying fixed or variable size matrices, respectively. Reference: J. Dongarra, I. Duff, M. Gates, A. Haidar, S. Hammarling, N. Higham, J. Hogg, P. Lara, M. Zounon, S. Relton, and S. Tomov, A Proposed API for Batched Basic Linear Algebra Subprograms, UTK Computer Science Technical Report, (available at March / / 57 51

25 Community Activity We want to engage the community in developing this proposed standard. The specification is open for discussion, and we welcome input. Workshop on Batched, Reproducible, and Reduced Precision BLAS May 18 & 19 University of Tennessee See 25

26 Questions? What does the word salishan mean? 26

27 Questions? What does the word salishan mean? A group of languages of the Pacific Northwest The Salishan language family consists of twenty-three languages. 27

28 Collaborators / Software / Support u PLASMA u MAGMA u PaRSEC(Parallel Runtime Scheduling and Execution Control) u Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver MAGMA PLASMA 28

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering