GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

Size: px

Start display at page:

Download "GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer"

Noah Lynch
5 years ago
Views:

1 GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES Nikolay Markovskiy Peter Messmer

2 ABOUT CP2K Atomistic and molecular simulations of solid state From ab initio DFT and Hartree-Fock to classical Hamiltonians Problem of SCF: O(N 3 ) Linear Scaling SCF : reduce O(N 3 ) problem to O(N), discard distant interactions Core calculation: iteration of distributed block sparse matrix-matrix products (up to 90%) DBCSR: a Sparse Matrix Library

DBCSR: A SPARSE MATRIX LIBRARY Distributed

Scheduling and load balancing CPU and GPU

3 DBCSR: A SPARSE MATRIX LIBRARY Distributed Blocked Compressed Sparse Row Multiple software layers Asynchronous MPI Scheduler Cache optimization Stack generation Scheduling and load balancing CPU and GPU MPI Scheduler Cluster Node Host driver CUDA driver Libsmm Libcusmm

4 LIBCUSMM GPU Accelerated Small Matrix Multiplications Stack of equally sized small matrices + index

5 TINY MATRICES k n n m m A B C m, n, k = 5, 13, 23, 26, 36 Arithmetic intensity: FLOP/ Byte Floating operations required: 2mnk Data to transfer: from 8(mk + kn) to 8 mk + kn + 2mn K20X 1300 GFLOPS in DP, 250 GB/s (ECC OFF)

6 ARITHMETIC INTENSITY EXAMPLE k m n m m, n, k = (20,20,20) Theoretical: 1.25 FLOP/Byte FLOP/Byte K20X GPU: 5.2 FLOP/Byte (1 300 / 250) Realistic K20X bandwidth with ECC ON ~ 180 GB/sec Not compute bound task n A B C

7 GFLOPS BATCHED CUBLAS Results are not directly comparable: C = AB Is not optimized for very small sizes E5-2667, K20X ECC

8 DBCSR Arithmetic intensity: FLOPs BYTEs Mixed C=C+AB and C=AB Projected GFLOPS: (Arithmetic Intensity)*(180 GB/s) E5-2667, K20X ECC

9 GENERAL GUIDES Sizes: 5x5 to 64x64 C = C + AB (not just C = AB) Always store C in registers Outer product formulation for larger cases Consider (26,13,13) case Arithmetic intensity: Estimated K20X GFLOPS: Start with naïve implementation: A and B in smem. One C output per thread Consider 26x13x13 case

10 NAÏVE IMPLEMENTATION for (int run = 0; run < nrun; run++) { a_loc =...; b_loc =...; // get locations of A and B for (int i = threadidx.x; i < mk; i += blockdim.x) // load A to smem buff_l[i] = a_data[a_loc + i]; // load B to smem syncthreads (); if (threadidx.x < mn) // C= C + AB for (int l = 0; l < k; l++) C = C + buff_l[l * m + r] * buff_r[c * k + l]; iflush = ; if (iflush) { c_loc =...; // get location of C if (threadidx.x < mn) atomicadd (&c_data[c_loc + threadidx.x], myc); C = 0; } // end of iflush condition syncthreads (); } // end of run loop

11 NAÏVE IMPLEMENTATION LSU Occupancy: 84.1% Smem efficiency: 73.2 % Issue slot utilization: 33.1% Device Memory: 58.9 GB/s GFLOPS measured in application : 84.4 i7-2600, K20 ECC

NAÏVE IMPLEMENTATION+ LSU i7-2600, K20 ECC Occupancy: 83.8% Smem efficiency: 100.0 % Issue slot utilization: 33.

12 NAÏVE IMPLEMENTATION+ LSU i7-2600, K20 ECC Occupancy: 83.8% Smem efficiency: % Issue slot utilization: 33.1% Device Memory: 70.0 GB/s GFLOPS measured in application : Double precision, add cudadevicesetsharedmemconfig( cudasharedmembanksizeeightbyte)

THREAD TILES Each thread processes MxN tile of input Stripes of A and B Increases work per thread (ILP) Use ldg to free up load/store unit double C[N*M];

13 THREAD TILES Each thread processes MxN tile of input Stripes of A and B Increases work per thread (ILP) Use ldg to free up load/store unit double C[N*M]; for (int l = 0; l < k; l++) for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) C[N*i+j] = C[N*i+j] + buff_l[l * m + M*r+i] * buff_r[(n*c+j) * k + l];

0 GB/s GFLOPS measured in application : 100.5 After Occupancy: 65.

14 THREAD TILES GMEM i7-2600, K20 ECC Before Occupancy: 83.8% Issue slot utilization: 33.1% Device Memory: 70.0 GB/s GFLOPS measured in application : After Occupancy: 65.3% Issue slot utilization: 42.3% Device Memory: GB/s GFLOPS measured in application: 171.3

15 DOUBLE BUFFERING Two groups of threads Group 1 load A and B from global memory to registers Several elements per thread Group 2 unpack from registers to shared memory perform C = C + AB

0 GB/s GFLOPS measured in application: 171.

16 DOUBLE BUFFERING GMEM Before Occupancy: 65.3% Issue slot utilization: 42.3% Device Memory: GB/s GFLOPS measured in application: i7-2600, K20 ECC After Occupancy: 48.3% Issue slot utilization: 49.4% Device Memory: GB/s GFLOPS measured in application: 199.0

17 PARAMETERS SPACE SAMPLING 26x13x13 block size: 64 max: 213 GFLOPS MxN: 4x2 E5-2667, K20X ECC

18 PARAMETERS SPACE SAMPLING 26x13x13 block size: 96 max: 234 GFLOPS MxN: 2x2 E5-2667, K20X ECC

19 PARAMETERS SPACE SAMPLING 26x13x13 block size: 128 max: 218 GFLOPS MxN: 2x2 E5-2667, K20X ECC

20 PARAMETERS SPACE SAMPLING 26x13x13 block size: 256 max: 156 GFLOPS MxN: 1x2 E5-2667, K20X ECC

21 THREAD PANELS Larger sizes (23x23 and up) Outer product formulation v, w additional template parameters Inner Outer

22 DBCSR Thread specialization several groups perform independent work E5-2667, K20X ECC

23 DBCSR Thread specialization Double buffering E5-2667, K20X ECC

24 DBCSR Thread specialization Double buffering Thread panels + transpose of B E5-2667, K20X ECC

25 BENCHMARK (CP2K) Full application performance comparison of the multi-threaded DBCSR library based on 23x23 matrix blocks, and was not using the MPI capabilities. The benchmark was run on a dual Sandy Bridge (E5-2620, 2.0GHz, 6 cores) machine, equipped with one NVIDIA Tesla K20 card

26 LARGE SIMULATION Aggregated nanoparticles in explicit solution (77538 atoms) can be run on the Piz Daint computer (5272 hybrid compute nodes) at approx. 122s per SCF step

27 CONCLUSIONS Library of template kernels used in production code Profiler guided algorithmic choices Optimized shared memory accesses Increased ILP Double buffering Outer product formulation and transpose optimization Automated optimization procedure Aknowledgements Ole Schütt and Joost VandeVondele (Nanoscale Simulations, Department of Materials, ETH Zürich) Jürg Hutter (Institute of Physical Chemistry, University of Zürich)

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization