A High-performance Drop-in GPU BLAS Library to Accelerate Existing Applications. Steve Rennich, Nvidia Developer Technology

Size: px

Start display at page:

Download "A High-performance Drop-in GPU BLAS Library to Accelerate Existing Applications. Steve Rennich, Nvidia Developer Technology"

Mabel James
5 years ago
Views:

1 A High-performance Drop-in GPU BLAS Library to Accelerate Existing Applications Steve Rennich, Nvidia Developer Technology

2 Preliminaries Start EC2 lab instance Create account / sign in to the EC2 / qwiklab nvidia.qwiklab.com Select Nvidia Workshop -Santa Clara

3 Preliminaries Start EC2 lab instance Select Start Lab button for this workshop This means lab is running (don t press anything) Connect using ec compute-1.amazonaws.com

4 Preliminaries Connect to EC2 instance Use Nomachine NX OpenNX (mac) (remote desktop) (ssh to IP# ) configure Host: configure desktop: xterm user: gpudev1 pass: GTC2013! accept RSA key should bring up a terminal tutorial materials are at:

5 Source setup cd ~ source gpudev1/ex_setup copy Exercise source to /run and adjust permissions cd /run/gpudev1

6 Problem GPU provides large computational resource Many apps could benefit, but have not been ported Too many apps Source not available Would be nice if there was an easy way to allow more apps to benefit from GPU acceleration

7 Solution Many apps use libraries with standard interfaces BLAS, FFTW, LAPACK, Intercept standard calls and route to GPU Only when appropriate Without modifying application Minimal programming use LD_PRELOAD Obtain high performance

8 Workshop Learn about LD_PRELOAD library manipulation Nvidia cublas library (DGEMM) Create a drop-in library to intercept all calls to BLAS DGEMM and demonstrate its performance in GNU Octave Just enough to tie things together

9 Workshop Ex 1 Baseline performance Ex 2 Send all DGEMM calls to GPU Ex 3 Send appropriate calls to GPU Ex 4 Better performance using tiles & concurrency Ex 5 Can be applied to any application

10 Exercises Materials in /run/gpudev1/dropingpublas/ex*_ Exercise directories contain EXERCISE An ascii description of the Exercise dropingpublas.c The code to modify / build buildscript How to build the code matmul.m GNU Octave matrix multiplication script Ex#_runscript Executable script

11 Objective Create a GPU DGEMM library that Is always beneficial Leverages the computational power of the GPU Handles arbitrarily large matrices Can be applied to existing applications Demonstrate the use and performance of this library with GNU Octave

12 Workshop Red Cup / Blue Cup Display Red Cup to get expert help CHEAT! Refer to solution Help each other Work ahead / skip exercises Experiment

13 GNU Octave / test script General purpose matrixmath package software/octave/ general utility large uses BLAS complex code would benefit from GPU acceleration matmul.m for n = [2, 4, 8, 16, 4096 ] # Initialize the matrices with sample data a = ones(n,n); b = ones(n,n); # Record the start time tstart = time; # Perform the multiplication c = a * b; # Record the end time tend = time; # Compute and output performance endfor

14 LD_PRELOAD Loaded before any other libraries Will replace any subsequent libraries with same signature man ld.so A list of additional, user-specified, ELF shared libraries to be loaded before all others. This can be used to selectively override functions in shared libraries. OpenBLAS Tuned BLAS libs for CPU (for comparison)

15 Ex 1 Baseline Performance /run/gpudev1/dropingpublas/ex1_baseline/ EC2 nodes: X5570 -> 94 GF/s Ex1_runscript bash script that will run the exercise Exercise: Use Ex1_runscript to measure base performance Modify Ex1_runscript with LD_PRELOAD to use DGEMM from /run/gpudev1/lib/libopenblas.so Measure performance with OpenBLAS

16 Ex 1 - Baseline GF/s DropinCuBLAS OpenBLAS Matrix Dimension

17 Ex 2 Simple GPU Replacement /run/gpudev1/dropingpublas/ex2_simple libdigpublas.so : cublas DGEMM copy A, B, C to GPU compute C=alphaAB+betaC on GPU copy C back to hos build using source buildscript Use LD_PRELOAD to send all DGEMM calls to the new library Exercise Use LD_PRELOAD=/run/gpudev1/DropinGpuBlas/Ex2_DropinSimple/libdigpublas.so How fast is the simple GPU DGEMM library? Is it always faster than OpenBLAS?

18 Ex 2 Simple GPU Replacement x GF/s DropinCuBLAS OpenBLAS 50 0 slower for small dimension Matrix Dimension fails for large dimension

19 Ex 2 Simple GPU Replacement Better peak performance 250 GF/s vs. 85 GF/s (2.9x) Suffers from Slower for small matrices Doesn t work for large matrices We d like to avoid these issues

20 Ex 3 Best of Both CPU and GPU Use GPU or CPU when appropriate Send small matrices to CPU Send matrices > 3GB to CPU Explicitly define host library dlopen( /run/gpudev1/lib/libopenblas.so, ) dlsym(, dgemm_ ) Exercise Properly load host library Choose appropriate conditionals for sending DGEMM to GPU Verify that this provides a robust solution

21 Ex 3 Best of Both x Always best GF/s OpenBLAS DropinBoth Matrix Dimension

22 Ex 3 Best of Both CPU and GPU Best of both CPU and GPU Good performance for small matrices Leverages GPU when appropriate Reverts to CPU when matrices are too large to fit on GPU No reason not to use this by default Doesn t achieve peak GPU performance capability Doesn t accelerate large matrices

23 Ex 4 Tiled DGEMM on GPU Tiling only part of the matrix is on the GPU at any one time REMOVES LIMIT ON MATRIX SIZE Leverages concurrency on the GPU HIDES TRANSFERS BEHIND COMPUTE Simple tiling algorithm

24 Ex4 - Tiled DGEMM on GPU CPU GPU 1,1 1,2 1,3 1,4 2,1 B H2D 1,3 1,1 1,3 Only need space for 7 tiles on GPU! 1,1 1,2 1,3 1,4 1,1 1,2 1,3 1,4 Kernel 1,2 += 1,1 * 1,2 run concurrently 2,1 A 2,1 C D2H 1,1 With suitably large tiles communication is hidden!

25 Ex 4 Tiled DGEMM on GPU Exercise Find appropriate tile sizes for maximum performance Note the amount of device memory this uses Optional Profile matmul by prefixing octave with nvprof o exercise4.nvvp Visualize timeline using nvvp

26 Ex 4 Tiled DGEMM on GPU x GF/s OpenBLAS DropinBoth DropinTiled Matrix Dimension

27 Ex 4 Tiled DGEMM on GPU Tiling provides peak performance 300 GF/s 3.5x Tiling permits multiplication of arbitrarily large matrices GPU is accelerates the most demanding problems Small matrices are still sent to the GPU

28 Ex 5 Use Drop-in library with Other Apps Many other applications use DGEMM (BLAS) Scilab Freemat Exercise Use LD_PRELOAD to perform matrix-multiplication in Scilab and verify it has been accelerated on the GPU Experiment with matrix sizes

29 Ex 5 Use Drop-in library with Other Apps 2048 square matrix multiplication using Scilab achieves: 10 GF/s using OpenBLAS 35 GF/s using DropinGpuBlas Once the drop-in library is written accelerating other apps is simple

30 Performance Note Performance plots have represented EC2 & Fermi SandyBridge and Kepler performance is quite improved 1200 GF/s x OpenBlas DropinBoth DropinTiled core 3.2 GHz Matrix Dimension K20C

31 Summary If codes use standard libraries, which are amenable to GPU acceleration They can be easily accelerated on the GPU Access to source is not necessary Shown here with DGEMM, could apply to BLAS3, FFTW, Lapack, etc. Substantial optimization / refinement still possible multi-gpu, hybrid computing This is not meant to replace source porting That is still the way to achieve optimal performance We hope this helps you get more out of your GPU!

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization