CUDA math libraries APC

Size: px

Start display at page:

Download "CUDA math libraries APC"

Arnold Cunningham
5 years ago
Views:

1 CUDA math libraries APC

2 CUDA Libraries CUDA Toolkit CUBLAS linear algebra CUSPARSE linear algebra with sparse matrices CUFFT fast discrete Fourier transform CURAND random number generation Thrust STL-like template library NPP signal and image processing NVCUVENC/NVCUVID video encoder and decoder libraries APC 2

3 3 rd party libraries MAGMA heterogeneous LAPACK and BLAS CUSP algorithms for sparse linear algebra and graph computations ArrayFire comprehensive GPU matrix library CULA Tools IMSL Fortran Numerial Library GPU AI path finding GPU AI for board games APC 3

(vector-vector) 2 (matrix-vector) 3 (matrix-matrix) On 2 On 3 On AXPY:

4 CUBLAS BLAS interface implementation Column-major adressing, 0- and 1-based indexing C compatibility macros Level Complexity Examples 1 (vector-vector) 2 (matrix-vector) 3 (matrix-matrix) On 2 On 3 On AXPY: DOT: y ax y s x, y GEMV matrix-vector multiplication GEMM matrix-matrix multiplication APC 4

5 Naming convention: cublas<t><func> <T> - data type S single precision, real number D double precision, real number C single precision, complex number Z double precision, complex number <func> - BLAS literal Example: cublasdgemm CUBLAS In API v.2 (CUDA 4.0+) handles are used for thread safety APC 5

6 Additional types: cucomplex, cudoublecomplex cublashandle_t cublasstatus_t Helper functions cublascreate() / cublasdestroy() cublas{get Set}Stream() cublas{get Set}{Vector Matrix}[Async]() CUBLAS APC 6

7 CUBLAS - workflow Initialize CUBLAS descriptor (cublascreate()) Allocate GPU memory and upload data Call all the necessary CUBLAS functions Copy data from the GPU to host memory Free CUBLAS descriptor (cublasdestroy()) APC 7

8 BLAS-like interface implementation for sparse matrices Sparse = a lot of zero elements Formats: Dense format (often ineffective) COO: Coordinate CSR/CSC: Compressed Sparse Row/Column ELL: Ellpack-Itpack HYB: Hybrid BSR: Block Compressed Sparse Row CUSPARSE APC 8

9 A nnz = Sparse Formats: COO coovala = [ ] coorowinda = [ ] coocolinda = [ ] APC 9

10 A nnz = Sparse Formats: CSR coovala = [ ] coorowinda = [ ] coocolinda = [ ] APC 10

11 A nnz = Sparse Formats: CSC coovala = [ ] coorowinda = [ ] coocolinda = [ ] APC 11

12 4 levels : cusparse<t><func> Sparse and dense vectors Sparse matrices and vectors Sparse matrices and dense matrices Format conversions CUSPARSE features Single/Double Precision, Real/Complex values APC 12

13 CUSPARSE workflow Initialize descriptor (cusparsecreate()) Allocate GPU memory and upload data Call all the necessary CUSPARSE functions Copy data from the GPU to host memory Free CUBLAS descriptor(cusparsedestroy()) APC 13

14 APC 14 CUFFT - Fast Discrete Fourier Transform exp N k n n i F f kn N exp N n k k i f F kn N N

15 Interface similar to FFTW (FFTW compatibility) 1D, 2D and 3D forward and inverse DFT Single/Double Real/Complex Up to 128M single precision elements in each dimension, 64M for double precision CUDA Streams support (Asyncronous transforms) IFFT(FFT(A)) = len(a)*a CUFFT APC 15

16 Poisson equation: CUFFT - Example u p f p, p 0 x, y 1 s x, y 2 s x, y f x, y exp Exact solution: u s x, y x, y exp APC 16

17 Numeric solution: 2 W expi N RSH expanded in Fourier harmonics N 1 1 f n, m f x, 2 k y j W N jk, 0 CUFFT - Example nk mj 2 n n m m,, 4 1 u n m f n m h W W W W N 1 k, j, u x y u n m W jk, 0 nk mj APC 17

18 CURAND Pseudo- and Quasi-Random Number Generation XORWOW, MRG32K3A, MTGP32 and SOBOL algorithms of generation Distributions: Uniform [Log]Normal Poisson Has 2 interfaces: for device and for host APC 18

19 NPP: Image & Signal Processing Similar to IPP Arithmetic and logical operations Color model conversion Compression Filtering Functions Geometry transforms Statistics functions APC 19

20 A comprehensive GPU matrix library: Linear Algebra Signal&image processing Statistics Code timing Graphics Unified array container type: Single/Double Real/Complex [Un]signed + boolean ND support Easy index manipulation (Matlab-like) Parallel gfor loops and multi-gpu scaling ArrayFire APC 20

21 ArrayFire Example: Conway s Game of Life array da(nx+2, ny+2, nz+2, A, afhost, 1); //The initialization array dc(nx+2, ny+2, nz+2, s32); array kernel = constant(1, 3, 3, 3, s32); //Convolution kernel for (step=1; step<= num_steps; step++){ // Neighbors count dc = convolve(da.as(f32), kernel.as(f32)).as(s32); dc -= da; // Evolution da = ((da==0)*((dc==6) (dc==7)) + (da==1)*((dc<=7) && (dc>=4))).as(s32); } APC 21

22 ArrayFire Example: Conway s Game of Life Steps Host, sec AF, sec APC 22

23 Conclusion If you are not a professional in some area use libraries If you think you are a professional in particular area use libraries at the beginning Do not worry if you can not implement a routine more efficient than in library Sometimes everything above is wrong. But only sometimes. APC 23

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes