CUDA 6. Performance Report April 214 1
CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2
usec CUDA 6: 2x Faster GPU Kernel Launches 4 Dynamic Parallel Kernel Launches 35 3 25 2 15 1 5 1.5x Faster 2.x Faster CUDA 5 CUDA 6 CUDA 5 CUDA 6 Back to Back Launches a<<<...>>>; b<<<...>>>; Launch and Synchronize a<<<...>>>; cudadevicesynchronize(); Performance may vary based on OS version and motherboard configuration CUDA 5. and CUDA 6. on Tesla K2 3
cufft: Multi-dimensional FFTs Real and complex Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts New in CUDA 6 XT interface supports dual-gpu cards (Tesla K1, GeForce GTX69, ) 4
GFLOPS GFLOPS cufft: up to 7 GFLOPS 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs 8 Single Precision 3 Double Precision 7 6 5 25 2 4 15 3 2 1 1 5 1 3 5 7 9 11 13 15 17 19 21 23 25 log2(transform_size) 1 3 5 7 9 11 13 15 17 19 21 23 25 log2(transform_size) Performance may vary based on OS version and motherboard configuration cufft 6. on K4c, ECC ON, 32M elements, input and output data on device 5
GFLOPS GFLOPS cufft: Consistently High Performance 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs 8 Single Precision 3 Double Precision 7 6 5 25 2 4 15 3 2 1 1 5 1 1 1, 1,, 1,, Transform Size 1 1 1, 1,, 1,, Transform Size Performance may vary based on OS version and motherboard configuration cufft 6. on K4c, ECC ON, 28M-33M elements, input and output data on device 6
Execution Time (ms) Execution Time (ms) New in CUDA 6 cufft-xt: Boosts Performance on K1 1 14 9 8 12 7 6 5 4 25% Faster 1 8 6 65% Faster 3 4 2 1 2 cufft cufft-xt cufft cufft-xt 256x256x256 512x512x512 Performance may vary based on OS version and motherboard configuration cufft 6. on K1, ECC ON, input and output data on device 7
cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface New in CUDA 6 XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size Drop-in BLAS intercepts CPU BLAS calls, streams to GPU 8
SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK GFLOPS cublas: >3 TFLOPS single-precision >1 TFLOPS double-precision 35 3 25 2 15 1 5 Single Single Complex Double Double Complex Performance may vary based on OS version and motherboard configuration cublas 6. on K4m, ECC ON, input and output data on device m=n=k=496, transpose=no, side=right, fill=lower 9
GFLOPS cublas: ZGEMM 5x Faster than MKL 12 1 8 6 cublas MKL 4 2 5 1 15 2 25 3 35 4 Matrix Dimension (m=n=k) Performance may vary based on OS version and motherboard configuration cublas 6. on K4m, ECC ON, input and output data on device MKL 11..4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz 1
New in CUDA 6 8 7 6 5 4 3 2 cublas-xt: Multi-GPU Performance Scaling 7.9 TFLOPS 6. TFLOPS 4.2 TFLOPS 2.2 TFLOPS 1 1 x K1 2 x K1 3 x K1 4 x K1 16K x 16K SGEMM on Tesla K1 Performance may vary based on OS version and motherboard configuration cublas-xt 6. on K1, ECC ON, input and output data on host 11
cusparse: Sparse linear algebra routines Optimized sparse linear algebra BLAS routines matrixvector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) New in CUDA 6 Many improvements to triangular solvers, Incomplete-LU, and Cholesky preconditioners y 1 y 2 y 3 1. 2. 3. 1. 2. 3. 4. \alpha + \beta 4. y 4 5. 6. 7. y 1 y 2 y 3 y 4 12
Speedup over MKL cusparse: 5x Faster than MKL 6x Sparse Matrix x Dense Vector (SpMV) 5x 4x 3x 2x 1x x Performance may vary based on OS version and motherboard configuration Average of s/c/d/z routines cusparse 6. on K4m, ECC ON, input and output data on device MKL 11..4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/ 13
curand: Random Number Generation Generating high quality random numbers in parallel is hard Don t do it yourself, use a library! Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results in documentation New in CUDA 6 Mersenne Twister 19937 14
Gsamples / sec curand: Up to 75x Faster vs. Intel MKL 16 14 12 1 8 6 curand MKL 4 2 Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6. on K4c, ECC ON, double-precision input and output data on device MKL 11..1 on Intel SandyBridge 6-core E5-262 @ 2. GHz 15
Gsamples / sec 18 curand: High Performance RNGs 16 14 12 1 8 6 4 2 XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Pseudo-random Sobol64 Scrambled Quasi-random Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6. on K4m, ECC ON, double precision input and output data on device 16
NPP: NVIDIA Performance Primitives Over 5 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation New in CUDA 6 Over 5 new routines, including: median filter, BGR/YUV conversion, 3D LUT color conversion, improvements to JPEG primitives, plus many more 17
NPP Speedup vs. Intel IPP 3x 25x 2x 15x 28.5x 1x 5x 5.7x 14.4x 17.8x 6.3x 12.9x x Image Set (8-bit RGB) Image Set Channel (8-bit RGB) Image Resize (8-bit RGB) Image Gaussian Filter (32-bit float) Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT Performance may vary based on OS version and motherboard configuration NPP 6. on K4m, input and output data on device IPP 7. on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz 18
CUDA C++ Template Library Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Algorithms for sort, reduce, scan, etc. OpenMP Backend for portability Also available on github: thrust.github.com Allows applications and prototypes to be built quickly 19
Speedup Speedup Thrust Performance vs. Intel TBB Thrust vs. TBB on 32M integers Thrust Sort vs. TBB on 32M samples 14x 5x 12x 4x 43.7x 1x 8x 3x 24.8x 6x 4x 2x 13.3x 15.x 2x 1x 5.2x 5.8x x reduce transform scan sort x char short int long float double Performance may vary based on OS version and motherboard configuration Thrust v1.7.1 on K4m, ECC ON, input and output data on device TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz 2
math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log1,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Extras: rsqrt, rcbrt, exp1, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],... New in CUDA 6 Over 8 new SIMD instructions Useful for video processing: _v*2, _v*4 Cylindrical bessel: cyl_i{,1} 1/hypotenuse: rhypot 21