CUDA 6.5 Performance Report

Size: px

Start display at page:

Download "CUDA 6.5 Performance Report"

Brendan Eaton
6 years ago
Views:

1 CUDA 6.5 Performance Report 1

2 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library cudnn Deep Neural Net building blocks Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2

cufft: Multi-dimensional FFTs Real and complex

3 cufft: Multi-dimensional FFTs Real and complex Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts XT interface supports dual-gpu cards New in CUDA 6.5 Device callbacks optimize use cases such as FFT + datatype conversion 6

4 cufft in 4 easy steps Step 1 Allocate space on GPU memory Step 2 Create plan specifying transform configuration like the size and type (real, complex, 1D, 2D and so on). Step 3 Execute the plan as many times as required, providing the pointer to the GPU data created in Step 1. Step 4 Destroy plan, free GPU memory 7

5 #include <stdlib.h> #include <stdio.h> #include cufft.h #define NX 256 #define NY 128 Example cufft Program /* Use the CUFFT plan to transform the signal out of place. * Note: idata!= odata indicates an out of place * transformation to CUFFT at execution time. */ cufftexecc2c(plan, idata, odata, CUFFT_FORWARD); int main() { cuffthandle plan; cufftcomplex *idata, *odata; int i; cudamalloc((void**)&idata, sizeof(cufftcomplex)*nx*ny); cudamalloc((void**)&odata, sizeof(cufftcomplex)*nx*ny); /* create a 2D FFT plan */ cufftplan2d(&plan, NX, NY, CUFFT_C2C); /* Inverse transform the signal in place */ cufftexecc2c(plan, odata, odata, CUFFT_INVERSE); /* Destroy the CUFFT plan */ cufftdestroy(plan); cudafree(idata); cudafree(odata); return 0 } /* initialize idata */ Check out the CUDA SDK for more 8

6 GFLOPS GFLOPS cufft: up to 700 GFLOPS Single Precision Double Precision log2(transform_size) log2(transform_size) 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs Performance may vary based on OS version and motherboard configuration cufft 6.5 on K40c, ECC ON, 32M elements, input and output data on device Excludes time to create cufft plans 9

7 GFLOPS GFLOPS cufft: Consistently High Performance 800 Single Precision 350 Double Precision Powers of 2 Powers of 3 Powers of 5 Powers of E+0 1E+3 1E+6 1E+9 Transform Size Performance may vary based on OS version and motherboard configuration 0 1E+0 1E+3 1E+6 1E+9 Transform Size 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs cufft 6.5 on K40c, ECC ON, 28M-33M elements, input and output data on device Excludes time to create cufft plans 10

8 Execution Time (ms) Execution Time (ms) cufft-xt on K10: 30% Faster on Large 3D FFTs cufft cufft-xt 256x256x256 0 cufft cufft-xt 512x512x512 Performance may vary based on OS version and motherboard configuration cufft 6.5 on K10, ECC ON, input and output data on device Excludes time to create cufft plans 11

9 cufft User-Defined Callbacks Without Callbacks: 3 kernels, 3 memory roundtrips Read input Convert to 32-bit Write 32-bit Read Perform FFT Write Read FFT output Convert to 8-bit Write 8- bit data User-defined Kernel cufft Internal Kernel User-defined Kernel With Callbacks: 1 kernel, 1 memory roundtrip Read 8- bit input Convert to 32-bit Perform FFT Convert to 8-bit Write output Merged into a single kernel 12

10 Normalized Performance Performance of cufft Callbacks Performance of single-precision complex cufft on 8-bit complex input and output datasets, including data conversion time 2.5x 2.0x 1.5x 1.0x 0.5x 0.0x cufft with separate kernels for data conversion cufft with callbacks for data conversion cufft 6.5 on K40, ECC ON, 512 1D C2C forward trasforms, 32M total elements Input and output data on device, excludes time to create cufft plans 13

11 cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size Drop-in BLAS intercepts CPU BLAS calls, streams to GPU 14

12 SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK GFLOPS cublas: >3 TFLOPS single-precision >1 TFLOPS double-precision 0 Single Single Complex Double Double Complex Performance may vary based on OS version and motherboard configuration cublas 6.5 on K40m, ECC ON, input and output data on device m=n=k=4096, transpose=no, side=right, fill=lower 15

13 GFLOPS cublas: ZGEMM 6x Faster than MKL cublas MKL Matrix Dimension (m=n=k) Performance may vary based on OS version and motherboard configuration cublas 6.5 on K40m, ECC ON, input and output data on device MKL on Intel IvyBridge single socket 12-core E GHz 16

14 SGEMM SSYRK STRSM CGEMM CSYRK DGEMM DSYRK DTRSM ZGEMM ZSYRK ZTRSM GFLOPS cublas-xt: Automatic Performance Boost >10 TF on a single node x GPU 4x GPUs 0 Single Single Complex Double Double Complex Performance may vary based on OS version and motherboard configuration cublas 6.5 on K40, double precision mode Input and output data on device, m=n=k=16384, transpose=no 17

15 cusparse: Sparse linear algebra routines Optimized sparse linear algebra BLAS routines matrixvector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) Incomplete-LU and Cholesky preconditioners y 1 y 2 y \alpha + \beta 4.0 y y 1 y 2 y 3 y 4 18

16 Speedup over MKL cusparse: 5x Faster than MKL 6x Sparse Matrix x Dense Vector (SpMV) 5x 4x 3x 2x 1x 0x Performance may vary based on OS version and motherboard configuration Average of s/c/d/z routines cusparse 6.5 on K40m, ECC ON, input and output data on device MKL on Intel IvyBridge single socket 12-core E GHz Matrices obtained from: 19

17 curand: Random Number Generation Generating high quality random numbers in parallel is hard Don t do it yourself, use a library! Pseudo- and Quasi-RNGs Mersenne Twister Supports several output distributions Statistical test results in documentation 20

18 GSamples / sec curand: Up to 70x Faster vs. Intel MKL curand MKL Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6.5 on K40c, ECC ON, double-precision input and output data on device MKL on Intel IvyBridge single socket 12-core E GHz 21

19 Gsamples / sec curand: High Performance RNGs XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Pseudo-random Uniform Distribution Normal Distribution Log-Normal Distribution Quasi-random Sobol64 Scrambled Performance may vary based on OS version and motherboard configuration curand 6.5 on K40m, ECC ON, double precision input and output data on device 22

filters, image & signal statistics, image & signal arithmetic, JPEG building

20 NPP: NVIDIA Performance Primitives Over 5000 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation, median filter, BGR/YUV conversion, 3D LUT color conversion 23

21 NPP Speedup vs. Intel IPP 35x 30x 25x 20x 15x 28.9x 10x 5x 0x 5.7x Image Set (8-bit RGB) 20.4x Image Set Channel (8-bit RGB) 22.2x Image Resize (8-bit RGB) 6.4x Image Gaussian Filter (32-bit float) 15.2x Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT Performance may vary based on OS version and motherboard configuration NPP 6.5 on K40m, input and output data on device IPP 7.0 on Intel IvyBridge single socket 12-core E GHz 24

22 Image Processing Function Group NPP Speedup vs. Intel IPP Filter Statistics Geometry Transforms JPEG Morphological Linear Transform Color Processing Color Conversion Threshold And Compare Alpha Composition Logical Arithmetic Data Exchange And Initialization 4.1x 6.4x 19.1x 22.0x 7.7x 10.0x 109.6x 11.2x 6.7x 15.3x 8.8x 17.1x 15.9x 0x 5x 10x 15x 20x 25x Performance may vary based on OS version and motherboard configuration NPP 6.5 on K40m, input and output data on device Each bar represents the average speedup over all routines in the function group IPP 7.0 on Intel IvyBridge single socket 12-core E GHz 25

23 CUDA C++ Template Library Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Algorithms for sort, reduce, scan, etc. OpenMP Backend for portability Also available on github: thrust.github.com Allows applications and prototypes to be built quickly 26

24 Speedup Speedup Thrust Performance vs. Intel TBB 14x Thrust vs. TBB on 32M integers 50x Thrust Sort vs. TBB on 32M samples 12x 10x 40x 8x 30x 6x 20x 4x 2x 10x 0x reduce transform scan sort 0x char short int long float double Performance may vary based on OS version and motherboard configuration Thrust v1.7.2 on K40m, ECC ON, input and output data on device TBB 4.2 on Intel IvyBridge single socket 12-core E GHz 27

25 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log10,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Bessel: j0, j1, jn, y0, y1, yn, cyl_bessel_i0, cyl_bessel_i1 Vector SIMD: vadd, vsub, vavrg, vabsdiff, vmin, vmax, vset Extras: rsqrt, rhypot, rcbrt, exp10, sinpi, sincos[pi], cospi, erf[c]inv, normcdf[inv] New in CUDA 6.5 Performance improvements [r]hypot[f], cbrt, sqrt, rsqrt, atan[f], exp[10][m1]f 28

26 Double-precision GFLOPS Math Performance Improvements RSQRT Improvement Increases N-Body Performance by 15% Significantly improved doubleprecision functions: [r]sqrt(), [r]hypot(), cbrt(), atan(), acosh() Significantly improved singleprecision functions: [r]hypot(), atanf(), expf(), exp10f(), expm1f() Performance of N-body sample code on Tesla K GFLOPS 801 GFLOPS CUDA 6.0 CUDA 6.5 CUDA 6.0 and CUDA 6.5 on K40m Sample code: 29

27 Introducing NVIDIA cudnn Forward and backward convolution routines tuned for NVIDIA GPUs Will be optimized for all future NVIDIA GPU generations Arbitrary dimension ordering, striding, and subregions for 4d tensors means easy integration into any neural net implementation Forward and backward paths for common layer types ReLu, Sigmoid, Tanh, Pooling, Softmax Download: Contact: 30

(targeting official release with Caffe 1.

28 Using Caffe with cudnn Accelerate Caffe layer types by 1.2 3x Example: AlexNet Layer 2 forward: 1.9x faster convolution, 2.7x faster pooling Integrated into Caffe dev branch today! (targeting official release with Caffe 1.0) Overall AlexNet training time Caffe (GPU) 11x Caffe (cudnn) 14x Caffe (CPU*) 1x *CPU is 24 core 2.4GHz Intel MKL Baseline Caffe compared to Caffe accelerated by cudnn on K40 31

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random