CUDA 7.0 Performance Report. May 2015

Size: px

Start display at page:

Download "CUDA 7.0 Performance Report. May 2015"

Prudence Hines
6 years ago
Views:

1 CUDA 7.0 Performance Report May

2 CUDA 7.0 Performance Report cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library New in cusolver Linear Solver Library CUDA 7.0 curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library cudnn Deep Neural Net building blocks Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2

cufft: Multi-dimensional FFTs Real and complex,

batched transforms Flexible input and output data

CPU FFTW library XT interface supports up to 4 GPUs

3 cufft: Multi-dimensional FFTs Real and complex, Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts Also supports simple drop-in replacement of a CPU FFTW library XT interface supports up to 4 GPUs Device callbacks optimize use cases such as FFT + datatype conversion 3

4 GFLOPS GFLOPS 800 cufft: up to 700 GFLOPS Single Precision Double Precision Powers of 2 Powers of 3 Powers of 5 Powers of ,000 1,000, ,000 1,000,000 Transform Size Transform Size 1D Complex, Batched FFTs Used in Signal Processing and as a Foundation for 2D and 3D FFTs Performance may vary based on OS and software versions, and motherboard configuration cufft 7.0 on K40m, Base clocks, ECC ON Batched transforms on 28M-33M total elements, input and output data on device Excludes time to create cufft plans 4

5 CUDA 7.0 vs. CUDA 6.5 Speedup New in CUDA 7.0 cufft up to 3x Faster for sizes that are composites of small primes 5x 4x Size = 15 Size = 30 1D Single Precision Complex-to-Complex Transforms Size = 31 Size = 121 3x Size = 127 2x 1x Transform Size Performance may vary based on OS and software versions, and motherboard configuration cufft 6.5 and 7.0 on K20m, ECC ON Batched transforms on 32M total elements, input and output data on device Excludes time to create cufft plans 5

6 cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface Batched routines for higher performance on small problem sizes XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size Drop-in BLAS intercepts CPU BLAS calls, streams to GPU 6

7 SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK GFLOPS cublas: >3 TFLOPS single-precision >1 TFLOPS double-precision 0 Single Single Complex Double Double Complex Performance may vary based on OS and software versions, and motherboard configuration cublas 7.0 on K40m, Base clocks, ECC ON, input and output data on device m=n=k=4096, transpose=no, side=right, fill=lower 7

8 SGEMM SSYRK STRSM CGEMM CSYRK DGEMM DSYRK DTRSM ZGEMM ZSYRK ZTRSM GFLOPS cublas-xt: Multi-GPU Performance Scaling >12 TF on a single node xK80 3xK80 Single Single Complex Double Double Complex Performance may vary based on OS and software versions, and motherboard configuration cublas 7.0 on K80, Base clocks, ECC ON input and output data on host, m=n=k=32768, transpose=no 8

9 cusparse: Sparse linear algebra routines y 1 y 2 y \alpha + \beta 4.0 y y 1 y 2 y 3 y 4 Optimized sparse linear algebra BLAS routines for matrix-vector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) Incomplete-LU and Cholesky preconditioners New in CUDA 7.0 Graph Coloring 9

10 Speedup over MKL cusparse: 4x Faster than MKL 5x Sparse Matrix x Dense Vector (SpMV) 4x 3x 2x 1x 0x Performance may vary based on OS and software versions, and motherboard configuration Average of S/C/D/Z routines cusparse 7.0 on K40m, Base clocks, ECC ON, input and output data on device MKL on Intel Xeon Haswell single-socket 16-core E GHz, 3.6GHz Turbo Matrices obtained from: 10

11 New in CUDA 7.0 Graph Coloring & Preconditioned CG Solver Input matrix Analysis and ilu0 (cusparse) Solve (cusparse) Execution time of ilu0 and Solve phases are limited by available parallelism in input Matrix Input matrix Graph coloring (cusparse) Reorder matrix (Thrust) Analysis and ilu0 (cusparse) Solve (cusparse) Coloring and reordering the matrix exposes more parallelism ilu0 and Solve phases run faster due to extra parallelism 11

12 Speedup New in CUDA 7.0 cusparse Graph Coloring Speeds Up Factorization Speedup of Incomplete LU Factorization (ILU0) after Graph Coloring 20x 28x 9x 6x 5x 4x 3x 2x 1x 0x Full results at: research.nvidia.com/publication/parallel-graph-coloring-applications-incomplete-lu-factorization-gpu Performance may vary based on OS and software versions, and motherboard configuration cusparse 7.0 on K40c Matrices obtained from: 12

13 New in CUDA 7.0 cusolver cusolverdn Dense Cholesky, LU, SVD, (batched) QR Optimization, Computer vision, CFD cusolversp Sparse direct solvers & Eigensolvers Newton s method, Oil & Gas Well Models cusolverrf Sparse refactorization solver Chemistry, ODEs, Combustion, Circuit simulation 13

14 SPOTRF DPOTRF CPOTRF ZPOTRF SGETRF DGETRF CGETRF ZGETRF SGEQRF DGEQRF CGEQRF ZGEQRF GFLOPS New in CUDA 7.0 cusolver Dense vs. MKL cusolver MKL Cholesky Factorization LU Factorization QR Factorization Performance may vary based on OS and software versions, and motherboard configuration cusolver 7.0 on K40c, ECC ON, M=N=4096 MKL on Intel Xeon Haswell 14-core E GHz 14

2x 0x 1138_bus Chem97ZtZ Muu ex9 nasa1824 Performance may vary based on OS and software versions, and

15 Speedup over CPU New in CUDA 7.0 cusolver Sparse QR 12x Analysis, Factorization and Solve 11.3x 10x 8x 6x 4x 2x 2.0x 1.9x 1.4x 1.2x 0x 1138_bus Chem97ZtZ Muu ex9 nasa1824 Performance may vary based on OS and software versions, and motherboard configuration cusolver 7.0 on K40c, ECC ON SuiteSparse v4.4 on Intel Xeon Haswell 14-core E GHz Matrices obtained from: 15

16 curand: Random Number Generation Generating high quality random numbers in parallel is hard Don t do it yourself, use a library! Pseudo- and Quasi-RNGs Mersenne Twister Supports several output distributions Statistical test results in documentation 16

17 GSamples / sec curand: > 50x Faster vs. Intel MKL curand MKL Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS and software versions, and motherboard configuration curand 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device MKL on Intel Xeon Haswell single-socket 16-core E GHz, 3.6GHz Turbo 17

18 Gsamples / sec curand: High Performance RNGs XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Pseudo-random Quasi-random Uniform Distribution Normal Distribution Log-Normal Distribution Sobol64 Scrambled Performance may vary based on OS and software versions, and motherboard configuration curand 7.0 on K40m, Base clocks, ECC ON, double-precision input and output data on device 18

19 CUDA C++ Parallel Template Library Parallel Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Parallel Algorithms for sort, reduce, scan, etc. TBB and OpenMP CPU Backends Performance portable Also available on GitHub: thrust.github.com Allows applications and prototypes to be built quickly 19

20 Speedup New in CUDA 7.0 Thrust Performance Improvements sort: 1.1x 1.8x faster (3x for user-defined types) merge: 2x faster scan: 1.15x faster reduce_by_key: 1.25x faster 2.0x 1.5x 1.0x 0.5x Thrust Sort, 7.0 vs. 6.5 (32M samples) 1.7x 1.8x 1.2x 1.3x 1.1x 1.1x CUDA streams argument (concurrency between threads) 0.0x char short int long float double CUDA 7.0 and 6.5 on K40m, ECC ON, input and output data on device Performance may vary based on OS and software versions, and motherboard configuration thrust::count_if(thrust::cuda::par.on(stream1), text, text+n, myfunc()); 20

filters, image & signal statistics, image & signal arithmetic, JPEG building

21 NPP: NVIDIA Performance Primitives Over 5000 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation, median filter, BGR/YUV conversion, 3D LUT color conversion 21

22 NPP Speedup vs. Intel IPP 25x 20x 15x 10x 19.9x 22.4x 19.9x 5x 10.7x 5.0x 5.3x 0x Image Set (8-bit RGB) Image Set Channel (8-bit RGB) Image Resize (8-bit RGB) Image Gaussian Filter (32-bit float) Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT Performance may vary based on OS and software versions, and motherboard configuration NPP 7.0 on K40m, ECC ON, Base clocks, input and output data on device IPP 7.0 on Intel Xeon Haswell single-socket 16-core E GHz, 3.6GHz Turbo 22

23 Image Processing Function Group NPP Speedup vs. Intel IPP Filter Statistics Geometry Transforms JPEG Morphological Linear Transform Color Processing Color Conversion Threshold And Compare Alpha Composition Logical Arithmetic Data Exchange And Initialization 4.0x 5.5x 9.1x 15.2x 5.5x 7.6x 71.6x 8.9x 4.1x 10.8x 5.9x 12.3x 9.8x 0x 5x 10x 15x 20x 25x Performance may vary based on OS and software versions, and motherboard configuration NPP 7.0 on K40m, ECC ON, Base clocks, input and output data on device Each bar represents the average speedup over all routines in the function group IPP 7.0 on Intel Xeon Haswell single-socket 16-core E GHz, 3.6GHz Turbo 23

24 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log10,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Bessel: j0, j1, jn, y0, y1, yn, cyl_bessel_i0, cyl_bessel_i1 Vector SIMD: vadd, vsub, vavrg, vabsdiff, vmin, vmax, vset Extras: rsqrt, rhypot, rcbrt, exp10, sinpi, sincos[pi], cospi, erf[c]inv, normcdf[inv] New in CUDA 7.0 Significantly optimized double precision reciprocal rcp() 3D/4D Euclidean Norms: [r]norm3d[f], norm4d[f] 24

25 Millions of Images NVIDIA releases cudnn Version 2 Accelerates key routines to improve performance of neural net training Up to 1.8x faster on AlexNet than a baseline GPU implementation New support for 3D convolutions Integrated into all major Deep Learning frameworks: Caffe, Theano, Torch Images Trained Per Day (Caffe AlexNet) 1.6x 1.0x 1.0x Caffe (GoogLeNet) 18M 23M M 0 16 Core CPU GTX Titan Titan Black E GHz cudnn v1 1.2x Torch (OverFeat) 43M Titan X cudnn v2 Baseline (GPU) With cudnn 25

26 CUDA Registered Developer Program Sign up for free at: Exclusive access to pre-release CUDA Installers Submit bugs and features requests to NVIDIA Keep informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers 26

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random