GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

Size: px

Start display at page:

Download "GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group"

Garey Gilmore
6 years ago
Views:

1 GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

2 Why CUDA? Mainstream Massively Parallel Programming Over 300 Million CUDA capable GPUs sold Runs on GPU and CPU (PGI CUDA-x86) Additional Compute Resource Significant speedups of Parallel Computing Tasks Offload tasks from CPU Efficient Great Perf/Watt and Perf/$ ratio

Tesla GPUs Power 3 of Top 5 Supercomputers #1 : Tianhe-1A 7168

2 PFLOPS #4 : Tsubame 2.0 4224 Tesla GPU s 1.

also implemented a heterogeneous computing architecture

3 Tesla GPUs Power 3 of Top 5 Supercomputers #1 : Tianhe-1A 7168 Tesla GPU s 2.5 PFLOPS #3 : Nebulae 4650 Tesla GPU s 1.2 PFLOPS #4 : Tsubame Tesla GPU s PFLOPS We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. Premier Wen Jiabao Public comments acknowledging Tianhe-1A

Chinese Academy of Sciences (CAS) MD Simulation for

4 World s Fastest MD Simulation Sustained Performance of 1.87 Petaflops/s Institute of Process Engineering (IPE) Chinese Academy of Sciences (CAS) MD Simulation for Crystalline Silicon Used all 7168 Tesla GPUs on Tianhe-1A GPU Supercomputer

GPU Computing Research & Education http://research.nvidia.

University of Utah University of Tennessee University of Maryland University of Illinois at

Sciences National Taiwan University Georgia Institute of Technology Proven Research Vision

Hospital/NE Univ Nanyan University North Carolina State University Technical

CSIRO TU Munich SINTEF UCLA HP Labs University of New Mexico ICHEC University Of Warsaw-ICM

5 GPU Computing Research & Education World Class Research Leadership and Teaching University of Cambridge Harvard University University of Utah University of Tennessee University of Maryland University of Illinois at Urbana-Champaign Tsinghua University Tokyo Institute of Technology Chinese Academy of Sciences National Taiwan University Georgia Institute of Technology Proven Research Vision Johns Hopkins University Mass. Gen. Hospital/NE Univ Nanyan University North Carolina State University Technical University-Czech Swinburne University of Tech. CSIRO TU Munich SINTEF UCLA HP Labs University of New Mexico ICHEC University Of Warsaw-ICM Barcelona SuperComputer Center VSB-Tech Clemson University University of Ostrava Uni Bonn/Fraunhofer SCAI TU Darmstadt Karlsruhe Institute Of Technology And more coming shortly. Academic Partnerships / Fellowships GPGPU Education 350+ Universities

6 Commercial Apps Accelerated by GPUs Molecular Dynamics AMBER CHARMM DL_POLY GROMACS LAMMPS NAMD Fluid Dynamics Altair Acusolve Autodesk Moldflow OpenFOAM Prometech Particlework Turbostream Earth Sciences ASUCA HOMME NASA GEOS-5 NOAA NIM WRF Engineering Simulation Others Agilent EMPro ANSYS Mechanical ANSYS Nexxim CST Microwave Studio Impetus AFEA Remcom XFdtd SIMULIA Abaqus GADGET2 MATLAB Mathematica NBODY Paradigm VoxelGeo PARATEC Schlumberger Petrel

7 Rich Toolchain & Ecosystem for Fast Ramp-up on GPUs Numerical Packages MATLAB Mathematica NI LabView pycuda Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView GPU Compilers C C++ Fortran OpenCL DirectCompute Java Python Parallelizing Compilers PGI Accelerator CAPS HMPP mcuda OpenMP Libraries BLAS FFT LAPACK NPP Sparse Imaging RNG GPGPU Consultants & Training OEM Solution Providers ANEO GPU Tech

8 GPU COMPUTING FLOW

9 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory

10 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

11 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Data is persistent in GPU memory, while computing data can also be copied

12 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Data is persistent in GPU memory, while computing data can also be copied 4. Copy results from GPU memory to CPU memory

13 NVIDIA GPU Direct PCIe Bus PCIe Bus InfiniBand NIC Efficient access to GPU memory for 3 rd party devices (Mellanox & Qlogic) GPU Peer-to-Peer memory access, transfers & synchronization MPI implementations natively support GPU data transfers (MVAPICH2- GPU, OpenMPI) New with CUDA 4.0

14 GPU ARCHITECTURE

products ECC on/off option for Quadro and Tesla products Streaming Multiprocessors (SMs) Perform the actual

15 DRAM I/F Giga Thread HOST I/F DRAM I/F GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB Bandwidth currently up to 150 GB/s for Quadro and Tesla products ECC on/off option for Quadro and Tesla products Streaming Multiprocessors (SMs) Perform the actual computations Each SM has its own: Control units, registers, execution pipelines, caches L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F

GPU Architecture Fermi: Streaming Multiprocessor (SM) 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock 2 warp schedulers

Core Core Core Core Core Core Core Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

16 GPU Architecture Fermi: Streaming Multiprocessor (SM) 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock 2 warp schedulers Up to 1536 threads concurrently 4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers Instruction Cache Scheduler Dispatch Core Core Core Core Core Core Core Core Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core

GPU Architecture Fermi: CUDA Core Floating point & Integer unit IEEE 754-2008 floating-point standard Fused multiply-add (FMA) instruction for both single and

Dispatch Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

17 GPU Architecture Fermi: CUDA Core Floating point & Integer unit IEEE floating-point standard Fused multiply-add (FMA) instruction for both single and double precision Logic unit Move, compare unit Branch unit CUDA Core Dispatch Port Operand Collector FP Unit INT Unit Result Queue Instruction Cache Scheduler Dispatch Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

18 GPU Architecture Fermi: Memory System L1 16 or 48KB / SM, can be chosen by the program Hardware-managed Aggregate bandwidth per GPU: 1.03 TB/s Shared memory User-managed scratch-pad Hardware will not evict until threads overwrite 16 or 48KB / SM (64KB total is split between Shared and L1) Aggregate bandwidth per GPU: 1.03 TB/s

19 GPU Architecture Fermi: Memory System ECC protection: DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache

20 CUDA PROGRAMMING MODEL

21 Anatomy of a CUDA C/C++ Application Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements CUDA C/C++ Application Serial code Host = CPU Parallel code Device = GPU Serial code Host = CPU Parallel code Device = GPU...

22 Compiling CUDA C Applications void serial_function( ) {... } void other_function(int... ) {... } void saxpy_serial(float... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..);... } Modify into Parallel CUDA C code CUDA C Functions NVCC (Open64) CUDA object files Linker Rest of C Application CPU Compiler CPU object files CPU-GPU Executable

23 CUDA C : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code

24 CUDA C : C with a few keywords Kernel: function called by the host that executes on the GPU Can only access GPU memory No variable number of arguments No static variables (Before No recursion - since Fermi recursion is supported) Functions must be declared with a qualifier: global : GPU kernel function launched by CPU, must return void device : can be called from GPU functions host : can be called from CPU functions (default) host and device qualifiers can be combined

25 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

different paths float x = input[threadidx.

26 CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU as an array of threads in parallel All threads execute the same code, can take different paths float x = input[threadidx.x]; float y = func(x); output[threadidx.x] = y; Each thread has an ID Select input/output data Control decisions

27 CUDA Kernels: Subdivide into Blocks

28 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

29 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

30 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

31 CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

Kernel Execution CUDA thread CUDA core Each thread is executed by a

.. CUDA Streaming Multiprocessor CUDA-enabled GPU Each block is

reside on one SM depending on the blocks resource requirements and the

32 Kernel Execution CUDA thread CUDA core Each thread is executed by a core CUDA thread block CUDA kernel grid... CUDA Streaming Multiprocessor CUDA-enabled GPU Each block is executed by one SM and does not migrate Several concurrent blocks can reside on one SM depending on the blocks resource requirements and the SM s available resources Each kernel is executed on one device Multiple kernels can execute on a device at one time

33 Thread blocks allow cooperation Instruction Cache Scheduler Scheduler Dispatch Dispatch Threads may need to cooperate: Cooperatively load/store blocks of memory all will use Share results with each other or cooperate to produce a single result Synchronize with each other Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

Thread blocks allow scalability Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability: A kernel scales across any number of SMs Kernel Grid

34 Thread blocks allow scalability Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability: A kernel scales across any number of SMs Kernel Grid Launch Device with 2 SMs SM 0 SM 1 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device with 4 SMs SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

35 CUDA MEMORY SYSTEM

36 Memory hierarchy Thread: Registers

37 Memory hierarchy Thread: Registers Local memory Local Local Local Local Local Local Local

38 Memory hierarchy Thread: Registers Local memory Block of threads: Shared memory

Memory hierarchy : Shared memory shared int a[size]; Allocated per thread block, same lifetime as the block Accessible by any thread in the block Latency: a few

39 Memory hierarchy : Shared memory shared int a[size]; Allocated per thread block, same lifetime as the block Accessible by any thread in the block Latency: a few cycles High aggregate bandwidth: 14 * 32 * 4 B * 1.15 GHz / 2 = 1.03 TB/s Several uses: Sharing data among threads in a block User-managed cache (reducing gmem accesses)

40 Memory hierarchy Thread: Registers Local memory Block of threads: Shared memory All blocks: Global memory

41 Memory hierarchy : Global memory Accessible by all threads of any kernel Data lifetime: from allocation to deallocation by host code cudamalloc (void ** pointer, size_t nbytes) cudamemset (void * pointer, int value, size_t count) cudafree (void* pointer) Latency: cycles Bandwidth: 156 GB/s Note: requirement on access pattern to reach peak performance

42 HARDWARE DETAILS

43 Warps Vectors of Threads Group of 32 Threads Share one Program Counter (PC) Threads of warp execute in lock-step (similar to Vectorprocessor) Divergence within warps is supported on the HW level Improves maintainability and flexibility over explicit vector programming Automatic serialization/instruction replays Warp ThreadIdx

44 Divergence Examples No Serialization Warp 0 Warp 1 //... if( threadidx.x < 32 ) { // Do something } else { // Do something else } //...

45 Divergence Examples No Serialization Warp 0 Warp 1 //... if( threadidx.x < 32 ) { // Do something } else { // Do something else } //...

46 Divergence Examples No Serialization Warp 0 Warp 1 //... if( threadidx.x < 32 ) { // Do something } else { // Do something else } //

47 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //...

48 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //...

49 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //

50 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //

51 Memory Operations Memory operations are issued per warp (32 threads) Threads in a warp provide memory addresses Determine which lines/segments are needed Request the needed lines/segments Hardware supports two types of loads: Full Caching Default mode Attempts to hit in L1, then L2, then GMEM Load granularity is 128-byte line L2-only Compile with Xptxas dlcm=cg option to nvcc Attempts to hit in L2, then GMEM Load granularity is 32-bytes Stores: Invalidate L1, write-back for L2

52 Caching Load Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

53 L2-only Load Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

54 Caching Load Warp requests 32 aligned, permuted 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

55 L2-only Load Warp requests 32 aligned, permuted 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

56 Caching Load Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within 2 cache-lines Warp needs 128 bytes 256 bytes move across the bus on misses Bus utilization: 50% addresses from a warp Memory addresses

57 L2-only Load Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within at most 5 segments Warp needs 128 bytes 160 bytes move across the bus on misses Bus utilization: at least 80% Some misaligned patterns will fall within 4 segments, so 100% utilization addresses from a warp Memory addresses

58 Optimization Strategies SHARED MEMORY

59 Shared Memory Organization: 32 banks, 4-byte wide banks Successive 4-byte words belong to different banks Performance: 4 bytes per bank per 2 clocks per multiprocessor smem accesses are issued per 32 threads (warp) serialization: if n threads of 32 access different 4-byte words in the same bank, n accesses are executed serially multicast: n threads access the same word in one fetch Could be different bytes within the same word Prior to Fermi, only broadcast was available, sub-word accesses within the same bank caused serialization

60 Bank Addressing Examples No Bank Conflicts No Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 31 Bank 31 Thread 31 Bank 31

61 Bank Addressing Examples 2-way Bank Conflicts 8-way Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 28 Thread 29 Thread 30 Thread 31 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 31 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 31 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 31

62 Shared Memory: Avoiding Bank Conflicts 32x32 float array in SMEM Warp accesses a column: 32-way bank conflicts (threads in a warp access the same bank) warps: Bank 0 Bank 1 Bank

63 Shared Memory: Avoiding Bank Conflicts Add a column for padding: 32x33 SMEM array Warp accesses a column: 32 different banks, no bank conflicts warps: padding Bank 0 Bank 1 Bank

64 Measure before you Optimize Premature optimization is the root of all evil -- Donald Knuth Use the Visual Profiler Identify limiting factor Analyze instruction throughput Analyze memory throughput Analyze kernel occupancy Compare to theoretical device limits

& Improved Developer Tools Share GPUs across multiple threads Single thread access

templated primitives library NPP image/video processing library Layered Textures

65 CUDA 4.0: Highlights Easier Parallel Application Porting Faster Multi-GPU Programming New & Improved Developer Tools Share GPUs across multiple threads Single thread access to all GPUs No-copy pinning of system memory New CUDA C/C++ features Thrust templated primitives library NPP image/video processing library Layered Textures NVIDIA GPUDirect v2.0 Peer-to-Peer Access Peer-to-Peer Transfers Unified Virtual Addressing Auto Performance Analysis C++ Debugging GPU Binary Disassembler cuda-gdb for MacOS

66 CUDA Programming Resources CUDA Toolkit (Current Version 4.0) Compiler, libraries, and documentation Free download for Windows, Linux, and MacOS GPU Computing SDK Code samples Whitepapers Instructional materials on NVIDIA Developer site CUDA introduction & optimization webinar: slides and audio Parallel programming course at University of Illinois UC Tutorials Forums

67 GPU Tools More in Hands On Session Profiler Available for all supported OSs Command-line or GUI Sampling signals on GPU for: Memory access parameters Execution (serialization, divergence) Automatic Performance Analysis (new in CUDA 4.0) Debugger Linux: cuda-gdb Windows: Parallel Nsight True step debugging on the GPU

68 GPU DIRECT

69 MPI Integration of GPUDirect GPUDirect + UVA allow a native support of GPU data transfers in MPI implementations enable MPI_Send, MPI_Recv from/to GPU memory MPI library can differentiate between device memory and host memory without any hints from the user Programmer doesn t have to deal with low-level details Code without MPI integration At Sender: cudamemcpy(s_buf, s_device, size, cudamemcpydevicetohost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudamemcpy(r_device, r_buf, size, cudamemcpyhosttodevice); Code with MPI integration At Sender: MPI_Send(s_device, size, ); At Receiver: MPI_Recv(r_device, size, );

70 Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg MVAPICH2-GPU Upcoming MVAPICH2 support for GPU-GPU communication With GPUDirect 45% improvement compared to Memcpy+Send (4MB) 24% improvement compared to MemcpyAsync+Isend (4MB) Without GPUDirect 38% improvement compared to Memcpy+send (4MB) 33% improvement compared to MemcpyAsync+Isend (4MB) With GPUDirect 45% improvement compared to Memcpy+Put Without GPUDirect 39% improvement compared with Memcpy+Put Similar improvement for Get operation Major improvement in programming

71 Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg MVAPICH2-GPU With GPU Direct 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Scatter (4MB) Without GPU Direct 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Scatter (4MB) With GPUDirect 30% improvement for 4 MB messages Without GPUDirect 26% improvement for 4 MB messages

72 OpenMPI OpenMPI support to send data directly from CUDA device memory via MPI calls developed by Rolf vandevaart (NVIDIA) Code is currently available in the trunk which can be downloaded at: By default, the code is disabled and has to be configured into the library. --with-cuda(=dir) Build cuda support, optionally adding DIR/include, DIR/lib, and DIR/lib64 --with-cuda-libdir=dir Search for cuda libraries in DIR

73 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h - C99 floating-point Library Included in the CUDA Toolkit (free download) For more information on CUDA libraries:

cufft: Multi-dimensional FFTs New in CUDA 4.

radix 2, 3, 5 and 7 2D/3D sizes that contain prime factors

Similar to the FFTW Advanced Interface Eliminates extra data

74 cufft: Multi-dimensional FFTs New in CUDA 4.0 Significant performance improvements in: double precision radix 2, 3, 5 and 7 2D/3D sizes that contain prime factors larger than 7 Flexible input and output data layouts* Similar to the FFTW Advanced Interface Eliminates extra data transposes and copies * Only supported for complex-to-complex transforms in this release

75 gflops gflops FFTs up to 10x Faster than MKL 1D used in audio processing and as a foundation for 2D and 3D FFTs cufft - Single Precision cufft - Double Precision Radix2 Radix7 MKL Radix3 Radix Radix2 Radix3 Radix7 Radix5 MKL log2(size) log2(size) MKL 10.1r1 on Intel Quad Core i , 2.93Ghz cufft 4.0 on Tesla C2070, ECC on Performance measured for ~16M total elements, split into batches of transforms of the size on the x-axis

76 gflops 2D/3D primes now use Bluestein Algorithm Significant performance improvement for 2D and 3D transform sizes cufft Single Precision 2D cufft 4.0 cufft 3.2 MKL Transform size (NxN, for prime N) * MKL 10.1r1 on Intel Quad Core i , 2.93Ghz * cufft4.0 on C2070, ECC on

77 cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex New in CUDA 4.0 New API Facilitates multi-gpu programming Thread-safe More routines provide parallelism using streams Previous legacy API still supported out-of-the-box Rewrote documentation from scratch Performance improvements Ex: ZGEMM performance improved 10% on Fermi (325 GFLOPS peak on C2050)

78 SGEMM SSYMM SSYRK STRMM STRSM CGEMM CHEMM CSYMM CSYRK CTRMM CTRSM DGEMM DSYMM DSYRK DTRMM DTRSM ZGEMM ZHEMM ZSYMM ZSYRK ZTRMM ZTRSM SGEMM SSYMM SSYRK STRMM STRSM CGEMM CHEMM CSYMM CSYRK CTRMM CTRSM DGEMM DSYMM DSYRK DTRMM DTRSM ZGEMM ZHEMM ZSYMM ZSYRK ZTRMM ZTRSM cublas Level 3 Performance Up to ~800GFLOPS and ~17x speedup over MKL GFLOPS Speedup over MKL Single Complex Double Double Complex Single Complex Double Double Complex * 4Kx4K matrix size * cublas 4.0, Tesla C2050 (Fermi), ECC on * MKL , 4-core 2.66Ghz

79 GFLOPS ZGEMM Performance vs. Matrix Size 350 Up to 8x speedup over MKL % of peak perf. (512x512) 97% of peak perf. (1024x1024) cublas 4.0 MKL % of peak perf. (512x512) 80% of peak perf. (1024x1024) Performance may vary based on OS version and motherboard configuration * cublas 4.0, Tesla C2050 (Fermi), ECC on * MKL , 4-core 2.66Ghz

80 cusparse: Sparse linear algebra routines Conversion routines for dense, COO, CSR and CSC formats Optimized sparse matrix-vector multiplication for CSR New Sparse Triangular Solve CUDA 4.0 API optimized for common iterative solve algorithms y 1 y 2 y 3 y 4 \alpha \beta y 1 y 2 y 3 y 4

81 Speedup (csrmv) cusparse is up to 6x Faster than MKL Sparse Matrix x Dense Vector single double complex double-complex Performance may vary based on OS version and motherboard configuration * cusparse 4.0, NVIDIA C2050 (Fermi), ECC on * MKL , 4-core 3.07GHz

82 Speedup (csrmm) Up to 35x faster with 6 Dense Vectors Useful for block iterative solve schemes single double complex double-complex Performance may vary based on OS version and motherboard configuration * cusparse 4.0, NVIDIA C2050 (Fermi), ECC on * MKL , 4-core 3.07GHz

83 curand: Random Number Generation New in CUDA 4.0 Scrambled and 64-bit Sobol Log-normal distribution New parallel ordering supports faster XORWOW initialization Results of CURAND generators against standard statistical test batteries are reported in documentation

84 GSamples/sec curand Performance curand 64-bit Scrambled Sobol 8x faster than MKL 32-bit plain Sobol CURAND XORWOW CURAND 32-bit Sobol' CURAND 32-bit Scrambled Sobol' CURAND 64-bit Sobol' CURAND 64-bit Scrambled Sobol' MKL 32-bit Sobol' (single thread) uniform normal log-normal uniform normal log-normal single precision double precision Performance may vary based on OS version and motherboard configuration * CURAND 4.0, NVIDIA C2050 (Fermi), ECC on

85 NVIDIA Performance Primitives Up to 40x speedups Arithmetic, Logic, Conversions, Filters, Statistics, etc. ~420 image functions (+70 in 4.0) ~500 signal functions (+400 in 4.0) Majority of primitives 5x to 10x faster than analogous routines in Intel IPP * NPP 4.0, NVIDIA C2050 (Fermi) * IPP 6.1, Dual Socket Core i7 2.67GHz

86 Thrust: CUDA C++ Template Library Added to CUDA Toolkit as of CUDA 4.0 Also available on Google Code Template library for CUDA Host and Device Containers that mimic the C++ STL Optimized algorithms for sort, reduce, scan, etc. OpenMP backend for portability Allows applications and prototypes to be built quickly

87 Speedup (X) Speedup (X) Thrust Algorithm Performance Various Algorithms (32M int.) Speedup compared to C++ STL Sort (32M samples) Speedup compared to C++ STL Thrust 4.0 Intel TBB C++ STL Thrust 4.0 Intel TBB C++ STL reduce transform scan sort 0 char short int long float double Algorithm Datatype * Thrust 4.0, NVIDIA Tesla C2050 (Fermi) * Core i7 3.07GHz

88 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, high accuracy Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log10,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Extras: rsqrt, rcbrt, exp10, sinpi, sincos, cospi, erfinv, erfcinv,... Performance improvements in CUDA 4.0 Double-precision /, rsqrt(), erfc(), & sinh() are all >~30% faster on Fermi Added cospi() to CUDA 4.0

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse