Advanced CUDA Optimization 1. Introduction

Size: px

Start display at page:

Download "Advanced CUDA Optimization 1. Introduction"

Adelia Harvey
5 years ago
Views:

1 Advanced CUDA Optimization 1. Introduction Thomas Bradley

2 Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines Productivity Resources

3 CUDA Review REVIEW OF CUDA ARCHITECTURE

4 Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory

5 Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

6 Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

7 CUDA Parallel Computing Architecture Parallel computing architecture and programming model Includes a CUDA C compiler, support for OpenCL and DirectCompute Architected to natively support multiple computational interfaces (standard languages and APIs)

8 CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads

9 CUDA Review PROGRAMMING MODEL

10 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

different paths float x = input[threadid]; float y = func(x);

11 CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel All threads execute the same code, can take different paths float x = input[threadid]; float y = func(x); output[threadid] = y; Each thread has an ID Select input/output data Control decisions

12 CUDA Kernels: Subdivide into Blocks

13 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

14 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

15 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

16 CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

17 Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to within a block permits scalability Fast communication between N threads is not feasible when N large

18 Transparent Scalability G

19 Transparent Scalability G

20 Transparent Scalability GT Idle Idle Idle

21 CUDA Programming Model - Summary A kernel executes as a grid of thread blocks Host Device Kernel D A block is a batch of threads Communicate through shared memory 0,0 0,1 0,2 0,3 Each block has a block ID Kernel 2 1,0 1,1 1,2 1,3 2D Each thread has a thread ID

22 CUDA Review MEMORY MODEL

23 Memory hierarchy Thread: Registers

24 Memory hierarchy Thread: Registers Thread: Local memory

25 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

26 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory

27 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

28 Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory

29 Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

30 CUDA Review PROGRAMMING ENVIRONMENT

31 CUDA C and OpenCL Entry point for developers who want low-level API Entry point for developers who prefer high-level C Shared back-end compiler and optimization technology

32 Visual Studio Separate file types.c/.cpp for host code.cu for device/mixed code Compilation rules: cuda.rules Syntax highlighting Intellisense Integrated debugger and profiler: Nexus

33 NVIDIA Nexus IDE The industry s first IDE for massively parallel applications Accelerates co-processing (CPU + GPU) application development Complete Visual Studio-integrated development environment

34 Linux Separate file types.c/.cpp for host code.cu for device/mixed code Typically makefile driven cuda-gdb for debugging CUDA Visual Profiler

35 Performance OPTIMIZATION GUIDELINES

36 Optimize Algorithms for GPU Algorithm selection Understand the problem, consider alternate algorithms Maximize independent parallelism Maximize arithmetic intensity (math/bandwidth) Recompute? GPU allocates transistors to arithmetic, not memory Sometimes better to recompute rather than cache Serial computation on GPU? Low parallelism computation may be faster on GPU vs copy to/from host

37 Optimize Memory Access Coalesce global memory access Maximise DRAM efficiency Order of magnitude impact on performance Avoid serialization Minimize shared memory bank conflicts Understand constant cache semantics Understand spatial locality Optimize use of textures to ensure spatial locality

38 Exploit Shared Memory Hundreds of times faster than global memory Inter-thread cooperation via shared memory and synchronization Cache data that is reused by multiple threads Stage loads/stores to allow reordering Avoid non-coalesced global memory accesses

39 Use Resources Efficiently Partition the computation to keep multiprocessors busy Many threads, many thread blocks Multiple GPUs Monitor per-multiprocessor resource utilization Registers and shared memory Low utilization per thread block permits multiple active blocks per multiprocessor Overlap computation with I/O Use asynchronous memory transfers

40 Productivity RESOURCES

41 Getting Started CUDA Zone Introductory tutorials/webinars Forums Documentation Programming Guide Best Practices Guide Examples CUDA SDK

42 Libraries NVIDIA cublas Dense linear algebra (subset of full BLAS suite) cufft 1D/2D/3D real and complex Third party NAG Numeric libraries e.g. RNGs culapack/magma Open Source Thrust STL/Boost style template language cudpp Data parallel primitives (e.g. scan, sort and reduction) CUSP Sparse linear algebra and graph computation Many more...

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory