CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Introduction to CUDA Lecture originally by Luke Durant and Tamas Szalay

Today CUDA - Why CUDA? - Overview of CUDA architecture - Dense matrix multiplication with CUDA 2

Shader GPGPU - Before current generation, this is all we had - Lots of things are faster in GLSL than on the CPU, but No scatter! No communication between fragments (threads) Awkward interface, need familiarity with graphics APIs Memory modes not what we would like Hard to transfer data from GPU to CPU 3

CUDA - nvidia s solution to GPGPU - Extension to the C language - Has been far more popular than CTM/Brook for GPGPU, and is thus the focus of this course - Still a proprietary environment Keep your eyes on OpenCL DirectX 11 includes a compute shader 5

CUDA - Only works on nvidia G80 (Geforce 8000 series) and newer cards - Can run in emulation on other hardware Floating point might not be exactly the same Watch out for OpenGL integration -Designed to scale well over time 6

CUDA - Compute Unified Device Architecture - A different way of looking at GPU programming - Provides far more features than we re used to from GL - Less hassle, more access to the hardware 7

What is a CUDA Program? - Two main parts, Host and Device - Host code Runs on CPU, uses special library calls.cpp or.cu -Device code Runs on GPU, written in C with some extensions Called kernels.cu 8

Host vs. Device -Host Code Single Program, Single Data Not parallel Typically few threads, threads take overhead -Device Code Single Program, Multiple Data Parallel Typically thousands of threads, very little overhead in thread creation 9

Graphics Mode Host Input Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP SP TF L1 SP Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB 1

CUDA Mode Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 1

Overview of CUDA architecture 1

Basic Units of CUDA: The Grid - A grid is a group of threads all running the same kernel (not synchronized) - Every call to CUDA from CPU is one grid - Starting a grid on CPU is a synchronous operation - But can run multiple grids at once - On multi-gpu systems, grids cannot be shared between GPUs - use several grids for maximum efficiency 1

Basic Units of CUDA: The Block - Grids are composed of blocks - Each block is a logical unit containing a number of coordinating threads, a certain amount of shared memory. - Just as grids are not shared between GPUs, blocks are not shared between multiprocessors. 1

Basic Units of CUDA: The Block - All blocks in a grid use the same program - How do you tell what block you are in? - blockidx built-in variable - You can set up block IDs to be 1D or 2D (based on the grid dimensions) 1

Basic Units of CUDA: The Thread - Blocks are composed of threads - Threads are run on the individual cores of the multiprocessors, but unlike grids and blocks, they are not married to a single core - Like blocks, each thread has an ID (threadidx) - Thread IDs can be 1D, 2D, or 3D (based on block dimension) The thread id is relative to the block it is in 1

Overview of CUDA architecture 1

Thread Storage and Communication - Threads have a certain amount of register memory - Limited register memory per multiprocessor - Several ways of communicating with other threads within the block - Outside of the block, not a lot of communication Ideally, should be none 1

Memory Model 1

Memory Areas: Globals -Global memory Main communication between device and host Read/write from both device and host -Texture memory Read only from device Uses 2D hardware caching -Constant memory Read only from device -Persistent across grid runs 2

Memory Areas: Per-Block -Shared memory Accessible by all threads within a block Physically stored in each multiprocessor 2

Memory Areas: Per-Thread - Register Memory Efficient Can t index (no arrays!) Very limited space -Local Memory More space More access modes - Limited by physical memory in each multiprocessor - Unless you say otherwise, compiler will try to put things into register memory - Easy to run out of register memory 2

Memory Synchronization - All memory accesses are thread safe in the sense that one will happen (atomic), although order is undefined without explicit synchronization - If two threads write the same location at the same time, one will win - Try to avoid these situations no two threads should write the same location at the same time 2

Memory Areas: Comparison -We have a lot more memory tools than GLSL: Shared memory effectively allows communication between threads Global memory offers read/write access to memory areas directly accessible from CPU Allowed memory formats aren t restricted by graphics APIs we don t have to stuff data into textures; we just get a void*! 2

Synchronization - Basic unit of synchronization: syncthreads() Waits until all threads in block reach the call to syncthreads() Be careful of using in conditionals! Some threads may not reach it deadlock! OK if the conditional only depends on blockidx -No easy way of synchronizing between blocks Be careful of having blocks write to the same area of global memory 2

What did we really get? - Cleaner interface (no more ARB, EXT, etc.) - No more graphics APIs, opening windows, messing with textures, etc. - Scatter! - Read/write memory (no more ping-ponging) - Synchronization - Easy library calls to share data with CPU - Now that we have scatter and synchronization, the whole literature of parallel algorithms is easy to implement 2

CUDA Example - Now, let s use our new knowledge to figure out how to do a familiar problem efficiently in CUDA - Matrix multiplication - Y = A*B - For simplicity, A and B are square (NxN) 2

Dense Multiplication #1 - First attempt: use 1 block - Use NxN threads with 2D indexing - In each thread, compute the value of the corresponding element of Y in the normal way 2

What s wrong with attempt #1? - Biggest problem: 1 thread block means only 1 multiprocessor is in use - On the GTX280, this means we re only using 3% of our processing power! - Also, global memory accesses are out of control - Each source element is read N times from global memory! - We ll fix this later 2

Dense Multiplication #2 - Use N blocks, with N threads per block - Each thread calculates element determined by block and thread index 3

What s wrong with attempt #2? - Biggest problem: global memory access - Global memory reads are not particularly fast! - Want to minimize the amount of data read from global memory - Key idea: each thread in the block is accessing the same data - Can move the data into shared memory, which is much faster 3

Dense Multiplication #3 - Use N blocks, with N threads per block - Before any calculation, copy the row corresponding to each block into shared memory - Now we ve drastically cut down on global memory accesses 3

Potential Problems with #3 - Shared memory probably isn t large enough to store a whole row - Solution: store as much as we can at one time, then grab more. Problem with this: we would need to sync threads between each memory access; otherwise we may produce the wrong answer 3

Dense Multiplication - There is a trade-off to be made here - If we use fewer blocks, in general, memory accesses are less common - If we use more blocks, there could be more dead time in processors - Have to experiment to find best balance - In general, as size increases, fewer blocks will give better performance for a while, then suddenly will be terrible 3

Dense Multiplication - Other factors as well - Global vs. Texture memory? Memory caching: texture memory is heavily cached 3

How to Learn CUDA in 3 Steps 1. Read the CUDA Programming Guide 2. Look at the SDK 3. Repeat steps 1 and 2 for a while 3