GPU Computing with CUDA

Size: px

Start display at page:

Download "GPU Computing with CUDA"

Everett Dennis
6 years ago
Views:

1 September 4, 2011 Gabriel Noaje Source-to-source code translator : OpenMP to CUDA 1 GPU Computing with CUDA Gabriel Noaje gabriel.noaje@univ-reims.fr University of Reims Champagne-Ardenne, France CUDA Training Day June 26, 2012

3 INTRODUCTION TO MASSIVELY PARALLEL COMPUTING

4 Moore s Law (paraphrased) The number of transistors on an integrated circuit doubles every two years. Gordon E. Moore

5 Moore s Law (visualized) Credits: Wikimedia

6 Serial Performance Scaling is Over! Cannot continue to scale processor frequencies! no 10 GHz chips! Cannot continue to increase power consumption! can t melt chip! Can continue to increase transistor density! as per Moore s Law

7 How to Use Transistors?! Instruction-level parallelism! out-of-order execution, speculation,! vanishing opportunities in power-constrained world! Data-level parallelism! vector units, SIMD execution,! increasing SSE, AVX, Cell SPE, Clearspeed, GPU! Thread-level parallelism! increasing multithreading, multicore, manycore! Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi,

8 Why Massively Parallel Processing?! A quiet revolution and potential build-up! Computation: TFLOPs vs. 100 GFLOPs T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Westmere! GPU in every PC massive volume & potential impact

9 Why Massively Parallel Processing?! A quiet revolution and potential build-up! Bandwidth: ~10x T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Westmere! GPU in every PC massive volume & potential impact

10 KEPLER ARCHITECTURE

11 Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP MB L2 Cache 384-bit GDDR5 PCI Express Gen3

12 Kepler GK110 SMX vs Fermi SM

13 SMX: Efficient Performance 192 CUDA cores 64 FP units 32 Special Function Units 32 load/store units dedicated for access memories registers 64KB shared memory 48KB cache

14 Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) T hrea d (0,0,0) T hrea d (1,0,0) T hrea d (2,0,0) T hrea d (3,0,0) T hrea d (0,1,0) T hrea d (1,1,0) T hrea d (2,1,0) T hrea d (3,1,0) Courtesy:

15 The New Moore s Law! Computers no longer get faster, just wider! You must re-think your algorithms to be parallel!! Data-parallel computing is most scalable solution! Otherwise: refactor code for 2 cores 4 cores 8 cores 16 cores! You will always have more data than cores build the computation around the data

16 Generic Multicore Chip Processor Memory Processor Memory Global Memory! Handful of processors each supporting ~1 hardware thread! On-chip memory near processors (cache, RAM, or both)! Shared global memory space (external DRAM)

17 Generic Manycore Chip Processor Memory Processor Memory Global Memory! Many processors each supporting many hardware threads! On-chip memory near processors (cache, RAM, or both)! Shared global memory space (external DRAM)

18 Enter the GPU Computing! Massive economies of scale! Massively parallel

19 Movie playing 3D animation CAD (Computer-Aided Design) Games

20 GPU Evolution! High throughput computation!! High bandwidth memory!! GeForce GTX 280: 933 GFLOP/s GeForce GTX 280: 140 GB/s Fermi 3B xtors High availability to all! 180+ million CUDA-capable GPUs in the wild GeForce M xtors RIVA 128 3M xtors 1995 GeForce M xtors 2000 GeForce 3 60M xtors GeForce FX 125M xtors

21 Why is this different from a CPU?! Different goals produce different designs! GPU assumes work load is highly parallel! CPU must be good at everything, parallel or not! CPU: minimize latency experienced by 1 thread! big on-chip caches! sophisticated control logic! GPU: maximize throughput of all threads! # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.)! multithreading can hide latency => skip the big caches! share control logic across many threads

22 Weather simulation Financial analysis Molecular docking Medical MRI Computer virus detection Geological modeling

) = Compute Unified Device Architecture extends the ANSI C standard gentle learning curve (compared to Cg,

23 Initially - use graphic calls: Cg by NVIDIA HLSL by Microsoft Nowadays - rich GPU programming ecosystem: ATI Stream by AMD CUDA by NVIDIA OpenCL by Khronos Group Direct Computing by Microsoft ( NVIDIA hardware specific ) = Compute Unified Device Architecture extends the ANSI C standard gentle learning curve (compared to Cg, HLSL, etc.) opens the underlying architecture to the user (driver + toolkit + SDK + docs + )

24 CUDA: Scalable parallel programming! Augment C/C++ with minimalist abstractions! let programmers focus on parallel algorithms! not mechanics of a parallel programming language! Provide straightforward mapping onto hardware! good fit to GPU architecture! maps well to multi-core CPUs too! Scale to 100s of cores & 10,000s of parallel threads! GPU threads are lightweight create / switch is free! GPU needs 1000s of threads for full utilization

25 Key Parallel Abstractions in CUDA! Hierarchy of concurrent threads! Lightweight synchronization primitives! Shared memory model for cooperating threads

26 Hierarchy of concurrent threads! Parallel kernels composed of many threads! all threads execute the same sequential program Thread t! Threads are grouped into thread blocks! threads in the same block can cooperate Block b t0 t1 tb! Threads/blocks have unique IDs

27 CUDA Model of Parallelism Block Memory Block Memory Global Memory! CUDA virtualizes the physical hardware! thread is a virtualized scalar processor (registers, PC, state)! block is a virtualized multiprocessor (threads, shared mem.)! Scheduled onto physical hardware without pre-emption! threads/blocks launch & run to completion! blocks should be independent

28 Heterogeneous Computing Multicore CPU

29 C for CUDA! Philosophy: provide minimal set of extensions necessary to expose power! Function qualifiers: global void my_kernel() { } device float my_device_func() { }! Variable qualifiers: constant float my_constant_array[32]; shared float my_shared_array[32];! Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel! Built-in variables and functions valid in device code: dim3 griddim; // Grid dimension dim3 blockdim; // Block dimension dim3 blockidx; // Block index dim3 threadidx; // Thread index void syncthreads(); // Thread synchronization

30 CUDA PROGRAMMING BASICS

31 Outline of CUDA Basics! Basic Kernels and Execution on GPU! Basic Memory Management! Coordinating CPU and GPU Execution! See the Programming Guide for the full API

32 CUDA Programming Model! Parallel code (kernel) is launched and executed on a device by many threads! Launches are hierarchical! Threads are grouped into blocks! Blocks are grouped into grids! Familiar serial code is written for a thread! Each thread is free to execute a unique code path! Built-in thread and block ID variables

33 High Level View SMEM SMEM SMEM SMEM Global Memory PCIe CPU Chipset

34 Blocks of threads run on an SM Streaming Processor Streaming Multiprocessor SMEM Registers Thread Memory Threadblock Per-block Shared Memory Memory

35 Whole grid runs on GPU Many blocks of threads... SMEM SMEM SMEM SMEM Global Memory

36 Thread Hierarchy! Threads launched for a parallel section are partitioned into thread blocks! Grid = all blocks for a given launch! Thread block is a group of threads that can:! Synchronize their execution! Communicate via shared memory

37 Memory Model Kernel 0 Kernel 1... Per-device Global Memory Sequential Kernels...

38 Memory Model Host memory cudamemcpy() Device 0 memory Device 1 memory

39 Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } int main() { // Run grid of N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }

40 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } Host Code int main() { // Run grid of N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }

41 Example: Host code for vecadd // allocate and initialize host (CPU) memory float *h_a =, *h_b = ; *h_c = (empty) // allocate device (GPU) memory float *d_a, *d_b, *d_c; cudamalloc( (void**) &d_a, N * sizeof(float)); cudamalloc( (void**) &d_b, N * sizeof(float)); cudamalloc( (void**) &d_c, N * sizeof(float)); // copy host memory to device cudamemcpy( d_a, h_a, N * sizeof(float), cudamemcpyhosttodevice) ); cudamemcpy( d_b, h_b, N * sizeof(float), cudamemcpyhosttodevice) ); // execute grid of N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c);

42 Example: Host code for vecadd (2) // execute grid of N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c); // copy result back to host memory cudamemcpy( h_c, d_c, N * sizeof(float), cudamemcpydevicetohost) ); // do something with the result // free device (GPU) memory cudafree(d_a); cudafree(d_b); cudafree(d_c);

43 IDs and Dimensions! Threads:! 3D IDs, unique within a block! Blocks:! 2D IDs, unique within a grid! Dimensions set at launch! Can be unique for each grid! Built-in variables:! threadidx, blockidx! blockdim, griddim Device Block (1, 1) Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Block (2, 0) Block (2, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

44 Kernel Variations and Output global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = 7; } Output: global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = blockidx.x; } Output: global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = threadidx.x; } Output:

45 Code executed on GPU! C/C++ with some restrictions: Can only access GPU memory No variable number of arguments No static variables No recursion! No dynamic polymorphism! Must be declared with a qualifier:! global : launched by CPU, cannot be called from GPU must return void! device : called from other GPU functions, cannot be called by the CPU! host : can be called by CPU! host and device qualifiers can be combined! Function is compiled for both host and device! Sample use: overloading operators

46 Memory Spaces! CPU and GPU have separate memory spaces! Data is moved across PCIe bus! Use functions to allocate/set/copy memory on GPU! Very similar to corresponding C functions! Pointers are just addresses! Can t tell from the pointer value whether the address is on CPU or GPU! Must exercise care when dereferencing:! Dereferencing CPU pointer on GPU will likely crash! Same for vice versa

47 CUDA Device Memory Model! Device code can:! R/W per-thread registers! R/W per-thread local memory! R/W per-block shared memory! R/W per-grid global memory! Read only per-grid constant memory! Host code can:! Transfer data to/from per-grid global and constant memories

48 GPU Memory Allocation / Release! Host (CPU) manages device (GPU) memory:! cudamalloc (void ** pointer, size_t nbytes)! cudamemset (void * pointer, int value, size_t count)! cudafree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudamalloc( (void**)&d_a, nbytes ); cudamemset( d_a, 0, nbytes); cudafree(d_a);

49 Data Copies! cudamemcpy( void *dst, void *src, size_t nbytes, enum cudamemcpykind direction);! returns after the copy is complete! blocks CPU thread until all bytes have been copied! doesn t start copying until previous CUDA calls complete! enum cudamemcpykind! cudamemcpyhosttodevice! cudamemcpydevicetohost! cudamemcpydevicetodevice! Non-blocking copies are also available! cudamemcpyasync

50 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

51 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

52 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudamemset( d_a, 0, num_bytes ); cudamemcpy( h_a, d_a, num_bytes, cudamemcpydevicetohost );

53 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudamemset( d_a, 0, num_bytes ); cudamemcpy( h_a, d_a, num_bytes, cudamemcpydevicetohost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudafree( d_a ); } return 0;

54 Example: Shuffling Data // Reorder values based on keys // Each thread moves one element global void shuffle(int* prev_array, int* new_array, int* indices) { int i = threadidx.x + blockdim.x * blockidx.x; new_array[i] = prev_array[indices[i]]; } Host Code int main() { // Run grid of N/256 blocks of 256 threads each shuffle<<< N/256, 256>>>(d_old, d_new, d_ind); }

55 Kernel with 2D Indexing global void kernel( int *a, int dimx, int dimy ) { int ix = blockidx.x*blockdim.x + threadidx.x; int iy = blockidx.y*blockdim.y + threadidx.y; int idx = iy*dimx + ix; } a[idx] = a[idx]+1;

56 int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); global void kernel( int *a, int dimx, int dimy ) { int ix = blockidx.x*blockdim.x + threadidx.x; int iy = blockidx.y*blockdim.y + threadidx.y; int idx = iy*dimx + ix; } a[idx] = a[idx]+1; if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudamemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudamemcpy( h_a, d_a, num_bytes, cudamemcpydevicetohost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudafree( d_a ); } return 0;

57 Blocks must be independent! Any possible interleaving of blocks should be valid! presumed to run to completion without preemption! can run in any order! can run concurrently OR sequentially! A thread block is a batch of threads that can cooperate with each other by:! synchronizing their execution: _syncthreads()! sharing data with shared memory! Independence requirement gives scalability

58 CUDA MEMORIES

59 Hardware Implementation of CUDA Memories Grid! Each thread can:! Read/write per-thread registers! Read/write per-thread local memory Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers! Read/write per-block shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)! Read/write per-grid global memory! Read/only per-grid constant memory Host Global Memory Constant Memory

60 CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread shared int shared_var; shared block block device int global_var; global grid application constant int constant_var; constant grid application! automatic scalar variables without qualifier reside in a register! compiler will spill to thread local memory! automatic array variables without qualifier reside in thread-local memory

61 CUDA Variable Type Performance Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x shared int shared_var; shared 1x device int global_var; global 100x constant int constant_var; constant 1x! scalar variables reside in fast, on-chip registers! shared variables reside in fast, on-chip memories! thread-local arrays & global variables reside in uncached off-chip memory! constant variables reside in cached off-chip memory

62 Where to declare variables? Can host access it? Yes No Outside of any function In the kernel constant int constant_var; device int global_var; int var; int array_var[10]; shared int shared_var;

63 Example thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application global void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadidx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; } // read through num_qs points, maintaining // the nearest 10 qs to p in the heap... // write out the contents of heap to result...

64 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; } } result[i] = x_i x_i_minus_one;

65 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; Two loads int x_i_minus_one = input[i-1]; } } result[i] = x_i x_i_minus_one;

66 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; // once by thread i int x_i_minus_one = input[i-1]; // again by thread i+1 } } result[i] = x_i x_i_minus_one;

67 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; } } result[i] = x_i x_i_minus_one;

68 Example shared variables // optimized version of adjacent difference global void adj_diff(int *result, int *input) { // shorthand for threadidx.x int tx = threadidx.x; // allocate a shared array, one element per thread shared int s_data[block_size]; // each thread reads one element to s_data unsigned int i = blockdim.x * blockidx.x + tx; s_data[tx] = input[i]; } // avoid race condition: ensure all loads // complete before continuing syncthreads();...

69 Example shared variables // optimized version of adjacent difference global void adj_diff(int *result, int *input) {... if(tx > 0) result[i] = s_data[tx] s_data[tx 1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; } }

70 Example shared variables // when the size of the array isn t known at compile time... global void adj_diff(int *result, int *input) { // use extern to indicate a shared array will be // allocated dynamically at kernel launch time extern shared int s_data[];... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

71 A Common Programming Strategy! Partition data into subsets that fit into shared memory

72 A Common Programming Strategy! Handle each data subset with one thread block

73 A Common Programming Strategy! Load the subset from global memory to shared memory, using multiple threads to exploit memorylevel parallelism

74 A Common Programming Strategy! Perform the computation on the subset from shared memory

75 A Common Programming Strategy! Copy the result from shared memory back to global memory

76 A Common Programming Strategy! Carefully partition data according to access patterns! Read-only è constant memory (fast)! R/W & shared within block è shared memory (fast)! R/W within each thread è registers (fast)! Indexed R/W within each thread è local memory (slow)! R/W inputs/results è cudamalloc ed global memory (slow)

77 Communication Through Memory! Question: global void race(void) { shared int my_shared_variable; my_shared_variable = threadidx.x; } // what is the value of // my_shared_variable?

78 Communication Through Memory! This is a race condition! The result is undefined! The order in which threads access the variable is undefined without explicit coordination! Use barriers (e.g., syncthreads) or atomic operations (e.g., atomicadd) to enforce well-defined semantics

79 Communication Through Memory! Use syncthreads to ensure data is ready for access global void share_data(int *input) { shared int data[block_size]; data[threadidx.x] = input[threadidx.x]; syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }

80 Communication Through Memory! Use atomic operations to ensure exclusive access to a variable // assume *result is initialized to 0 global void sum(int *input, int *result) { atomicadd(result, input[threadidx.x]); } // after this kernel exits, the value of // *result will be the sum of the input

81 Hierarchical Atomics Σ Σ 0 Σ 1 Σ ι! Divide & Conquer! Per-thread atomicadd to a shared partial sum! Per-block atomicadd to the total sum

82 SMX EXECUTION

83 How an SM executes threads! Overview of how a Stream Multiprocessor works! SIMT Execution! Divergence

84 Scheduling Blocks onto SMs Streaming Multiprocessor Thread Block 5 Thread Block 27 Thread Block 61 Thread Block 2001! HW Schedules thread blocks onto available SMs! No guarantee of ordering among thread blocks! HW will schedule thread blocks as soon as a previous thread block finishes

85 Warps! Each thread block is executed as 32-thread warps! An implementation decision, not part of the CUDA programming model! Warps are scheduling units in SMs! If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 x 3 = 24 Warps

86 Mapping of Thread Blocks! Each thread block is mapped to one or more warps! The hardware schedules each warp independently Thread Block N (128 threads) TB N W1 TB N W2 TB N W3 TB N W4

87 Thread Scheduling Example! SM implements zero-overhead warp scheduling! At any time, only one of the warps is executed by SM! Warps whose next instruction has its inputs ready for consumption are eligible for execution! Eligible Warps are selected for execution on a prioritized scheduling policy! All threads in a warp execute the same instruction when selected Time TB1 W1 TB2 W1 TB1, W1 stall TB2, W1 stall TB3 W1 TB3 W2 TB2 W1 TB3, W2 stall Instruction: TB1 W1 TB1 W2 TB = Thread Block, W = Warp TB1 W3 TB3 W2

88 Control Flow Divergence! What happens if you have the following code? if(foo(threadidx.x)) { do_a(); } else { do_b(); }

89 Control Flow Divergence Branch Path A Path B

90 Control Flow Divergence! Nested branches are handled as well if(foo(threadidx.x)) { if(bar(threadidx.x)) do_a(); else do_b(); } else do_c();

91 Control Flow Divergence Branch Branch Path A Path B Path C

92 Control Flow Divergence! You don t have to worry about divergence for correctness! You might have to think about it for performance! Depends on your branch conditions

93 Control Flow Divergence! Performance drops off with the degree of divergence switch(threadidx.x % N) { case 0:... case 1:... }

94 Divergence Performance Divergence

95 Compiling a CUDA program CUDA.c /.cpp Host code.gpu Device code C/C++ Libraries CUDA Libraries Object File Fat bin Executable (with embedded GPU code)

96 CUDA Makefile CC=nvcc CUDA_DIR= /opt/cuda/4.1 SDK_DIR= ${CUDA_DIR}/sdk CFLAGS= -I. -I${CUDA_DIR}/include -I${SDK_DIR}/C/common/inc LDFLAGS= -L${CUDA_DIR}/lib64 -L${SDK_DIR}/lib -L${SDK_DIR}/C/common/lib LIB= -lm -lrt SOURCES= vector_add.cu EXECNAME= vector_add all: clean: $(CC) --ptxas-options=-v -g -G -keep -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) $(CC) --ptxas-options=-v -g -G -keep -clean -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) rm -f *.o core

97 OPENACC API

98 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

OpenACC Directives CPU GPU Simple Compiler hints

.. End Program myscience OpenACC Compiler Hint

99 OpenACC Directives CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience OpenACC Compiler Hint Compiler Parallelizes code Works on many-core GPUs & multicore CPUs Your original Fortran or C code

100 OpenACC Open Programming Standard for Parallel Computing

101 OpenACC The Standard for GPU Directives Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

102 High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator Programming model allows programmers to start simple Enhance with additional guidance for compiler on loop mappings, data location, and other performance details Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.

OpenACC Specification and Website Full OpenACC 1.0 Specification available online http://www.openacc-standard.

103 OpenACC Specification and Website Full OpenACC 1.0 Specification available online Quick reference card also available Available compilers: Caps HMPP OpenACC (soon at Romeo) PGI OpenACC (soon at Romeo)

104 A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y);... subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d)...

105 Directive Syntax Fortran!$acc directive [clause [,] clause] ] Often paired with a matching end directive surrounding a structured code block!$acc end directive C #pragma acc directive [clause [,] clause] ] Often followed by a structured code block

106 CUDA = much more Atomic, shuffle operations Streams GPU Direct CUBLAS, NPP, CUFFT, OpenCL David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana- 20

Introduction to General-Purpose GPU Programming by

107 References (CUDA zone, developer zone) Programming Massively Parallel Processors: A Hands-on Approach by David Kirk, Wen-mei Hwu CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot CUDA Application Design and Development by Rob Farber

108 Full Version Questions?

109 Matrix Multiplication Example! Generalize adjacent_difference example! AB = A * B! Each element AB ij! = dot(row(a,i),col(b,j))! Parallelization strategy! Thread à AB ij! 2D kernel

110 First Implementation global void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockidx.y*blockdim.y + threadidx.y; int col = blockidx.x*blockdim.x + threadidx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; } ab[row*width+col] = result;

111 Idea: Use shared memory to reuse global data! Each input element is read by width threads! Load each element into shared memory and have several threads use the local version to reduce the memory bandwidth width

112 Tiled Multiply TILE_WIDTH! Partition kernel loop into phases! Load a tile of both matrices into shared each phase! Each phase, each thread computes a partial result

113 Better Implementation global void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadidx.x, ty = threadidx.y; int bx = blockidx.x, by = blockidx.y; // allocate tiles in shared memory shared float s_a[tile_width][tile_width]; shared float s_b[tile_width][tile_width]; // calculate the row & col index int row = by*blockdim.y + ty; int col = bx*blockdim.x + tx; float result = 0;

114 Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/tile_width; ++p) { // collaboratively load tiles into shared s_a[ty][tx] = a[row*width + (p*tile_width + tx)]; s_b[ty][tx] = b[(m*tile_width + ty)*width + col]; syncthreads(); } // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; syncthreads(); } ab[row*width+col] = result;

115 Use of Barriers in mat_mul! Two barriers per phase:! syncthreads after all data is loaded into shared memory! syncthreads after all data is read from shared memory! Note that second syncthreads in phase p guards the load in phase p+1! Use barriers to guard data! Guard against using uninitialized data! Guard against bashing live data

116 First Order Size Considerations! Each thread block should have many threads! TILE_WIDTH = 16 à 16*16 = 256 threads! There should be many thread blocks! 1024*1024 matrices à 64*64 = 4096 thread blocks! TILE_WIDTH = 16 à gives each SM 4 blocks, 1024 threads! Full occupancy! Each thread block performs 2 * 256 = 512 x 4B loads for 256 * (2 * 16) = 8,192 fp ops (0.25 B/op)! Compare to 4B/op

117 TILE_SIZE Effects

118 Memory Resources as Limit to Parallelism Resource Per GTX480 SM Full Occupancy on GTX480 Registers <= / 1024 threads = 32 per thread shared Memory 48KB <= 48KB / 8 blocks = 6KB per block! Effective use of different memory resources reduces the number of accesses to global memory! These resources are finite!! The more memory locations each thread requires à the fewer threads an SM can accommodate

CS 193G. Lecture 4: CUDA Memories

CS 193G. Lecture 4: CUDA Memories CS 193G Lecture 4: CUDA Memories Hardware Implementation of CUDA Memories Grid! Each thread can:! Read/write per-thread registers! Read/write per-thread local memory Block (0, 0) Shared Memory Registers