A crash course on (C- ) CUDA programming for (mul8) GPU. Massimo Bernaschi h;p://

Size: px

Start display at page:

Download "A crash course on (C- ) CUDA programming for (mul8) GPU. Massimo Bernaschi h;p://www.iac.cnr.it/~massimo"

Edmund Carr
6 years ago
Views:

1 A crash course on (C- ) CUDA programming for (mul8) GPU Massimo Bernaschi massimo.bernaschi@cnr.it h;p://

2 Single GPU part of the class based on Code samples available from (wget) h3p://twin.iac.rm.cnr.it/cccsample.tgz Slides available from (wget) h3p://twin.iac.rm.cnr.it/ccc.pdf Let us start with few basic info

3 A Graphics Processing Unit is not a CPU! CPU GPU Control ALU ALU ALU ALU Cache DRAM DRAM And there is no GPU without a CPU

(GPGPU): ü Op8mized for high memory throughput ü Very suitable to data- parallel processing ü

4 Graphics Processing Unit ü Originally dedicated to specific opera8ons required by video cards for accelera8ng graphics, GPUs have become flexible general- purpose computa8onal engines (GPGPU): ü Op8mized for high memory throughput ü Very suitable to data- parallel processing ü Tradi8onally GPU programming has been tricky but CUDA, OpenCL and now OpenACC made it affordable.

5 What is CUDA? CUDA=Compute Unified Device Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard C A handful of language extensions to allow heterogeneous programs Straightforward APIs to manage devices, memory, etc.

6 CUDA Programming Model The GPU is viewed as a compute device that: has its own RAM (device memory) runs data- parallel por8ons of an applica8on as kernels by using many threads GPU vs. CPU threads GPU threads are extremely lightweight Very li;le crea8on overhead GPU needs 1000s of threads for full efficiency A mul8- core CPU needs only a few (basically one thread per core)

7 CUDA C Jargon: The Basics Host The CPU and its memory (host memory) Device The GPU and its memory (device memory) Host Device

8 What Programmer Expresses in CUDA HOST (CPU) P M P Interconnect between devices and memories ü Computa8on par88oning (where does computa8on occur?) ü Declara8ons on func8ons host, global, device ü Mapping of thread programs to device: compute <<<gs, bs>>>(<args>) ü Data par88oning (where does data reside, who may access it and how?) M DEVICE (GPU) ü Declara8ons on data shared, device, constant, ü Data management and orchestra8on ü Copying to/from host: e.g., cudamemcpy(h_obj,d_obj, size, cudamemcpydevicetohost) ü Concurrency management ü e.g. synchthreads()

9 Code Samples 1. Connect to the system ssh X XYZ@abcd.zz 2. Get the code samples: wget h;p://twin.iac.rm.cnr.it/ccc.tgz wget h;p://twin.iac.rm.cnr.it/cccsample.tgz 3. Unpack the samples: tar zxvf cccsample.tgz 4. Go to source directory: cd cccsample/source 5. Get the CUDA C Quick Reference wget h;p://twin.iac.rm.cnr.it/cuda_c_quickref.pdf

10 Hello, World! int main( void ) { printf( "Hello, World!\n" ); return 0; } To compile: nvcc o hello_world hello_world.cu To execute:./hello_world This basic program is just standard C that runs on the host NVIDIA s compiler (nvcc) will not complain about CUDA programs with no device code At its simplest, CUDA C is just C!

11 Hello, World! with Device Code global void kernel( void ) { } int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!\n" ); return 0; } To compile: nvcc o simple_kernel simple_kernel.cu To execute:./simple_kernel Two notable addi8ons to the original Hello, World!

12 Hello, World! with Device Code global void kernel( void ) { } CUDA C keyword global indicates that a func8on Runs on the device Called from host code nvcc splits source file into host and device components NVIDIA s compiler handles device func8ons like kernel() Standard host compiler handles host func8ons like main() gcc, icc, Microso] Visual C

13 Hello, World! with Device Code int main( void ) { kernel<<< 1, 1 >>>(); printf( "Hello, World!\n" ); return 0; } Triple angle brackets mark a call from host code to device code A kernel launch in CUDA jargon We ll discuss the parameters inside the angle brackets later This is all that s required to execute a func8on on the GPU! The func8on kernel() does nothing, so let us run something a bit more useful:

14 A kernel to make an addition global void add(int a, int b, int *c) { *c = a + b; } Compile: nvcc o addint addint.cu Run:./addint Look at line 28 of the addint.cu code How do we pass the first two arguments?

15 A More Complex Example Another kernel to add two integers: global void add( int *a, int *b, int *c ) { } *c = *a + *b; As before, global is a CUDA C keyword meaning add() will execute on the device add() will be called from the host

16 A More Complex Example No8ce that now we use pointers for all our variables: } global void add( int *a, int *b, int *c ) { *c = *a + *b; add() runs on the device so a, b, and c must point to device memory How do we allocate memory on the GPU?

17 Memory Management Up to CUDA 4.0 host and device memory were dis8nct en88es from the programmers viewpoint Device pointers point to GPU memory May be passed to and from host code (In general) May not be dereferenced from host code Host pointers point to CPU memory May be passed to and from device code (In general) May not be dereferenced from device code Star8ng on CUDA 4.0 there is a Unified Virtual Addressing feature. Basic CUDA API for dealing with device memory cudamalloc(&p, size), cudafree(p), cudamemcpy(t, s, size, direction) Similar to their C equivalents: malloc(), free(), memcpy() pointer to pointer

18 A More Complex Example: main() int main( void ) { int a, b, c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = sizeof( int ); // we need space for an integer // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = 2; b = 7; // copy inputs to device } cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, &b, size, cudamemcpyhosttodevice ); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( &c, dev_c, size, cudamemcpydevicetohost ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ) return 0;

19 Parallel Programming in CUDA C But wait GPU compu8ng is about massive parallelism So how do we run code in parallel on the device? Solu8on lies in the parameters between the triple angle brackets: add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); add<<< N, 1 >>>( dev_a, dev_b, dev_c ); Instead of execu8ng add() once, add() executed N 8mes in parallel

20 Parallel Programming in CUDA C With add() running in parallel let s do vector addi8on Terminology: Each parallel invoca8on of add() referred to as a block Kernel can refer to its block s index with the variable blockidx.x Each block adds a value from a[] and b[], storing the result in c[]: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x]+b[blockidx.x]; } By using blockidx.x to index arrays, each block handles a different index blockidx.x is the first example of a CUDA predefined variable.

21 Parallel Programming in CUDA C We write this code: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x]+b[blockidx.x]; } This is what runs in parallel on the device: Block 0 c[0]=a[0]+b[0]; Block 1 c[1]=a[1]+b[1]; Block 2 c[2]=a[2]+b[2]; Block 3 c[3]=a[3]+b[3];

22 Parallel Addition: main() #define N 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for 512 // integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

23 Parallel Addition: main() (cont) // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with N parallel blocks add<<< N, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0;

24 Review Difference between host and device Host = CPU Device = GPU Using global to declare a function as device code Runs on device Called from host Passing parameters from host code to a device function

25 Review (cont) Basic device memory management cudamalloc() cudamemcpy() cudafree() Launching parallel kernels Launch N copies of add() with: add<<< N, 1 >>>(); blockidx.x allows to access block s index Exercise: look at, compile and run the add_simple_blocks.cu code

26 Threads Terminology: A block can be split into parallel threads Let s change vector addition to use parallel threads instead of parallel blocks: global void add( int *a, int *b, int *c ) { c[ threadidx.x blockidx.x ] = a[ threadidx.x blockidx.x ]+ b[ threadidx.x blockidx.x ]; } We use threadidx.x instead of blockidx.x in add() main() will require one change as well

27 Parallel Addition (Threads): main() #define N 512 int main( void ) { int *a, *b, *c; //host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int )//we need space for 512 integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

28 Parallel Addition (Threads): main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with N threads blocks add<<< 1, N, N 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Exercise: compile and run the add_simple_threads.cu code

29 Using Threads And Blocks We ve seen parallel vector addition using Many blocks with 1 thread apiece 1 block with many threads Let s adapt vector addition to use lots of both blocks and threads After using threads and blocks together, we ll talk about why threads First let s discuss data indexing

30 Indexing Arrays With Threads & Blocks No longer as simple as just using threadidx.x or blockidx.x as indices To index array with 1 thread per entry (using 8 threads/block) threadidx.x threadidx.x threadidx.x threadidx.x blockidx.x = 0 blockidx.x = 1 blockidx.x = 2 blockidx.x = 3 If we have M threads/block, a unique array index for each entry is given by int index = threadidx.x + blockidx.x * M; int index = x + y * width;

31 Indexing Arrays: Example In this example, the red entry would have an index of 21: M = 8 threads/block blockidx.x = 2 int index = threadidx.x + blockidx.x * M; = * 8; = 21;

32 Indexing Arrays: other examples (4 blocks with 4 threads per block)

33 Addition with Threads and Blocks blockdim.x is a built-in variable for threads per block: int index= threadidx.x + blockidx.x * blockdim.x; griddim.x is a built-in variable for blocks in a grid; A combined version of our vector addition kernel to use blocks and threads: global void add( int *a, int *b, int *c ) { int index = threadidx.x + blockidx.x * blockdim.x; } c[index] = a[index] + b[index]; So what changes in main() when we use both blocks and threads?

34 Parallel Addition (Blocks/Threads) #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int ); // we need space for N // integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

35 Parallel Addition (Blocks/Threads) // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with blocks and threads add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Exercise: compile and run the add_simple.cu code

36 Exercise: array reversal Start from the arrayreversal.cu code The code must fill an array d_out with the contents of an input array d_in in reverse order so that if d_in is [100, 110, 200, 220, 300] then d_out must be [300, 220, 200, 110, 100]. You must complete line 13, 14 and 34. Remember that: blockdim.x is the number of threads per block; griddim.x is the number of blocks in a grid;

37 Array Reversal: Soluaon global void reversearrayblock(int *d_out, int *d_in) { int in = blockdim.x * blockidx.x + threadidx.x ; } int out = (blockdim.x * griddim.x - 1) in ; d_out[out] = d_in[in];

38 Why Bother With Threads? Threads seem unnecessary Added a level of abstraction and complexity What did we gain? Unlike parallel blocks, parallel threads have mechanisms to Communicate Synchronize Let s see how

39 Dot Product Unlike vector addition, dot product is a reduction from vectors to a scalar c = a b c = (a 0, a 1, a 2, a 3 ) (b 0, b 1, b 2, b 3 ) c = a 0 b 0 + a 1 b 1 + a 2 b 2 + a 3 b 3

40 Dot Product Parallel threads have no problem compu8ng the pairwise products: So we can start a dot product CUDA kernel by doing just that: global void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadidx.x] * b[threadidx.x];

41 Dot Product But we need to share data between threads to compute the final sum: global void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadidx.x] * b[threadidx.x]; } // Can t compute the final sum // Each thread s copy of temp is private!!!

42 Sharing Data Between Threads Terminology: A block of threads shares memory called shared memory Extremely fast, on-chip memory (user-managed cache) Declared with the shared CUDA keyword Not visible to threads in other blocks running in parallel Block 0 Block 1 Block 2 Threads Shared Memory Threads Shared Memory Threads Shared Memory

43 Parallel Dot Product: dot() We perform parallel multiplication, serial addition: #define N 512 global void dot( int *a, int *b, int *c ) { // Shared memory for results of multiplication shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; } // Thread 0 sums the pairwise products if( 0 == threadidx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i]; } *c = sum; }

44 Parallel Dot Product Recap We perform parallel, pairwise multiplications Shared memory stores each thread s result We sum these pairwise products from a single thread Sounds good But Exercise: Compile and run dot_simple_threads.cu. Does it work as expected?

45 Faulty Dot Product Exposed! Step 1: In parallel, each thread writes a pairwise product shared int temp Step 2: Thread 0 reads and sums the products shared int temp But there s an assumption hidden in Step 1

46 Read-Before-Write Hazard Suppose thread 0 finishes its write in step 1 Then thread 0 reads index 12 in step 2 Before thread 12 writes to index 12 in step 1 This read returns garbage!

47 Synchronization We need threads to wait between the sections of dot(): global void dot( int *a, int *b, int *c ) { shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; // * NEED THREADS TO SYNCHRONIZE HERE * // No thread can advance until all threads // have reached this point in the code } // Thread 0 sums the pairwise products if( 0 == threadidx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i]; } *c = sum; }

48 syncthreads() We can synchronize threads with the function syncthreads() Threads in the block wait until all threads have hit the syncthreads() Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 syncthreads() syncthreads() syncthreads() syncthreads() syncthreads() Threads are only synchronized within a block!

49 Parallel Dot Product: dot() global void dot( int *a, int *b, int *c ) { shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; syncthreads(); } if( 0 == threadidx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i]; } *c = sum; } With a properly synchronized dot() routine, let s look at main()

50 Parallel Dot Product: main() #define N 512 int main( void ) { int *a, *b, *c; // copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

51 Parallel Dot Product: main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, sizeof( int ), cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Exercise: insert syncthreads() in the dot kernel. Compile and run.

52 Review Launching kernels with parallel threads Launch add() with N threads: add<<< 1, N >>>(); Used threadidx.x to access thread s index Using both blocks and threads Used (threadidx.x + blockidx.x * blockdim.x) to index input/output N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads total

53 Review (cont) Using shared to declare memory as shared memory Data shared among threads in a block Not visible to threads in other parallel blocks Using syncthreads() as a barrier No thread executes instructions after syncthreads() until all threads have reached the syncthreads() Needs to be used to prevent data hazards

54 Multiblock Dot Product Recall our dot product launch: // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); Launching with one block will not u8lize much of the GPU Let s write a mul9block version of dot product

55 Multiblock Dot Product: Algorithm Each block computes a sum of its pairwise products like before:

56 Multiblock Dot Product: Algorithm And then contributes its sum to the final result:

57 Multiblock Dot Product: dot() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) global void dot( int *a, int *b, int *c ) { shared int temp[threads_per_block]; int index = threadidx.x + blockidx.x * blockdim.x; temp[threadidx.x] = a[index] * b[index]; syncthreads(); } if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) { sum += temp[i]; } *c atomicadd( += sum; c, sum ); } But we have a race condi8on Compile and run dot_simple_mul8block.cu We can fix it with one of CUDA s atomic opera8ons. Use atomicadd in the dot kernel. Compile with nvcc o dot_simple_mulablock dot_simple_mulablock.cu arch=sm_35

58 Race Conditions Terminology: A race condition occurs when program behavior depends upon relative timing of two (or more) event sequences What actually takes place to execute the line in question: *c += sum; Read value at address c Add sum to value Terminology: Read-Modify-Write Write result to address c What if two threads are trying to do this at the same time? Thread 0, Block 0 Read value at address c Add sum to value Write result to address c Thread 0, Block 1 Read value at address c Add sum to value Write result to address c

59 Global Memory Contention Read-Modify-Write Block 0 sum = 3 Reads 0 0 Computes 0+3 Writes = 3 3 *c += sum c Block 1 sum = 4 3 Reads = 7 7 Computes 3+4 Writes 7 Read-Modify-Write

60 Global Memory Contention Read-Modify-Write Block 0 sum = 3 Reads 0 0 Computes 0+3 Writes = 3 3 *c += sum c Block 1 sum = 4 0 Reads = 4 4 Computes 0+4 Writes 4 Read-Modify-Write

61 Atomic Operations Terminology: Read-modify-write uninterruptible when atomic Many atomic operations on memory available with CUDA C atomicadd() atomicsub() atomicmin() atomicmax() atomicinc() atomicdec() atomicexch() atomiccas()old == compare? val : old Predictable result when simultaneous access to memory required We need to atomically add sum to c in our multiblock dot product

62 Multiblock Dot Product: dot() global void dot( int *a, int *b, int *c ) { shared int temp[threads_per_block]; int index = threadidx.x + blockidx.x * blockdim.x; temp[threadidx.x] = a[index] * b[index]; syncthreads(); } if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ){ sum += temp[i]; } atomicadd( c, sum ); } Now let s fix up main() to handle a mul9block dot product

63 Parallel Dot Product: main() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N ints // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

64 Parallel Dot Product: main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch dot() kernel dot<<<n/threads_per_block,threads_per_block>>>(dev_a, dev_b, dev_c); // copy device result back to host copy of c cudamemcpy( c, dev_c, sizeof( int ), cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0;

65 Review Race conditions Behavior depends upon relative timing of multiple event sequences Can occur when an implied read-modify-write is interruptible Atomic operations CUDA provides read-modify-write operations guaranteed to be atomic Atomics ensure correct results when multiple threads modify memory To use atomic operations a new option (-arch= ) must be used at compile time

66 Eratosthenes Sieve A sieve has holes in it and is used to filter out the juice. Eratosthenes s sieve filters out numbers to find the prime numbers.

67 Defini8on Prime Number a number that has only two factors, itself and is prime because the only numbers that will divide into it evenly are 1 and 7.

68 The Eratosthenes Sieve

69 The Prime Numbers from 1 to 120 are the following: 2,3,5,7,11,13,17,19,23, 29,31,37,41,43,47,53,59, 61,67,71,73,79,83,89,97, 101,103,107,109,113 Download (wget) h3p://twin.iac.rm.cnr.it/eratostenepr.cu

70 CUDA Thread organiza8on: Grids and Blocks A kernel is executed as a 1D, 2D or 3D grid of thread blocks. All threads share the global memory A thread block is a 1D, 2D or 3D batch of threads that can cooperate with each other by: Synchronizing their execu8on For hazard- free shared memory accesses Efficiently sharing data through a low latency shared memory Threads blocks are independent of each other and can be executed in any order! Host Kernel 1 Kernel 2 Block (1, 1) Thread (0, 0) Thread (0, 1) Device Thread (1, 0) Thread (1, 1) Grid 1 Block (0, 0) Block (0, 1) Grid 2 Thread (2, 0) Thread (2, 1) Block (1, 0) Block (1, 1) Thread (3, 0) Thread (3, 1) Thread (4, 0) Thread (4, 1) Block (2, 0) Block (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Courtesy: NVIDIA

71 Built- in Variables to manage grids and blocks dim3: a new datatype defined by CUDA: struct dim3 { unsigned int x, y, z }; three unsigned ints where any unspecified component defaults to 1. dim3 griddim; Dimensions of the grid in blocks dim3 blockdim; Dimensions of the block in threads dim3 blockidx; Block index within the grid dim3 threadidx; Thread index within the block

72 Bi- dimensional threads configura8on by example: set the elements of a square matrix (assume the matrix is a single block of memory!) int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; global void kernel( int *a, int dimx, int dimy ) { int ix = blockidx.x*blockdim.x + threadidx.x; int iy = blockidx.y*blockdim.y + threadidx.y; int idx = iy*dimx + ix; } a[idx] = idx+1; Exercise: compile and run: setmatrix.cu } kernel<<<grid, block>>>( d_a, dimx, dimy ); cudamemcpy(h_a,d_a,num_bytes, cudamemcpydevicetohost); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) prin ("%d ", h_a[row*dimx+col] ); prin ("\n"); } free( h_a ); cudafree( d_a ); return 0;

73 Matrix mul8ply with one thread block One block of threads computes matrix dp Each thread computes one element of dp Each thread Loads a row of matrix dm Loads a column of matrix dn Perform one mul8ply and addi8on for each pair of dm and dn elements Size of matrix limited by the number of threads allowed in a thread block! Grid 1 Block 1 Thread )2, 2 ( dn WIDTH dm dp

74 Exercise: matrix- matrix mul8ply global void MatrixMulKernel(DATA* dm, DATA* dn, DATA* dp, int Width) { DATA Pvalue = 0.0; for (int k = 0; k < Width; ++k) { DATA Melement = dm[ ]; DATA Nelement = dn[ ]; Pvalue += Melement * Nelement; } dp[ ] = Pvalue; } ü Start from MMGlob.cu ü Complete the func8on MatrixMulOnDevice and the kernel MatrixMulKernel ü Use one bi- dimensional block ü Threads will have an x (threadidx.x) and an y iden8fier (threadidx.y) ü Max size of the matrix: 16 Compile: nvcc o MMGlob MMglob.cu Run:./MMGlob N

75 CUDA Compiler: nvcc basic op8ons ü - arch=sm_13 (or sm_20, sm_35) Enable double precision (on compa8ble hardware) ü - G Enable debug for device code ü - - ptxas- op8ons=- v Show register and memory usage ü - - maxrregcount <N> Limit the number of registers ü - use_fast_math Use fast math library ü - O3 Enables compiler op8miza8on ü - ccbin compiler_path use a different C compiler (e.g., Intel instead of gcc) Exercises: 1. re- compile the MMGlob.cu program and check how many registers the kernel uses: nvcc o MMGlob MMGlob.cu - - ptxas- op8ons=- v 2. Edit MMGlob.cu and modify DATA from float to double, then compile for double precision and run

76 CUDA Error Repor8ng to CPU All CUDA calls return an error code: except kernel launches cudaerror_t type cudaerror_t cudagetlasterror(void) returns the code for the last error (cudasuccess means no error ) char* cudageterrorstring(cudaerror_t code) returns a null- terminated character string describing the error prinn( %s\n,cudageterrorstring(cudagetlasterror() ) );

77 CUDA Event API Events are inserted (recorded) into CUDA call streams Usage scenarios: measure elapsed 8me (in milliseconds) for CUDA calls (resolu8on ~0.5 microseconds) query the status of an asynchronous CUDA call block CPU un8l CUDA calls prior to the event are completed cudaevent_t start, stop; float et; cudaeventcreate(&start); cudaeventcreate(&stop); cudaeventrecord(start, 0); kernel<<<grid, block>>>(...); cudaeventrecord(stop, 0); cudaeventsynchronize(stop); cudaeventelapsedtime(&et, start, stop); cudaeventdestroy(start); cudaeventdestroy(stop); prin ( Kernel execu8on 8me=%f\n,et); Read the func8on MatrixMulOnDevice of the MMGlobWT.cu program. Compile and run

78 Execution Model Software Hardware Thread Thread Block Scalar Processor Mul8processor Threads are executed by scalar processors Thread blocks are executed on mul8processors Thread blocks do not migrate Several concurrent thread blocks can reside on one mul8processor - limited by mul8processor resources (shared memory and register file)... A kernel is launched as a grid of thread blocks Grid Device

79 Transparent Scalability Hardware is free to assigns blocks to any processor at any 8me A kernel scales across any number of parallel processors Device Kernel grid Block 0 Block 1 Device Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 8me Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relaave to other blocks! 86

80 Warps and Half Warps... Thread Block = 32 Threads 32 Threads 32 Threads Warps Mul8processor A thread block consists of 32- thread warps A warp is executed physically in parallel (SIMD) on a mul8processor Half Warps DRAM Global Local A half- warp of 16 threads can coordinate global memory accesses into a single transac8on Device Memory 87!

81 Execu8ng Thread Blocks t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU SP MT IU SP Blocks Blocks Shared Memory Shared Memory The number of threads in a block depends on the capability (i.e., the version) of the GPU. On a Fermi or Kepler GPU a block may have up to 1024 threads. Threads are assigned to Streaming Mulaprocessors (SM) in block granularity Up to 16 blocks to each SM on Kepler. 8 blocks on Fermi SM. A Fermi SM can take up to 1536 threads Could be 256 (threads/block) * 6 blocks or 192 (threads/block) * 8 blocks but, for instance, 128 (threads/block) * 12 blocks not allowed! A Kepler SM can take up to 2048 threads Threads run concurrently SM manages/schedules thread execu8on 88

82 Block Granularity Considera8ons For a kernel running on a Fermi and using mul8ple blocks, should we use 8X8, 16X16 or 32X32 blocks? For 8X8, we have 64 threads per Block. Since each Fermi SM can take up to 1536 threads, there are 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 32X32, we have 1024 threads per Block. Only one block can fit into a SM! For 16X16, we have 256 threads per Block. Since each Fermi SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considera8ons overrule. Repeat the analysis for a Kepler card! 89

83 Programmer View of Register File There are up to (32 bit) registers in each Fermi SM There are up to (32 bit) registers in each Kepler SM This is an implementa8on decision, not part of CUDA Registers are dynamically par88oned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself 4 blocks 3 blocks 90

84 Registers occupancy example If a Block has 16x16 threads and each thread uses 30 registers, how many blocks can run on each Fermi SM? Each block requires 30*256 = 7680 registers 32768/7680 = 4 + change (2048) So, 4 blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 10%? Each Block now requires 33*256 = 8448 registers 32768/8448 = 3 + change (7424) Only three Blocks can run on an SM, 25% reducaon of parallelism!!! Repeat the analysis for a Kepler card!

85 Optimizing threads per block ü Choose threads per block as a multiple of warp size (32) ü Avoid wasting computation on under-populated warps ü Facilitates efficient memory access (coalescing ) ü Run as many warps as possible per multiprocessor (hide latency) ü SM can run up to 16 (on Kepler) blocks at a time ü Heuristics ü Minimum: 64 threads per block ü 192 or 256 threads a better choice ü Usually still enough registers to compile and invoke successfully ü The right tradeoff depends on your computation, so experiment, experiment, experiment!!!

86 Thread Block Review on Memory Hierarchy Local Memory Shared Memory Local Memory: per- thread Private per thread but slow! Auto variables, register spill Shared Memory: per- block Shared by threads of the same block Fast inter- thread communica8on Global Memory: per- applica8on Shared by all threads Inter- Grid communica8on Grid 0... Grid 1... Global Memory Sequential Grids in Time

Reduce number of memory transac8ons by applying a proper memory access pa;ern.

87 Op8mizing Global Memory Access Memory Bandwidth is very high (~ 150 Gbytes/sec.) but Latency is also very high (hundreds of cycles)! Local memory has the same latency! Reduce number of memory transac8ons by applying a proper memory access pa;ern. Array of structures (data locality!) Structure of arrays (thread locality!) Thread locality is be3er than data locality on GPU!

88 Global Memory Opera8ons Two types of loads: Caching Default mode A;empts to hit in L1, then L2, then GMEM Load granularity is 128- byte line Non- caching Compile with Xptxas dlcm=cg op8on to nvcc A;empts to hit in L2, then GMEM Do not hit in L1, invalidate the line if it s in L1 already Load granularity is 32- bytes Stores: Invalidate L1, write- back for L2

89 Caching Load Warp requests 32 aligned, consecu8ve 4- byte words Addresses fall within 1 cache- line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

90 Non- caching Load Warp requests 32 aligned, consecu8ve 4- byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

91 Caching Load Warp requests 32 aligned, permuted 4- byte words Addresses fall within 1 cache- line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

92 Non- caching Load Warp requests 32 aligned, permuted 4- byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

93 Caching Load Warp requests 32 misaligned, consecu8ve 4- byte words Addresses fall within 2 cache- lines Warp needs 128 bytes 256 bytes move across the bus on misses Bus u8liza8on: 50% addresses from a warp Memory addresses

94 Non- caching Load Warp requests 32 misaligned, consecu8ve 4- byte words Addresses fall within at most 5 segments Warp needs 128 bytes 160 bytes move across the bus on misses Bus u8liza8on: at least 80% Some misaligned pa;erns will fall within 4 segments, so 100% u8liza8on addresses from a warp Memory addresses

95 Caching Load All threads in a warp request the same 4- byte word Addresses fall within a single cache- line Warp needs 4 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 3.125% addresses from a warp Memory addresses

96 Non- caching Load All threads in a warp request the same 4- byte word Addresses fall within a single segment Warp needs 4 bytes 32 bytes move across the bus on a miss Bus u8liza8on: 12.5% addresses from a warp Memory addresses

97 Caching Load Warp requests 32 sca;ered 4- byte words Addresses fall within N cache- lines Warp needs 128 bytes N*128 bytes move across the bus on a miss Bus u8liza8on: 128 / (N*128) addresses from a warp Memory addresses

98 Non- caching Load Warp requests 32 sca;ered 4- byte words Addresses fall within N segments Warp needs 128 bytes N*32 bytes move across the bus on a miss Bus u8liza8on: 128 / (N*32) addresses from a warp Memory addresses

99 Shared Memory On- chip memory 2 orders of magnitude lower latency than global memory Much higher bandwidth than global memory Up to 48 KByte per mul8processor (on Fermi and Kepler) Allocated per thread- block Accessible by any thread in the thread- block Not accessible to other thread- blocks Several uses: Sharing data among threads in a thread- block User- managed cache (reducing global memory accesses) shared float As[16][16];

100 Shared Memory Example Compute adjacent_difference Given an array d_in produce a d_out array containing the difference: d_out[0]=d_in[0]; d_out[i]=d_in[i]- d_in[i- 1]; /* i>0 */ Exercise: Complete the shared_variables.cu code

101 N- Body problems An N- body simulaaon numerically approximates the evoluaon of a system of bodies in which each body conanuously interacts with every other body. The all- pairs approach to N- body simulaaon is a brute- force technique that evaluates all pair- wise interacaons among the N bodies.

102 N- Body problems We use the gravitaaonal potenaal to illustrate the basic form of computaaon in an all- pairs N- body simulaaon.

103 N- Body problems To integrate over ame, we need the acceleraaon a i = F i /m i to update the posiaon and velocity of body i. Download h3p://twin.iac.rm.cnr.it/mini_nbody.tgz and unpack: tar zxvf mini_nbody.tgz look at nbody.c

104 void bodyforce(body *p, float dt, int n) { for (int i = 0; i < n; i++) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; N- Body: main loop for (int j = 0; j < n; j++) { float dx = p[j].x - p[i].x; float dy = p[j].y - p[i].y; float dz = p[j].z - p[i].z; float distsqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invdist = 1.0f / sqrn(distsqr); float invdist3 = invdist * invdist * invdist; } Fx += dx * invdist3; Fy += dy * invdist3; Fz += dz * invdist3; } } p[i].vx += dt*fx; p[i].vy += dt*fy; p[i].vz += dt*fz;

105 N- Body: parallel processing We may think of the all- pairs algorithm as calculaang each entry f ij in an NxN grid of all pair- wise forces. Each entry can be computed independently, so there is O(N 2 ) available parallelism We may assign each body to a thread If the block size is, say, 256 then the number of blocks is: (nbodies +255) / 256

106 N- Body: a first CUDA code global void bodyforce(body *p, float dt, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int j = 0; j < n; j++) { float dx = p[j].x - p[i].x; float dy = p[j].y - p[i].y; float dz = p[j].z - p[i].z; float distsqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invdist = rsqr (distsqr); float invdist3 = invdist * invdist * invdist; Fx += dx * invdist3; Fy += dy * invdist3; Fz += dz * invdist3; } p[i].vx += dt*fx; p[i].vy += dt*fy; p[i].vz += dt*fz; } }

107 N- Body: a first CUDA code Do we expect this code to be efficient? All threads read all data from the global memory It is be3er to serialize some of the computaaons to achieve the data reuse needed to reach peak performance of the arithmeac units and to reduce the memory bandwidth required. To that purpose let us introduce the noaon of computaaonal 8le

108 N- Body: Computa8onal 9le A 8le is a square region of the grid of pair- wise forces consisang of p rows and p columns. Only 2p body descripaons are required to evaluate all p 2 interacaons in the ale (p of which can be reused later). These body descripaons can be stored in shared memory or in registers.

109 N- Body: Computa8onal 9le Each thread in a block loads a body All threads in a block compute the interacaons with the bodies that have been loaded

110 N- Body: using Shared Memory (1) global void bodyforce(float4 *p, float4 *v, float dt, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int 8le = 0; 8le < griddim.x; 8le++) { shared float3 spos[block_size]; float4 tpos = p[8le * blockdim.x + threadidx.x]; spos[threadidx.x] = make_float3(tpos.x, tpos.y, tpos.z); syncthreads(); for (int j = 0; j < BLOCK_SIZE; j++) {.. } syncthreads(); } v[i].x += dt*fx; v[i].y += dt*fy; v[i].z += dt*fz; } }

111 N- Body: using Shared Memory (2) global void bodyforce(float4 *p, float4 *v, float dt, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) {... for (int j = 0; j < BLOCK_SIZE; j++) { float dx = spos[j].x - p[i].x; float dy = spos[j].y - p[i].y; float dz = spos[j].z - p[i].z; float distsqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invdist = rsqr (distsqr); float invdist3 = invdist * invdist * invdist; Fx += dx * invdist3; Fy += dy * invdist3; Fz += dz * invdist3; }. } }

112 N- Body: Further op8miza8ons Tuning of block size Unrolling of the inner loop for (int j = 0; j < BLOCK_SIZE; j++) { } Test: 1. shmoo- cpu- nbody.sh 2. cuda/shmoo- cuda- nbody- orig.sh 3. cuda/shmoo- cuda- nbody- block.sh 4. cuda/shmoo- cuda- nbody- unroll.sh

113 Shared Memory Organiza8on: 32 banks, 4- byte wide banks Successive 4- byte words belong to different banks Performance: 4 bytes per bank per 2 clocks per mul8processor smem accesses are issued per 32 threads (warp) serializa8on: if N threads of 32 access different 4- byte words in the same bank, N accesses are executed serially mul8cast: N threads access the same word in one fetch Could be different bytes within the same word

114 Bank Addressing Examples No Bank Conflicts No Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 31 Bank 31 Thread 31 Bank 31

115 Bank Addressing Examples 2- way Bank Conflicts x Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 28 Thread 29 Thread 30 Thread 31 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 31 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 31 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 31

116 Shared Memory: Avoiding Bank Conflicts 32x32 SMEM array Warp accesses a column: 32- way bank conflicts (threads in a warp access the same bank) Bank 0 Bank 1 Bank 31 warps:

117 Shared Memory: Avoiding Bank Conflicts Add a column for padding: 32x33 SMEM array Warp accesses a column: warps: 32 different banks, no bank conflicts padding Bank 0 Bank 1 Bank

118 Tree- Based Parallel Reduc8ons in Shared Memory CUDA solves this by exposing on- chip shared memory Reduce blocks of data in shared memory to save bandwidth 25

119 Parallel Reduc8on: Interleaved Addressing Values (shared memory) Step 1 Stride 1 Arbitrarily bad bank conflicts Requires barriers if N > warpsize Supports non- commutaave operators Thread IDs Values Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Values Thread IDs Values Thread IDs Values Interleaved addressing results in bank conflicts

120 Parallel Reduc8on: Sequen8al Addressing Values (shared memory) Step 1 Stride 8 Thread IDs Values Step 2 Stride 4 Thread IDs No bank conflicts Requires barriers if N > warpsize Does not support non- commutaave operators Step 3 Stride 2 Step 4 Stride 1 Values Thread IDs Values Thread IDs Values Sequential addressing is conflict free

121 Parallel Prefix Sum (Scan) Given an array A = [a0, a1,, an- 1] and a binary associa8ve operator with iden8ty I, scan(a) = [I, a0, (a0 a1),, (a0 a1 an- 2)] Example: if is addi8on, then scan on the set [ ] returns the set [ ]

122 O(n log n) Scan Step efficient (log n steps) Not work efficient (n log n work) Requires barriers at each step (WAR dependencies)

123 Constant Memory ü Both scalar and array values can be stored ü Constants stored in DRAM, and cached on chip L1 per SM ü A constant value can be broadcast to all threads in a Warp ü Extremely efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R F C $ Shared L 1 Mem Operand Select MAD SFU cudamemcpytosymbol(ds, &hs, sizeof(int), 0,cudaMemcpyHostToDevice); 132

124 Texture Memory (and co- stars) Another type of memory system, featuring: Spa8ally- cached read- only access Avoid coalescing worries Interpola8on (Other) fixed- func8on capabili8es Graphics interoperability

125 Tradi8onal Caching When reading, cache nearby elements (i.e. cache line) Memory is linear! Applies to CPU, GPU L1/L2 cache, etc

126 Texture- Memory Caching Can cache spa8ally! (2D, 3D) Specify dimensions (1D, 2D, 3D) on crea8on 1D applica8ons: Interpola8on, clipping (later) Caching when, e.g., coalesced access is infeasible

127 Texture Memory Memory is just an unshaped bucket of bits (CUDA Handbook) Need texture reference in order to: Interpret data Deliver to registers

128 Interpolaaon Can read between the lines! Download h3p://twin.iac.rm.cnr.it/readtexels.cu h;p://cuda- programming.blogspot.com/2013/02/texture- memory- in- cuda- what- is- texture.html

129 Clamping Seamlessly handle reads beyond region! h;p://cuda- programming.blogspot.com/2013/02/texture- memory- in- cuda- what- is- texture.html

130 CUDA Arrays So far, we ve used standard linear arrays CUDA arrays : Different addressing calcula8on Con8guous addresses have 2D/3D locality! Not pointer- addressable (Designed specifically for texturing)

131 Texture Memory Texture reference can be a;ached to: Ordinary device- memory array CUDA array Many more capabili8es

132 Texturing Example (2D) h;p:// h;p://

133 Texturing Example (2D) h;p://

Texturing Example (2D) h;p://www.math.ntu.edu.

134 Texturing Example (2D) h;p://

135 Texturing Example (2D)

136 Arithmetic Instruction Throughput 146! Number of operaaons per clock cycle per mulaprocessor

137 Control Flow Instruc8ons Main performance concern with branching is divergence Threads within a single warp (group of 32) take different paths Different execu8on paths are serialized The control paths taken by the threads in a warp are traversed one at a 8me un8l there is no more. Avoid divergence when branch condi8on is a func8on of thread ID Example with divergence: if (threadidx.x > 7) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0, 1, 7 follow different path than the rest of the threads 8, 9,, 31 in the same warp. Example without divergence: if (threadidx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole mul8ple of warp size; all threads in any given warp follow the same path 147

138 Warp Vote Func8ons and other useful primi8ves int all(predicate) return true iff the predicate evaluates as true for all threads of a warp int any(predicate) return true iff the predicate evaluates as true for any threads of a warp unsigned int ballot(predicate) return an unsigned int whose N th is set iff predicate evaluates as true for the N th thread of the warp. int popc(int x) POPula8on Count: return the number of bits that are set to 1 in the 32- bit integer x. int clz(int x) Count Leading Zero: return the number of consecu8ve zero bits beginning at the most significant bit of the 32- bit integer x. 148

139 Shuffle Instruc8on Instruc8on to exchange data in a warp Threads can read other threads registers No shared memory is needed It is available star8ng from SM 3.0 Kepler s shuffle instruc8on (SHFL) enables a thread to directly read a register from another thread in the same warp 149

140 Shuffle Instruc8on 4 Variants (idx, up, down, bfly): 150

141 A simple shuffle example (1) int shfl_down(int var, unsigned int delta, int width=warpsize); shfl_down() calculates a source lane ID by adding delta to the caller s lane ID the lane ID is a thread s index within its warp, from 0 to 31. Can be computed as threadidx.x % 32; The value of var held by the resul8ng lane ID is returned: this has the effect of shi ing var down the warp by delta lanes If the source lane ID is out of range or the source thread has exited, the calling thread s own var is returned. 151

142 A simple shuffle example (2) int i = threadidx.x % 32; int j = shfl_down(i, 2, 8); 152

$offset = warpsize/2; offset > 0; offset /= 2) { val += shfl_down(val, offset); } return val;$

143 Using shuffle for a reduce opera8on inline device int warpreducesum(int val) { for (int offset = warpsize/2; offset > 0; offset /= 2) { val += shfl_down(val, offset); } return val; } 153

144 Using shuffle for an all reduce opera8on If all threads in the warp need the result, it possible to implement an all reduce within a warp using shfl_xor() instead of shfl_down(), like the following: inline device int warpallreducesum(int val) { for (int mask = warpsize/2; mask > 0; mask /= 2) { } } val += shfl_xor(val, mask); return val; 154

145 Block reduce by using shuffle First reduce within warps. Then the first thread of each warp writes its par8al sum to shared memory. Finally, a]er synchronizing, the first warp reads from shared memory and reduces again. inline device int blockreducesum(int val) { } staac shared int shared[32]; // Shared mem for 32 paraal sums int lane = threadidx.x % warpsize; int wid = threadidx.x / warpsize; val = warpreducesum(val); // Each warp performs paraal reducaon if (lane==0) shared[wid]=val; // Write reduced value to shared memory syncthreads(); // Wait for all paraal reducaons //read from shared memory only if that warp existed val = (threadidx.x < blockdim.x / warpsize)? shared[lane] : 0; if (wid==0) val = warpreducesum(val); //Final reduce within first warp return val; 155

146 Reducing Large Arrays Reducing across blocks means global communica8on which means synchronizing the grid Two possible approaches: Breaking the computa8on into two separate kernel launches Use atomics opera8ons Exercise (home work!): write a devicereduce that works across a complete grid (many thread blocks). 156

147 Implement SHFL for 64 bits Numbers Generic SHFL: h3ps://github.com/bryancatanzaro/generics 157

148 Opamized Filtering Let us have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src sa8sfying the predicate into the des8na8on array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not ma;er. If the predicate is src[i]>0, a simple CPU solu8on is: int filter(int *dst, const int *src, int n) { int nres = 0; for (int i = 0; i < n; i++) if (src[i] > 0) dst[nres++] = src[i]; // return the number of elements copied return nres; } 158

149 Opamized Filtering on GPU: global memory A straigh orward approach is to use a single global counter and atomically increment it for each new element wri;en int the dst array: global void filter_k(int *dst, int *nres, const int *src, int n) { int i = threadidx.x + blockidx.x * blockdim.x; if(i < n && src[i] > 0) dst[atomicadd(nres, 1)] = src[i]; } Here atomic opera8ons are clearly a bo;leneck, and need to be removed or reduced to increase performance 159

150 Opamized Filtering on GPU: _global void filter_shared_k(int *dst, int *nres, const int* src, int n) { shared int l_n; int i = blockidx.x * (NPER_THREAD * BS) + threadidx.x; for (int iter = 0; iter < NPER_THREAD; iter++) { // zero the counter if (threadidx.x == 0) l_n = 0; syncthreads() // get the value, evaluate the predicate, and // increment the counter if needed int d, pos; if(i < n) { } d = src[i]; if(d > 0) pos = atomicadd(&l_n, 1); using shared memory syncthreads(); } } // leader increments the global counter if(threadidx.x == 0) l_n = atomicadd(nres, l_n); syncthreads(); // threads with true predicates write their elements if(i < n && d > 0) { } pos += l_n; // increment local pos by global counter dst[pos] = d; syncthreads(); i += BS; 160

151 Opamized Filtering on GPU: warp- aggregated atomics Warp aggrega9on is the process of combining atomic opera8ons from mul8ple threads in a warp into a single atomic. 1. Threads in the warp elect a leader thread. 2. Threads in the warp compute the total atomic increment for the warp. 3. The leader thread performs an atomic add to compute the offset for the warp. 4. The leader thread broadcasts the offset to all other threads in the warp. 5. Each thread adds its own index within the warp to the warp offset to get its posi8on in the output array. 161

152 Step 1: elect a leader The thread iden8fier within a warp can be computed as #define WARP_SZ 32 device inline int lane_id(void) { return threadidx.x % WARP_SZ; } A reasonable choice is to select as leader the lowest ac9ve lane int mask = ballot(1); // mask of acave lanes int leader = ffs(mask) - 1; // - 1 for 0- based indexing ffs(int v) returns the 1- based posi8on of the lowest bit that is set in v, or 0 if no bit is set (ffs stands for find first set) 162

153 Step 2: compute the total increment The total increment for the warp is equal to the number of ac8ve lanes which hold true predicates. The value can be obtained by using the popc intrinsic which returns the number of bits set in the binary representa8on of its argument: int change = popc(mask); 163

154 Step 3: performing the atomic Only the leader lane performs the atomic opera8on int res; if(lane_id() == leader) res = atomicadd(ctr, change); // ctr is the pointer to the counter 164

155 Step 4: broadcasang the result We can implement the broadcast efficiently using the shuffle intrinsic device int warp_bcast(int v, int leader) { return shfl(v, leader); } 165

156 Step 5: compuang the index for each lane This is a bit tricky: we must compute the output index for each lane, by adding the broadcast counter value for the warp to the lane s intra- warp offset the intra- warp offset is the number of ac9ve lanes with IDs lower than current lane ID We can compute it by applying a mask that selects only lower lane IDs, and then applying the popc() intrinsic to the result. res += popc(mask & ((1 << lane_id()) 1)); 166

157 Final code // warp- aggregated atomic increment device int atomicaddinc(int *ctr) { int mask = ballot(1); // select the leader int leader = ffs(mask) 1; // leader does the update int res; if(lane_id() == leader) res = atomicadd(ctr, popc(mask)); // broadcast result res = warp_bcast(res, leader); // each thread computes its own value return res + popc(mask & ((1 << lane_id()) 1)); } // atomicaddinc Download h3p://twin.iac.rm.cnr.it/filter.cu 167

158 Exercise Write a shuffle- based kernel for the N- body problem. Hint: start from the nbody- unroll.cu code 168

159 CUDA Dynamic Parallelism A facility introduced by NVIDIA in their GK110 chip/architecture Allows a kernel to call another kernel from within it without returning the host. Each such kernel call has the same calling construction - the grid and block sizes and dimensions are set at the time of the call. Facility allows computations to be done with dynamically altering grid structures and recursion, to suit the computation. For example allows a 2/3D simulation mesh to be non-uniform with increased precision at places of interest, see previous slides. 169

160 Kernels calling other kernels Host code Device code global void kernel1 ( ) global void kernel2 ( ) kernel1<<< G1,B1>>>( ) ; kernel2<<< G2,B2>>>( ) ; kernel3<<< G3,B3>>>( ) ; return 0; return 0; Dynamic Parallelism Notice the kernel call is a standard syntax and allows each kernel call to have different grid/block structures 170 Nested depth limited by memory and <= 63 or 64

161 Recursion is allowed Host code Device code global void kernel1 ( ) kernel1<<< G2,B2>>>( ) ; kernel1<<< G1,B1>>>( ) ; Dynamic Parallelism return 0; 171

162 Parent- Child Launch Nes8ng Time Host (CPU) thread Grid A launch Grid A completes Grid A (Parent) Grid B launch Grid B (Child) Grid A threads Grid B threads 172 Derived from Fig 20.4 of Kirk and Hwu 2nd Ed. Grid B completes Implicit synchronization between parent and child forcing parent to wait until all children exit before it can exit.

163 Host-Device Data Transfers ü Device-to-host memory bandwidth much lower than device-to-device bandwidth ü 8 GB/s peak (PCI-e x16 Gen 2) vs. ~ 150 GB/s peak ü Minimize transfers ü Intermediate data can be allocated, operated on, and deallocated without ever copying them to host memory ü Group transfers ü One large transfer is usually much better than many small ones 173!

164 Synchronizing GPU and CPU All kernel launches are asynchronous! control returns to CPU immediately Be careful measuring 8mes! kernel starts execu8ng once all previous CUDA calls have completed Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed cudadevicesynchronize() CPU code blocks un8l all previous CUDA calls complete Asynchronous CUDA calls provide: non- blocking memcopies ability to overlap memcopies and kernel execuaon

165 Page-Locked Data Transfers ü cudamallochost() allows allocation of page-locked ( pinned ) host memory ü Same syntax of cudamalloc() but works with CPU pointers and memory ü Enables highest cudamemcpy performance ü 3.2 GB/s on PCI-e x16 Gen1 ü 6.5 GB/s on PCI-e x16 Gen2 ü Use with caution!!! ü Alloca8ng too much page- locked memory can reduce overall system performance See and test the bandwidth.cu sample 175!

166 Overlapping Data Transfers and Computation ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable devices ü Kernel computation can overlap data transfers on devices with Concurrent copy and execution (compute capability >= 1.1) ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches 176!

167 Asynchronous Data Transfers ü Asynchronous host-device memory copy returns control immediately to CPU ü cudamemcpyasync(dst, src, size, dir, stream); ü requires pinned host memory (allocated with cudamallochost) ü Overlap CPU computation with data transfer ü 0 = default stream cudamemcpyasync(a_d, a_h, size, cudamemcpyhosttodevice, 0); kernel<<<grid, block>>>(a_d); cpufunction(); overlapped 177!

Overlapping kernel and data transfer ü Requires: ü Kernel and transfer use different, non-zero streams ü For the crea8on of a stream: cudastreamcreate ü Remember that a CUDA call to stream 0 blocks

168 Overlapping kernel and data transfer ü Requires: ü Kernel and transfer use different, non-zero streams ü For the crea8on of a stream: cudastreamcreate ü Remember that a CUDA call to stream 0 blocks until all previous calls complete and cannot be overlapped ü Example: cudastream_t stream1, stream2; cudastreamcreate(&stream1); cudastreamcreate(&stream2); cudamemcpyasync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>( ); 178! Exercise: read, compile and run simplestreams.cu overlapped

169 Device Management CPU can query and select GPU devices cudagetdevicecount( int* count ) cudasetdevice( int device ) cudagetdevice( int *current_device ) cudagetdeviceproper8es( cudadeviceprop* prop, int device ) cudachoosedevice( int *device, cudadeviceprop* prop ) Mul8- GPU setup device 0 is used by default Star8ng on CUDA 4.0 a CPU thread can control more than one GPU mul8ple CPU threads can control the same GPU the driver regulates the access.

170 Device Management (sample code) int cudadevice; struct cudadeviceprop prop; cudagetdevice( &cudadevice ); cudagetdeviceproper8es (&prop, cudadevice); mpc=prop.mul8processorcount; mtpb=prop.maxthreadsperblock; shmsize=prop.sharedmemperblock; prin ("Device %d: number of mul8processors %d\n" "max number of threads per block %d\n" "shared memory per block %d\n", cudadevice, mpc, mtpb, shmsize); Exercise: compile and run the enum_gpu.cu code

171 CUDA libraries ü CUDA includes a number of widely used libraries CUBLAS: BLAS implementa8on CUFFT: FFT implementa8on CURAND: Random number genera8on Etc. ü Thrust h;p://docs.nvidia.com/cuda/thrust Reduc8on Scan Sort Etc.

172 CUDA libraries ü There are two types of run8me math opera8ons func(): direct mapping to hardware ISA Fast but low accuracy Examples: sin(x), exp(x), pow(x,y) func() : compile to mul8ple instruc8ons Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) ü The -use_fast_math compiler option forces every func() to compile to func()

CURAND Library Library for genera8ng random numbers Features: XORWOW pseudo- random generator Sobol quasi- random number generators Host API for genera8ng

173 CURAND Library Library for genera8ng random numbers Features: XORWOW pseudo- random generator Sobol quasi- random number generators Host API for genera8ng random numbers in bulk Inline implementa8on allows use inside GPU func8ons/ kernels Single- and double- precision, uniform, normal and log- normal distribu8ons

174 CURAND Features Pseudo- random numbers George Marsaglia. Xorshi RNGs. Journal of Sta9s9cal SoRware, 8(14), Available at h;p:// Quasi- random 32- bit and 64- bit Sobol sequences with up to 20,000 dimensions. Host API: call kernel from host, generate numbers on GPU, consume numbers on host or on GPU. GPU API: generate and consume numbers during kernel execu8on.

175 CURAND use 1. Create a generator: curandcreategenerator() 2. Set a seed: curandsetpseudorandomgeneratorseed() 3. Generate the data from a distribu8on: curandgenerateuniform()/(curandgenerateuniformdouble(): Uniform curandgeneratenormal()/curandgeneratenormaldouble(): Gaussian curandgeneratelognormal/curandgeneratelognormaldouble(): Log- Normal 4. Destroy the generator: curanddestroygenerator()

176 Example CURAND Program: Compu8ng π with MonteCarlo method π = 4 * (Σ red points)/(σ points) 1. Generate 0 <= x < 1 2. Generate 0 <= y < 1 3. Compute z = x 2 + y 2 4. If z <= 1 the point is red Accuracy increases slowly

177 Example CURAND Program: Compu8ng π with MonteCarlo method Compare the speed of pigpu and picpu cd Pi gcc o picpu O3 picpu.c - lm nvcc o pigpu O3 pigpu.cu arch=sm_13 - lm Run for itera8ons Look at pigpu and try to understand why is it so SLOW?

178 Mul8- GPU Memory ü GPUs do not share global memory ü But star8ng on CUDA 4.0 one GPU can copy data from/ to another GPU memory directly if the GPU are connected to the same PCIe switch ü Inter- GPU communica8on ü Applica8on code is responsible for copying/moving data between GPU ü Data travel across the PCIe bus ü Even when GPUs are connected to the same PCIe switch!

179 Mul8- GPU Environment ü GPUs have consecu8ve integer IDs, star8ng with 0 ü Star8ng on CUDA 4.0, a host thread can maintain more than one GPU context at a 8me ü CudaSetDevice allows to change the ac8ve GPU (and so the context) ü Device 0 is chosen when cudasetdevice is not called ü Remember that mul8ple host threads can establish contexts with the same GPU ü Driver handles 8me- sharing and resource par88oning unless the GPU is in exclusive mode

180 Mul8- GPU Environment: a simple approach // Run independent kernel on each CUDA device int numdevs = 0; cudagetnumdevices(&numdevs);... for (int d = 0; d < numdevs; d++) { cudasetdevice(d); kernel<<<blocks, threads>>>(args); } Exercise: Compare the dot_simple_mul8block.cu and dot_mul8gpu.cu programs. Compile and run. A;en8on! Add the arch=sm_35 op8on: nvcc o dot_mulagpu dot_mulagpu.cu arch=sm_35

181 General Mula- GPU Programming Pa3ern The goal is to hide communicaaon cost Overlap with computa8on So, every ame- step, each GPU should: Compute parts (halos) to be sent to neighbors Compute the internal region (bulk) Exchange halos with neighbors overlapped Linear scaling as long as internal- computaaon takes longer than halo exchange Actually, separate halo computa8on adds some overhead

182 Example: Two Subdomains

183 Example: Two Subdomains Phase 1 compute compute Phase 2 compute send send compute GPU- 0: green subdomain GPU- 1: grey subdomain

184 CUDA Features Useful for Mula- GPU Control mulaple GPUs with a single CPU thread Simpler coding: no need for CPU mul8threading Streams: Enable execu8ng kernels and memcopies concurrently Up to 2 concurrent memcopies: to/from GPU Peer- to- Peer (P2P) GPU memory copies Transfer data between GPUs using PCIe P2P support Done by GPU DMA hardware host CPU is not involved Data traverses PCIe links, without touching CPU memory Disjoint GPU- pairs can communicate simultaneously

Peer- to- Peer GPU Communica8on Up to CUDA 4.

0 two GPU connected to the same PCI/express switch can exchange data

cudastreamcreate(&stream_on_gpu_0); cudasetdevice(0);

array on GPU 0 */ cudasetdevice(1); cudamalloc(d_1, num_bytes); /* create

185 Peer- to- Peer GPU Communica8on Up to CUDA 4.0, GPU could not exchange data directly. With CUDA >= 4.0 two GPU connected to the same PCI/express switch can exchange data directly. cudastreamcreate(&stream_on_gpu_0); cudasetdevice(0); cudadeviceenablepeeraccess( 1, 0 ); cudamalloc(d_0, num_bytes); /* create array on GPU 0 */ cudasetdevice(1); cudamalloc(d_1, num_bytes); /* create array on GPU 1 */ cudamemcpypeerasync( d_0, 0, d_1, 1, num_bytes, stream_on_gpu_0 ); /* copy d_1 from GPU 1 to d_0 on GPU 0: pull copy */ Example: read p2pcopy.cu, compile nvcc o p2pcopy p2pcopy.cu arch=sm_35

186 What does the code look like? for( int i=0; i<num_gpus-1; i++ ) // right stage cudamemcpypeerasync( d_a[i+1], gpu[i+1], d_a[i], gpu[i], num_bytes, stream[i] ); for( int i=0; i<num_gpus; i++ ) cudastreamsynchronize( stream[i] ); for( int i=1; i<num_gpus; i++ ) // left stage cudamemcpypeerasync( d_b[i-1], gpu[i-1], d_b[i], gpu[i], num_bytes, stream[i] ); Arguments to the cudamemcpypeerasync() call: Dest address, dest GPU, src addres, src GPU, num bytes, stream ID The middle loop isn t necessary for correctness Improves performance by preven8ng the two stages from interfering with each other (15 vs. 11 GB/s using 4- GPU example)

187 Code Pa3ern for Mula- GPU Every ame- step each GPU will: Compute halos to be sent to neighbors (stream h) Compute the internal region (stream i) Exchange halos with neighbors (stream h)

188 Code For Looping through Time for( int istep=0; istep<nsteps; istep++) { for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>(... ); kernel<<<..., stream_halo[i]>>>(... ); cudastreamquery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>(... ); } for( int i=0; i<num_gpus-1; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudastreamsynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); Compute halos Compute internal subdomain Exchange halos } for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); cudadevicesynchronize(); // swap input/output pointers } Synchronize before next step

189 Code For Looping through Time for( int istep=0; istep<nsteps; istep++) { for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>(... ); kernel<<<..., stream_halo[i]>>>(... ); cudastreamquery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>(... ); } for( int i=0; i<num_gpus-1; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudastreamsynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); Two CUDA streams per GPU halo and internal Each in an array indexed by GPU Calls issued to different streams can overlap Prior to ame- loop Create streams Enable P2P access } for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); cudadevicesynchronize(); // swap input/output pointers }

190 GPUs in Different Cluster Nodes Add network communicaaon to halo- exchange Transfer data from edge GPUs of each node to CPU Hosts exchange via MPI (or other API) Transfer data from CPU to edge GPUs of each node With CUDA >= 4.0, transparent interoperability between CUDA and Infiniband export CUDA_NIC_INTEROP=1 Let MPI use, for Remote Direct Memory Access opera8ons, memory pinned by GPU

Inter- GPU Communica8on with MPI Example: simplempi.c Generate some random numbers on one node. Dispatch them to all nodes. Compute their square root on each node's GPU.

191 Inter- GPU Communica8on with MPI Example: simplempi.c Generate some random numbers on one node. Dispatch them to all nodes. Compute their square root on each node's GPU. Compute the average of the results by using MPI. To compile: $nvcc - c simplecudampi.cu $mpic++ - o simplempi simplempi.c simplecudampi.o - L/usr/local/cuda/lib64/ - lcudart To run: $mpirun np 2 simplempi On a single line!

Introduction to CUDA C

NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is