A crash course on (C- ) CUDA programming for (mul8) GPU. Massimo Bernaschi h;p://

Size: px
Start display at page:

Download "A crash course on (C- ) CUDA programming for (mul8) GPU. Massimo Bernaschi h;p://www.iac.cnr.it/~massimo"

Transcription

1 A crash course on (C- ) CUDA programming for (mul8) GPU Massimo Bernaschi massimo.bernaschi@cnr.it h;p://

2 Single GPU part of the class based on Code samples available from (wget) h3p://twin.iac.rm.cnr.it/cccsample.tgz Slides available from (wget) h3p://twin.iac.rm.cnr.it/ccc.pdf Let us start with few basic info

3 A Graphics Processing Unit is not a CPU! CPU GPU Control ALU ALU ALU ALU Cache DRAM DRAM And there is no GPU without a CPU

4 Graphics Processing Unit ü Originally dedicated to specific opera8ons required by video cards for accelera8ng graphics, GPUs have become flexible general- purpose computa8onal engines (GPGPU): ü Op8mized for high memory throughput ü Very suitable to data- parallel processing ü Tradi8onally GPU programming has been tricky but CUDA, OpenCL and now OpenACC made it affordable.

5 What is CUDA? CUDA=Compute Unified Device Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard C A handful of language extensions to allow heterogeneous programs Straightforward APIs to manage devices, memory, etc.

6 CUDA Programming Model The GPU is viewed as a compute device that: has its own RAM (device memory) runs data- parallel por8ons of an applica8on as kernels by using many threads GPU vs. CPU threads GPU threads are extremely lightweight Very li;le crea8on overhead GPU needs 1000s of threads for full efficiency A mul8- core CPU needs only a few (basically one thread per core)

7 CUDA C Jargon: The Basics Host The CPU and its memory (host memory) Device The GPU and its memory (device memory) Host Device

8 What Programmer Expresses in CUDA HOST (CPU) P M P Interconnect between devices and memories ü Computa8on par88oning (where does computa8on occur?) ü Declara8ons on func8ons host, global, device ü Mapping of thread programs to device: compute <<<gs, bs>>>(<args>) ü Data par88oning (where does data reside, who may access it and how?) M DEVICE (GPU) ü Declara8ons on data shared, device, constant, ü Data management and orchestra8on ü Copying to/from host: e.g., cudamemcpy(h_obj,d_obj, size, cudamemcpydevicetohost) ü Concurrency management ü e.g. synchthreads()

9 Code Samples 1. Connect to the system ssh X XYZ@abcd.zz 2. Get the code samples: wget h;p://twin.iac.rm.cnr.it/ccc.tgz wget h;p://twin.iac.rm.cnr.it/cccsample.tgz 3. Unpack the samples: tar zxvf cccsample.tgz 4. Go to source directory: cd cccsample/source 5. Get the CUDA C Quick Reference wget h;p://twin.iac.rm.cnr.it/cuda_c_quickref.pdf

10 Hello, World! int main( void ) { printf( "Hello, World!\n" ); return 0; } To compile: nvcc o hello_world hello_world.cu To execute:./hello_world This basic program is just standard C that runs on the host NVIDIA s compiler (nvcc) will not complain about CUDA programs with no device code At its simplest, CUDA C is just C!

11 Hello, World! with Device Code global void kernel( void ) { } int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!\n" ); return 0; } To compile: nvcc o simple_kernel simple_kernel.cu To execute:./simple_kernel Two notable addi8ons to the original Hello, World!

12 Hello, World! with Device Code global void kernel( void ) { } CUDA C keyword global indicates that a func8on Runs on the device Called from host code nvcc splits source file into host and device components NVIDIA s compiler handles device func8ons like kernel() Standard host compiler handles host func8ons like main() gcc, icc, Microso] Visual C

13 Hello, World! with Device Code int main( void ) { kernel<<< 1, 1 >>>(); printf( "Hello, World!\n" ); return 0; } Triple angle brackets mark a call from host code to device code A kernel launch in CUDA jargon We ll discuss the parameters inside the angle brackets later This is all that s required to execute a func8on on the GPU! The func8on kernel() does nothing, so let us run something a bit more useful:

14 A kernel to make an addition global void add(int a, int b, int *c) { *c = a + b; } Compile: nvcc o addint addint.cu Run:./addint Look at line 28 of the addint.cu code How do we pass the first two arguments?

15 A More Complex Example Another kernel to add two integers: global void add( int *a, int *b, int *c ) { } *c = *a + *b; As before, global is a CUDA C keyword meaning add() will execute on the device add() will be called from the host

16 A More Complex Example No8ce that now we use pointers for all our variables: } global void add( int *a, int *b, int *c ) { *c = *a + *b; add() runs on the device so a, b, and c must point to device memory How do we allocate memory on the GPU?

17 Memory Management Up to CUDA 4.0 host and device memory were dis8nct en88es from the programmers viewpoint Device pointers point to GPU memory May be passed to and from host code (In general) May not be dereferenced from host code Host pointers point to CPU memory May be passed to and from device code (In general) May not be dereferenced from device code Star8ng on CUDA 4.0 there is a Unified Virtual Addressing feature. Basic CUDA API for dealing with device memory cudamalloc(&p, size), cudafree(p), cudamemcpy(t, s, size, direction) Similar to their C equivalents: malloc(), free(), memcpy() pointer to pointer

18 A More Complex Example: main() int main( void ) { int a, b, c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = sizeof( int ); // we need space for an integer // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = 2; b = 7; // copy inputs to device } cudamemcpy( dev_a, &a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, &b, size, cudamemcpyhosttodevice ); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( &c, dev_c, size, cudamemcpydevicetohost ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ) return 0;

19 Parallel Programming in CUDA C But wait GPU compu8ng is about massive parallelism So how do we run code in parallel on the device? Solu8on lies in the parameters between the triple angle brackets: add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); add<<< N, 1 >>>( dev_a, dev_b, dev_c ); Instead of execu8ng add() once, add() executed N 8mes in parallel

20 Parallel Programming in CUDA C With add() running in parallel let s do vector addi8on Terminology: Each parallel invoca8on of add() referred to as a block Kernel can refer to its block s index with the variable blockidx.x Each block adds a value from a[] and b[], storing the result in c[]: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x]+b[blockidx.x]; } By using blockidx.x to index arrays, each block handles a different index blockidx.x is the first example of a CUDA predefined variable.

21 Parallel Programming in CUDA C We write this code: global void add( int *a, int *b, int *c ) { c[blockidx.x] = a[blockidx.x]+b[blockidx.x]; } This is what runs in parallel on the device: Block 0 c[0]=a[0]+b[0]; Block 1 c[1]=a[1]+b[1]; Block 2 c[2]=a[2]+b[2]; Block 3 c[3]=a[3]+b[3];

22 Parallel Addition: main() #define N 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for 512 // integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

23 Parallel Addition: main() (cont) // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with N parallel blocks add<<< N, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0;

24 Review Difference between host and device Host = CPU Device = GPU Using global to declare a function as device code Runs on device Called from host Passing parameters from host code to a device function

25 Review (cont) Basic device memory management cudamalloc() cudamemcpy() cudafree() Launching parallel kernels Launch N copies of add() with: add<<< N, 1 >>>(); blockidx.x allows to access block s index Exercise: look at, compile and run the add_simple_blocks.cu code

26 Threads Terminology: A block can be split into parallel threads Let s change vector addition to use parallel threads instead of parallel blocks: global void add( int *a, int *b, int *c ) { c[ threadidx.x blockidx.x ] = a[ threadidx.x blockidx.x ]+ b[ threadidx.x blockidx.x ]; } We use threadidx.x instead of blockidx.x in add() main() will require one change as well

27 Parallel Addition (Threads): main() #define N 512 int main( void ) { int *a, *b, *c; //host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int )//we need space for 512 integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

28 Parallel Addition (Threads): main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with N threads blocks add<<< 1, N, N 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Exercise: compile and run the add_simple_threads.cu code

29 Using Threads And Blocks We ve seen parallel vector addition using Many blocks with 1 thread apiece 1 block with many threads Let s adapt vector addition to use lots of both blocks and threads After using threads and blocks together, we ll talk about why threads First let s discuss data indexing

30 Indexing Arrays With Threads & Blocks No longer as simple as just using threadidx.x or blockidx.x as indices To index array with 1 thread per entry (using 8 threads/block) threadidx.x threadidx.x threadidx.x threadidx.x blockidx.x = 0 blockidx.x = 1 blockidx.x = 2 blockidx.x = 3 If we have M threads/block, a unique array index for each entry is given by int index = threadidx.x + blockidx.x * M; int index = x + y * width;

31 Indexing Arrays: Example In this example, the red entry would have an index of 21: M = 8 threads/block blockidx.x = 2 int index = threadidx.x + blockidx.x * M; = * 8; = 21;

32 Indexing Arrays: other examples (4 blocks with 4 threads per block)

33 Addition with Threads and Blocks blockdim.x is a built-in variable for threads per block: int index= threadidx.x + blockidx.x * blockdim.x; griddim.x is a built-in variable for blocks in a grid; A combined version of our vector addition kernel to use blocks and threads: global void add( int *a, int *b, int *c ) { int index = threadidx.x + blockidx.x * blockdim.x; } c[index] = a[index] + b[index]; So what changes in main() when we use both blocks and threads?

34 Parallel Addition (Blocks/Threads) #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int ); // we need space for N // integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

35 Parallel Addition (Blocks/Threads) // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch add() kernel with blocks and threads add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudamemcpy( c, dev_c, size, cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Exercise: compile and run the add_simple.cu code

36 Exercise: array reversal Start from the arrayreversal.cu code The code must fill an array d_out with the contents of an input array d_in in reverse order so that if d_in is [100, 110, 200, 220, 300] then d_out must be [300, 220, 200, 110, 100]. You must complete line 13, 14 and 34. Remember that: blockdim.x is the number of threads per block; griddim.x is the number of blocks in a grid;

37 Array Reversal: Soluaon global void reversearrayblock(int *d_out, int *d_in) { int in = blockdim.x * blockidx.x + threadidx.x ; } int out = (blockdim.x * griddim.x - 1) in ; d_out[out] = d_in[in];

38 Why Bother With Threads? Threads seem unnecessary Added a level of abstraction and complexity What did we gain? Unlike parallel blocks, parallel threads have mechanisms to Communicate Synchronize Let s see how

39 Dot Product Unlike vector addition, dot product is a reduction from vectors to a scalar c = a b c = (a 0, a 1, a 2, a 3 ) (b 0, b 1, b 2, b 3 ) c = a 0 b 0 + a 1 b 1 + a 2 b 2 + a 3 b 3

40 Dot Product Parallel threads have no problem compu8ng the pairwise products: So we can start a dot product CUDA kernel by doing just that: global void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadidx.x] * b[threadidx.x];

41 Dot Product But we need to share data between threads to compute the final sum: global void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadidx.x] * b[threadidx.x]; } // Can t compute the final sum // Each thread s copy of temp is private!!!

42 Sharing Data Between Threads Terminology: A block of threads shares memory called shared memory Extremely fast, on-chip memory (user-managed cache) Declared with the shared CUDA keyword Not visible to threads in other blocks running in parallel Block 0 Block 1 Block 2 Threads Shared Memory Threads Shared Memory Threads Shared Memory

43 Parallel Dot Product: dot() We perform parallel multiplication, serial addition: #define N 512 global void dot( int *a, int *b, int *c ) { // Shared memory for results of multiplication shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; } // Thread 0 sums the pairwise products if( 0 == threadidx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i]; } *c = sum; }

44 Parallel Dot Product Recap We perform parallel, pairwise multiplications Shared memory stores each thread s result We sum these pairwise products from a single thread Sounds good But Exercise: Compile and run dot_simple_threads.cu. Does it work as expected?

45 Faulty Dot Product Exposed! Step 1: In parallel, each thread writes a pairwise product shared int temp Step 2: Thread 0 reads and sums the products shared int temp But there s an assumption hidden in Step 1

46 Read-Before-Write Hazard Suppose thread 0 finishes its write in step 1 Then thread 0 reads index 12 in step 2 Before thread 12 writes to index 12 in step 1 This read returns garbage!

47 Synchronization We need threads to wait between the sections of dot(): global void dot( int *a, int *b, int *c ) { shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; // * NEED THREADS TO SYNCHRONIZE HERE * // No thread can advance until all threads // have reached this point in the code } // Thread 0 sums the pairwise products if( 0 == threadidx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i]; } *c = sum; }

48 syncthreads() We can synchronize threads with the function syncthreads() Threads in the block wait until all threads have hit the syncthreads() Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 syncthreads() syncthreads() syncthreads() syncthreads() syncthreads() Threads are only synchronized within a block!

49 Parallel Dot Product: dot() global void dot( int *a, int *b, int *c ) { shared int temp[n]; temp[threadidx.x] = a[threadidx.x] * b[threadidx.x]; syncthreads(); } if( 0 == threadidx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i]; } *c = sum; } With a properly synchronized dot() routine, let s look at main()

50 Parallel Dot Product: main() #define N 512 int main( void ) { int *a, *b, *c; // copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N integers // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

51 Parallel Dot Product: main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudamemcpy( c, dev_c, sizeof( int ), cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0; Exercise: insert syncthreads() in the dot kernel. Compile and run.

52 Review Launching kernels with parallel threads Launch add() with N threads: add<<< 1, N >>>(); Used threadidx.x to access thread s index Using both blocks and threads Used (threadidx.x + blockidx.x * blockdim.x) to index input/output N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads total

53 Review (cont) Using shared to declare memory as shared memory Data shared among threads in a block Not visible to threads in other parallel blocks Using syncthreads() as a barrier No thread executes instructions after syncthreads() until all threads have reached the syncthreads() Needs to be used to prevent data hazards

54 Multiblock Dot Product Recall our dot product launch: // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); Launching with one block will not u8lize much of the GPU Let s write a mul9block version of dot product

55 Multiblock Dot Product: Algorithm Each block computes a sum of its pairwise products like before:

56 Multiblock Dot Product: Algorithm And then contributes its sum to the final result:

57 Multiblock Dot Product: dot() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) global void dot( int *a, int *b, int *c ) { shared int temp[threads_per_block]; int index = threadidx.x + blockidx.x * blockdim.x; temp[threadidx.x] = a[index] * b[index]; syncthreads(); } if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) { sum += temp[i]; } *c atomicadd( += sum; c, sum ); } But we have a race condi8on Compile and run dot_simple_mul8block.cu We can fix it with one of CUDA s atomic opera8ons. Use atomicadd in the dot kernel. Compile with nvcc o dot_simple_mulablock dot_simple_mulablock.cu arch=sm_35

58 Race Conditions Terminology: A race condition occurs when program behavior depends upon relative timing of two (or more) event sequences What actually takes place to execute the line in question: *c += sum; Read value at address c Add sum to value Terminology: Read-Modify-Write Write result to address c What if two threads are trying to do this at the same time? Thread 0, Block 0 Read value at address c Add sum to value Write result to address c Thread 0, Block 1 Read value at address c Add sum to value Write result to address c

59 Global Memory Contention Read-Modify-Write Block 0 sum = 3 Reads 0 0 Computes 0+3 Writes = 3 3 *c += sum c Block 1 sum = 4 3 Reads = 7 7 Computes 3+4 Writes 7 Read-Modify-Write

60 Global Memory Contention Read-Modify-Write Block 0 sum = 3 Reads 0 0 Computes 0+3 Writes = 3 3 *c += sum c Block 1 sum = 4 0 Reads = 4 4 Computes 0+4 Writes 4 Read-Modify-Write

61 Atomic Operations Terminology: Read-modify-write uninterruptible when atomic Many atomic operations on memory available with CUDA C atomicadd() atomicsub() atomicmin() atomicmax() atomicinc() atomicdec() atomicexch() atomiccas()old == compare? val : old Predictable result when simultaneous access to memory required We need to atomically add sum to c in our multiblock dot product

62 Multiblock Dot Product: dot() global void dot( int *a, int *b, int *c ) { shared int temp[threads_per_block]; int index = threadidx.x + blockidx.x * blockdim.x; temp[threadidx.x] = a[index] * b[index]; syncthreads(); } if( 0 == threadidx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ){ sum += temp[i]; } atomicadd( c, sum ); } Now let s fix up main() to handle a mul9block dot product

63 Parallel Dot Product: main() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N ints // allocate device copies of a, b, c cudamalloc( (void**)&dev_a, size ); cudamalloc( (void**)&dev_b, size ); cudamalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

64 Parallel Dot Product: main() // copy inputs to device cudamemcpy( dev_a, a, size, cudamemcpyhosttodevice ); cudamemcpy( dev_b, b, size, cudamemcpyhosttodevice ); // launch dot() kernel dot<<<n/threads_per_block,threads_per_block>>>(dev_a, dev_b, dev_c); // copy device result back to host copy of c cudamemcpy( c, dev_c, sizeof( int ), cudamemcpydevicetohost ); } free( a ); free( b ); free( c ); cudafree( dev_a ); cudafree( dev_b ); cudafree( dev_c ); return 0;

65 Review Race conditions Behavior depends upon relative timing of multiple event sequences Can occur when an implied read-modify-write is interruptible Atomic operations CUDA provides read-modify-write operations guaranteed to be atomic Atomics ensure correct results when multiple threads modify memory To use atomic operations a new option (-arch= ) must be used at compile time

66 Eratosthenes Sieve A sieve has holes in it and is used to filter out the juice. Eratosthenes s sieve filters out numbers to find the prime numbers.

67 Defini8on Prime Number a number that has only two factors, itself and is prime because the only numbers that will divide into it evenly are 1 and 7.

68 The Eratosthenes Sieve

69 The Prime Numbers from 1 to 120 are the following: 2,3,5,7,11,13,17,19,23, 29,31,37,41,43,47,53,59, 61,67,71,73,79,83,89,97, 101,103,107,109,113 Download (wget) h3p://twin.iac.rm.cnr.it/eratostenepr.cu

70 CUDA Thread organiza8on: Grids and Blocks A kernel is executed as a 1D, 2D or 3D grid of thread blocks. All threads share the global memory A thread block is a 1D, 2D or 3D batch of threads that can cooperate with each other by: Synchronizing their execu8on For hazard- free shared memory accesses Efficiently sharing data through a low latency shared memory Threads blocks are independent of each other and can be executed in any order! Host Kernel 1 Kernel 2 Block (1, 1) Thread (0, 0) Thread (0, 1) Device Thread (1, 0) Thread (1, 1) Grid 1 Block (0, 0) Block (0, 1) Grid 2 Thread (2, 0) Thread (2, 1) Block (1, 0) Block (1, 1) Thread (3, 0) Thread (3, 1) Thread (4, 0) Thread (4, 1) Block (2, 0) Block (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Courtesy: NVIDIA

71 Built- in Variables to manage grids and blocks dim3: a new datatype defined by CUDA: struct dim3 { unsigned int x, y, z }; three unsigned ints where any unspecified component defaults to 1. dim3 griddim; Dimensions of the grid in blocks dim3 blockdim; Dimensions of the block in threads dim3 blockidx; Block index within the grid dim3 threadidx; Thread index within the block

72 Bi- dimensional threads configura8on by example: set the elements of a square matrix (assume the matrix is a single block of memory!) int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; global void kernel( int *a, int dimx, int dimy ) { int ix = blockidx.x*blockdim.x + threadidx.x; int iy = blockidx.y*blockdim.y + threadidx.y; int idx = iy*dimx + ix; } a[idx] = idx+1; Exercise: compile and run: setmatrix.cu } kernel<<<grid, block>>>( d_a, dimx, dimy ); cudamemcpy(h_a,d_a,num_bytes, cudamemcpydevicetohost); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) prin ("%d ", h_a[row*dimx+col] ); prin ("\n"); } free( h_a ); cudafree( d_a ); return 0;

73 Matrix mul8ply with one thread block One block of threads computes matrix dp Each thread computes one element of dp Each thread Loads a row of matrix dm Loads a column of matrix dn Perform one mul8ply and addi8on for each pair of dm and dn elements Size of matrix limited by the number of threads allowed in a thread block! Grid 1 Block 1 Thread )2, 2 ( dn WIDTH dm dp

74 Exercise: matrix- matrix mul8ply global void MatrixMulKernel(DATA* dm, DATA* dn, DATA* dp, int Width) { DATA Pvalue = 0.0; for (int k = 0; k < Width; ++k) { DATA Melement = dm[ ]; DATA Nelement = dn[ ]; Pvalue += Melement * Nelement; } dp[ ] = Pvalue; } ü Start from MMGlob.cu ü Complete the func8on MatrixMulOnDevice and the kernel MatrixMulKernel ü Use one bi- dimensional block ü Threads will have an x (threadidx.x) and an y iden8fier (threadidx.y) ü Max size of the matrix: 16 Compile: nvcc o MMGlob MMglob.cu Run:./MMGlob N

75 CUDA Compiler: nvcc basic op8ons ü - arch=sm_13 (or sm_20, sm_35) Enable double precision (on compa8ble hardware) ü - G Enable debug for device code ü - - ptxas- op8ons=- v Show register and memory usage ü - - maxrregcount <N> Limit the number of registers ü - use_fast_math Use fast math library ü - O3 Enables compiler op8miza8on ü - ccbin compiler_path use a different C compiler (e.g., Intel instead of gcc) Exercises: 1. re- compile the MMGlob.cu program and check how many registers the kernel uses: nvcc o MMGlob MMGlob.cu - - ptxas- op8ons=- v 2. Edit MMGlob.cu and modify DATA from float to double, then compile for double precision and run

76 CUDA Error Repor8ng to CPU All CUDA calls return an error code: except kernel launches cudaerror_t type cudaerror_t cudagetlasterror(void) returns the code for the last error (cudasuccess means no error ) char* cudageterrorstring(cudaerror_t code) returns a null- terminated character string describing the error prinn( %s\n,cudageterrorstring(cudagetlasterror() ) );

77 CUDA Event API Events are inserted (recorded) into CUDA call streams Usage scenarios: measure elapsed 8me (in milliseconds) for CUDA calls (resolu8on ~0.5 microseconds) query the status of an asynchronous CUDA call block CPU un8l CUDA calls prior to the event are completed cudaevent_t start, stop; float et; cudaeventcreate(&start); cudaeventcreate(&stop); cudaeventrecord(start, 0); kernel<<<grid, block>>>(...); cudaeventrecord(stop, 0); cudaeventsynchronize(stop); cudaeventelapsedtime(&et, start, stop); cudaeventdestroy(start); cudaeventdestroy(stop); prin ( Kernel execu8on 8me=%f\n,et); Read the func8on MatrixMulOnDevice of the MMGlobWT.cu program. Compile and run

78 Execution Model Software Hardware Thread Thread Block Scalar Processor Mul8processor Threads are executed by scalar processors Thread blocks are executed on mul8processors Thread blocks do not migrate Several concurrent thread blocks can reside on one mul8processor - limited by mul8processor resources (shared memory and register file)... A kernel is launched as a grid of thread blocks Grid Device

79 Transparent Scalability Hardware is free to assigns blocks to any processor at any 8me A kernel scales across any number of parallel processors Device Kernel grid Block 0 Block 1 Device Block 2 Block 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 8me Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relaave to other blocks! 86

80 Warps and Half Warps... Thread Block = 32 Threads 32 Threads 32 Threads Warps Mul8processor A thread block consists of 32- thread warps A warp is executed physically in parallel (SIMD) on a mul8processor Half Warps DRAM Global Local A half- warp of 16 threads can coordinate global memory accesses into a single transac8on Device Memory 87!

81 Execu8ng Thread Blocks t0 t1 t2 tm SM 0 SM 1 t0 t1 t2 tm MT IU SP MT IU SP Blocks Blocks Shared Memory Shared Memory The number of threads in a block depends on the capability (i.e., the version) of the GPU. On a Fermi or Kepler GPU a block may have up to 1024 threads. Threads are assigned to Streaming Mulaprocessors (SM) in block granularity Up to 16 blocks to each SM on Kepler. 8 blocks on Fermi SM. A Fermi SM can take up to 1536 threads Could be 256 (threads/block) * 6 blocks or 192 (threads/block) * 8 blocks but, for instance, 128 (threads/block) * 12 blocks not allowed! A Kepler SM can take up to 2048 threads Threads run concurrently SM manages/schedules thread execu8on 88

82 Block Granularity Considera8ons For a kernel running on a Fermi and using mul8ple blocks, should we use 8X8, 16X16 or 32X32 blocks? For 8X8, we have 64 threads per Block. Since each Fermi SM can take up to 1536 threads, there are 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! For 32X32, we have 1024 threads per Block. Only one block can fit into a SM! For 16X16, we have 256 threads per Block. Since each Fermi SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considera8ons overrule. Repeat the analysis for a Kepler card! 89

83 Programmer View of Register File There are up to (32 bit) registers in each Fermi SM There are up to (32 bit) registers in each Kepler SM This is an implementa8on decision, not part of CUDA Registers are dynamically par88oned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks Each thread in the same block only access registers assigned to itself 4 blocks 3 blocks 90

84 Registers occupancy example If a Block has 16x16 threads and each thread uses 30 registers, how many blocks can run on each Fermi SM? Each block requires 30*256 = 7680 registers 32768/7680 = 4 + change (2048) So, 4 blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 10%? Each Block now requires 33*256 = 8448 registers 32768/8448 = 3 + change (7424) Only three Blocks can run on an SM, 25% reducaon of parallelism!!! Repeat the analysis for a Kepler card!

85 Optimizing threads per block ü Choose threads per block as a multiple of warp size (32) ü Avoid wasting computation on under-populated warps ü Facilitates efficient memory access (coalescing ) ü Run as many warps as possible per multiprocessor (hide latency) ü SM can run up to 16 (on Kepler) blocks at a time ü Heuristics ü Minimum: 64 threads per block ü 192 or 256 threads a better choice ü Usually still enough registers to compile and invoke successfully ü The right tradeoff depends on your computation, so experiment, experiment, experiment!!!

86 Thread Block Review on Memory Hierarchy Local Memory Shared Memory Local Memory: per- thread Private per thread but slow! Auto variables, register spill Shared Memory: per- block Shared by threads of the same block Fast inter- thread communica8on Global Memory: per- applica8on Shared by all threads Inter- Grid communica8on Grid 0... Grid 1... Global Memory Sequential Grids in Time

87 Op8mizing Global Memory Access Memory Bandwidth is very high (~ 150 Gbytes/sec.) but Latency is also very high (hundreds of cycles)! Local memory has the same latency! Reduce number of memory transac8ons by applying a proper memory access pa;ern. Array of structures (data locality!) Structure of arrays (thread locality!) Thread locality is be3er than data locality on GPU!

88 Global Memory Opera8ons Two types of loads: Caching Default mode A;empts to hit in L1, then L2, then GMEM Load granularity is 128- byte line Non- caching Compile with Xptxas dlcm=cg op8on to nvcc A;empts to hit in L2, then GMEM Do not hit in L1, invalidate the line if it s in L1 already Load granularity is 32- bytes Stores: Invalidate L1, write- back for L2

89 Caching Load Warp requests 32 aligned, consecu8ve 4- byte words Addresses fall within 1 cache- line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

90 Non- caching Load Warp requests 32 aligned, consecu8ve 4- byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

91 Caching Load Warp requests 32 aligned, permuted 4- byte words Addresses fall within 1 cache- line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

92 Non- caching Load Warp requests 32 aligned, permuted 4- byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 100% addresses from a warp Memory addresses

93 Caching Load Warp requests 32 misaligned, consecu8ve 4- byte words Addresses fall within 2 cache- lines Warp needs 128 bytes 256 bytes move across the bus on misses Bus u8liza8on: 50% addresses from a warp Memory addresses

94 Non- caching Load Warp requests 32 misaligned, consecu8ve 4- byte words Addresses fall within at most 5 segments Warp needs 128 bytes 160 bytes move across the bus on misses Bus u8liza8on: at least 80% Some misaligned pa;erns will fall within 4 segments, so 100% u8liza8on addresses from a warp Memory addresses

95 Caching Load All threads in a warp request the same 4- byte word Addresses fall within a single cache- line Warp needs 4 bytes 128 bytes move across the bus on a miss Bus u8liza8on: 3.125% addresses from a warp Memory addresses

96 Non- caching Load All threads in a warp request the same 4- byte word Addresses fall within a single segment Warp needs 4 bytes 32 bytes move across the bus on a miss Bus u8liza8on: 12.5% addresses from a warp Memory addresses

97 Caching Load Warp requests 32 sca;ered 4- byte words Addresses fall within N cache- lines Warp needs 128 bytes N*128 bytes move across the bus on a miss Bus u8liza8on: 128 / (N*128) addresses from a warp Memory addresses

98 Non- caching Load Warp requests 32 sca;ered 4- byte words Addresses fall within N segments Warp needs 128 bytes N*32 bytes move across the bus on a miss Bus u8liza8on: 128 / (N*32) addresses from a warp Memory addresses

99 Shared Memory On- chip memory 2 orders of magnitude lower latency than global memory Much higher bandwidth than global memory Up to 48 KByte per mul8processor (on Fermi and Kepler) Allocated per thread- block Accessible by any thread in the thread- block Not accessible to other thread- blocks Several uses: Sharing data among threads in a thread- block User- managed cache (reducing global memory accesses) shared float As[16][16];

100 Shared Memory Example Compute adjacent_difference Given an array d_in produce a d_out array containing the difference: d_out[0]=d_in[0]; d_out[i]=d_in[i]- d_in[i- 1]; /* i>0 */ Exercise: Complete the shared_variables.cu code

101 N- Body problems An N- body simulaaon numerically approximates the evoluaon of a system of bodies in which each body conanuously interacts with every other body. The all- pairs approach to N- body simulaaon is a brute- force technique that evaluates all pair- wise interacaons among the N bodies.

102 N- Body problems We use the gravitaaonal potenaal to illustrate the basic form of computaaon in an all- pairs N- body simulaaon.

103 N- Body problems To integrate over ame, we need the acceleraaon a i = F i /m i to update the posiaon and velocity of body i. Download h3p://twin.iac.rm.cnr.it/mini_nbody.tgz and unpack: tar zxvf mini_nbody.tgz look at nbody.c

104 void bodyforce(body *p, float dt, int n) { for (int i = 0; i < n; i++) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; N- Body: main loop for (int j = 0; j < n; j++) { float dx = p[j].x - p[i].x; float dy = p[j].y - p[i].y; float dz = p[j].z - p[i].z; float distsqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invdist = 1.0f / sqrn(distsqr); float invdist3 = invdist * invdist * invdist; } Fx += dx * invdist3; Fy += dy * invdist3; Fz += dz * invdist3; } } p[i].vx += dt*fx; p[i].vy += dt*fy; p[i].vz += dt*fz;

105 N- Body: parallel processing We may think of the all- pairs algorithm as calculaang each entry f ij in an NxN grid of all pair- wise forces. Each entry can be computed independently, so there is O(N 2 ) available parallelism We may assign each body to a thread If the block size is, say, 256 then the number of blocks is: (nbodies +255) / 256

106 N- Body: a first CUDA code global void bodyforce(body *p, float dt, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int j = 0; j < n; j++) { float dx = p[j].x - p[i].x; float dy = p[j].y - p[i].y; float dz = p[j].z - p[i].z; float distsqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invdist = rsqr (distsqr); float invdist3 = invdist * invdist * invdist; Fx += dx * invdist3; Fy += dy * invdist3; Fz += dz * invdist3; } p[i].vx += dt*fx; p[i].vy += dt*fy; p[i].vz += dt*fz; } }

107 N- Body: a first CUDA code Do we expect this code to be efficient? All threads read all data from the global memory It is be3er to serialize some of the computaaons to achieve the data reuse needed to reach peak performance of the arithmeac units and to reduce the memory bandwidth required. To that purpose let us introduce the noaon of computaaonal 8le

108 N- Body: Computa8onal 9le A 8le is a square region of the grid of pair- wise forces consisang of p rows and p columns. Only 2p body descripaons are required to evaluate all p 2 interacaons in the ale (p of which can be reused later). These body descripaons can be stored in shared memory or in registers.

109 N- Body: Computa8onal 9le Each thread in a block loads a body All threads in a block compute the interacaons with the bodies that have been loaded

110 N- Body: using Shared Memory (1) global void bodyforce(float4 *p, float4 *v, float dt, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int 8le = 0; 8le < griddim.x; 8le++) { shared float3 spos[block_size]; float4 tpos = p[8le * blockdim.x + threadidx.x]; spos[threadidx.x] = make_float3(tpos.x, tpos.y, tpos.z); syncthreads(); for (int j = 0; j < BLOCK_SIZE; j++) {.. } syncthreads(); } v[i].x += dt*fx; v[i].y += dt*fy; v[i].z += dt*fz; } }

111 N- Body: using Shared Memory (2) global void bodyforce(float4 *p, float4 *v, float dt, int n) { int i = blockdim.x * blockidx.x + threadidx.x; if (i < n) {... for (int j = 0; j < BLOCK_SIZE; j++) { float dx = spos[j].x - p[i].x; float dy = spos[j].y - p[i].y; float dz = spos[j].z - p[i].z; float distsqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invdist = rsqr (distsqr); float invdist3 = invdist * invdist * invdist; Fx += dx * invdist3; Fy += dy * invdist3; Fz += dz * invdist3; }. } }

112 N- Body: Further op8miza8ons Tuning of block size Unrolling of the inner loop for (int j = 0; j < BLOCK_SIZE; j++) { } Test: 1. shmoo- cpu- nbody.sh 2. cuda/shmoo- cuda- nbody- orig.sh 3. cuda/shmoo- cuda- nbody- block.sh 4. cuda/shmoo- cuda- nbody- unroll.sh

113 Shared Memory Organiza8on: 32 banks, 4- byte wide banks Successive 4- byte words belong to different banks Performance: 4 bytes per bank per 2 clocks per mul8processor smem accesses are issued per 32 threads (warp) serializa8on: if N threads of 32 access different 4- byte words in the same bank, N accesses are executed serially mul8cast: N threads access the same word in one fetch Could be different bytes within the same word

114 Bank Addressing Examples No Bank Conflicts No Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 31 Bank 31 Thread 31 Bank 31

115 Bank Addressing Examples 2- way Bank Conflicts x Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 28 Thread 29 Thread 30 Thread 31 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 31 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 31 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 31

116 Shared Memory: Avoiding Bank Conflicts 32x32 SMEM array Warp accesses a column: 32- way bank conflicts (threads in a warp access the same bank) Bank 0 Bank 1 Bank 31 warps:

117 Shared Memory: Avoiding Bank Conflicts Add a column for padding: 32x33 SMEM array Warp accesses a column: warps: 32 different banks, no bank conflicts padding Bank 0 Bank 1 Bank

118 Tree- Based Parallel Reduc8ons in Shared Memory CUDA solves this by exposing on- chip shared memory Reduce blocks of data in shared memory to save bandwidth 25

119 Parallel Reduc8on: Interleaved Addressing Values (shared memory) Step 1 Stride 1 Arbitrarily bad bank conflicts Requires barriers if N > warpsize Supports non- commutaave operators Thread IDs Values Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Values Thread IDs Values Thread IDs Values Interleaved addressing results in bank conflicts

120 Parallel Reduc8on: Sequen8al Addressing Values (shared memory) Step 1 Stride 8 Thread IDs Values Step 2 Stride 4 Thread IDs No bank conflicts Requires barriers if N > warpsize Does not support non- commutaave operators Step 3 Stride 2 Step 4 Stride 1 Values Thread IDs Values Thread IDs Values Sequential addressing is conflict free

121 Parallel Prefix Sum (Scan) Given an array A = [a0, a1,, an- 1] and a binary associa8ve operator with iden8ty I, scan(a) = [I, a0, (a0 a1),, (a0 a1 an- 2)] Example: if is addi8on, then scan on the set [ ] returns the set [ ]

122 O(n log n) Scan Step efficient (log n steps) Not work efficient (n log n work) Requires barriers at each step (WAR dependencies)

123 Constant Memory ü Both scalar and array values can be stored ü Constants stored in DRAM, and cached on chip L1 per SM ü A constant value can be broadcast to all threads in a Warp ü Extremely efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R F C $ Shared L 1 Mem Operand Select MAD SFU cudamemcpytosymbol(ds, &hs, sizeof(int), 0,cudaMemcpyHostToDevice); 132

124 Texture Memory (and co- stars) Another type of memory system, featuring: Spa8ally- cached read- only access Avoid coalescing worries Interpola8on (Other) fixed- func8on capabili8es Graphics interoperability

125 Tradi8onal Caching When reading, cache nearby elements (i.e. cache line) Memory is linear! Applies to CPU, GPU L1/L2 cache, etc

126 Texture- Memory Caching Can cache spa8ally! (2D, 3D) Specify dimensions (1D, 2D, 3D) on crea8on 1D applica8ons: Interpola8on, clipping (later) Caching when, e.g., coalesced access is infeasible

127 Texture Memory Memory is just an unshaped bucket of bits (CUDA Handbook) Need texture reference in order to: Interpret data Deliver to registers

128 Interpolaaon Can read between the lines! Download h3p://twin.iac.rm.cnr.it/readtexels.cu h;p://cuda- programming.blogspot.com/2013/02/texture- memory- in- cuda- what- is- texture.html

129 Clamping Seamlessly handle reads beyond region! h;p://cuda- programming.blogspot.com/2013/02/texture- memory- in- cuda- what- is- texture.html

130 CUDA Arrays So far, we ve used standard linear arrays CUDA arrays : Different addressing calcula8on Con8guous addresses have 2D/3D locality! Not pointer- addressable (Designed specifically for texturing)

131 Texture Memory Texture reference can be a;ached to: Ordinary device- memory array CUDA array Many more capabili8es

132 Texturing Example (2D) h;p:// h;p://

133 Texturing Example (2D) h;p://

134 Texturing Example (2D) h;p://

135 Texturing Example (2D)

136 Arithmetic Instruction Throughput 146! Number of operaaons per clock cycle per mulaprocessor

137 Control Flow Instruc8ons Main performance concern with branching is divergence Threads within a single warp (group of 32) take different paths Different execu8on paths are serialized The control paths taken by the threads in a warp are traversed one at a 8me un8l there is no more. Avoid divergence when branch condi8on is a func8on of thread ID Example with divergence: if (threadidx.x > 7) { } This creates two different control paths for threads in a block Branch granularity < warp size; threads 0, 1, 7 follow different path than the rest of the threads 8, 9,, 31 in the same warp. Example without divergence: if (threadidx.x / WARP_SIZE > 2) { } Also creates two different control paths for threads in a block Branch granularity is a whole mul8ple of warp size; all threads in any given warp follow the same path 147

138 Warp Vote Func8ons and other useful primi8ves int all(predicate) return true iff the predicate evaluates as true for all threads of a warp int any(predicate) return true iff the predicate evaluates as true for any threads of a warp unsigned int ballot(predicate) return an unsigned int whose N th is set iff predicate evaluates as true for the N th thread of the warp. int popc(int x) POPula8on Count: return the number of bits that are set to 1 in the 32- bit integer x. int clz(int x) Count Leading Zero: return the number of consecu8ve zero bits beginning at the most significant bit of the 32- bit integer x. 148

139 Shuffle Instruc8on Instruc8on to exchange data in a warp Threads can read other threads registers No shared memory is needed It is available star8ng from SM 3.0 Kepler s shuffle instruc8on (SHFL) enables a thread to directly read a register from another thread in the same warp 149

140 Shuffle Instruc8on 4 Variants (idx, up, down, bfly): 150

141 A simple shuffle example (1) int shfl_down(int var, unsigned int delta, int width=warpsize); shfl_down() calculates a source lane ID by adding delta to the caller s lane ID the lane ID is a thread s index within its warp, from 0 to 31. Can be computed as threadidx.x % 32; The value of var held by the resul8ng lane ID is returned: this has the effect of shi ing var down the warp by delta lanes If the source lane ID is out of range or the source thread has exited, the calling thread s own var is returned. 151

142 A simple shuffle example (2) int i = threadidx.x % 32; int j = shfl_down(i, 2, 8); 152

143 Using shuffle for a reduce opera8on inline device int warpreducesum(int val) { for (int offset = warpsize/2; offset > 0; offset /= 2) { val += shfl_down(val, offset); } return val; } 153

144 Using shuffle for an all reduce opera8on If all threads in the warp need the result, it possible to implement an all reduce within a warp using shfl_xor() instead of shfl_down(), like the following: inline device int warpallreducesum(int val) { for (int mask = warpsize/2; mask > 0; mask /= 2) { } } val += shfl_xor(val, mask); return val; 154

145 Block reduce by using shuffle First reduce within warps. Then the first thread of each warp writes its par8al sum to shared memory. Finally, a]er synchronizing, the first warp reads from shared memory and reduces again. inline device int blockreducesum(int val) { } staac shared int shared[32]; // Shared mem for 32 paraal sums int lane = threadidx.x % warpsize; int wid = threadidx.x / warpsize; val = warpreducesum(val); // Each warp performs paraal reducaon if (lane==0) shared[wid]=val; // Write reduced value to shared memory syncthreads(); // Wait for all paraal reducaons //read from shared memory only if that warp existed val = (threadidx.x < blockdim.x / warpsize)? shared[lane] : 0; if (wid==0) val = warpreducesum(val); //Final reduce within first warp return val; 155

146 Reducing Large Arrays Reducing across blocks means global communica8on which means synchronizing the grid Two possible approaches: Breaking the computa8on into two separate kernel launches Use atomics opera8ons Exercise (home work!): write a devicereduce that works across a complete grid (many thread blocks). 156

147 Implement SHFL for 64 bits Numbers Generic SHFL: h3ps://github.com/bryancatanzaro/generics 157

148 Opamized Filtering Let us have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src sa8sfying the predicate into the des8na8on array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not ma;er. If the predicate is src[i]>0, a simple CPU solu8on is: int filter(int *dst, const int *src, int n) { int nres = 0; for (int i = 0; i < n; i++) if (src[i] > 0) dst[nres++] = src[i]; // return the number of elements copied return nres; } 158

149 Opamized Filtering on GPU: global memory A straigh orward approach is to use a single global counter and atomically increment it for each new element wri;en int the dst array: global void filter_k(int *dst, int *nres, const int *src, int n) { int i = threadidx.x + blockidx.x * blockdim.x; if(i < n && src[i] > 0) dst[atomicadd(nres, 1)] = src[i]; } Here atomic opera8ons are clearly a bo;leneck, and need to be removed or reduced to increase performance 159

150 Opamized Filtering on GPU: _global void filter_shared_k(int *dst, int *nres, const int* src, int n) { shared int l_n; int i = blockidx.x * (NPER_THREAD * BS) + threadidx.x; for (int iter = 0; iter < NPER_THREAD; iter++) { // zero the counter if (threadidx.x == 0) l_n = 0; syncthreads() // get the value, evaluate the predicate, and // increment the counter if needed int d, pos; if(i < n) { } d = src[i]; if(d > 0) pos = atomicadd(&l_n, 1); using shared memory syncthreads(); } } // leader increments the global counter if(threadidx.x == 0) l_n = atomicadd(nres, l_n); syncthreads(); // threads with true predicates write their elements if(i < n && d > 0) { } pos += l_n; // increment local pos by global counter dst[pos] = d; syncthreads(); i += BS; 160

151 Opamized Filtering on GPU: warp- aggregated atomics Warp aggrega9on is the process of combining atomic opera8ons from mul8ple threads in a warp into a single atomic. 1. Threads in the warp elect a leader thread. 2. Threads in the warp compute the total atomic increment for the warp. 3. The leader thread performs an atomic add to compute the offset for the warp. 4. The leader thread broadcasts the offset to all other threads in the warp. 5. Each thread adds its own index within the warp to the warp offset to get its posi8on in the output array. 161

152 Step 1: elect a leader The thread iden8fier within a warp can be computed as #define WARP_SZ 32 device inline int lane_id(void) { return threadidx.x % WARP_SZ; } A reasonable choice is to select as leader the lowest ac9ve lane int mask = ballot(1); // mask of acave lanes int leader = ffs(mask) - 1; // - 1 for 0- based indexing ffs(int v) returns the 1- based posi8on of the lowest bit that is set in v, or 0 if no bit is set (ffs stands for find first set) 162

153 Step 2: compute the total increment The total increment for the warp is equal to the number of ac8ve lanes which hold true predicates. The value can be obtained by using the popc intrinsic which returns the number of bits set in the binary representa8on of its argument: int change = popc(mask); 163

154 Step 3: performing the atomic Only the leader lane performs the atomic opera8on int res; if(lane_id() == leader) res = atomicadd(ctr, change); // ctr is the pointer to the counter 164

155 Step 4: broadcasang the result We can implement the broadcast efficiently using the shuffle intrinsic device int warp_bcast(int v, int leader) { return shfl(v, leader); } 165

156 Step 5: compuang the index for each lane This is a bit tricky: we must compute the output index for each lane, by adding the broadcast counter value for the warp to the lane s intra- warp offset the intra- warp offset is the number of ac9ve lanes with IDs lower than current lane ID We can compute it by applying a mask that selects only lower lane IDs, and then applying the popc() intrinsic to the result. res += popc(mask & ((1 << lane_id()) 1)); 166

157 Final code // warp- aggregated atomic increment device int atomicaddinc(int *ctr) { int mask = ballot(1); // select the leader int leader = ffs(mask) 1; // leader does the update int res; if(lane_id() == leader) res = atomicadd(ctr, popc(mask)); // broadcast result res = warp_bcast(res, leader); // each thread computes its own value return res + popc(mask & ((1 << lane_id()) 1)); } // atomicaddinc Download h3p://twin.iac.rm.cnr.it/filter.cu 167

158 Exercise Write a shuffle- based kernel for the N- body problem. Hint: start from the nbody- unroll.cu code 168

159 CUDA Dynamic Parallelism A facility introduced by NVIDIA in their GK110 chip/architecture Allows a kernel to call another kernel from within it without returning the host. Each such kernel call has the same calling construction - the grid and block sizes and dimensions are set at the time of the call. Facility allows computations to be done with dynamically altering grid structures and recursion, to suit the computation. For example allows a 2/3D simulation mesh to be non-uniform with increased precision at places of interest, see previous slides. 169

160 Kernels calling other kernels Host code Device code global void kernel1 ( ) global void kernel2 ( ) kernel1<<< G1,B1>>>( ) ; kernel2<<< G2,B2>>>( ) ; kernel3<<< G3,B3>>>( ) ; return 0; return 0; Dynamic Parallelism Notice the kernel call is a standard syntax and allows each kernel call to have different grid/block structures 170 Nested depth limited by memory and <= 63 or 64

161 Recursion is allowed Host code Device code global void kernel1 ( ) kernel1<<< G2,B2>>>( ) ; kernel1<<< G1,B1>>>( ) ; Dynamic Parallelism return 0; 171

162 Parent- Child Launch Nes8ng Time Host (CPU) thread Grid A launch Grid A completes Grid A (Parent) Grid B launch Grid B (Child) Grid A threads Grid B threads 172 Derived from Fig 20.4 of Kirk and Hwu 2nd Ed. Grid B completes Implicit synchronization between parent and child forcing parent to wait until all children exit before it can exit.

163 Host-Device Data Transfers ü Device-to-host memory bandwidth much lower than device-to-device bandwidth ü 8 GB/s peak (PCI-e x16 Gen 2) vs. ~ 150 GB/s peak ü Minimize transfers ü Intermediate data can be allocated, operated on, and deallocated without ever copying them to host memory ü Group transfers ü One large transfer is usually much better than many small ones 173!

164 Synchronizing GPU and CPU All kernel launches are asynchronous! control returns to CPU immediately Be careful measuring 8mes! kernel starts execu8ng once all previous CUDA calls have completed Memcopies are synchronous control returns to CPU once the copy is complete copy starts once all previous CUDA calls have completed cudadevicesynchronize() CPU code blocks un8l all previous CUDA calls complete Asynchronous CUDA calls provide: non- blocking memcopies ability to overlap memcopies and kernel execuaon

165 Page-Locked Data Transfers ü cudamallochost() allows allocation of page-locked ( pinned ) host memory ü Same syntax of cudamalloc() but works with CPU pointers and memory ü Enables highest cudamemcpy performance ü 3.2 GB/s on PCI-e x16 Gen1 ü 6.5 GB/s on PCI-e x16 Gen2 ü Use with caution!!! ü Alloca8ng too much page- locked memory can reduce overall system performance See and test the bandwidth.cu sample 175!

166 Overlapping Data Transfers and Computation ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable devices ü Kernel computation can overlap data transfers on devices with Concurrent copy and execution (compute capability >= 1.1) ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches 176!

167 Asynchronous Data Transfers ü Asynchronous host-device memory copy returns control immediately to CPU ü cudamemcpyasync(dst, src, size, dir, stream); ü requires pinned host memory (allocated with cudamallochost) ü Overlap CPU computation with data transfer ü 0 = default stream cudamemcpyasync(a_d, a_h, size, cudamemcpyhosttodevice, 0); kernel<<<grid, block>>>(a_d); cpufunction(); overlapped 177!

168 Overlapping kernel and data transfer ü Requires: ü Kernel and transfer use different, non-zero streams ü For the crea8on of a stream: cudastreamcreate ü Remember that a CUDA call to stream 0 blocks until all previous calls complete and cannot be overlapped ü Example: cudastream_t stream1, stream2; cudastreamcreate(&stream1); cudastreamcreate(&stream2); cudamemcpyasync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>( ); 178! Exercise: read, compile and run simplestreams.cu overlapped

169 Device Management CPU can query and select GPU devices cudagetdevicecount( int* count ) cudasetdevice( int device ) cudagetdevice( int *current_device ) cudagetdeviceproper8es( cudadeviceprop* prop, int device ) cudachoosedevice( int *device, cudadeviceprop* prop ) Mul8- GPU setup device 0 is used by default Star8ng on CUDA 4.0 a CPU thread can control more than one GPU mul8ple CPU threads can control the same GPU the driver regulates the access.

170 Device Management (sample code) int cudadevice; struct cudadeviceprop prop; cudagetdevice( &cudadevice ); cudagetdeviceproper8es (&prop, cudadevice); mpc=prop.mul8processorcount; mtpb=prop.maxthreadsperblock; shmsize=prop.sharedmemperblock; prin ("Device %d: number of mul8processors %d\n" "max number of threads per block %d\n" "shared memory per block %d\n", cudadevice, mpc, mtpb, shmsize); Exercise: compile and run the enum_gpu.cu code

171 CUDA libraries ü CUDA includes a number of widely used libraries CUBLAS: BLAS implementa8on CUFFT: FFT implementa8on CURAND: Random number genera8on Etc. ü Thrust h;p://docs.nvidia.com/cuda/thrust Reduc8on Scan Sort Etc.

172 CUDA libraries ü There are two types of run8me math opera8ons func(): direct mapping to hardware ISA Fast but low accuracy Examples: sin(x), exp(x), pow(x,y) func() : compile to mul8ple instruc8ons Slower but higher accuracy (5 ulp, units in the least place, or less) Examples: sin(x), exp(x), pow(x,y) ü The -use_fast_math compiler option forces every func() to compile to func()

173 CURAND Library Library for genera8ng random numbers Features: XORWOW pseudo- random generator Sobol quasi- random number generators Host API for genera8ng random numbers in bulk Inline implementa8on allows use inside GPU func8ons/ kernels Single- and double- precision, uniform, normal and log- normal distribu8ons

174 CURAND Features Pseudo- random numbers George Marsaglia. Xorshi RNGs. Journal of Sta9s9cal SoRware, 8(14), Available at h;p:// Quasi- random 32- bit and 64- bit Sobol sequences with up to 20,000 dimensions. Host API: call kernel from host, generate numbers on GPU, consume numbers on host or on GPU. GPU API: generate and consume numbers during kernel execu8on.

175 CURAND use 1. Create a generator: curandcreategenerator() 2. Set a seed: curandsetpseudorandomgeneratorseed() 3. Generate the data from a distribu8on: curandgenerateuniform()/(curandgenerateuniformdouble(): Uniform curandgeneratenormal()/curandgeneratenormaldouble(): Gaussian curandgeneratelognormal/curandgeneratelognormaldouble(): Log- Normal 4. Destroy the generator: curanddestroygenerator()

176 Example CURAND Program: Compu8ng π with MonteCarlo method π = 4 * (Σ red points)/(σ points) 1. Generate 0 <= x < 1 2. Generate 0 <= y < 1 3. Compute z = x 2 + y 2 4. If z <= 1 the point is red Accuracy increases slowly

177 Example CURAND Program: Compu8ng π with MonteCarlo method Compare the speed of pigpu and picpu cd Pi gcc o picpu O3 picpu.c - lm nvcc o pigpu O3 pigpu.cu arch=sm_13 - lm Run for itera8ons Look at pigpu and try to understand why is it so SLOW?

178 Mul8- GPU Memory ü GPUs do not share global memory ü But star8ng on CUDA 4.0 one GPU can copy data from/ to another GPU memory directly if the GPU are connected to the same PCIe switch ü Inter- GPU communica8on ü Applica8on code is responsible for copying/moving data between GPU ü Data travel across the PCIe bus ü Even when GPUs are connected to the same PCIe switch!

179 Mul8- GPU Environment ü GPUs have consecu8ve integer IDs, star8ng with 0 ü Star8ng on CUDA 4.0, a host thread can maintain more than one GPU context at a 8me ü CudaSetDevice allows to change the ac8ve GPU (and so the context) ü Device 0 is chosen when cudasetdevice is not called ü Remember that mul8ple host threads can establish contexts with the same GPU ü Driver handles 8me- sharing and resource par88oning unless the GPU is in exclusive mode

180 Mul8- GPU Environment: a simple approach // Run independent kernel on each CUDA device int numdevs = 0; cudagetnumdevices(&numdevs);... for (int d = 0; d < numdevs; d++) { cudasetdevice(d); kernel<<<blocks, threads>>>(args); } Exercise: Compare the dot_simple_mul8block.cu and dot_mul8gpu.cu programs. Compile and run. A;en8on! Add the arch=sm_35 op8on: nvcc o dot_mulagpu dot_mulagpu.cu arch=sm_35

181 General Mula- GPU Programming Pa3ern The goal is to hide communicaaon cost Overlap with computa8on So, every ame- step, each GPU should: Compute parts (halos) to be sent to neighbors Compute the internal region (bulk) Exchange halos with neighbors overlapped Linear scaling as long as internal- computaaon takes longer than halo exchange Actually, separate halo computa8on adds some overhead

182 Example: Two Subdomains

183 Example: Two Subdomains Phase 1 compute compute Phase 2 compute send send compute GPU- 0: green subdomain GPU- 1: grey subdomain

184 CUDA Features Useful for Mula- GPU Control mulaple GPUs with a single CPU thread Simpler coding: no need for CPU mul8threading Streams: Enable execu8ng kernels and memcopies concurrently Up to 2 concurrent memcopies: to/from GPU Peer- to- Peer (P2P) GPU memory copies Transfer data between GPUs using PCIe P2P support Done by GPU DMA hardware host CPU is not involved Data traverses PCIe links, without touching CPU memory Disjoint GPU- pairs can communicate simultaneously

185 Peer- to- Peer GPU Communica8on Up to CUDA 4.0, GPU could not exchange data directly. With CUDA >= 4.0 two GPU connected to the same PCI/express switch can exchange data directly. cudastreamcreate(&stream_on_gpu_0); cudasetdevice(0); cudadeviceenablepeeraccess( 1, 0 ); cudamalloc(d_0, num_bytes); /* create array on GPU 0 */ cudasetdevice(1); cudamalloc(d_1, num_bytes); /* create array on GPU 1 */ cudamemcpypeerasync( d_0, 0, d_1, 1, num_bytes, stream_on_gpu_0 ); /* copy d_1 from GPU 1 to d_0 on GPU 0: pull copy */ Example: read p2pcopy.cu, compile nvcc o p2pcopy p2pcopy.cu arch=sm_35

186 What does the code look like? for( int i=0; i<num_gpus-1; i++ ) // right stage cudamemcpypeerasync( d_a[i+1], gpu[i+1], d_a[i], gpu[i], num_bytes, stream[i] ); for( int i=0; i<num_gpus; i++ ) cudastreamsynchronize( stream[i] ); for( int i=1; i<num_gpus; i++ ) // left stage cudamemcpypeerasync( d_b[i-1], gpu[i-1], d_b[i], gpu[i], num_bytes, stream[i] ); Arguments to the cudamemcpypeerasync() call: Dest address, dest GPU, src addres, src GPU, num bytes, stream ID The middle loop isn t necessary for correctness Improves performance by preven8ng the two stages from interfering with each other (15 vs. 11 GB/s using 4- GPU example)

187 Code Pa3ern for Mula- GPU Every ame- step each GPU will: Compute halos to be sent to neighbors (stream h) Compute the internal region (stream i) Exchange halos with neighbors (stream h)

188 Code For Looping through Time for( int istep=0; istep<nsteps; istep++) { for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>(... ); kernel<<<..., stream_halo[i]>>>(... ); cudastreamquery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>(... ); } for( int i=0; i<num_gpus-1; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudastreamsynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); Compute halos Compute internal subdomain Exchange halos } for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); cudadevicesynchronize(); // swap input/output pointers } Synchronize before next step

189 Code For Looping through Time for( int istep=0; istep<nsteps; istep++) { for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>(... ); kernel<<<..., stream_halo[i]>>>(... ); cudastreamquery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>(... ); } for( int i=0; i<num_gpus-1; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudastreamsynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudamemcpypeerasync(..., stream_halo[i] ); Two CUDA streams per GPU halo and internal Each in an array indexed by GPU Calls issued to different streams can overlap Prior to ame- loop Create streams Enable P2P access } for( int i=0; i<num_gpus; i++ ) { cudasetdevice( gpu[i] ); cudadevicesynchronize(); // swap input/output pointers }

190 GPUs in Different Cluster Nodes Add network communicaaon to halo- exchange Transfer data from edge GPUs of each node to CPU Hosts exchange via MPI (or other API) Transfer data from CPU to edge GPUs of each node With CUDA >= 4.0, transparent interoperability between CUDA and Infiniband export CUDA_NIC_INTEROP=1 Let MPI use, for Remote Direct Memory Access opera8ons, memory pinned by GPU

191 Inter- GPU Communica8on with MPI Example: simplempi.c Generate some random numbers on one node. Dispatch them to all nodes. Compute their square root on each node's GPU. Compute the average of the results by using MPI. To compile: $nvcc - c simplecudampi.cu $mpic++ - o simplempi simplempi.c simplecudampi.o - L/usr/local/cuda/lib64/ - lcudart To run: $mpirun np 2 simplempi On a single line!

Introduction to CUDA C

Introduction to CUDA C NVIDIA GPU Technology Introduction to CUDA C Samuel Gateau Seoul December 16, 2010 Who should you thank for this talk? Jason Sanders Senior Software Engineer, NVIDIA Co-author of CUDA by Example What is

More information

Introduction to CUDA C

Introduction to CUDA C Introduction to CUDA C What will you learn today? Start from Hello, World! Write and launch CUDA C kernels Manage GPU memory Run parallel kernels in CUDA C Parallel communication and synchronization Race

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

An Introduction to GPU Computing and CUDA Architecture

An Introduction to GPU Computing and CUDA Architecture An Introduction to GPU Computing and CUDA Architecture Sarah Tariq, NVIDIA Corporation GPU Computing GPU: Graphics Processing Unit Traditionally used for real-time rendering High computational density

More information

CUDA C/C++ BASICS. NVIDIA Corporation

CUDA C/C++ BASICS. NVIDIA Corporation CUDA C/C++ BASICS NVIDIA Corporation What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and

More information

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation

CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation CUDA C/C++ Basics GTC 2012 Justin Luitjens, NVIDIA Corporation What is CUDA? CUDA Platform Expose GPU computing for general purpose Retain performance CUDA C/C++ Based on industry-standard C/C++ Small

More information

EEM528 GPU COMPUTING

EEM528 GPU COMPUTING EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

A crash course on C- CUDA programming for GPU. Massimo Bernaschi h7p://www.iac.cnr.it/~massimo

A crash course on C- CUDA programming for GPU. Massimo Bernaschi h7p://www.iac.cnr.it/~massimo A crash course on C- CUDA programming for GPU Massimo Bernaschi m.bernaschi@iac.cnr.it h7p://www.iac.cnr.it/~massimo A class based on Original slides by Jason Sanders by courtesy of Massimiliano FaGca.

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

GPU Programming Introduction

GPU Programming Introduction GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Lecture 6b Introduction of CUDA programming

Lecture 6b Introduction of CUDA programming CS075 1896 1920 1987 2006 Lecture 6b Introduction of CUDA programming 0 1 0, What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

First: Shameless Adver2sing

First: Shameless Adver2sing Agenda A Shameless self promo2on Introduc2on to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy The Cuda Memory Hierarchy Mapping Cuda to Nvidia GPUs As much of the OpenCL informa2on as I can

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Improving performance of the N-Body problem

Improving performance of the N-Body problem Improving performance of the N-Body problem Efim Sergeev Senior Software Engineer at Singularis Lab LLC Contents Theory Naive version Memory layout optimization Cache Blocking Techniques Data Alignment

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application $ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application --- General Information for device 0 --- Name: xxxx Compute capability:

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

Introduction to Scientific Programming using GPGPU and CUDA

Introduction to Scientific Programming using GPGPU and CUDA Introduction to Scientific Programming using GPGPU and CUDA Day 1 Sergio Orlandini s.orlandini@cineca.it Mario Tacconi m.tacconi@cineca.it 0 Hands on: Compiling a CUDA program Environment and utility:

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Profiling & Tuning Applica1ons. CUDA Course July István Reguly Profiling & Tuning Applica1ons CUDA Course July 21-25 István Reguly Introduc1on Why is my applica1on running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA,

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the

More information

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018 GPU 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Vector Addition on the Device: main()

Vector Addition on the Device: main() Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space

More information

Global Memory Access Pattern and Control Flow

Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

CUDA. More on threads, shared memory, synchronization. cuprintf

CUDA. More on threads, shared memory, synchronization. cuprintf CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

CS : Many-core Computing with CUDA

CS : Many-core Computing with CUDA CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z)

CUDA Advanced Techniques 3 Mohamed Zahran (aka Z) Some slides are used and slightly modified from: NVIDIA teaching kit CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming CUDA Advanced Techniques 3 Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

Optimization Strategies Global Memory Access Pattern and Control Flow

Optimization Strategies Global Memory Access Pattern and Control Flow Global Memory Access Pattern and Control Flow Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Highest latency instructions: 400-600 clock cycles Likely

More information

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications The NVIDIA GPU Memory Ecosystem Atomic operations in CUDA The thrust library October 7, 2015 Dan Negrut, 2015 ECE/ME/EMA/CS 759

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory Class Windows and CUDA : Shu Guo Program from last time: Constant memory Windows on CUDA Reference: NVIDIA CUDA Getting Started Guide for Microsoft Windows Whiting School has Visual Studio Cuda 5.5 Installer

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

Shared Memory and Synchronizations

Shared Memory and Synchronizations and Synchronizations Bedrich Benes, Ph.D. Purdue University Department of Computer Graphics Technology SM can be accessed by all threads within a block (but not across blocks) Threads within a block can

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information