GPU Computing with CUDA

Size: px
Start display at page:

Download "GPU Computing with CUDA"

Transcription

1 September 4, 2011 Gabriel Noaje Source-to-source code translator : OpenMP to CUDA 1 GPU Computing with CUDA Gabriel Noaje gabriel.noaje@univ-reims.fr University of Reims Champagne-Ardenne, France CUDA Training Day June 26, 2012

2

3 INTRODUCTION TO MASSIVELY PARALLEL COMPUTING

4 Moore s Law (paraphrased) The number of transistors on an integrated circuit doubles every two years. Gordon E. Moore

5 Moore s Law (visualized) Credits: Wikimedia

6 Serial Performance Scaling is Over! Cannot continue to scale processor frequencies! no 10 GHz chips! Cannot continue to increase power consumption! can t melt chip! Can continue to increase transistor density! as per Moore s Law

7 How to Use Transistors?! Instruction-level parallelism! out-of-order execution, speculation,! vanishing opportunities in power-constrained world! Data-level parallelism! vector units, SIMD execution,! increasing SSE, AVX, Cell SPE, Clearspeed, GPU! Thread-level parallelism! increasing multithreading, multicore, manycore! Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi,

8 Why Massively Parallel Processing?! A quiet revolution and potential build-up! Computation: TFLOPs vs. 100 GFLOPs T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Westmere! GPU in every PC massive volume & potential impact

9 Why Massively Parallel Processing?! A quiet revolution and potential build-up! Bandwidth: ~10x T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Westmere! GPU in every PC massive volume & potential impact

10 KEPLER ARCHITECTURE

11 Kepler GK110 Block Diagram Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP MB L2 Cache 384-bit GDDR5 PCI Express Gen3

12 Kepler GK110 SMX vs Fermi SM

13 SMX: Efficient Performance 192 CUDA cores 64 FP units 32 Special Function Units 32 load/store units dedicated for access memories registers 64KB shared memory 48KB cache

14 Block IDs and Thread IDs Each thread uses IDs to decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Host Kernel 1 Kernel 2 Device Grid 1 Grid 2 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) T hrea d (0,0,0) T hrea d (1,0,0) T hrea d (2,0,0) T hrea d (3,0,0) T hrea d (0,1,0) T hrea d (1,1,0) T hrea d (2,1,0) T hrea d (3,1,0) Courtesy:

15 The New Moore s Law! Computers no longer get faster, just wider! You must re-think your algorithms to be parallel!! Data-parallel computing is most scalable solution! Otherwise: refactor code for 2 cores 4 cores 8 cores 16 cores! You will always have more data than cores build the computation around the data

16 Generic Multicore Chip Processor Memory Processor Memory Global Memory! Handful of processors each supporting ~1 hardware thread! On-chip memory near processors (cache, RAM, or both)! Shared global memory space (external DRAM)

17 Generic Manycore Chip Processor Memory Processor Memory Global Memory! Many processors each supporting many hardware threads! On-chip memory near processors (cache, RAM, or both)! Shared global memory space (external DRAM)

18 Enter the GPU Computing! Massive economies of scale! Massively parallel

19 Movie playing 3D animation CAD (Computer-Aided Design) Games

20 GPU Evolution! High throughput computation!! High bandwidth memory!! GeForce GTX 280: 933 GFLOP/s GeForce GTX 280: 140 GB/s Fermi 3B xtors High availability to all! 180+ million CUDA-capable GPUs in the wild GeForce M xtors RIVA 128 3M xtors 1995 GeForce M xtors 2000 GeForce 3 60M xtors GeForce FX 125M xtors

21 Why is this different from a CPU?! Different goals produce different designs! GPU assumes work load is highly parallel! CPU must be good at everything, parallel or not! CPU: minimize latency experienced by 1 thread! big on-chip caches! sophisticated control logic! GPU: maximize throughput of all threads! # threads in flight limited by resources => lots of resources (registers, bandwidth, etc.)! multithreading can hide latency => skip the big caches! share control logic across many threads

22 Weather simulation Financial analysis Molecular docking Medical MRI Computer virus detection Geological modeling

23 Initially - use graphic calls: Cg by NVIDIA HLSL by Microsoft Nowadays - rich GPU programming ecosystem: ATI Stream by AMD CUDA by NVIDIA OpenCL by Khronos Group Direct Computing by Microsoft ( NVIDIA hardware specific ) = Compute Unified Device Architecture extends the ANSI C standard gentle learning curve (compared to Cg, HLSL, etc.) opens the underlying architecture to the user (driver + toolkit + SDK + docs + )

24 CUDA: Scalable parallel programming! Augment C/C++ with minimalist abstractions! let programmers focus on parallel algorithms! not mechanics of a parallel programming language! Provide straightforward mapping onto hardware! good fit to GPU architecture! maps well to multi-core CPUs too! Scale to 100s of cores & 10,000s of parallel threads! GPU threads are lightweight create / switch is free! GPU needs 1000s of threads for full utilization

25 Key Parallel Abstractions in CUDA! Hierarchy of concurrent threads! Lightweight synchronization primitives! Shared memory model for cooperating threads

26 Hierarchy of concurrent threads! Parallel kernels composed of many threads! all threads execute the same sequential program Thread t! Threads are grouped into thread blocks! threads in the same block can cooperate Block b t0 t1 tb! Threads/blocks have unique IDs

27 CUDA Model of Parallelism Block Memory Block Memory Global Memory! CUDA virtualizes the physical hardware! thread is a virtualized scalar processor (registers, PC, state)! block is a virtualized multiprocessor (threads, shared mem.)! Scheduled onto physical hardware without pre-emption! threads/blocks launch & run to completion! blocks should be independent

28 Heterogeneous Computing Multicore CPU

29 C for CUDA! Philosophy: provide minimal set of extensions necessary to expose power! Function qualifiers: global void my_kernel() { } device float my_device_func() { }! Variable qualifiers: constant float my_constant_array[32]; shared float my_shared_array[32];! Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel! Built-in variables and functions valid in device code: dim3 griddim; // Grid dimension dim3 blockdim; // Block dimension dim3 blockidx; // Block index dim3 threadidx; // Thread index void syncthreads(); // Thread synchronization

30 CUDA PROGRAMMING BASICS

31 Outline of CUDA Basics! Basic Kernels and Execution on GPU! Basic Memory Management! Coordinating CPU and GPU Execution! See the Programming Guide for the full API

32 CUDA Programming Model! Parallel code (kernel) is launched and executed on a device by many threads! Launches are hierarchical! Threads are grouped into blocks! Blocks are grouped into grids! Familiar serial code is written for a thread! Each thread is free to execute a unique code path! Built-in thread and block ID variables

33 High Level View SMEM SMEM SMEM SMEM Global Memory PCIe CPU Chipset

34 Blocks of threads run on an SM Streaming Processor Streaming Multiprocessor SMEM Registers Thread Memory Threadblock Per-block Shared Memory Memory

35 Whole grid runs on GPU Many blocks of threads... SMEM SMEM SMEM SMEM Global Memory

36 Thread Hierarchy! Threads launched for a parallel section are partitioned into thread blocks! Grid = all blocks for a given launch! Thread block is a group of threads that can:! Synchronize their execution! Communicate via shared memory

37 Memory Model Kernel 0 Kernel 1... Per-device Global Memory Sequential Kernels...

38 Memory Model Host memory cudamemcpy() Device 0 memory Device 1 memory

39 Example: Vector Addition Kernel Device Code // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } int main() { // Run grid of N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }

40 Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition global void vecadd(float* A, float* B, float* C) { int i = threadidx.x + blockdim.x * blockidx.x; C[i] = A[i] + B[i]; } Host Code int main() { // Run grid of N/256 blocks of 256 threads each vecadd<<< N/256, 256>>>(d_A, d_b, d_c); }

41 Example: Host code for vecadd // allocate and initialize host (CPU) memory float *h_a =, *h_b = ; *h_c = (empty) // allocate device (GPU) memory float *d_a, *d_b, *d_c; cudamalloc( (void**) &d_a, N * sizeof(float)); cudamalloc( (void**) &d_b, N * sizeof(float)); cudamalloc( (void**) &d_c, N * sizeof(float)); // copy host memory to device cudamemcpy( d_a, h_a, N * sizeof(float), cudamemcpyhosttodevice) ); cudamemcpy( d_b, h_b, N * sizeof(float), cudamemcpyhosttodevice) ); // execute grid of N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c);

42 Example: Host code for vecadd (2) // execute grid of N/256 blocks of 256 threads each vecadd<<<n/256, 256>>>(d_A, d_b, d_c); // copy result back to host memory cudamemcpy( h_c, d_c, N * sizeof(float), cudamemcpydevicetohost) ); // do something with the result // free device (GPU) memory cudafree(d_a); cudafree(d_b); cudafree(d_c);

43 IDs and Dimensions! Threads:! 3D IDs, unique within a block! Blocks:! 2D IDs, unique within a grid! Dimensions set at launch! Can be unique for each grid! Built-in variables:! threadidx, blockidx! blockdim, griddim Device Block (1, 1) Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Block (2, 0) Block (2, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

44 Kernel Variations and Output global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = 7; } Output: global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = blockidx.x; } Output: global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = threadidx.x; } Output:

45 Code executed on GPU! C/C++ with some restrictions: Can only access GPU memory No variable number of arguments No static variables No recursion! No dynamic polymorphism! Must be declared with a qualifier:! global : launched by CPU, cannot be called from GPU must return void! device : called from other GPU functions, cannot be called by the CPU! host : can be called by CPU! host and device qualifiers can be combined! Function is compiled for both host and device! Sample use: overloading operators

46 Memory Spaces! CPU and GPU have separate memory spaces! Data is moved across PCIe bus! Use functions to allocate/set/copy memory on GPU! Very similar to corresponding C functions! Pointers are just addresses! Can t tell from the pointer value whether the address is on CPU or GPU! Must exercise care when dereferencing:! Dereferencing CPU pointer on GPU will likely crash! Same for vice versa

47 CUDA Device Memory Model! Device code can:! R/W per-thread registers! R/W per-thread local memory! R/W per-block shared memory! R/W per-grid global memory! Read only per-grid constant memory! Host code can:! Transfer data to/from per-grid global and constant memories

48 GPU Memory Allocation / Release! Host (CPU) manages device (GPU) memory:! cudamalloc (void ** pointer, size_t nbytes)! cudamemset (void * pointer, int value, size_t count)! cudafree (void* pointer) int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudamalloc( (void**)&d_a, nbytes ); cudamemset( d_a, 0, nbytes); cudafree(d_a);

49 Data Copies! cudamemcpy( void *dst, void *src, size_t nbytes, enum cudamemcpykind direction);! returns after the copy is complete! blocks CPU thread until all bytes have been copied! doesn t start copying until previous CUDA calls complete! enum cudamemcpykind! cudamemcpyhosttodevice! cudamemcpydevicetohost! cudamemcpydevicetodevice! Non-blocking copies are also available! cudamemcpyasync

50 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

51 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

52 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudamemset( d_a, 0, num_bytes ); cudamemcpy( h_a, d_a, num_bytes, cudamemcpydevicetohost );

53 Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudamemset( d_a, 0, num_bytes ); cudamemcpy( h_a, d_a, num_bytes, cudamemcpydevicetohost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudafree( d_a ); } return 0;

54 Example: Shuffling Data // Reorder values based on keys // Each thread moves one element global void shuffle(int* prev_array, int* new_array, int* indices) { int i = threadidx.x + blockdim.x * blockidx.x; new_array[i] = prev_array[indices[i]]; } Host Code int main() { // Run grid of N/256 blocks of 256 threads each shuffle<<< N/256, 256>>>(d_old, d_new, d_ind); }

55 Kernel with 2D Indexing global void kernel( int *a, int dimx, int dimy ) { int ix = blockidx.x*blockdim.x + threadidx.x; int iy = blockidx.y*blockdim.y + threadidx.y; int idx = iy*dimx + ix; } a[idx] = a[idx]+1;

56 int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudamalloc( (void**)&d_a, num_bytes ); global void kernel( int *a, int dimx, int dimy ) { int ix = blockidx.x*blockdim.x + threadidx.x; int iy = blockidx.y*blockdim.y + threadidx.y; int idx = iy*dimx + ix; } a[idx] = a[idx]+1; if( 0==h_a 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudamemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudamemcpy( h_a, d_a, num_bytes, cudamemcpydevicetohost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudafree( d_a ); } return 0;

57 Blocks must be independent! Any possible interleaving of blocks should be valid! presumed to run to completion without preemption! can run in any order! can run concurrently OR sequentially! A thread block is a batch of threads that can cooperate with each other by:! synchronizing their execution: _syncthreads()! sharing data with shared memory! Independence requirement gives scalability

58 CUDA MEMORIES

59 Hardware Implementation of CUDA Memories Grid! Each thread can:! Read/write per-thread registers! Read/write per-thread local memory Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers! Read/write per-block shared memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)! Read/write per-grid global memory! Read/only per-grid constant memory Host Global Memory Constant Memory

60 CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread shared int shared_var; shared block block device int global_var; global grid application constant int constant_var; constant grid application! automatic scalar variables without qualifier reside in a register! compiler will spill to thread local memory! automatic array variables without qualifier reside in thread-local memory

61 CUDA Variable Type Performance Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x shared int shared_var; shared 1x device int global_var; global 100x constant int constant_var; constant 1x! scalar variables reside in fast, on-chip registers! shared variables reside in fast, on-chip memories! thread-local arrays & global variables reside in uncached off-chip memory! constant variables reside in cached off-chip memory

62 Where to declare variables? Can host access it? Yes No Outside of any function In the kernel constant int constant_var; device int global_var; int var; int array_var[10]; shared int shared_var;

63 Example thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application global void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadidx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; } // read through num_qs points, maintaining // the nearest 10 qs to p in the heap... // write out the contents of heap to result...

64 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; } } result[i] = x_i x_i_minus_one;

65 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; Two loads int x_i_minus_one = input[i-1]; } } result[i] = x_i x_i_minus_one;

66 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; // once by thread i int x_i_minus_one = input[i-1]; // again by thread i+1 } } result[i] = x_i x_i_minus_one;

67 Example shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] input[i-1] global void adj_diff_naive(int *result, int *input) { // compute this thread s global index unsigned int i = blockdim.x * blockidx.x + threadidx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; } } result[i] = x_i x_i_minus_one;

68 Example shared variables // optimized version of adjacent difference global void adj_diff(int *result, int *input) { // shorthand for threadidx.x int tx = threadidx.x; // allocate a shared array, one element per thread shared int s_data[block_size]; // each thread reads one element to s_data unsigned int i = blockdim.x * blockidx.x + tx; s_data[tx] = input[i]; } // avoid race condition: ensure all loads // complete before continuing syncthreads();...

69 Example shared variables // optimized version of adjacent difference global void adj_diff(int *result, int *input) {... if(tx > 0) result[i] = s_data[tx] s_data[tx 1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] input[i-1]; } }

70 Example shared variables // when the size of the array isn t known at compile time... global void adj_diff(int *result, int *input) { // use extern to indicate a shared array will be // allocated dynamically at kernel launch time extern shared int s_data[];... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

71 A Common Programming Strategy! Partition data into subsets that fit into shared memory

72 A Common Programming Strategy! Handle each data subset with one thread block

73 A Common Programming Strategy! Load the subset from global memory to shared memory, using multiple threads to exploit memorylevel parallelism

74 A Common Programming Strategy! Perform the computation on the subset from shared memory

75 A Common Programming Strategy! Copy the result from shared memory back to global memory

76 A Common Programming Strategy! Carefully partition data according to access patterns! Read-only è constant memory (fast)! R/W & shared within block è shared memory (fast)! R/W within each thread è registers (fast)! Indexed R/W within each thread è local memory (slow)! R/W inputs/results è cudamalloc ed global memory (slow)

77 Communication Through Memory! Question: global void race(void) { shared int my_shared_variable; my_shared_variable = threadidx.x; } // what is the value of // my_shared_variable?

78 Communication Through Memory! This is a race condition! The result is undefined! The order in which threads access the variable is undefined without explicit coordination! Use barriers (e.g., syncthreads) or atomic operations (e.g., atomicadd) to enforce well-defined semantics

79 Communication Through Memory! Use syncthreads to ensure data is ready for access global void share_data(int *input) { shared int data[block_size]; data[threadidx.x] = input[threadidx.x]; syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }

80 Communication Through Memory! Use atomic operations to ensure exclusive access to a variable // assume *result is initialized to 0 global void sum(int *input, int *result) { atomicadd(result, input[threadidx.x]); } // after this kernel exits, the value of // *result will be the sum of the input

81 Hierarchical Atomics Σ Σ 0 Σ 1 Σ ι! Divide & Conquer! Per-thread atomicadd to a shared partial sum! Per-block atomicadd to the total sum

82 SMX EXECUTION

83 How an SM executes threads! Overview of how a Stream Multiprocessor works! SIMT Execution! Divergence

84 Scheduling Blocks onto SMs Streaming Multiprocessor Thread Block 5 Thread Block 27 Thread Block 61 Thread Block 2001! HW Schedules thread blocks onto available SMs! No guarantee of ordering among thread blocks! HW will schedule thread blocks as soon as a previous thread block finishes

85 Warps! Each thread block is executed as 32-thread warps! An implementation decision, not part of the CUDA programming model! Warps are scheduling units in SMs! If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 x 3 = 24 Warps

86 Mapping of Thread Blocks! Each thread block is mapped to one or more warps! The hardware schedules each warp independently Thread Block N (128 threads) TB N W1 TB N W2 TB N W3 TB N W4

87 Thread Scheduling Example! SM implements zero-overhead warp scheduling! At any time, only one of the warps is executed by SM! Warps whose next instruction has its inputs ready for consumption are eligible for execution! Eligible Warps are selected for execution on a prioritized scheduling policy! All threads in a warp execute the same instruction when selected Time TB1 W1 TB2 W1 TB1, W1 stall TB2, W1 stall TB3 W1 TB3 W2 TB2 W1 TB3, W2 stall Instruction: TB1 W1 TB1 W2 TB = Thread Block, W = Warp TB1 W3 TB3 W2

88 Control Flow Divergence! What happens if you have the following code? if(foo(threadidx.x)) { do_a(); } else { do_b(); }

89 Control Flow Divergence Branch Path A Path B

90 Control Flow Divergence! Nested branches are handled as well if(foo(threadidx.x)) { if(bar(threadidx.x)) do_a(); else do_b(); } else do_c();

91 Control Flow Divergence Branch Branch Path A Path B Path C

92 Control Flow Divergence! You don t have to worry about divergence for correctness! You might have to think about it for performance! Depends on your branch conditions

93 Control Flow Divergence! Performance drops off with the degree of divergence switch(threadidx.x % N) { case 0:... case 1:... }

94 Divergence Performance Divergence

95 Compiling a CUDA program CUDA.c /.cpp Host code.gpu Device code C/C++ Libraries CUDA Libraries Object File Fat bin Executable (with embedded GPU code)

96 CUDA Makefile CC=nvcc CUDA_DIR= /opt/cuda/4.1 SDK_DIR= ${CUDA_DIR}/sdk CFLAGS= -I. -I${CUDA_DIR}/include -I${SDK_DIR}/C/common/inc LDFLAGS= -L${CUDA_DIR}/lib64 -L${SDK_DIR}/lib -L${SDK_DIR}/C/common/lib LIB= -lm -lrt SOURCES= vector_add.cu EXECNAME= vector_add all: clean: $(CC) --ptxas-options=-v -g -G -keep -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) $(CC) --ptxas-options=-v -g -G -keep -clean -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) rm -f *.o core

97 OPENACC API

98 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

99 OpenACC Directives CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience OpenACC Compiler Hint Compiler Parallelizes code Works on many-core GPUs & multicore CPUs Your original Fortran or C code

100 OpenACC Open Programming Standard for Parallel Computing

101 OpenACC The Standard for GPU Directives Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

102 High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator Programming model allows programmers to start simple Enhance with additional guidance for compiler on loop mappings, data location, and other performance details Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.

103 OpenACC Specification and Website Full OpenACC 1.0 Specification available online Quick reference card also available Available compilers: Caps HMPP OpenACC (soon at Romeo) PGI OpenACC (soon at Romeo)

104 A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y);... subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d)...

105 Directive Syntax Fortran!$acc directive [clause [,] clause] ] Often paired with a matching end directive surrounding a structured code block!$acc end directive C #pragma acc directive [clause [,] clause] ] Often followed by a structured code block

106 CUDA = much more Atomic, shuffle operations Streams GPU Direct CUBLAS, NPP, CUFFT, OpenCL David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana- 20

107 References (CUDA zone, developer zone) Programming Massively Parallel Processors: A Hands-on Approach by David Kirk, Wen-mei Hwu CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot CUDA Application Design and Development by Rob Farber

108 Full Version Questions?

109 Matrix Multiplication Example! Generalize adjacent_difference example! AB = A * B! Each element AB ij! = dot(row(a,i),col(b,j))! Parallelization strategy! Thread à AB ij! 2D kernel

110 First Implementation global void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockidx.y*blockdim.y + threadidx.y; int col = blockidx.x*blockdim.x + threadidx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; } ab[row*width+col] = result;

111 Idea: Use shared memory to reuse global data! Each input element is read by width threads! Load each element into shared memory and have several threads use the local version to reduce the memory bandwidth width

112 Tiled Multiply TILE_WIDTH! Partition kernel loop into phases! Load a tile of both matrices into shared each phase! Each phase, each thread computes a partial result

113 Better Implementation global void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadidx.x, ty = threadidx.y; int bx = blockidx.x, by = blockidx.y; // allocate tiles in shared memory shared float s_a[tile_width][tile_width]; shared float s_b[tile_width][tile_width]; // calculate the row & col index int row = by*blockdim.y + ty; int col = bx*blockdim.x + tx; float result = 0;

114 Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/tile_width; ++p) { // collaboratively load tiles into shared s_a[ty][tx] = a[row*width + (p*tile_width + tx)]; s_b[ty][tx] = b[(m*tile_width + ty)*width + col]; syncthreads(); } // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; syncthreads(); } ab[row*width+col] = result;

115 Use of Barriers in mat_mul! Two barriers per phase:! syncthreads after all data is loaded into shared memory! syncthreads after all data is read from shared memory! Note that second syncthreads in phase p guards the load in phase p+1! Use barriers to guard data! Guard against using uninitialized data! Guard against bashing live data

116 First Order Size Considerations! Each thread block should have many threads! TILE_WIDTH = 16 à 16*16 = 256 threads! There should be many thread blocks! 1024*1024 matrices à 64*64 = 4096 thread blocks! TILE_WIDTH = 16 à gives each SM 4 blocks, 1024 threads! Full occupancy! Each thread block performs 2 * 256 = 512 x 4B loads for 256 * (2 * 16) = 8,192 fp ops (0.25 B/op)! Compare to 4B/op

117 TILE_SIZE Effects

118 Memory Resources as Limit to Parallelism Resource Per GTX480 SM Full Occupancy on GTX480 Registers <= / 1024 threads = 32 per thread shared Memory 48KB <= 48KB / 8 blocks = 6KB per block! Effective use of different memory resources reduces the number of accesses to global memory! These resources are finite!! The more memory locations each thread requires à the fewer threads an SM can accommodate

CS 193G. Lecture 4: CUDA Memories

CS 193G. Lecture 4: CUDA Memories CS 193G Lecture 4: CUDA Memories Hardware Implementation of CUDA Memories Grid! Each thread can:! Read/write per-thread registers! Read/write per-thread local memory Block (0, 0) Shared Memory Registers

More information

EEM528 GPU COMPUTING

EEM528 GPU COMPUTING EEM528 CS 193G GPU COMPUTING Lecture 2: GPU History & CUDA Programming Basics Slides Credit: Jared Hoberock & David Tarjan CS 193G History of GPUs Graphics in a Nutshell Make great images intricate shapes

More information

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci Outline of CUDA Basics Basic Kernels and Execution on GPU

More information

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43

GPU Programming. CUDA Memories. Miaoqing Huang University of Arkansas Spring / 43 1 / 43 GPU Programming CUDA Memories Miaoqing Huang University of Arkansas Spring 2016 2 / 43 Outline CUDA Memories Atomic Operations 3 / 43 Hardware Implementation of CUDA Memories Each thread can: Read/write

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

More CUDA. Advanced programming

More CUDA. Advanced programming More CUDA 1 Advanced programming Synchronization and atomics Warps and Occupancy Memory coalescing Shared Memory Streams and Asynchronous Execution Bank conflicts Other features 2 Synchronization and Atomics

More information

CS 193G. Lecture 1: Introduction to Massively Parallel Computing

CS 193G. Lecture 1: Introduction to Massively Parallel Computing CS 193G Lecture 1: Introduction to Massively Parallel Computing Course Goals! Learn how to program massively parallel processors and achieve! High performance! Functionality and maintainability! Scalability

More information

CUDA C Programming Mark Harris NVIDIA Corporation

CUDA C Programming Mark Harris NVIDIA Corporation CUDA C Programming Mark Harris NVIDIA Corporation Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory Models Programming Environment

More information

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list CUDA Programming Week 1. Basic Programming Concepts Materials are copied from the reference list G80/G92 Device SP: Streaming Processor (Thread Processors) SM: Streaming Multiprocessor 128 SP grouped into

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690

Stanford University. NVIDIA Tesla M2090. NVIDIA GeForce GTX 690 Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690 Moore s Law 2 Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium

More information

CS : Many-core Computing with CUDA

CS : Many-core Computing with CUDA CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 (Moreno Maza) CS4402-9535: Many-core Computing with CUDA UWO-CS4402-CS9535

More information

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

GPU COMPUTING. Ana Lucia Varbanescu (UvA) GPU COMPUTING Ana Lucia Varbanescu (UvA) 2 Graphics in 1980 3 Graphics in 2000 4 Graphics in 2015 GPUs in movies 5 From Ariel in Little Mermaid to Brave So 6 GPUs are a steady market Gaming CAD-like activities

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA CUDA Parallel Programming Model Scalable Parallel Programming with CUDA Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of

More information

CUDA Parallel Programming Model Michael Garland

CUDA Parallel Programming Model Michael Garland CUDA Parallel Programming Model Michael Garland NVIDIA Research Some Design Goals Scale to 100s of cores, 1000s of parallel threads Let programmers focus on parallel algorithms not mechanics of a parallel

More information

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun Outline Memory Management CIS 565 Fall 2011 Qing Sun sunqing@seas.upenn.edu Kernels Matrix multiplication Managing Memory CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

Introduction to GPGPUs and to CUDA programming model

Introduction to GPGPUs and to CUDA programming model Introduction to GPGPUs and to CUDA programming model www.cineca.it Marzia Rivi m.rivi@cineca.it GPGPU architecture CUDA programming model CUDA efficient programming Debugging & profiling tools CUDA libraries

More information

Module 2: Introduction to CUDA C

Module 2: Introduction to CUDA C ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Module 2: Introduction to CUDA C. Objective

Module 2: Introduction to CUDA C. Objective ECE 8823A GPU Architectures Module 2: Introduction to CUDA C 1 Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

Matrix Multiplication in CUDA. A case study

Matrix Multiplication in CUDA. A case study Matrix Multiplication in CUDA A case study 1 Matrix Multiplication: A Case Study Matrix multiplication illustrates many of the basic features of memory and thread management in CUDA Usage of thread/block

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Basic Elements of CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

CUDA Parallelism Model

CUDA Parallelism Model GPU Teaching Kit Accelerated Computing CUDA Parallelism Model Kernel-Based SPMD Parallel Programming Multidimensional Kernel Configuration Color-to-Grayscale Image Processing Example Image Blur Example

More information

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Introduction to GPU Computing Junjie Lai, NVIDIA Corporation Outline Evolution of GPU Computing Heterogeneous Computing CUDA Execution Model & Walkthrough of Hello World Walkthrough : 1D Stencil Once upon

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA basics Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline

More information

GPU Computing: Introduction to CUDA. Dr Paul Richmond

GPU Computing: Introduction to CUDA. Dr Paul Richmond GPU Computing: Introduction to CUDA Dr Paul Richmond http://paulrichmond.shef.ac.uk This lecture CUDA Programming Model CUDA Device Code CUDA Host Code and Memory Management CUDA Compilation Programming

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

CSE 160 Lecture 24. Graphical Processing Units

CSE 160 Lecture 24. Graphical Processing Units CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

University of Bielefeld

University of Bielefeld Geistes-, Natur-, Sozial- und Technikwissenschaften gemeinsam unter einem Dach Introduction to GPU Programming using CUDA Olaf Kaczmarek University of Bielefeld STRONGnet Summerschool 2011 ZIF Bielefeld

More information

Scientific discovery, analysis and prediction made possible through high performance computing.

Scientific discovery, analysis and prediction made possible through high performance computing. Scientific discovery, analysis and prediction made possible through high performance computing. An Introduction to GPGPU Programming Bob Torgerson Arctic Region Supercomputing Center November 21 st, 2013

More information

High Performance Linear Algebra on Data Parallel Co-Processors I

High Performance Linear Algebra on Data Parallel Co-Processors I 926535897932384626433832795028841971693993754918980183 592653589793238462643383279502884197169399375491898018 415926535897932384626433832795028841971693993754918980 592653589793238462643383279502884197169399375491898018

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Memories. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Memories Klaus Mueller Computer Science Department Stony Brook University Importance of Memory Access Efficiency Every loop iteration has two global memory accesses two floating

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture CE 431 Parallel Computer Architecture Spring 2017 Graphics Processor Units (GPU) Architecture Nikos Bellas Computer and Communications Engineering Department University of Thessaly Some slides borrowed

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Lecture 2: Introduction to CUDA C

Lecture 2: Introduction to CUDA C CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 1 CUDA /OpenCL Execution Model Integrated host+device app C program Serial or

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Parallel Accelerators

Parallel Accelerators Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon

More information

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro INTRODUCTION TO GPU COMPUTING WITH CUDA Topi Siro 19.10.2015 OUTLINE PART I - Tue 20.10 10-12 What is GPU computing? What is CUDA? Running GPU jobs on Triton PART II - Thu 22.10 10-12 Using libraries Different

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

Introduction to CUDA (1 of n*)

Introduction to CUDA (1 of n*) Agenda Introduction to CUDA (1 of n*) GPU architecture review CUDA First of two or three dedicated classes Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Acknowledgements

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

GPU Computing with CUDA. Part 2: CUDA Introduction

GPU Computing with CUDA. Part 2: CUDA Introduction GPU Computing with CUDA Part 2: CUDA Introduction Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

GPU Memory Memory issue for CUDA programming

GPU Memory Memory issue for CUDA programming Memory issue for CUDA programming Variable declaration Memory Scope Lifetime device local int LocalVar; local thread thread device shared int SharedVar; shared block block device int GlobalVar; global

More information

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA GPU Architecture and Programming Andrei Doncescu inspired by NVIDIA Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions are executed

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

Overview: Graphics Processing Units

Overview: Graphics Processing Units advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication

Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden. Computing with Graphical Processing Units CUDA Programming Matrix multiplication Lecture 6 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Computing with Graphical Processing Units CUDA Programming Matrix multiplication Announcements A2 has been released: Matrix multiplication

More information

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique GPU programming: CUDA Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr This lecture: CUDA programming We have seen some GPU architecture Now how to program it? 2 Outline GPU

More information

Programming with CUDA, WS09

Programming with CUDA, WS09 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller Lecture 3 Thursday, 29 Nov, 2009 Recap Motivational videos Example kernel Thread IDs Memory overhead CUDA hardware and programming

More information

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60 1 / 60 GPU Programming Performance Considerations Miaoqing Huang University of Arkansas Fall 2013 2 / 60 Outline Control Flow Divergence Memory Coalescing Shared Memory Bank Conflicts Occupancy Loop Unrolling

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing. GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming Martin Weigel Applied Mathematics Research Centre, Coventry University, Coventry, United Kingdom and Institut

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

COSC 6385 Computer Architecture. - Data Level Parallelism (II) COSC 6385 Computer Architecture - Data Level Parallelism (II) Fall 2013 SIMD Instructions Originally developed for Multimedia applications Same operation executed for multiple data items Uses a fixed length

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop General Purpose GPU programming (GP-GPU) with Nvidia CUDA Libby Shoop 3 What is (Historical) GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates

More information