CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23

Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration and execution Exercise: Calculate the number of threads per block 3 Warps and Warp Scheduling 4 Synchronization N. Cardoso & P. Bicudo CUDA programming model 2/23

CUDA qualifiers We can write in the same file the Host code and the Device code The Host controls the Device! How to declare functions in CUDA that: run only in the Device and are called by the Host? run in Device and are called only by the Device? run only in the Host? N. Cardoso & P. Bicudo CUDA programming model 3/23

CUDA CUDA qualifiers global defines a kernel function must return void defines the code to be executed by all threads during the parallel execution call device defines a function that can be called inside the kernel or by other device function device and host can be used together, the compiler creates the code for both host and device When no qualifier is used then that function is a host function Executed in: Only called by: device void/float/int/... DeviceFunc() device device global void KernelFunc() device host* host void/float/int/... HostFunc() host host in Kepler and newer architectures, global functions can be called inside another global function, Dynamic Parallelism. N. Cardoso & P. Bicudo CUDA programming model 4/23

CUDA Kernel A kernel is a function executed on the GPU as an array of threads in parallel All threads execute the same code, can take different paths Each thread has an ID Select input/output data Control decisions Threads are grouped into blocks Blocks are grouped into a grid ernel Execution A kernel is executed as a grid of blocks of threads CUDA thread CUDA thread block CUDA kernel grid... CUDA core CUDA Streaming Multiprocessor CUDA-enabled GPU Each thread is executed by a core Each thread is executed by a core Each block is executed by one SM and does not migrate Each block is executed by one SM and does not migrate Several concurrent blocks can reside on one SM depending on the blocks memory requirements and the SM s memory resources Several concurrent blocks can reside on one SM depending on the blocks memory requirements and the SM s memory resources Each kernel is executed on one device Each kernel is executed on one device Multiple kernels can execute on a device at one time Multiple kernels can execute on a device at one time N. Cardoso & P. Bicudo CUDA programming model 5/23

CUDA Kernel Thread hierarchy code executed in the GPU (device); executed by a parallel array of threads; all threads run the same code hierarchical launches: threads are grouped in blocks; blocks are grouped in grids. CUDA has identification variables for threads, blocks and grids. Threads are logically grouped into thread blocks: Threads in the same block can cooperate cooperatively load/store blocks of memory that all will use through shared memory share results with each other or cooperate to produce a single result synchronize with each other ( syncthreads()) Threads in different blocks cannot cooperate Host Kernel 1 Device Grid 1 Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Nuno Cardoso Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: N. Cardoso & P. Bicudo CUDA programming model 6/23

CUDA Kernel Thread hierarchy IDs and dimensions: threads can have 3D IDs, unique for a single block (threadidx.x, threadidx.y, threadidx.z); maximum number of threads per block are architecture dependent. blocks can have 2D (for Fermi and below) or 3D (for Kepler and newer architectures) IDs, unique in the same grid (blockidx.x, blockidx.y, blockidx.z); Includes also the block dimensions, (blockdim.x, blockdim.y, blockdim.z) and grid dimensions (griddim.x, griddim.y, griddim.z). all dimensions are defined by the user/programmer in the kernel call; The list of these specifications supported by your GPU are available by running the./devicequery in the sample code directory Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability: a kernel scales across any number of SMs Host Kernel 1 Device Grid 1 Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Nuno Cardoso Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) N. Cardoso & P. Bicudo CUDA programming model 7/23

CUDA Kernel Using 1D blocks and grids: global v o i d k e r n e l ( ) { i n t i d = b l o c k I d x. x blockdim. x + t h r e a d I d x. x ; (... ) } Using 2D blocks and grids: global v o i d k e r n e l ( ) { i n t i = b l o c k I d x. x blockdim. x + t h r e a d I d x. x ; i n t j = b l o c k I d x. y blockdim. y + t h r e a d I d x. y ; } i n t i d = i + j Nx ; (... ) Important: Start using 1D arrays from now on! a 2D array can be accessed as: i + j * Nx a 3D array can be accessed as: i + j * Nx + k * Nx * Ny pay attention to the faster and the slower indexes otherwise you lose performance in the previous example, i is the faster and k the slower! N. Cardoso & P. Bicudo CUDA programming model 8/23

CUDA Kernel Kernel, configuration and execution a kernel function can be called in the following way: dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block kernel<<< DimGrid, DimBlock >>>(...); dim3 can take 1, 2 or 3 parameters N. Cardoso & P. Bicudo CUDA programming model 9/23

CUDA Kernel Kernel, configuration and execution Calculating Y = ax + Y in parallel: global void kernel(int n, float a, float *x, float *y){ int i = blockidx.x * blockdim.x + threadidx.x; if(i<n) y[i] = a * x[i] + y[i]; } //call kernel function with 256 threads per block int nblocks = (n+255) / 256; //n is the array size kernel<<<nblocks,256>>> (n, 2.0, X, Y); N. Cardoso & P. Bicudo CUDA programming model 10/23

CUDA Kernel Execution configuration identifier in 1D: Thread device inline Blocks: float func(float Scalable x){ Cooperation (...) } global void kernel(float *input, float *output){ int threadid = blockidx.x * blockdim.x + threadidx.x; Threads float x = within input[threadid] a block cooperate via shared memory, atomic float y operations = func(x); and barrier synchronization output[threadid] = y; } Threads in different blocks cannot cooperate //call kernel function with 8 threads per block //problem size: N*8 = total number of threads kernel<<<n,8>>> (input, output); Divide monolithic thread array into multiple blocks threadid Thread Block 0 Thread Block 1 Thread Block N - 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; float x = input[threadid]; float y = func(x); output[threadid] = y; David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 22 N. Cardoso & P. ECE Bicudo 498AL, University of Illinois, Urbana-Champaign CUDA programming model 11/23

What is the maximum number of threads per block? How many blocks can I run in a single SM? How is a thread block executed in a single SM? N. Cardoso & P. Bicudo CUDA programming model 12/23

32 CUDA Cores per SM (512 total) 8x peak FP64 performance 50% of peakad communication Fermi Architecture Example: 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock 2 warp schedulers Up to 1536 threads concurrently Up to 1024 threads per block 4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache N. Cardoso & P. Bicudo CUDA programming model 13/23

SM Multiprocessor SM, Streaming Multiprocessor GPU serves as a coprocessor to the CPU 32 CUDA Cores per SM (512 total) has its own device memory on the card SIMT (Single Instruction Multiple Thread) execution 8x peak FP64 performance Many threads execute 50% concurrently of peakad communication Same instruction Different data elements Hardware automatically handles divergence Threads run in groups of 32 (warp size) called warps Each warp is executed in a SIMD fashion (i.e. all threads within a warp must execute the same instruction at any given time). Problem: branch divergence Hardware multithreading Hardware resource allocation & thread scheduling Excess of threads to hide latency Any thread not waiting for something can run Context switching is (basically) free Hardware relies on threads to hide latency Instruction Cache Scheduler Scheduler Dispatch Dispatch Register File Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache N. Cardoso & P. Bicudo CUDA programming model 14/23

Maximum number of blocks per SM: architecture dependent Fermi: 8 Kepler: 16 Maxwell: 32 Maximum number of threads per SM: architecture dependent Fermi: 1536 Kepler: 2048 Maxwell: 2048 Maximum number of threads per block: 1024 Block size must be multiple of the warp size (32) The warp size is currently 32 threads The warp size can change in future GPUs There isn t a best universal block size Needs to be tuned manually for each kernel Depends a lot of the resources used by each thread Start with 128 or 256 threads per block and tune in the end N. Cardoso & P. Bicudo CUDA programming model 15/23

Exercise: Calculate the number of threads per block Goal: Get the maximum number of threads in a SM. Limitations maximum number of threads per block: 512 maximum number of threads per SM: 1024 maximum number of blocks per SM: 8 block size: 8 8 64 threads per block 8 64 = 512 threads 50% occupation of the SM block size: 16 16 256 threads per block 1024/256 = 4 4 blocks per SM. block size: 32 32 1024 threads per block > 512 exceeds the SM capacity. Example: Tiled Matrix Multiplication d_p blockidx.y threadidx.y blockidx.x threadidx.x 19 blockdim.x blockdim.y assigned to a thread assigned to a thread block e : š A Å Ã A@ ž ² @A B ³ Goal: To have as many threads on an SM as possible ³ Limitations: Maximum Number of Threads per Block: 512 N. Cardoso & P. Bicudo CUDA programming Maximum Number of Threads per SM: 1024 Æ model 16/23 Maximum Number of Block per SM: 8

Warps and Warp Scheduling Each Thread Block divided in 32-thread "Warps" This is an implementation decision, not part of the CUDA programming model Warps are the basic scheduling unit in SM The SM implements a zero-overhead warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction Prefer thread block sizes that result in mostly full warps Bad: kernel <<<N, 1>>> (... ) Okay: kernel <<<N / 32, 32>>>(... ) Better: kernel <<<N / 128, 128>>>(... ) Prefer to have enough threads per block to provide hardware with many warps to switch between This is how the GPU hides memory access latency N. Cardoso & P. Bicudo CUDA programming model 17/23

Warps and Warp Scheduling Exercise: If 3 blocks are processed by an SM and each Block has 256 threads, how many Warps are managed by the SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. N. Cardoso & P. Bicudo CUDA programming model 18/23

Warps and Warp Scheduling anch Divergence in Warps Branch Divergence in Warps occurs when threads inside warps branches to different execution paths. rs when Example: threads e warps branches ifferent execution s. Branch Path A Path B 50% performance loss 50% performance loss 18 N. Cardoso & P. Bicudo CUDA programming model 19/23

Warps and Warp Scheduling Scoreboarding The SM schedules threads in groups of 32 parallel threads called warps. Each SM fea warp schedulers and two instruction dispatch units, allowing two warps to be issued a executed concurrently. Fermi s dual warp scheduler selects two warps, and issues on instruction from each warp to a group of sixteen cores, sixteen load/store units, or fou Because warps execute independently, Fermi s scheduler does not need to check for dependencies from within the instruction stream. Using this elegant model of dual-issu All register operands of all instructions in Fermi: 2 Warp schedulers (per SM) achieves near peak hardware performance. the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited prevents hazards cleared Warp instructions Scheduling are eligible for issue Decoupled Memory/Processor pipelines À any thread Thread can warps continue are to issue scheduled to hide memory instructions access until delays scoreboarding prevents issueá Each warp runs until it is stalled due to memory access allowá Memory/Processor ops to proceed in shadow Upon of other a warp waiting stall, another warp is scheduled with near Memory/Processor zero overhead ops Á Number of The Warp original schedulers warp is will architecture be ready to be schedule once the memory content it asked for is ready dependent! Most instructions can be dual issued; two integer instructions, two floating instruction mix of integer, floating point, load, store, and SFU instructions can be issued concurre Double precision instructions do not support dual dispatch with any other operation. 64 KB Configurable Shared Memory and L1 Cache One of the key architectural innovations that greatly improved both the programmabili performance of GPU applications is on-chip shared memory. Shared memory enables within the same thread block to cooperate, facilitates extensive reuse of on-chip data, greatly reduces off-chip traffic. Shared memory is a key enabler for many high-perform CUDA applications. \ G80 and GT200 have 16 KB of shared memory per SM. In the Fermi architecture, each N. Cardoso & P. Bicudo 64 KB CUDA of on-chip programming memory model that can be configured as 48 KB of Shared memory with 20/2316

Thread Synchronization (within a block only) CUDA allows thread synchronization within the same block using v o i d syncthreads ( ) ; waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to syncthreads() are visible to all threads in the block. is used to coordinate communication between the threads of the same block. is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects. Devices of compute capability 2.x and higher support three variations of syncthreads(): int syncthreads_count(int predicate); is identical to syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero. int syncthreads_and(int predicate); is identical to syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them. int syncthreads_or(int predicate); is identical to syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them. N. Cardoso & P. Bicudo CUDA programming model 21/23

Synchronization in the Host All kernel calls are asynchronous with the host: the control returns to the host (CPU) immediately after the kernel launch, the kernel is executed when all previous synchronous calls have ended and there are enough resources for it. synchronization barrier using cudadevicesynchronize() cudaerror_t cudadevicesynchronize(void) Blocks until the device has completed all preceding requested tasks. The only safe way to synchronize threads in different blocks is to terminate the kernel and start a new kernel for the activities after the synchronization point N. Cardoso & P. Bicudo CUDA programming model 22/23

Conclusion We must be aware of the restrictions imposed by hardware: threads/sm blocks/sm threads/blocks threads/warps The only safe way to synchronize threads in different blocks is to terminate the kernel and start a new kernel for the activities after the synchronization point The warp size is the most important number in CUDA Try to avoid or minimize branch divergence inside warps N. Cardoso & P. Bicudo CUDA programming model 23/23