Overview: Graphics Processing Units

advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply memories, thread synchronization, scheduling case study: reductions performance considerations: bandwidth, scheduling, resource conflicts, instruction mix host-device data transfer: multiple GPUs, NVLink, Unified Memory, APUs the OpenCL programming model directive-based programming models refs: CUDA Toolkit Documentation, An Even Easier Introduction to CUDA (tutorial); NCI NF GPU page, Programming Massively Parallel Processors, Kirk & Hwu, Morgan-Kaufman, 2010; Cuda By Example, by Sanders and Kandrot; OpenCL web page, OpenCL in Action, by Matthew Scarpino COMP4300/8300 L21,22: Graphics Processing Units 2017 1

Advent of General-purpose Graphics Processing Units many applications have massive amounts of mostly independent calculations e.g. ray tracing, image rendering, matrix computations, molecular simulations, HDTV can be largely expressed in terms of SIMD operations implementable with minimal control logic & caches, simple instruction sets design point: maximize number of ALUs & FPUs and memory bandwidth to take advantage of Moore s Law (shown here) put this on a co-processor (GPU); have a normal CPU to co-ordinate, run the operating system, launch applications, etc architecture/infrastructure development requires a massive economic base for its development (the gaming industry!) pre 2006: only specialized graphics operations (integer & float data) 2006: General Purpose (GPGPU): general computations but only through a graphics library (e.g. OpenGL) 2009: programmable for general (numeric) calculations (e.g. CUDA, OpenCL) Some applications have large speedups (10 500 ) over a single CPU core. COMP4300/8300 L21,22: Graphics Processing Units 2017 2

Graphics Processor Unit Systems GPU systems are a co-processor device on a CPU-based system ([O H.&Bryant, fig 1.4]) separate memory space (DRAMs) for CPU (host) and GPU (device) must allocate space on GPU and copy data from CPU memory to GPU memory (and visa versa) via the PCIe bus also need a way to copy the GPU executable code and start it (kernel launch) issues? Why not use the same memory space? COMP4300/8300 L21,22: Graphics Processing Units 2017 3

Graphics Processor Unit Architecture GPU chip: an array of streaming multiprocessors (SMs) sharing an L2 cache comparison with UltraSPARC T2 (courtesy Real World Tech) each SM has (8 32) streaming processors (SPs) only SPs (= cores) within an SM can (easily) synchronize, share data identical threads are organized into fixed-size blocks, each allocated to an SM blocks in turn are divided into warps at any timestep, all SPs execute an instruction from a warp ( SIMT mode) latencies hidden by scheduling from many warps TeslaS2050 co-processor TeslaS2050 architecture (courtesy NVIDIA) COMP4300/8300 L21,22: Graphics Processing Units 2017 4

The Fermi Graphics Processor Chip GF110 model: 1.15 GHz; 900W; 3D grid & thread blocks; warp size: 32; max resident: blocks 8, warps 32, threads 1536 (from NCI NF page) COMP4300/8300 L21,22: Graphics Processing Units 2017 5

GPU vs CPU Floating Point Speed and Memory Bandwidth COMP4300/8300 L21,22: Graphics Processing Units 2017 6

The Common Unified Device Architecture Programming Model device refers to a co-processor with own DRAM that can run many threads in parallel host performs serial execution, transfers data to/from device (via DMA), and sends (highly ) kernels to device the kernel s threads are organized into a grid of blocks each block is sent to an SM a CUDA program is a C/C++ program with device calls & kernels (each with many threads) embedded into it GPU threads are very lightweight (some overheads in invoking a kernel, and dispatching each block) threads are identical but have thread (& block) ids (courtesy NCSU) CUDA compiler (e.g. nvcc) produces a normal executable with device code embedded into it has CUDA runtime (cudart) and core (cuda) libraries linked into it COMP4300/8300 L21,22: Graphics Processing Units 2017 7

CUDA Program: Simple Example reverse an array (reversearray.cu) g l o b a l void reversearray ( int a d, int N ) { int idx = threadidx. x ; int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; } # define N (1<<16) int main () { // may not dereference a d! int a [ N ], a d, a s i z e = N sizeof ( int );... cudamalloc (( void ) & a d, a s i z e ); cudamemcpy ( a d, a, a size, cudamemcpyhosttodevice ); reversearray <<<1, N/2>>> ( a d, N ); cudathreadsynchronize (); // wait till threads finish cudamemcpy (a, a d, a size, cudamemcpydevicetohost ); cudafree ( a d );... } cf. OpenMP on a normal multicore: style; practicality? # pragma omp parallel n u m t h r e a d s ( N /2) default ( shared ) { int idx = o m p g e t t h r e a d s n u m (); int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; } COMP4300/8300 L21,22: Graphics Processing Units 2017 8

CUDA Thread Organization and Memory Model a 2 1 grid with 2 1 blocks memory model (left) reflects that of the GPU 2 2 grid with 4 2 2 blocks (courtesy Real World Tech.) (courtesy NCSC) COMP4300/8300 L21,22: Graphics Processing Units 2017 9

Case Study: Matrix Multiply perform C+=AB, C is N N, A is N K, B is K N column-major storage: C i j is at C[i + j N] 1st attempt: each thread computes one element of C, C i, j invocation with W W thread blocks (assume W N) why better than using a N N thread block? (2 reasons, both important!) for thread (t x,t y ) of block (b x,b y ), i = b y W +t y and j = b x W +t x (courtesy xfig) COMP4300/8300 L21,22: Graphics Processing Units 2017 10

kernel: } CUDA Matrix Multiply: Implementation g l o b a l void matmult ( int N, int K, double A d, double B d, double C d ) { int i = blockidx. y blockdim. y + threadidx. y ; int j = blockidx. x blockdim. x + threadidx. x ; double cij = C d [ i + j N ]; for ( int k =0; k < K ; k ++) cij += A d [ i + k N ] B d [ k + j K ]; C d [ i + j N ] = cij ; main program: needs to allocate device versions of A, B & C (A d, B d, and C d) and cudamemcpy() host versions into them invocation with W W thread blocks (assume W N) dim3 dimg ( N /W, N / W ); dim3 dimb (W, W ); // in kernel blockdim. x == W matmult <<<dimg, dimb >>> (N, K, A d, B d, C d ); what if N % W > 0? Add to kernel if (i < N && j < N) and declare dim3 dimg((n+w 1)/W, (N+W 1)/W); note: SIMD nature of SPs cycles for both branches of if are consumed COMP4300/8300 L21,22: Graphics Processing Units 2017 11

CUDA Memories and Thread Synchronization GPUs can potentially suffer more still from the memory wall DRAM access still may be 100 s of cycles bandwidth is limited for load/store intensive kernels the shared memory is on-chip (hence very fast) the shared type modifier may be used to denote a (fixed) array allocated to shared memory threads within a block can synchronized via the (efficient why?) syncthreads() intrinsic (SM-level) atomic instructions can enforce data consistency within a block note: no way to synchronize between blocks, or safely ensure data consistency across blocks can only be done across separate kernel invocations COMP4300/8300 L21,22: Graphics Processing Units 2017 12

Matrix Multiply Using Shared Memory threads (t x,0)... (t x,w 1) all access B k,bx W +t x ; ((0,t y )... (W 1,t y ) access A by W +t y,k) high ratio of load to FP instructions harder to hide L1 cache latencies; strains memory bandwidth can improve kernel by utilizing SM shared memory: } s h a r e d double A s [ W ][ W ], B s [ W ][ W ]; g l o b a l void m a t M u l t s ( int N, int K, double A d, double B d, double C d ) { int ty = threadidx.y, tx = threadidx. x ; int i = blockidx. y W + ty, j = blockidx. x W + tx ; double cij = C d [ i + j N ]; for ( int k =0; k < K ; k += W ) { A s [ ty ][ tx ] = A d [ i + ( k + tx ) N ]; B s [ ty ][ tx ] = B d [( k + ty ) + j K ]; s y n c t h r e a d s (); for ( int w =0; w < W ; w ++) cij += A s [ ty ][ w ] B s [ w ][ tx ]; s y n c t h r e a d s (); // can this be avoided? } C d [ i + j N ] = cij ; COMP4300/8300 L21,22: Graphics Processing Units 2017 13

GPU Scheduling - Warps the GigaThread scheduler assigns the (independently executable) thread blocks to each SM each block is divided into groups (of 32) called warps grouping occurs in linear order by t x + b x t y + b x b y t z (e.g. warp size 4) the warp scheduler determines which blocks are ready to run with 32-thread warps, suitable block sizes range from 4 8 to 16 16 SIMT: each SP executes next instr n SIMD-style (note: requires only a single instruction fetch!) thus, a kernel with enough blocks can scale across a GPU with any number of cores (courtesy NVIDIA - both) COMP4300/8300 L21,22: Graphics Processing Units 2017 14

threads within a single 1D block summing A[0..N 1]: } Reductions and Thread Divergence g l o b a l void sumv ( int N, double A, double s ) { int bx = blockdim. x ; s h a r e d double psum [ bx ]; int tx = threadid.x, x ; psum [ tx ] =... for ( x = bx /2; x>0; x /=2) { s y n c t h r e a d s (); if ( tx < x ) psum [ tx ] += psum [ tx + x ]; } if ( tx ==0) s = psum [ tx ]; predicated execution: threads in a warp where the condition is false execute a no-op if-else statements thus cause thread divergence (worse when nested) (courtesy NVIDIA) divergence is minimized: occurs only when x < 32 (on one warp) cf. alternative algorithm: for ( x =1; x < bx ; x =2) { s y n c t h r e a d s (); if ( tx % x == 0) psum [ tx ] += psum [ tx + x ]; } COMP4300/8300 L21,22: Graphics Processing Units 2017 15

Global Memory Bandwidth Issues in reduction example, all threads in warp contiguously access (shared) array psum very important when you have global memory accesses: memory subsystem can coalesce these into a single access allows DRAM banks to deliver peak bandwidth (burst mode) reason: 2D organization of DRAM chips (same row address) (Lect 3, p14) matmult example: threads within warp access A contiguously, but not B effect of accesses to B in this case is mitigated by use of shared memory in multiply note that this effect is opposite to normal cores, where contiguous access within a thread is most desirable (maximizes spatial locality) worst case scenario: memory strides in (large) powers of 2 causes memory bank conflicts COMP4300/8300 L21,22: Graphics Processing Units 2017 16

SM Registers and Warp Scheduling the SM maintains block ids of scheduled blocks, and thread ids (and block sizes) of scheduled threads the SM s (32K word) register file is shared between all of these the block and thread ids are used to index the file for the registers allocated to a particular thread warps whose next instruction has its operands ready for consumption may be selected round-robin used if there are several ready thus, registers need to be scoreboarded can make use of this to (software) prefetch data and better hide latencies (sh. mem. matmult) example: if there are 4 instrn s between a load & its use, on the G80, with 4 clock cycles needed to process an instrn., we need 14 active warps to tolerate a 200-cycle memory latency (courtesy NVIDIA) COMP4300/8300 L21,22: Graphics Processing Units 2017 17

Performance Considerations: Shared SM Resources on Fermi GPUs, may have resident on an SM: 8 blocks, 32 warps and 1536 threads; 128 KB register file, 64 KB shared memory / L1 cache to fully utilize block & thread slots, need at least 192 threads per block assuming 4-byte operands, can have at most 16 registers per thread optimizations on a kernel resulting in more registers may result in fewer blocks being resident... (courtesy NVIDIA) resource contention can cause a dramatic loss of performance the CUDA occupancy calculator can help evaluate this COMP4300/8300 L21,22: Graphics Processing Units 2017 18

Performance Considerations: Instruction Mix goal: keep the SP s FPUs fully occupied doing useful operations every other kind of instruction (loads, address calculations, branches) hinders this! matrix multiply revisited: strategy 1: unroll k loops: for ( int k =0; k < K ; k +=2) cij += A d [ i + k N ] B d [ k + j K ] + A d [ i +( k +1) N ] B d [ k +1+ j K ]; halves loop index increments & branches strategy 2: each thread computes a 2 2 tile of C instead of a single element reduces load instructions; reduces branches by 4 but may require 4 the registers! also increases thread granularity: may help if K is not large COMP4300/8300 L21,22: Graphics Processing Units 2017 19

Host-Device Issues: Multiple GPUs, NVLink, and Unified Memory transfer of data to/from host to device is error-prone, potentially a performance bottleneck (what if the array for an advection solver could not fit in GPU memory?) the problem is exacerbated when multiple GPUs are connected to one host we can select the required device by cudasetdevice(): cudasetdevice (0); cudamalloc ( a d, n ); cudamemcpy ( a d, a, n,...); reversearray <<<1,n/2>>>(a d, n ); cudathreadsynchronize (); cudamemcpypeer ( a b, 0, b d, 1, n ); cudasetdevice (1); reversearray <<<1,n/2>>>(b d, n ); fast interconnects such as NVLink will reduce the transfer costs (e.g. Sierra system) CUDA s Unified Memory will improve programability issues (and in some cases, performance) cudamallocmanaged(a, n); allocates the array on host so that it can migrate, page-by-page, to/from GPU(s) transparently and on demand alternatively, have the device and CPU use the same memory, as on AMD s APU for Exascale Computing COMP4300/8300 L21,22: Graphics Processing Units 2017 20

The Open Compute Language for Devices and Regular Cores open standard not proprietary like CUDA; based on C (no C++) design philosophy: treat GPUs and CPUs as peers, data- and task- parallel compute model similar execution model to CUDA: NDRange (CUDA grid): operates on global data, units within cannot synch. WorkGroup (CUDA block): units within can use shared ), to synch. local data (CUDA WorkItem (CUDA thread): indpt. unit of execution, also has private data example kernel: } k e r n e l void reversearray ( g l o b a l int a d, int N ) { int idx = getglobalid (0); int v = a [N idx 1]; a [N idx 1] = a [ idx ]; a [ idx ] = v ; recall that in CUDA, we could launch as reversearray<<<1,n/2>>>(a d, N), but in OpenCL... COMP4300/8300 L21,22: Graphics Processing Units 2017 21

OpenCL Kernel Launch must explicitly create device handle, compute context and work-queue, load and compile the kernel, and finally enqueue it for execution clgetdeviceids (..., C L D E V I C E T Y P E G P U, 1, & device,...); context = clcreatecontext (0, 1, & device,...); queue = clcreatecommandqueue ( context, device,...); program = clcreateprogramwithsource ( context, " r e v e r s e A r r a y. cl ",. clbuildprogram ( program, 1, & device,...); r e v e r s e A r r k = clcreatekernel ( program, " r e v e r s e A r r a y ",...); clsetkernelarg ( reversearray k, 0, sizeof ( c l m e m ) & a d ); clsetkernelarg ( reversearray k, 0, sizeof ( int ) & N ); cndimension = 1; cnblocksize = N /2; clenqueuendrangekernel ( queue, reversearray k, 1, 0, & cndimension, & cnblocksize, 0, 0, 0); note: CUDA host code is compiled into.cubin intermediate files which follow a similar sequence for usage on normal core (CL DEVICE TYPE CPU), a WorkItem corresponds to an item in a work queue that a number of (kernel-level) threads get work from compiler may aggregate these to reduce overheads COMP4300/8300 L21,22: Graphics Processing Units 2017 22

Directive-Based Programming Models OpenACC enables us to specify which code is to run on a device, and how to transfer data to/from it # pragma acc parallel loop copyin (a, b ) copy ( c ) for ( i =0; i < N ; i ++) for ( int j =0; j < N ; j ++) { double cij = C [ i + j N ]; for ( int k =0; k < K ; k ++) cij += A [ i + k N ] B [ k + j K ]; C [ i + j N ] = cij ; } the data directive may be used to specify data placement across kernels the code can be also compiled to run across multiple CPUs OpenMP 4.0 operates similarly. For the above example: # pragma omp target map ( to : A [0: N K ], B [0: N K ]) map ( tofrom : C [0: N N ]) # pragma omp parallel for default ( shared ) studies on complex applications where all data must be kept on device indicate a productivity grain and performance loss of 2 over CUDA (e.g. Zhe14) COMP4300/8300 L21,22: Graphics Processing Units 2017 23

Graphics Processing Units: Summary designed to exploit computations expressible in large numbers of identical, independent threads grouped into blocks: allocated to an SM and hence can have synchronization within each GPU cores are designed for throughput, not single-thread speed low clock speed, instructions taking several clock cycles SIMT execution to hide long latencies; large amounts of hardware to maintain many thread contexts destructive sharing: appears as resource contention may lose performance due to poor utilization, but not from load imbalance L2 cache and memory bandwidth an important consideration, but main consideration in access patterns is within a warp COMP4300/8300 L21,22: Graphics Processing Units 2017 24