Stanford University NVIDIA Tesla M2090 NVIDIA GeForce GTX 690
Moore s Law 2
Clock Speed 10000 Pentium 4 Prescott Core 2 Nehalem Sandy Bridge 1000 Pentium 4 Williamette Clock Speed (MHz) 100 80486 Pentium Pentium II Pentium III 10 80386 80286 8086 8080 1 1974 1978 1982 1985 1989 1993 1997 1999 2000 2004 2006 2008 2011 Date of Introduction 3
How did they use all those additional transistors? Additional functionality Floating point units SSE vector units Caches Data Instructions Translation Lookaside Buffer (virtual memory) Hardware prefetcher Instruction Level Parallelism Instruction pipelining Superscalar execution Out of order execution Speculative execution Branch prediction 4
Multicore CPUs Maximum # cores 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Six core AMD Opteron (image from AMD) 2003 2004 2005 2006 2007 2008 2009 2010 2011 Date of Introduction for AMD Opteron CPUs 5
Graphics Processing Unit (GPU) Leverages demand and volume from video game players Consumer versions widely available at any electronics store at a variety of price points Programmable via free software options (images from NVIDIA) 6
Compute and Bandwidth Performance FLOPS Bandwidth Source: NVIDIA CUDA C Programming Guide Version 4.2 7
Hardware Characteristics 3 or 4 generations of NVIDIA CUDA architectures released from 2007-2012 Key characteristics are: Architectural differences Double precision floating point on newer generations Memory system Newer generations are much more flexible L1 and L2 caches have been added in later generations Number of cores and the rate at which they re clocked Bandwidth Depends on memory clock and width of the memory interface Amount of on-board memory (256 MB to 6 GB) Power consumption Higher end GPUs need auxiliary 6- and/or 8-pin PCIe power connectors 8
NVIDIA Fermi Architecture 9
Fermi Streaming Multiprocessor 10
GPU Parallelism Multicore CPU needs one or two threads per core to run efficiently GPUs need thousands to tens of thousands of threads to run efficiently Each time a GPU computes a frame (which it does tens of times a second) it uses a thread per pixel, of which there are millions In the context of computation, this means fine grain parallelism This is possible because GPU threads are different than CPU threads 11
GPU Threads Lightweight compared to CPU threads Creation, scheduling, destruction are done in hardware Fast switching between threads Many times more threads than cores Typically about 100x more threads than cores Why so many threads? 12
GPU Memory System GPU doesn t rely solely on cache to hide memory latency Many more transistors available for computational units Can do without cache because it s specialized to handle parallel computations When a thread stalls due to memory access latency, a core switches to executing another thread and when that thread eventually stalls, the core switches to another thread, etc. Registers are partitioned to allow fast switching (one clock cycle) More threads means more opportunity for latency hiding A GPU can run a single-threaded function but the performance will be horrible 13
GPU Programming Today NVIDIA CUDA C and Fortran OpenCL Microsoft DirectCompute OpenACC 14
Getting Started with CUDA Hardware Requires compatible hardware (emulator no longer available) Any reasonably new NVIDIA GPU supports CUDA Check NVIDIA website for more information: http://www.nvidia.com/object/cuda_learn_products.html CUDA C Freely available from NVIDIA website: http://www.nvidia.com/getcuda CUDA driver (part of display driver), toolkit (compiler and libraries), and SDK code examples Windows, Linux, and Mac OS X supported Use the Quick Start Guides for installation and verification 15
CUDA C CPU code API for interacting with the GPU(s) Extension for easily invoking computational kernels that run on the GPU GPU code Subset of C Kernels must return void Recursion, malloc / free or new / delete, printf, assert, etc. only supported on Fermi GPUs and beyond Extensions Parallel programming model Libraries Callable from the CPU, run on the GPU BLAS, FFT, CURAND, CUSPARSE, NPP, Thrust, etc. 16
Simple CUDA C Program global void addvectors(float *a, float *b, float *c) { int idx = blockidx.x*blockdim.x + threadidx.x c[idx] = a[idx] + b[idx]; addvectors<<<nvalues/256, 256>>>(a_d, b_d, c_d); Keyword that indicates GPU code } Each thread computes a unique index to access the data it will access and produce // CPU code calls GPU code CPU specifies how many GPU threads to start, in this case one thread per element in the vectors 17
CPU and GPU Memory CPU and GPU each have their own physical memory Data transferred over PCI Express (PCIe) 8 GB/s theoretical peak for Gen 2 x16, up to ~5.5 GB/s observed PCI Express 18
Allocating GPU Memory GPU memory explicitly allocated and freed cudamalloc, cudafree Pointers to memory allocated on the GPU are not valid on the CPU and vice versa GPU uses a virtual memory system, but: On Windows Vista systems (and their derivatives, e.g. Windows HPC Server 2008) and up, allocating beyond physical memory will automatically result in paging to CPU memory On all other operating systems allocations will fail Done for performance reasons Allocation has the life of the host CPU process/thread Automatically cleaned up by driver if application doesn t 19
cudamalloc cudaerror_t cudamalloc(void **devptr, size_t size) float *a_d; status = cudamalloc((void **) &a_d, 1024*sizeof(float)); assert(status == cudasuccess); No type distinction between CPU and GPU pointers Recommend adopting a standard convention, e.g. _d suffix Be sure to implement some sort of error checking consistent with your application 20
cudafree cudaerror_t cudafree(void *devptr) status = cudafree(a_d); assert(status == cudasuccess); 21
cudamemcpy cudaerror_t cudamemcpy(void *dst, const void *src, size_t count, enum cudamemcpykind kind) Copy from memory area pointed to by src to memory area pointed to by dst kind specifies the direction of the copy: cudamemcpyhosttohost cudamemcpyhosttodevice cudamemcpydevicetohost cudamemcpydevicetodevice Calls with dst and src pointers inconsistent with the copy direction will result in undefined behavior (typically garbage in the destination, or perhaps even an application crash) cudamemcpy will block until the memory copy has completed Options for asynchronous copies also available 22
cudamemcpy Example size_t bytes = 1024*sizeof(float); a = (float *) malloc(bytes); b = (float *) malloc(bytes); cudamalloc((void **) &a_d, bytes); cudamalloc((void **) &b_d, bytes); cudamemcpy(a_d, a, bytes, cudamemcpyhosttodevice); cudamemcpy(b_d, a_d, bytes, cudamemcpydevicetodevice); cudamemcpy(b, b_d, bytes, cudamemcpydevicetohost); for(int n = 0; n < 1024; n++) assert(a[n] == b[n]); 23
Parallel Programming Model Parallel portions of the code are initiated from the CPU and run on the GPU Parallelism is based on many threads running in parallel Developer writes one thread program Each instance of the thread will use a unique index to determine which portion of the computation to perform Sometimes referred to as SIMD, SPMD, or SIMT Single Instruction Multiple Data Single Program Multiple Data Single Instruction Multiple Threads 24
Parallel Threads... idx 0 1 2 3 nthreads - 1 x = input[idx]; y = func(x); output[idx] = y; 25
Thread Cooperation Useful to have threads cooperate with one another Share intermediate results Reduce memory accesses (stencil operations, etc.) Cooperation is difficult to scale Synchronization is expensive Potential for deadlock Kernels launched as grid of thread blocks 26
Grid of Thread Blocks Grid Block 0 1 2 nblocks - 1... Threads in the same block can cooperate More on this later when we discuss shared memory Threads in different blocks cannot cooperate 27
Hardware Execution Software Hardware Thread Thread Processor Thread Block Multiprocessor Grid... Device 28
Scalability Across GPUs Blocks scheduled across one or more multiprocessors Correctly written program will work for any number of multiprocessors and ordering of blocks Blocks which won t fit are queued and started when other blocks finish Device A Device B Time 29
GPU Code Kernel is a C function with restrictions: Must return void Cannot access host memory No variable number of arguments No recursion on older generations of GPUs No static variables Function arguments automatically copied from host to device But not memory that backs pointers 30
Function Qualifiers Function qualifiers used to specify where a function will be called from and where it will execute global device Called from the host and executes on the device Called from the device and executes on the device host Called from the host and executes on the host Combine with device for overloading 31
Kernel Launch Special syntax for invoking kernels: mykernel<<<dim3 dimgrid, dim3 dimblock>>>( ); <<<, >>> referred to as the execution configuration dimgrid is the number of blocks in the grid One or two dimensional: dimgrid.x, dimgrid.y dimblock is the number of threads in a block One, two, or three dimensional: dimblock.x, dimblock.y, dimblock.z Multidimensional grids and blocks are for programming convenience Unspecified dim3 fields default to 1 32
Execution Configuration Examples dim3 grid, block; grid.x = 2; grid.y = 4; block.x = 16; block.y = 16; mykernel<<<grid, block>>>( ); dim3 grid(2, 4), block(16, 16); mykernel<<<grid, block>>>( ); mykernel<<<8, 256>>>( ); 33
Built in Variables global and device functions have access to several automatically defined variables dim3 griddim Dimension of the grid in blocks dim3 blockdim Dimension of the block in threads dim3 blockidx Block index within the grid dim3 threadidx Thread index within the block 34
Globally Unique Thread Indices Grid blockidx.x 0 1 2 nblocks - 1... threadidx.x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 idx 0 1 2 3 4 5 6 7 8 9 10 11 idx = blockidx.x*blockdim.x + threadidx.x; 35
Vector Addition Example global void addvectors(float *a, float *b, float *c, int N) { int idx = blockidx.x*blockdim.x + threadidx.x; if (idx < N) c[idx] = a[idx] + b[idx]; }... blocksize = 256; dim3 dimgrid(ceil(nvalues/(float)blocksize)); addvectors<<<dimgrid, blocksize>>>(a_d, b_d, c_d, nvalues); 36
Reduction Combine the elements of an array using an associative, commutative operator Typical examples include: sum, min, max, product, etc. sum = 0.; for (n = 0; n < nvalues; n++) sum += a[n]; Note that CUDPP and Thrust implement very efficient reductions as library calls: http://gpgpu.org/developer/cudpp http://code.google.com/p/thrust/ 37
Generic Parallel Reduction Implemented using recursive pairwise reduction 7 3 8-5 19 27 0 9 10 3 46 9 13 55 68 38
Simple Parallel Reduction 7 3 8-5 19 27 0 9 Kernel 1 26 30 8 4 19 27 0 9 Kernel 2 34 34 8 4 19 27 0 9 Kernel 3 68 34 8 4 19 27 0 9 39
Simple Parallel Reduction global void sumreductionkernel(float *a, int nthreads){ int idx = blockdim.x*blockidx.x + threadidx.x; if (idx < nthreads) a[idx] += a[idx + nthreads]; }... nthreads = nvalues/2; while (nthreads > 0){ griddim.x = ceil((float)nthreads/blocksize); sumreductionkernel<<<nvalues/256,blocksize>>>(a_d, nthreads); nthreads /= 2; } cudamemcpy(a, a_d, sizeof(float), cudamemcpydevicetohost); printf( sum of a = %f\n, a[0]); 40
Memory Model So far we ve seen per thread variables (like idx) which are stored in registers and memory in the offchip DRAM (device/global memory) Registers: Accessible by one thread Life of the thread... Device: Accessible by all threads Life of the application 41
Shared Memory Exchanging data through device memory is expensive due to bandwidth and latency and it also requires multiple kernel launches Global synchronization between kernel launches Instead use high performance on-chip memory ~100 times lower latency than device memory ~10 times more bandwidth 16 KB to 48 KB SRAM per multiprocessor 16 KB on compute capability 1.x 16 KB to 48 KB on compute capability 2.x+ Allocated per thread block and can be read and written by any thread in the block Has the lifetime of the thread block 42
Expanded Memory Model Registers: Accessible by one thread Life of the thread Shared Memory: Accessible by all threads in a block Life of the thread block... Device: Accessible by all threads Life of the application 43
Variable Qualifiers device Located in off-chip DRAM memory Allocated with cudamalloc ( device qualifier implied) Life of the application Accessible from threads and host shared Located in on-chip shared memory Life of the thread block Only accessible from threads within the block constant See the documentation Unqualified variables in device code normally reside in registers 44
Example of Shared Memory Declaration #define BLOCKSIZE 256 global void mykernel(float *a, int nvalues) { /* Per thread block shared memory. */ shared float a_s[blocksize]; } /* Local (per thread) variables. */ int idx;... 45
More on Shared Memory Declaration global void mykernel(float *a, int nvalues) { /* Per thread block shared memory. */ extern shared float a_s[]; /* Local variables. */ int idx;... Size of a_s specified at kernel launch }... bytes = 256*sizeof(float); mykernel<<<dimgrid, dimblock, bytes>>>(a_d, nvalues); 46
Thread Synchronization Threads can cooperate by writing and reading to shared memory but there is potential for race conditions Thread reads from shared memory before another thread has written the data, etc. syncthreads() synchronizes all threads in a block Acts as a barrier No thread in the block can continue until all threads reach it Allowed in conditional code only if the conditional is uniform across the entire thread block! 47
Better Parallel Reduction a[0] a[255] a[256] a[511] a[512] a[767] Kernel 1 Block 0 Block 1 Block 2 a[0] a[1] a[2] Kernel 2 Block 0 a[0] Reductions within each thread block significantly reduce the number of kernel invocations and the amount of data written back to memory from each kernel 48
Better Parallel Reduction: GPU Code (1) #define BLOCKSIZE 256 global void sumreductionkernel(float *a, int nvalues { int n = BLOCKSIZE/2; int idx = blockidx.x*blockdim.x + threadidx.x; /* Shared memory common to all threads within a block. */ shared float a_s[blocksize]; /* Load data from global memory into shared memory. */ if (idx < nvalues) a_s[threadidx.x] = a[idx]; else a_s[threadidx.x] = 0.f;... syncthreads(); 49
Better Parallel Reduction: GPU Code (2)... /* Reduction within this thread block. */ while (n > 0) { if (threadidx.x < n) a_s[threadidx.x] += a_s[threadidx.x + n]; n /= 2; syncthreads(); } /* Thread 0 writes the one value from this block back to global memory. */ if (threadidx.x == 0) a[blockidx.x] = a_s[0]; } 50
Better Parallel Reduction: CPU Code... nthreads = nvalues; while (nthreads > 0) { griddim.x = ceil((float)nthreads/blocksize); sumreductionkernel<<<griddim, BLOCKSIZE>>>(a_d, nthreads); nthreads /= BLOCKSIZE; }... 51
Processor-Memory Gap From: Computer Architecture: A Quantitative Approach by Hennessy and Patterson 52
Thread Warps Thread blocks are made up of groups of threads called warps Warp size on all current hardware is 32, but could change on future hardware (can query through device properties) A warp is executed in lock step SIMD on a multiprocessor Hardware automatically handles divergence due to branching Note that you are free to specify an arbitrary number of threads per block, but the hardware can only work in increments of warps Number of threads is internally rounded up to a multiple of the warp size and extras are masked out in terms of memory accesses Trivia: Warps is a term which comes from weaving. They are threads woven in parallel. 53
Warps and Half Warps = Thread block Warp 0 Warp 1 Warp 2 Warp n Warp n = Half warp Half warp 54
Compute Capability Compute capability is a versioning scheme for keeping track of multiprocessor capabilities / features Compute capability 1.0 Tesla architecture, first CUDA capable multiprocessor Compute capability 1.1 Adds atomic operations for global memory, etc. Compute capability 1.2 Tesla 2 architecture Doubles the number of registers from 1.0 and 1.1 Adds atomic operations for shared memory, etc. Compute capability 1.3 Adds double precision floating point, etc. Compute capability 2.x Fermi architecture Compute capability 3.x Kepler architecture Compute capability of a GPU can be queried at runtime 55
Memory Coalescing Coalescing is the process of combining global memory access (load or store) across threads within a warp or half warp into one or more transactions How coalescing is performed depends on the compute capability 1.0 and 1.1 have the same coalescing characteristics 1.2 and 1.3 have the same coalescing characteristics 1.0 and 1.1 are subsets of 1.2 and 1.3 2.0 and 3.0 adds L1 and L2 caches Global memory divided into segments of size 32, 64, 128, and 256 bytes Pointers from cudamalloc are always at least 256 byte aligned 56
Global Memory Segments 57
Coalescing on Compute Capability 1.2 and 1.3 Global memory accesses by a half warp are combined to minimize the total number of transactions Eliminates the dependence on the order in which threads access data in compute capability 1.0 and 1.1 In addition, transaction sizes are automatically reduced to avoid wasted bandwidth Recursively reduces the transaction size if only the upper or lower half of the segment is needed See the programming guide for more details of the algorithm Minimum transaction size of 32 bytes and per thread word sizes of 32, 64, and 128 bits Note that the coalescing for Compute Capability 1.2 and 1.3 is a superset of the requirements for Compute Capability 1.0 and 1.1 Code that is efficient on 1.0 and 1.1 will continue to be efficient on 1.2 and 1.3 but not necessarily vice versa 58
Examples of Memory Transactions 59
Examples of Memory Transactions 60
Coalescing and Caching on Compute Capability 2.x+ Each multiprocessor has 64 KB of SRAM for shared memory and L1 cache Can choose split between shared memory and L1 on a per kernel basis GPU as a whole has an L2 cache Memory accesses are coalesced across the full warp of 32 threads The cache line size is 128 bytes, so any misses in L1 will result in one or more 128 bytes transactions from L2 to L1 If the L1 cache is bypassed, either through a compilation flag or an inline assembly instruction, the requests are served from L2 using 32 byte transactions There is no way to bypass L2 cache 61
Big Picture on Global Memory Access On a CPU, you want spatial locality of data down a thread On a GPU, you want spatial locality of data across threads True for both NVIDIA and AMD GPUs CPU GPU time 62