GPU Computing K. Cooper 1 1 Department of Mathematics Washington State University 2014
Review of Parallel Paradigms MIMD Computing Multiple Instruction Multiple Data Several separate program streams, each executing possibly different sets of instructions Each instruction stream operates on different data each instruction stream may only have access to a fragment of data
Review of Parallel Paradigms SIMD Computing Single Instruction Multiple Data Only one program stream, though that may launch multiple threads The instruction stream may be applied simultaneously to many different data elements.
Review of Parallel Paradigms Advantages of MIMD Advantages Instructions can be wildly different for individual streams Instructions can be separated even to different nodes Memory is distributed becomes limited only by number of nodes Nodes can be unsophisticated cheap Disadvantages Communication
Review of Parallel Paradigms Advantages of SIMD Disadvantages All computations must happen on single machine limited memory, processors Hardware must be very complex, therefore expensive Advantages All computations happen on a single machine fast
Review of Parallel Paradigms SIMT Historically, SIMD computing involved vastly complex CPUs with many ALUs, with complicated switch architectures. This is, in some sense, a description of a modern video card. Ever since SGI, video cards have had small specialized processors designed for arithmetic involved in 3-d projections. Single Instruction - Multiple Thread We start one program that program can launch many threads to perform small tasks in parallel on a Graphics Processing Unit.
CUDA Computing NVidia The company that really drives this is NVidia Makes video cards for 3-d games Interface (API) for programmers to send instructions to card. CUDA - Compute Unified Device Architecture AMD/ATI is playing too, but uses different API
CUDA Computing Model 1 Start one program 2 Write function(s) to handle core of computation in parallel kernel 3 Allocate memory in RAM and also on video card 4 Copy data from CPU to video card 5 Run the kernel on the card 6 Copy data back from card to CPU
CUDA Computing Example - Assignment 1 Here is some code to parallelize. h = 1.0/(double)n; for(i=0;i<n;i++){ for(i=0;i<n;i++){ x = i*h; u[i] = (sin(x+h)-sin(x))/h; err[i] = u[i]-cos(x); } }
CUDA Computing Example - Kernel Write the kernel. global void fwd_diff(double *u, double *err, int int i = blockidx.x*blockdim.x + threadidx.x; double h = 1/(double)n; double x = i*h; } u[i] = (sin(x+h)-sin(x))/h; err[i] = u[i]-cos(x);
CUDA Computing Example - Allocate Allocate memory. size_t size = n*sizeof(double); double *u = (double *)malloc(size); double *err = (double *)malloc(size); double *d_u; prob = cudamalloc((void **)&d_u,size); double *d_err; prob = cudamalloc((void **)&d_err,size);
CUDA Computing Example - Copy to Card Copy data to the video card. prob = cudamemcpy(d_u,u,size,cudamemcpyhosttodevice) prob = cudamemcpy(d_err,u,size,cudamemcpyhosttodevic
CUDA Computing Example Run the kernel Run the kernel. Note the peculiar syntax. fwd_diff <<<blockspergrid,threadsperblock>>> (d_u,d_err,n); Note that we pass the pointers to the device memory. Blocks, threads have to do with the device architecture.
CUDA Computing Example Copy back to CPU Copy the results back to the CPU. prob = cudamemcpy(u,d_u,size,cudamemcpydevicetohost) prob = cudamemcpy(err,d_err,size,cudamemcpydevicetoh
CUDA Computing Comparison
Grids, Blocks, and Threads SPs and SMs Each device is organized as a collection of streaming multiprocessors SMs. Each SM is composed of a collection of streaming processors (SPs) and an L1 cache. Each SP has access to some number of registers.
Grids, Blocks, and Threads Blocks You must write your code so that it recognizes a block structure. Each block is loaded onto a single SM. The number of blocks does not have to match the number of SMs. Still, some recognition of the number of SMs can help efficiency. Blocks comprise a collection of threads. Each thread is one little program fragment. Threads from a given block are executed in warps (32 threads). Thus, the block size should probably be a multiple of 32.
Grids, Blocks, and Threads Threads Spawning threads... fwd_diff <<<blockspergrid,threadsperblock>>> (d_u,d_err,n); Code for one thread - the kernel. global void fwd_diff(double *u, double *err, int int i = blockidx.x*blockdim.x + threadidx.x; double h = 1/(double)n; double x = i*h; } u[i] = (sin(x+h)-sin(x))/h; err[i] = u[i]-cos(x);
Grids, Blocks, and Threads Memory The main memory for a device is called global memory. It is comparable in speed to L2 cache on a CPU. Each block has access to its L1 cache, called shared memory, which is much faster than global memory. Each thread has access to some number of registers, which are much faster than shared memory.
Grids, Blocks, and Threads Cards Here are a few numbers from our cards. GT 610 GTX Titan CUDA Cores 48 2688 SMs 1 14 Total Memory 2 GB 6 GB Memory Bus 64-bit 384-bit Shared Mem/Block 49152 bytes 49152 bytes
Performance Block Size
Performance Comparison to MPI Excerpt from email message regarding system with 12000 particles integrated over a period of a second or so... scalar program: 6 days scalar program with parallel fudge on ethernet cluster: 1 day parallel program on ethernet cluster: 44 minutes parallel program on infiniband cluster: 20 minutes CUDA: 19 minutes
Performance Conclusions GPU computing has limitations... Memory Scalability Reduction still an issue Difficulty in programming Still, when our little $2K machine can compete with a $1M cluster on a real problem... this is important.