Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1

~ socket 14

~ core 14

~ HWMT+SIMD ( SIMT ) 14

Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

~ 500 GF/s (single) Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

~ 4 TF/s (single) ~ 500 GF/s (single) Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 14

Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

~ 50 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

~ 50 GB/s ~ 250 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

~ 50 GB/s ~ 250 GB/s 6 GB/s Intel E5-2687W vs. NVIDIA K20X Sandy Bridge-EP vs. Kepler 15

System Comparison Intel Xeon NVIDIA Difference E5-2687W K20X # Cores/SMX 8 14 1.75 Clock frequency (max) SIMD Width Thread processors 3.8 GHz 735 MHz 0.20 256-bits 2688 SP + 896 DP Performance 8 cores 3.8 GHz 2688 735 MHz 8.12 (single precision) (8 Add + 8 Mul) = 2 (FMA) = Performance 8 cores 3.8 GHz 896 735 MHz 5.42 (double precision) (4 Add + 4 Mul) = 2 (FMA) = Memory bandwidth 51.2 GB/s 250 GB/s 4.88 TDP 150 W 235 W 1.57

6 GB/s 17

CUDA is NVIDIA s implementation of this execution model

Thread hierarchy Single instruction multiple (SIMT)

An example to compare models Naïve: for (i=0; i<n; i++) A[i] += 2; OpenMP: #pragma omp parallel for for (i=0; i<n; i++) A[i] += 2; CUDA, with N s: int i = f(global ID); A[i] += 2;

Global IDs blockidx.x 0 1 2 3 Idx.x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 global ID 0 1 2 3 15

Global IDs blockidx.x 0 1 2 3 Idx.x 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 A global ID 0 1 2 3 15

Thread hierarchy Given a 3-D grid of blocks there are (griddim.x*griddim.y*griddim.z) blocks in the grid each block s position is identified by blockidx.x, blockidx.y, and blockidx.z Similarly for a 3-D block blockdim.x, blockdim.y, blockdim.z Idx.x, Idx.y, Idx.z Thread-to-data mapping depends on how the work is divided amongst the s

Memory hierarchy variables local memory block shared memory grid global memory constant memory (read-only) texture memory (read-only)

CUDA by example Basic CUDA code global void test (int* in, int* out, int N) { int gid = Idx.x + blockdim.x * blockidx.x; out[gid] = in[gid]; }! int main (int argc, char** argv) { int N = 1048576; in tbsize = 256;! int nblocks = N / tbsize;! dim3 grid (nblocks); dim3 block (tbsize);! test <<<grid, block>>> (d_in, d_out, N); cudathreadsynchronize (); }

CUDA by example Basic CUDA code int main (int argc, char** argv) { /* allocate memory for host and device */ int* h_in, h_out, d_in, d_out; h_in = (int*) malloc (N * sizeof (int)); h_out = (int*) malloc (N * sizeof (int)); cudamalloc ((void**) &d_in, N * sizeof (int)); cudamalloc ((void**) &d_out, N * sizeof (int));! /* copy data from device to host */ cudamemcpy (d_in, h_in, N * sizeof (int), cudamemcpyhosttodevice);! /* body of the problem here */... /* copy data back to host */ cudamemcpy (h_out, d_out, N * sizeof (int), cudamemcpydevicetohost); allocate memory on device Copy data from CPU to GPU Copy data from GPU to CPU } /* free memory */ free (h_in); free (h_out) cudafree (d_in); cudafree (d_out); free memory

CUDA by example What is this code doing? global mysteryfunction (int* in) { int tidx, tidy, gidx, gidy; tidx = Idx.x; tidy = Idx.y; gidx = tidx + blockdim.x * blockidx.x; gidy = tidy + blockdim.y * blockidx.y;! shared buffer[16][16];! buffer[tidx][tidy] = in[gidx + gidy * blockdim.x * griddim.x]; syncs();! if(tidx > 0 && tidy > 0) { int temp = (buffer[tidx][tidy - 1] + (buffer[tidx][tidy + 1] + (buffer[tidx - 1][tidy] + (buffer[tidx + 1][tidy] + (buffer[tidx][tidy]) / 5; } else { /* take care of boundary conditions */ } in[gidx + gidy * blockdim.x * griddim.x] = temp; }

Synchronization Within a block via syncs (); Global synchronization implicit synchronization between kernels only way to synchronize globally is to finish the grid and start another grid

Scheduling Each block gets scheduled on a multiprocessor (SMX) there is no guarantee in the order in which they get scheduled blocks run independently to each other Multiple blocks can reside on a single SMX simultaneously (occupancy) the number of blocks is determined by the resource usage and availability (shared memory and registers) Once scheduled, each blocks runs to completion

Execution Minimum unit of execution: warp typically 32 s At any given time, multiple warps will be executing could be from the same or different blocks A warp of s could be either executing waiting (for data or their turn) When a warp gets stalled, they could be switched out instantaneously so that another warp can start executing hardware multi-ing

Performance Notes Thread Divergence On a branch, s in a warp can diverge execution is serialized s taking one branch executes while others idle Avoid divergence!!! use bitwise operation when possible diverge at granularity of warps (no penalty)

Performance Notes Occupancy Occupancy = # resident warps / max # warps # resident warps is determined by per- register and per-block shared memory usage max # warps is specific to the hardware generation More warps means more s with which to hide latency increases the chance of keeping the GPU busy at all times does not necessarily mean better performance

Performance Notes Bandwidth Utilization Reading from the DRAM occurs at the granularity of 128 Byte transactions requests are further decomposed to aligned cache lines read-only cache:128 Bytes L2 cache: 32 Bytes Minimize loading redundant cache lines to maximize bandwidth utilization aligned access to memory sequential access pattern

Performance Notes Bandwidth Utilization

Backup 44

GPU Architecture

Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent

Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent Memory Bandwidth Bandwidth (λ) Latency (W)

Performance Notes Bandwidth Utilization II Little s Law L = λw L = average number of customers in a store λ = arrival rate W = average time spent Memory Bandwidth tens of thousands of in-flight requests!!! Bandwidth (λ) Latency (W)

In summary Use as many cheap s as possible maximizes occupancy increases the number of memory requests Avoid divergence if unavoidable, diverge at the warp level Use aligned and sequential data access pattern minimize redundant data loads

CUDA by example Quicksort Let s now consider quicksort on a GPU Step 1 Partition the initial list how do we partition the list amongst blocks? recall that blocks CANNOT co-operate and blocks can go in ANY order however, we need to have MANY s and blocks in order to see good performance

CUDA by example Quicksort 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot (5) 2 1 0 2 1 0 1 1 >= pivot (5) 0 1 2 0 1 2 1 1

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 0 2 1 1 1 2 >= pivot 0 1 2 2 1 3 1 2 Do a cumulative sum on < pivot and >= pivot This should be done in shared memory in parallel

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 0 2 1 1 1 2 >= pivot 0 1 2 2 1 3 1 2 This tells us how much space and where each block needs to store its values

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 >= pivot 0 1 temporary array start end

CUDA by example Quicksort thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 thr 0 thr 1 4 2 3 5 6 1 9 3 4 7 6 5 9 8 3 1 block 0 block 1 block 2 block 3 < pivot 2 3 atomic fetch-and-add (FAA) >= pivot 0 1 temporary array start end

CUDA by example Quicksort Phew. That was the first part. This is repeated until there are enough independent partitions that can be assigned to blocks In the next part, each block will do something similar minus the FAA When sequences become small enough, you can sort it using an alternative sorting algorithm (e.g., bitonic sort)