CS377P Programming for Performance GPU Programming - II

CS377P Programming for Performance GPU Programming - II Sreepathi Pai UTCS November 11, 2015

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

Occupancy Recap GPUs partition resources among running threads NVIDIA Manual says maximize occupancy Why?

Reasoning about occupancy kernel <<<x, y>>>() Consider: 1 Thread Block N thread blocks, N equal to number of SMs/SMX N Residency thread blocks > N Residency thread blocks

Less Occupancy? Is there a case to reduce occupancy/residency? i.e. let threads consume more resources? smaller thread blocks?

Better Performance at Lower Occupancy Volkov, V., Better Performance at Lower Occupancy, GTC 2010

Volkov s Insights Do more parallel work per thread to hide latency with fewer threads (i.e. increase ILP) Unroll Use more registers per thread to access slower shared memory less Shared memory latency comparable to registers, but Shared memory throughput is lower! Both may be accomplished by computing multiple outputs per thread Note that Volkov underutilizes threads, but maxes out registers! Fermi had 63 registers/thread, Kepler has 255 registers/thread Why have a register limit?

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

SIMT Issue All threads in a warp execute the same instruction (same PC) What happens when: that instruction is a conditional branch? is a load that misses for some threads but not others?

Divergence If threads in a warp decide execute different PCs, the warp splits Two directions for a branch Two splits Each split is executed serially Nested branches also split correctly Join back at a pre-determined meet point Immediate post-dominator

Example if (cond) { x = 1; } else { y = 1; } Assume warps contains four threads each Assume only T0, T2 have cond == true. Time T0 T1 T2 T3 0 x = 1 x = 1 1 y = 1 y = 1 If cond is true for all threads Time T0 T1 T2 T3 0 x = 1 x = 1 x = 1 x = 1

Tackling Divergence Threads in the same warps should avoid divergent conditions Easier said than done Threads in the same warp should try to access locations in same memory line Memory divergence repeats requests until all threads have received data Compiler will predicate instructions No divergence both sides executed Predicated instructions are executed but do not commit Shown as [] below Time T0 T1 T2 T3 0 x = 1 [x = 1] x = 1 [x = 1] 1 [y = 1] y = 1 [y = 1] y = 1

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

The Cost of Everything What is the ordering of operations based on cost for a GPU? ALU: Integer, FP (+, *, /, %) Special Function Unit: trig, log, etc. Atomics: to the same address, to different addresses Load/Stores: Global Memory, Shared Memory, Registers, Caches (Texture/Constant/L1/L2) Barriers ( Memory Fences syncthreads)

Throughputs

Modeling GPU Performance Performance Equation Time = Operations / Throughput Throughput = Rate at which operations complete Example: Load 144MByte from memory Memory Bandwidth: 144GByte/s Time = 144M/144G = 1ms

Identifying Bottlenecks A GPU program: Reads 144M bytes (144 GBps) Performs 144M atomic operations (1/clock, 745MHz) Carries out 144M FMADDs (192/clock, 745MHz) What is the most likely bottleneck? Reading: 144M/144GBps =? ms Atomics: (144M/1)/745Mhz =? ms FMADDs: (144M/192)/745Mhz =? ms

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

The Scan Primitive Fold reduce a list of values to a single value ([1 2 0 1 3 5], +) Result: 12 Scan reduce a list of values and return intermediate values ([1 2 0 1 3 5], +) Inclusive Scan: [1 3 3 4 7 12] Exclusive Scan: [0 1 3 3 4 7] Also known as: All prefix sums, prefix scan, tree reduction, etc.

Serial implementations of Scan Exclusive Scan: result[0] = 0; for(int i = 1; i < N; i++) result[i] = result[i - 1] + A[i - 1] Inclusive Scan: result[0] = A[0]; for(int i = 1; i < N; i++) result[i] = result[i - 1] + A[i]

Parallel Implementation of Scan: Upsweep Sengupta et al. Scan Primitives for GPU Computing; Harris et al. GPU Gems 3

Parallel Implementation of Scan: Downsweep Sengupta et al. Scan Primitives for GPU Computing; Harris et al. GPU Gems 3

GPU Scan Implementations Very large arrays Store array in global memory Synchronize using multiple kernel calls Arrays that fit in shared memory Store array in shared memory Synchronize using syncthreads() Arrays smaller than warpsize Use warp collective instructions Use NVIDIA CUB library!

Using Scan to reduce the cost of atomics Assume you have a worklist which is implemented as follows: int worklist[1024]; int tail = 0; void push_parallel(int item) { old_tail = atomicadd(&tail, 1); worklist[old_tail] = item; }

Client Code 1 Thread for(int i = 0; i < n; i++) push_parallel(work[i]);

Client Code Optimized 1 Thread old_tail = atomicadd(tail, n) for(int i = 0; i < n; i++) worklist[old_tail + i] = work[i];

Client Code Adding more threads T Threads, each n items, n is same for every thread shared old_tail; if(tid == 0) old_tail = atomicadd(tail, n*t) syncthreads(); for(int i = 0; i < n; i++) worklist[old_tail + n*tid + i] = work[i];

Client Code General Problem T Threads, each n items, n may be different for each thread shared old_tail; if(tid == 0) old_tail = atomicadd(tail,?) syncthreads(); for(int i = 0; i < n; i++) worklist[old_tail +? + i] = work[i];

Client Code 4 Solution T Threads, each n items, n may be different for each thread shared old_tail; int offset; ExclusiveSum(n, total, offset) if(tid == 0) old_tail = atomicadd(tail, total) syncthreads(); for(int i = 0; i < n; i++) worklist[old_tail + offset + i] = work[i]; T0 T1 T2 T3 T4 n 1 0 3 5 1 offset 0 1 1 4 9

Performance

The Many Uses of Scan Stream compaction/filtering When you want to filter an array to another array Radix sort Many more...

Outline 1 GPU Occupancy 2 Divergence 3 Costs 4 Cooperation to reduce costs 5 Scheduling Regular Work

Scalar Product Problem: Given n pairs of vectors, all w elements wide, compute the scalar products of all the pairs Multiplications: n w Additions: n w How shall we distribute work?

Distribution 1 Assign a pair to a thread block A thread block executes on a single SM What happens if number of pairs is less than number of SMs?

Distribution 2 Divide vectors into parts Assign parts to thread blocks All thread blocks handle one pair at a time What happens if width of vectors is less than number of SMs?

Input size sensitivity Samadi et al. Adaptive Input-aware Compilation for Graphics Engines, PLDI 12.

Solution Enough work to saturate GPU Wide pairs in first case Lots of pairs in second case Just not distributed evenly Write both versions Choose between two versions at runtime depending on input size See MonteCarlo in the CUDA SDK (4.2) for an example

Conclusion GPU architecture has different tradeoffs: Occupancy Divergence GPU costs are different from CPU costs GPU programs can take advantage of several parallel programming primitives scan, in particular novel ways to reduce costs using such collective operations GPU utilization may require multiple schedules we did not cover dynamic scheduling