Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

modern GPU architectures deeply pipelined for efficient resource sharing several buffering and collision avoidance stages hiding long read-after-write (RAW) latency w/ extensive multi-threading no power-hungry data forwarding network (DFN) modern GPGPU applications insufficient thread-level parallelism (TLP) to hide long RAW latency (18-24 cycles) 40% of total stalls

most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint

23% (SP/INT) and 33% (DP) higher performance most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint

motivation insufficient TLP to hide long RAW latency proposed GPU architectural techniques MoRF HFMA dual-v DD pipeline evaluation summary

SM (Fermi-like arch.) 32-wide SIMT w/ 32K registers 1536 (8 CTAs ) 24-cycle RAW latency longer for SFU ops hiding RAW latency requiring 24 active warps 32 /warp 24 warps = 768 active SM: stream multiprocessor CTA: cooperative thread arrays *SMIT: single-instr. multiple 24 cycles front-end register address unit bank arbitration unit RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

application char. reg. file size 10% of GPU power how is an application divided into kernels? how much TLP exists in different kernels? avoiding non-coalesced memory accesses & resource contention synchronization and data sharing b/w shared mem. size 8% of GPU power

application char. reg. file size 10% of GPU power shared mem. size 8% of GPU power GPUs allocate registers statically during work distribution are grouped in unit of CTAs # of registers per thread determined by compiler/driver a CTA can be issued only if there are enough registers for all its

application char. reg. file size 10% of GPU power shared/const. memory requirement per CTA determined by compiler/driver SM must have enough memory to accommodate a complete CTA shared mem. size 8% of GPU power

DCT 512 512 512 33% occupancy 8 CTAs w/ 64 /CTA NDL 132 132 9% occupancy 16 KB/CTA w/ 44 /CTA CFD 576 1536 47 registers/thread w/ 192 /CTA 37% occupancy

DCT application partitioning, 512 thread synchronization, data dependencies 512 8 CTAs w/ 64 /CTA 512 33% occupancy NDL 132 16 KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD 576 1536 registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy

performance of low-occupancy applications constrained by long RAW latencies DCT application partitioning, 512 thread synchronization, data dependencies 512 8 CTAs w/ 64 /CTA 512 33% occupancy NDL 132 16 KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD 576 1536 registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy

8% occupancy 50% occupancy RAW stalls due to low occupancy 40% of total stalled cycles ideal DFN 18% IPC improvement

forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages high overhead of ideal DFN (16 % of GPU power) RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

mul.wide.u16 $r3, %ctaid.y, 0x8; cvt.u32.u16 $r2, $r0.lo; mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three 80% of dynamic instr. < 3 dep. dist. store 3 most recent exe. results per thread in forwarding buffer (FB)

Most-recent Result Forwarding forwarding buffer (FB) dispatch port FB[0] FB[1] FB[2] operand collector warp[0] warp[0] warp[0] warp[1] warp[1] warp[1] warp[2] warp[2] warp[2] FP units FB INT units warp[47] warp[47] warp[47] result queue rd/wr ports one 32-bit entry per warp in each FB bank each entry indexed by warp id

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] warp[1] warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, FB[0] $r2; FB[1] $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; FB[0] FB[1] $r6 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A 24 24 75 48 75 LZC & normalizer Accumulate RF exponent update 8 result 24` exponent 26 rounder result significand

early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A 24 24 75 48 75 LZC & normalizer breaking 7-cycle dep. chain exponent allows update rounder 1-cycle Accumulate RF exe. latency 8 result 24` exponent 26 result significand

power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) Normalize (75-bit) Round (24-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit) SPwDP dynamic leakage dynamic leakage 4.6 3.7 1.4 1.2 Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit)

power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) SPwDP dynamic leakage dynamic leakage 4.6 3.7 1.4 1.2 Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (75-bit) Normalize (163-bit) Normalize (163-bit) replacing 8 out of 16 DP FPUs w/ SPwDP Round (24-bit) Round (53-bit) Round (53-bit)

SP/INT benchmarks memory intensive INT only MoRF speedup: 14% MoRF+HFMA speedup: 18%

DP benchmarks longer FP latency replaced by shorter HFMA latency memory intensive MoRF speedup: 15% MoRF+HFMA speedup: 30%

RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FPU power consumption 35% of total GPU power in TDP condition!

RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower V DD for exe. units double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles lower V DD

RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback lower V DD for exe. units FP/INT exe. units result queue writeback MoRF +HFMA counter negative impact of double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles increased exe. latency lower V DD

RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower GPU power increase # of SMs (i.e., higher perf.) for the same power budget 16 cycles lower V DD

1 To support DP FMA operations. 2 For half as many DP FMA units. 3 Including additional pipeline overhead. DFN GPU peak power overhead 16% MoRF GPU peak power overhead 0.9% MoRF+HFMA MoRF+HFMA+Dual V dd SP FMA power increase 1 20% DP FMA power decrease 2 50% GPU peak power reduction 6.4% exe. units power reduction 3 35% GPU peak power reduction 14% 1. to support DP FMA ops. 2. for half as many DP FMA units. 3. Including additional pipeline overhead. MoRF+HFMA slightly higher power due to additional logic and SRAM dual-v DD design reduced SP and DP power consumption

SP/INT benchmarks 23% speedup w/ MoRF+HFMA+dual-V DD +more SMs 5% additional improvement over MoRF+HFMA

DP benchmarks 33% speedup w/ MoRF+HFMA+dual-V DD +more SMs 3% additional improvement over MoRF+HFMA

insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint

23% (SP/INT) and 33% (DP) higher performance insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint

Question? 38