Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

Size: px

Start display at page:

Download "Nam Sung Kim. w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *"

Martina Elliott
5 years ago
Views:

1 Nam Sung Kim w/ Syed Zohaib Gilani * and Michael J. Schulte * University of Wisconsin-Madison Advanced Micro Devices *

2 modern GPU architectures deeply pipelined for efficient resource sharing several buffering and collision avoidance stages hiding long read-after-write (RAW) latency w/ extensive multi-threading no power-hungry data forwarding network (DFN) modern GPGPU applications insufficient thread-level parallelism (TLP) to hide long RAW latency (18-24 cycles) 40% of total stalls

3 most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint

4 23% (SP/INT) and 33% (DP) higher performance most recent result forwarding (MoRF) storing 3 most recent exe results per thread covering 80% RAW stalls high-throughput FMA unit (HFMA) predicting exponent reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same countering its negative impact of increased pipeline stages w/ MoRF and HFMA allowing more cores under power constraint

5 motivation insufficient TLP to hide long RAW latency proposed GPU architectural techniques MoRF HFMA dual-v DD pipeline evaluation summary

6 SM (Fermi-like arch.) 32-wide SIMT w/ 32K registers 1536 (8 CTAs ) 24-cycle RAW latency longer for SFU ops hiding RAW latency requiring 24 active warps 32 /warp 24 warps = 768 active SM: stream multiprocessor CTA: cooperative thread arrays *SMIT: single-instr. multiple 24 cycles front-end register address unit bank arbitration unit RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

7 application char. reg. file size 10% of GPU power how is an application divided into kernels? how much TLP exists in different kernels? avoiding non-coalesced memory accesses & resource contention synchronization and data sharing b/w shared mem. size 8% of GPU power

8 application char. reg. file size 10% of GPU power shared mem. size 8% of GPU power GPUs allocate registers statically during work distribution are grouped in unit of CTAs # of registers per thread determined by compiler/driver a CTA can be issued only if there are enough registers for all its

9 application char. reg. file size 10% of GPU power shared/const. memory requirement per CTA determined by compiler/driver SM must have enough memory to accommodate a complete CTA shared mem. size 8% of GPU power

10 DCT % occupancy 8 CTAs w/ 64 /CTA NDL % occupancy 16 KB/CTA w/ 44 /CTA CFD registers/thread w/ 192 /CTA 37% occupancy

DCT application partitioning, 512 thread synchronization, data dependencies 512 8 CTAs w/ 64 /CTA 512 33% occupancy NDL 132 16

11 DCT application partitioning, 512 thread synchronization, data dependencies CTAs w/ 64 /CTA % occupancy NDL KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy

performance of low-occupancy applications constrained by long RAW latencies DCT application partitioning, 512 thread synchronization, data dependencies 512 8 CTAs w/ 64

12 performance of low-occupancy applications constrained by long RAW latencies DCT application partitioning, 512 thread synchronization, data dependencies CTAs w/ 64 /CTA % occupancy NDL KB/CTA w/ 44 /CTA 132 shared memory per CTA 9% occupancy CFD registers per thread 47 registers/thread w/ 192 /CTA 37% occupancy

13 8% occupancy 50% occupancy RAW stalls due to low occupancy 40% of total stalled cycles ideal DFN 18% IPC improvement

14 forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

15 forwarding requirement 3 post-exe stages to 14 pre-exe stages high complexity wiring 86K wires + multiplexers! 3 post-exe stages 14 pre-exe stages 32 lanes per SM 32 bits per lane 2 datapaths (for INT/FP) pre-exe. 3 stages 14 stages high overhead of ideal DFN (16 % of GPU power) RF banks+caches dispatcher operand collectors FP/INT exe. units result queue writeback

16 mul.wide.u16 $r3, %ctaid.y, 0x8; cvt.u32.u16 $r2, $r0.lo; mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three 80% of dynamic instr. < 3 dep. dist. store 3 most recent exe. results per thread in forwarding buffer (FB)

17 Most-recent Result Forwarding forwarding buffer (FB) dispatch port FB[0] FB[1] FB[2] operand collector warp[0] warp[0] warp[0] warp[1] warp[1] warp[1] warp[2] warp[2] warp[2] FP units FB INT units warp[47] warp[47] warp[47] result queue rd/wr ports one 32-bit entry per warp in each FB bank each entry indexed by warp id

18 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] warp[1] warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

19 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] warp[1] warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

20 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

21 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, FB[0] $r2; FB[1] $r3 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.

22 Most-recent Result Forwarding forwarding buffer (FB) mul.wide.u16 $r3, %ctaid.y, 0x8; FB[0] FB[1] FB[2] cvt.u32.u16 $r2, $r0.lo; warp[0] warp[0] warp[0] mov.u32 $r0, s[0x0018]; add.u32 $r6, $r3, $r2; FB[0] FB[1] $r6 warp[2] $r2 warp[2] $r0 warp[2] mul.wide.u16 $r4, $r0.lo, $r6.hi; dep. dist.: one, two, three warp[47] warp[47] warp[47] rd/wr ports src. reg. id. dynamically modified to read from FB entries by scoreboarding logic

23 early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A LZC & normalizer Accumulate RF exponent update 8 result 24` exponent 26 rounder result significand

24 early forwarding predict exp. range [exp blk -13: exp blk +13] require 75-bit adder instead of 48-bit store intermediate result w/ [exp blk -13: exp blk +13] 48-entry ARF exponents C 8 B 8 exponent comparison 8 A 8 ARF 75 mux C 24 significands significand multiplier alignment adder B A LZC & normalizer breaking 7-cycle dep. chain exponent allows update rounder 1-cycle Accumulate RF exe. latency 8 result 24` exponent 26 result significand

power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part.

instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) Normalize

multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit) SPwDP

25 power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) Normalize (75-bit) Round (24-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit) SPwDP dynamic leakage dynamic leakage Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (163-bit) Round (53-bit)

26 power-hungry DP FPU 54-bit multiplier reuse SP HFMA for DP ops shift/acc. part. products A HI B HI A HI B LOW B LOW A HI A LOW B LOW 4 cycles for DP FMA require 27-bit multiplier instead of 24-bit Significand multiplier (24-bit x 24-bit) Align (75-bit) Add (75-bit) SP HFMA w/ DP Support power normalized to SP HFMA DP HFMA Significand multiplier (54-bit x 54-bit) Align (163-bit) Add (163-bit) SPwDP dynamic leakage dynamic leakage Significand multiplier (27-bit x 27-bit) Align (163-bit) Add (163-bit) Normalize (75-bit) Normalize (163-bit) Normalize (163-bit) replacing 8 out of 16 DP FPUs w/ SPwDP Round (24-bit) Round (53-bit) Round (53-bit)

27 SP/INT benchmarks memory intensive INT only MoRF speedup: 14% MoRF+HFMA speedup: 18%

28 DP benchmarks longer FP latency replaced by shorter HFMA latency memory intensive MoRF speedup: 15% MoRF+HFMA speedup: 30%

29 RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FPU power consumption 35% of total GPU power in TDP condition!

30 RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower V DD for exe. units double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles lower V DD

units result queue writeback MoRF +HFMA counter negative impact of double exe. lat.

31 RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback lower V DD for exe. units FP/INT exe. units result queue writeback MoRF +HFMA counter negative impact of double exe. lat. to maintain the same freq. only volt. domain crossing w/ no freq domain crossing 16 cycles increased exe. latency lower V DD

32 RF banks+caches dispatcher operand collectors RF banks+caches dispatcher operand collectors 8 cycles FP/int exe. units result queue writeback FP/INT exe. units result queue writeback lower GPU power increase # of SMs (i.e., higher perf.) for the same power budget 16 cycles lower V DD

33 1 To support DP FMA operations. 2 For half as many DP FMA units. 3 Including additional pipeline overhead. DFN GPU peak power overhead 16% MoRF GPU peak power overhead 0.9% MoRF+HFMA MoRF+HFMA+Dual V dd SP FMA power increase 1 20% DP FMA power decrease 2 50% GPU peak power reduction 6.4% exe. units power reduction 3 35% GPU peak power reduction 14% 1. to support DP FMA ops. 2. for half as many DP FMA units. 3. Including additional pipeline overhead. MoRF+HFMA slightly higher power due to additional logic and SRAM dual-v DD design reduced SP and DP power consumption

34 SP/INT benchmarks 23% speedup w/ MoRF+HFMA+dual-V DD +more SMs 5% additional improvement over MoRF+HFMA

35 DP benchmarks 33% speedup w/ MoRF+HFMA+dual-V DD +more SMs 3% additional improvement over MoRF+HFMA

36 insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint

37 23% (SP/INT) and 33% (DP) higher performance insufficient TLP to hide long RAW latency 40% of total stalls leading to 18% perf. loss MoRF storing 3 most recent exe results per thread HFMA reducing effective FP op latency to 1 cycle dual-v DD pipeline reducing V DD of FPUs while keeping frequency same allowing more cores under power constraint

38 Question? 38

Power-efficient Computing for Compute-intensive GPGPU Applications

Power-efficient Computing for Compute-intensive GPGPU Applications Syed Zohaib Gilani, Nam Sung Kim, Michael J. Schulte The University of Wisconsin-Madison, WI, U.S.A. Advanced Micro Devices, TX, U.S.A.