Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Size: px

Start display at page:

Download "Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding"

Kristopher Carr
5 years ago
Views:

Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali

1 Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern California

2 GPUs Have Long Memory Latency GPU Core (SM) GPU I-cache/Fetch/Decode Core (SM) GPU I-cache/Fetch/Decode Core Warp (SM) Scheduler I-cache/Fetch/Decode Warp Register Scheduler File Warp ALU Register Scheduler ALU File ALU ALU ALU ALURegister ALUFile ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Scratchpad ALU ALU L1 ALU cache Scratchpad Scratchpad L1 cache L1 Cache On-chip Interconnect Memory Partition L2 cache Memory Controller Off-chip Interface Off-chip DRAM

3 GPUs Have Long Memory Latency GPU Core (SM) GPU I-cache/Fetch/Decode Core (SM) GPU I-cache/Fetch/Decode Core Warp (SM) Scheduler I-cache/Fetch/Decode Warp Register Scheduler File Warp ALU Register Scheduler ALU File ALU ALU ALU ALURegister ALUFile ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Scratchpad ALU ALU L1 ALU cache Scratchpad Scratchpad L1 cache L1 Cache On-chip Interconnect cycles Cycle DRAM Latency Memory Partition L2 cache Memory Controller Off-chip Interface cycles Off-chip DRAM

4 Long Latency Stalls Warp Execution Instruction stream of a Warp 0x80 0x88 0x90 0x98 0xA0 0x80 0x88 0x90 0x98 0xA0 ld.global $r0, [$r1] ld.global $r0, [$r1] In-order Long-latency memory instructions

5 Long Latency Stalls Warp Execution Instruction stream of a Warp Stall Point 0x80 0x88 0x90 0x98 0xA0 0x80 0x88 0x90 0x98 0xA0 ld.global $r0, [$r1] $r4 brlong-latency 0x80 ld.global stall! $r0, [$r1] In-order

6 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units

7 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units

8 Warp Interleaving GPU interleave multiple warps for latency hiding For each warp, instructions are issued in-order Warps can be interleaved in any order Warp 1 Warp 2 Warp 3 Warp 4 Next inst. Next inst. Next inst. Next inst. Scoreboard Scoreboard Scoreboard Scoreboard Warp Scheduler Execution Units

9 Fraction of Total Cycles Need of Improving Latency Tolerance 100% 80% 60% 40% 20% Long-latency RAW Stall Load/Store Unit Stall Other Stalls 0% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Benchmarks Long-latency RAW stalls Warps waiting for the response from memory system Long memory latency Load/store unit stall No more memory requests can be generated Limited memory throughput

10 Fraction of Total Cycles Need of Improving Latency Tolerance 100% 80% Long-latency RAW Stall Load/Store Unit Stall Other Stalls Latency-sensitive Latency-insensitive 60% 40% 20% 0% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Long memory Benchmarks latency 23% of cycles stalled Long-latency RAW stalls Warps waiting for the response from memory system Long memory latency Load/store unit stall No more memory requests can be generated Limited memory throughput

11 Relative Performance More TLP Improves Performance Better latency tolerance by interleaving more threads 1.2 Performance TB2 Baseline TB8 TB16 Active Thread Blocks / Core

12 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts 1.2 Performance 350 Registers Scratchpad TB2 Baseline TB8 TB16 0 TB2 Baseline TB8 TB16 Active Thread Blocks / Core Active Thread Blocks / Core

13 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases Performance +6.7% +4.0% Registers Scratchpad % x 2x x TB2 Baseline TB8 TB16 0 TB2 Baseline TB8 TB16 Active Thread Blocks / Core Active Thread Blocks / Core

14 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases % Can Performance we improve latency Registers hiding Scratchpad 350 without +6.7% +4.0% adding more 300 threads? 250 2x TB2 Baseline TB8 TB16 1.5x 2x TB2 Baseline TB8 TB16

15 Relative Performance Storage / SM (KB) Cost of Increasing Thread Parallelism Better latency tolerance by interleaving more threads Need more registers/scratchpads for thread contexts Speedup diminishes as thread increases % Can Performance we improve latency Registers hiding Scratchpad 350 without +6.7% +4.0% adding more 300 threads? Exploit instruction-level parallelism! x TB2 Baseline TB8 TB16 2x 2x TB2 Baseline TB8 TB16

16 Where is Instruction-level Parallelism? Stall ld.global $r0, [$r1] ld.global $r0, [$r1]

17 Where is Instruction-level Parallelism? Stall Appears past branch ld.global $r0, [$r1] ld.global $r0, [$r1] Long-latency independent

18 Where is Instruction-level Parallelism? Multiple Warps Stall Appears past branch ld.global $r0, [$r1] ld.global add $r0, $r2, [$r1] $r2, #1 ld.global add mul $r0, $r2, $r6, [$r1] $r2, $r0, #1 $r6 add mul set.lt $r2, $r6, $r4, $r2, $r0, $r2, #1 $r6$r5 mul set.lt $r4 br $r6, $r4, 0x80 $r0, $r2, $r6$r5 set.lt $r4 ld.global br $r4, 0x80 $r0, $r2, [$r1] $r5 $r4 ld.global br add 0x80 $r0, $r2, [$r1] $r2, #1 ld.global add mul $r0, $r2, $r6, [$r1] $r2, $r0, #1 $r6 add mul set.lt $r2, $r6, $r4, $r2, $r0, $r2, #1 $r6$r5 mul set.lt $r4 br $r6, $r4, 0x80 $r0, $r2, $r6$r5 set.lt $r4 br $r4, 0x80 $r2, $r5 Long-latency independent

19 Where is Instruction-level Parallelism? Rescheduling required to exploit instruction-level parallelism Hardware schedulers may be too expensive for multiple threads Execution order Stall Stall ld.global $r0, [$r1] ld.global $r0, [$r1] Long-latency independent Producer-consumer correctness between ld and mul ignored for illustration

20 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics

21 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics P-mode (Pre-execution mode) At long-latency stall point, Switch warp to P-mode Skip instructions in long-latency dependence chain Execute long-latency independent instructions

22 Solution: Warp Pre-execution Idea: Skip long-latency dependent instructions while maintaining in-order semantics P-mode (Pre-execution mode) At long-latency stall point, Switch warp to P-mode Skip instructions in long-latency dependence chain Execute long-latency independent instructions N-mode (Normal execution mode) Resume warp at stall point after completion of long-latency operation Execute long-latency dependent instructions Key: Keep in-order Fetch-decode-issue

23 Warp Pre-execution Warp mode: Normal (N) Key modifications: Each warp has mode (Normal and Pre-ex) Scoreboard/SIMT Stacks for each mode P-queue: Records pre-executed instructions in program order N-Scoreboard P-Scoreboard P-queue Head N-Stack P-Stack

24 Warp Pre-execution Warp mode: Normal (N) Long-latency Inst. N-Scoreboard $r0 (Long) P-Scoreboard N-Stack 1111 P-Stack P-queue Head

25 Warp Pre-execution Warp mode: Pre-execution (N P) Long-latency dependent! N-Scoreboard $r0 (Long) Copy P-Scoreboard $r0 (Long) P-queue N-Stack 1111 P-Stack 1111 Long-latency dependent detected Switch from N-mode to P-mode Head

26 Warp Pre-execution Skip Warp mode: Pre-execution (P) In P-mode, instructions dependent to in-flight long-latency operation are skipped N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue Head N-Stack 1111 P-Stack 1111

27 Warp Pre-execution Warp mode: Pre-execution (P) In P-mode, executed instructions are recorded in P-queue N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack 1111

28 Warp Pre-execution Warp mode: Pre-execution (P) Divergent state handled using P-stack P-mode continues beyond branches No branch speculation Stop at dependent branch N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack

29 Warp Pre-execution Prefetch Warp mode: Pre-execution (P) In P-mode, loads are used as accurate prefetch (treated as skipped) N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue set.lt Head N-Stack 1111 P-Stack

30 Warp Pre-execution Warp mode: Pre-execution (P) N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue N-Stack 1111 P-Stack nd instance of mul is also skipped add set.lt Head

31 Warp Pre-execution Warp mode: Pre-execution (P) 2 nd instance of set.lt pushed to P-queue P-queue can hold multiple instances of an instruction in the presence of loops N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Skip) $r6 (Skip) P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

32 Warp Pre-execution Warp mode: Normal (P N) After 1 st ld.global completion, switch to N-mode and continues from the original next instruction N-Scoreboard $r0 (Long) P-Scoreboard $r0 (Long) $r6 (Skip) P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

33 Warp Pre-execution Reuse Warp mode: Normal (N) set.lt was pre-executed Popped from P-queue head No re-execution, result reused immediately N-Scoreboard P-Scoreboard P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

34 Warp Pre-execution Prefetch Hit Warp mode: Normal (N P) Prefetches generated for 2 nd ld.global will make cache hits N-Scoreboard $r0 (Cache hit) P-Scoreboard P-queue set.lt add Head N-Stack P-Stack

35 Warp Pre-execution Reuse Warp mode: Normal (N P) The result of 2 nd add also reused through P-queue N-Scoreboard P-Scoreboard P-queue set.lt add Head N-Stack P-Stack

36 Problem: WAR and WAW Hazards N P add $r1, $r1, #4 mul $r3, $r2, $r0 sub $r2, $r4, $r5 sub.$r2 must be written to register file after mul.$r2 is read sub can be pre-executed, but the result cannot be updated until mul.$r2 is read!

37 Solution: Register Renaming N P ld.global $r0, [$r1] add $r1, $r1, #4 mul $r3, $r2, $r0 sub $p0, $r4, $r5 set.lt $r4, $p0, $r5 $r2 renamed to $p0 $r2 renamed to $p0 Idea: Store pre-executed results to rename registers In P-mode, source and destination registers renamed mul.$r2 not overwritten, sub.$r2 renamed to $p0 Subsequent $r2 renamed to $p0 Values propagated via rename registers in P-mode

38 Solution: Register Renaming N P N ld.global $r0, [$r1] add $r1, $r1, #4 mul $r3, $r2, $r0 sub $p0, $r4, $r5 set.lt $r4, $p0, $r5 mul $r3, $r2, $r0 sub $r2, $r4, $r5 $r2 renamed to $p0 $r2 renamed to $p0 Result read from $p0, updated to $r2 Idea: Store pre-executed results to rename registers In P-mode, source and destination registers renamed mul.$r2 not overwritten, sub.$r2 renamed to $p0 Subsequent $r2 renamed to $p0 Values propagated via rename registers in P-mode In N-mode, commit renamed registers in-order (reuse) Pre-computed sub.$r2 is read from rename register $p0

39 Warp Pre-execution Reuse Inst. ID set.lt Warp mode: Normal (N P) Arch. Dst ID Rename Register ID $r4 $p100 More details in paper N-Scoreboard P-Scoreboard P-queue set.lt add set.lt Head N-Stack 1111 P-Stack

40 Register Utilization (%) Do We Need More Registers? 100% 80% 60% 40% 20% Baseline P-mode 0% MU BT BP LB AE LK SF S2 ST SG KM SV MC PF MQ DC S1 CF FW LP DWWP BF FD SD MT LI NW GPUs underutilize register file Thread block granularity register management Unused registers for P-mode rename registers Increased register utilization (73% 86%)

41 Relative Performance Storage/SM (KB) Performance and Storage Up to +38% for latency-sensitive top 50% applications Combined with OptTB (Optimized thread block count) Only 10% storage overhead Performance On-chip Storage P-mode Overhead % +22% +22% +38% x 1.1x 0.8 Base 2xTLP 2xTLP P-modeP-mode +OptTB +OptTB 0 Base 2xTLP 2xTLP P-modeP-mode +OptTB +OptTB Outperforms 2x more threads Less storage overhead

42 LI KM SV SG MC BT CF LP FD WP LB DW LK FW DC S2 MU BP ST S1 PF BF NW AE SF MQ MT SD P-mode Instruction Coverage 100% 80% 60% 40% 20% 0% Pre-execute/Prefetch Skipped Pre-execution covers 38% of dynamic instructions 16% useful instructions (pre-execution/prefetch) 22% skipped Up to 90% coverage, up to 60% useful pre-execution Implication: GPU apps may have further ILP to exploit

43 Conclusion Scaling TLP for higher latency tolerance demands high storage cost Our solution: Warp pre-execution Exploit instruction-level parallelism Retain in-order fetch-decode-scheduling Challenges WAW and WAR hazards Solution: Register renaming Pre-execution shows promising results 38% performance improvement (combined with TLP throttling)

44 Thank you for listening! Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, *Gunjae Koo, Won Woo Ro, *Murali Annavaram Yonsei University *University of Southern California

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding

Warped-Preexecution: A GPU Pre-execution Approach for Improving Latency Hiding Keunsoo Kim * Sangpil Lee * Myung Kuk Yoon * Gunjae Koo Won Woo Ro * Murali Annavaram * School of Electrical and Electronic