Lecture 10: Pipelined Implementations: Hazards and Resolutions. Instruction Pipeline Reality

Size: px

Start display at page:

Download "Lecture 10: Pipelined Implementations: Hazards and Resolutions. Instruction Pipeline Reality"

Branden Bates
5 years ago
Views:

1 Lecture 10: Pipelined Implementations: Hazards and Resolutions S 09 L10-1 James C. Hoe José F. Martínez Electrical and Computer Engineering Carnegie Mellon University February 15, 2010 Instruction Pipeline Reality S 09 L10-2 Identical operations... NOT! unifying instruction types - coalescing instruction types into one multi-function pipe - external fragmentation (some idle stages) Uniform Suboperations... NOT! balance pipeline stages - stage quantization to yield balanced stages - internal fragmentation (some too-fast stages ) Independent operations... NOT! resolve data and resource hazards - duplicate contended resources - inter-instruction dependency detection and resolution MIPS ISA features are engineered for improved pipelineability

2 Data Dependence Data dependence r 3 r 1 op r 2 Read-after-Write r 5 r 3 op r 4 (RAW) Anti-dependence r 3 r 1 op r 2 Write-after-Read r 1 r 4 op r 5 (WAR) Output-dependence r 3 r 1 op r 2 Write-after-Write r 5 r 3 op r 4 (WAW) r 3 r 6 op r 7 S 09 L10-3 We discuss control-flow dependence in a later lecture RAW Dependency and Hazard S 09 L10-4 Following RAW dependencies lead to hazards in the 5-stage pipelined from L10 addi ra r- -?

3 Necessary Condition S 09 L10-5 :_ r stage X Reg Read i O i:r _ stage Y Reg Write RAW Hazard dist(i,) dist(x,y)?? Hazard!! dist(i,) > dist(x,y)?? Safe RAW Hazard Analysis Example S 09 L10-6 R/I-Type LW SW Br J Jr read RF read RF read RF read RF read RF write RF write RF Instructions I A and I B (where I A comes before I B ) have RAW hazard iff I B (R/I, LW, SW, Br or JR) reads a register written by I A (R/I or LW) dist(i A, I B ) dist(, ) = 3 What about WAW and WAR hazard? What about memory data hazard?

4 1. Pipeline Stall (Interlocing) S 09 L10-7 t 0 t 1 t 2 t 3 t 4 t 5 Inst h ALU Inst i i ALU Inst ALU ALU ALU ALU Inst ALU ALU ALU Inst l ALU ALU ALU i: r x _ : bubble _ r x dist(i,)=1 Stall==mae the younger instruction : bubble _ r x dist(i,)=2 wait until the hazard has passed : bubble _ r x dist(i,)=3 : _ r x dist(i,)=4 1. stop all up-stream stages 2. drain all down-stream stages Pipeline Stall S 09 L10-8 t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 i l h i l h i bub bub bub l h i bub bub bub l h i bub bub bub l i: rx _ : _ rx

5 Stall S 09 L10-9 stall Stall disable PC and IR latching control should set RegWrite=0 and MemWrite=0 Based on figures from [P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Stall Conditions S 09 L10-10 Instructions I A and I B (where I A comes before I B ) have RAW hazard iff I B (R/I, LW, SW, Br or JR) reads a register written by I A (R/I or LW) dist(i A, I B ) dist(, ) = 3 In other words, must stall when I B in stage wants to read a register to be written by I A in, or stage

6 Stall Condition S 09 L10-11 Helper functions rs(i) returns the rs field of I use_rs(i) returns true if I requires RF[rs] and rs!=r0 Stall when (rs(ir )==dest ) && use_rs(ir ) && RegWrite or (rs(ir )==dest ) && use_rs(ir ) && RegWrite or (rs(ir )==dest ) && use_rs(ir ) && RegWrite or (rt(ir )==dest ) && use_rt(ir ) && RegWrite or (rt(ir )==dest ) && use_rt(ir ) && RegWrite or (rt(ir )==dest ) && use_rt(ir ) && RegWrite It is crucial that the, and stages continue to advance normally during stall cycles Impact of Stall on Performance S 09 L10-12 Each stall cycle corresponds to 1 lost ALU cycle For a program with N instructions and S stall cycles, Average IPC=N/(N+S) S depends on frequency of RAW hazards exact distance between the hazard-causing instructions distance between hazards suppose i 1,i 2 and i 3 all depend on i 0, once i 1 s hazard is resolved, i 2 and i 3 must be oay too

7 Sample Assembly [p126, P&H] for (=i-1; >=0 && v[] > v[+1]; -=1) {... } S 09 L10-13 addi $s1, $s0, -1 for2tst: slti $t0, $s1, 0 bne $t0, $zero, exit2 sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) slt $t0, $t4, $t3 beq $t0, $zero, exit2... addi $s1, $s1, -1 exit2: for2tst 3 stalls 3 stalls 3 stalls 3 stalls 3 stalls 3 stalls 2. Code Scheduling S 09 L10-14 Compiler moves operations in between producer and consumer instructions Must be semantically invariant (NOPs always safe) addi ra r- - <indep. inst.> <indep. inst.> <indep. inst.>

8 3. Data Forwarding S 09 L10-15 It is intuitive to thin of RF as state add rx ry rz literally means get values from RF[ry] and RF [rz] respectively and put result in RF[rx] But, RF is ust a part of a computing abstraction add rx ry rz means 1. get the results of the last instructions to define the values of RF[ry] and RF[rz], respectively, and 2. until another instruction redefines RF [rx], younger instructions that refers to RF[rx] should use this instruction s result What matters is to maintain the correct dataflow between operations, thus add ra r- r- addi r- ra r- S 09 L10-16 Resolving RAW Hazard by Forwarding Instructions I A and I B (where I A comes before I B ) have RAW hazard iff I B (R/I, LW, SW, Br or JR) reads a register written by I A (R/I or LW) dist(i A, I B ) dist(, ) = 3 In other words, if I B in stage reads a register written by I A in, or stage, then the operand required by I B is not yet in RF retrieve operand from datapath instead of the RF retrieve operand from the youngest definition if multiple definitions are outstanding

9 Forwarding Paths S 09 L10-17 dist(i,)=3 dist(i,)=1 dist(i,)=2 [Based on figures from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] Assumes RF forwards internally Forwarding Logic S 09 L10-18 if (rs!=0) && (rs ==dest ) && RegWrite then forward operand from stage // dist=1 else if (rs!=0) && (rs ==dest ) && RegWrite then forward operand from stage // dist=2 else use A (operand from register file) // dist >= 3 Ordering matters!! Must chec youngest match first Why doesn t use_rs( ) appear in the forwarding logic?

10 Load Delay Slot S 09 L10-19 LW ra --- addi r- ra r- addi r- ra r- R2000 defined load with arch. latency of 1 inst the instruction immediately following a load (in the delay slot ) still sees the old register value (this is the behavior if we don t do anything special beyond forwarding) ISA feature tailored to the 5-stage pipelined microarchitecture Warning!! Implementation exposed!! If loads are defined normally, i.e., atomic a dependent immediate successor to LW must stall 1 cycle in Stall = (rs(ir )==dest ) && use_rs(ir ) && MemRead Sample Assembly [p126, P&H] for (=i-1; >=0 && v[] > v[+1]; -=1) {... } S 09 L10-20 addi $s1, $s0, -1 for2tst: slti $t0, $s1, 0 bne $t0, $zero, exit2 sll $t1, $s1, 2 add $t2, $a0, $t1 lw $t3, 0($t2) lw $t4, 4($t2) nop slt $t0, $t4, $t3 beq $t0, $zero, exit2... addi $s1, $s1, -1 exit2: for2tst

11 Terminology S 09 L10-21 Dependencies ordering requirement between instructions Pipeline Hazards: (potential) violations of dependencies Hazard Resolution: static schedule instructions at compile time to avoid hazards dynamic detect hazard and adust pipeline operation Stall, Flush or Forward Pipeline Interloc: hardware mechanisms for dynamic hazard resolution detect and enforce dependences at run time Why not very deep pipelines? S 09 L stage pipeline still has plenty of combinational delay between registers Superpipelining increase pipelining such that even intrinsic operations (e.g. ALU, RF access, memory access) require multiple stages What s the problem? Inst 0 : r1 r2 + r3 Inst 1 : r4 r1 + 2 t 0 t 1 t 2 t 3 t 4 t 5 Inst 0 F a F F b D a D b E a E E b M a M b W a W b Inst 1 F a F b F D a D b D E a E b a E ME b a M b a MW b a W b a W b F a F b D a F D b DE a b D E a b ME b a E M a b WM b M a W a b WW b D b

12 Intel P4 s Superpipelined Integer ALU S 09 L10-23 A lower B lower 16-bit add S lower A upper B upper 16-bit add S upper bit addition pipelined over 2 stages, BW=1/latency 16-bit-add No stall between bac-to-bac dependencies S 09 L10-24 What if you really can t superpipeline? input 0 input 1 output 0 output 1 2T delay If you can t double the bandwidth by pipelining, doubling the resource also doubles the bandwidth

13 R Stage Pipeline S 09 L10-25 t 0 t 1 t 2 t 3 t 4 t 5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 F Instruction Ordering/ Dependencies Data Dependence True dependence or Read after Write (RAW) Instruction must wait for all required input operands Anti-Dependence or Write after Read (WAR) Later write must not clobber a still-pending earlier read Output dependence or Write after Write (WAW) Earlier write must not clobber an already-finished later write S 09 L10-26 Control Dependence (or Procedural Dependence) Conditional branches cause uncertainty in instruction sequencing Instructions following a conditional branch depends on the resolution of the branch instruction (more on control in Lec 14)

Lecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University

18 447 Lecture 8: Data Hazard and Resolution James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L08 S1, James C. Hoe, CU/ECE/CALC, 2018 Your goal today Housekeeping detect and resolve