Advanced Computer Architecture Pipelining

Size: px

Start display at page:

Download "Advanced Computer Architecture Pipelining"

Theodore Cooper
5 years ago
Views:

1 Advanced Computer Architecture Pipelining Dr. Shadrokh Samavi

2 Some slides are from the instructors resources which accompany the 6 th and previous editions of the textbook. Some slides are from David Patterson, David Culler and Krste Asanovic of UC Berkeley; Israel Koren of U Amherst, ilos Prvulovic and Sean Lee of Georgia Tech. Sources of some slides are mentioned at the bottom of the page. Please inform me if I am missing a name in the above list. Dr. Shadrokh Samavi 2 2

3 What Is Pipelining? Dr. Shadrokh Samavi 3

4 Pipeline: In a computer pipeline, each step in the pipeline completes a part of an instruction. pipe stage Throughput: how often an instruction exits the pipeline achine cycle: The time required between moving an instruction one step down the pipeline is a processor cycle Dr. Shadrokh Samavi 4

5 RISC-V instruction set architecture formats All instructions are 32 bits long. R: integer register-to-register operations. I: for loads and immediate operations. B: branches. J: jumps and link. S: stores. U: wide immediate instructions (LUI, AUIPC). Dr. Shadrokh Samavi 5

6 IF ID EXE E WB Dr. Shadrokh Samavi 6

7 IF Dr. Shadrokh Samavi 7

8 ID Dr. Shadrokh Samavi 8

9 EXE Dr. Shadrokh Samavi 9

10 E Dr. Shadrokh Samavi

11 WB Dr. Shadrokh Samavi

12 Pipeline Implementation Dr. Shadrokh Samavi 2

13 Pipeline Stage F/F Combinational Logic F/F Dr. Shadrokh Samavi 3

14 WB E/WB IF ID EX E EX/E ID/EX IF/ID INST INST 2 Dr. Shadrokh Samavi 4

15 Five-stage Pipelined Datapath Inst. Fetch Inst. Decode Exec em WB Dr. Shadrokh Samavi 5

16 fetch Example for lw instruction: Fetch (IF) IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 6

17 Example for lw instruction: Decode (ID) decode IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 7

18 Example for lw instruction: Execution (EX) Execution IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 8

19 Example for lw instruction: emory (E) emory IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 9

20 Example for lw instruction: back (WB) back IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 2

21 Example for sw instruction: emory (E) emory IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 2

22 Example for sw instruction: back (WB): do nothing back IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 22

23 Corrected Datapath (for lw) IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 23

24 Pipelining Example add $4, $5, $6 lw $3, 24($) add $2, $3, $4 sub $, $2, $3 lw $, 2($) IF/ID ID/EX EX/E E/WB 4 Shift left2 result PC ress memory register register 2 isters 2 register Zero result ress Data memory 6 Sign extend 32 Dr. Shadrokh Samavi 24

25 Pipeline Control PCSrc u x IF/ID ID/EX EX/E E/WB 4 Shift left 2 result Branch PC ress memory register register 2 isters 2 register [5 ] 6 Sign 32 extend Src u x 6 control Zero result ress em Data memory em emto u x [2 6] [5 ] u x Op Dst Dr. Shadrokh Samavi 25

26 Pipeline control We have 5 stages. What needs to be controlled in each stage? Fetch and PC Increment Decode / ister Fetch Execution (4 lines)» Dst» op[:]» Src emory Stage (3 lines)» Branch» em» em Back (2 lines)» emto» (note that this signal is in ID stage) Dr. Shadrokh Samavi 26

27 Pipeline Control Extend pipeline registers to include control information (created in ID) Pass control signals along just like the Execution/ress Calculation stage control lines -back stage control lines emory access stage control lines em em Dst Op Op Src Branch write R-format lw sw X X beq X X WB em to Control WB EX WB IF/ID ID/EX EX/E E/WB Dr. Shadrokh Samavi 27

28 Datapath with Control PCSrc u x Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress memory register register 2 isters 2 register Shift left 2 u x result Src Zero result Branch em ress Data memory emto u x [5 ] 6 Sign 32 extend 6 control em [2 6] [5 ] u x Dst Op Dr. Shadrokh Samavi 28

29 Datapath with Control IF: lw $, 8($) PCSrc Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 29

30 Datapath with Control IF: sub $, $2, $3 ID: lw $, 8($) PCSrc lw Control ID/EX WB EX/E WB E/WB IF/ID E X WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 3

31 Datapath with Control IF: and $2, $4, $5 PCSrc ID: sub $, $2, $3 EX: lw $, 8($) IF/ID sub Control ID/EX WB E X EX/E WB E/WB WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 3

32 Datapath with Control IF: or $3, $6, $7 PCSrc ID: and $2, $4, $5 EX: sub $, $2, $3 E: lw $, 8($) IF/ID and Control ID/EX WB E X EX/E WB E/WB WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 32

33 Datapath with Control IF: add $4, $8, $9 PCSrc ID: or $3, $6, $7 EX: and $2, $4, $5 E: sub $,.. WB: lw $, 8($) IF/ID or Control ID/EX WB E X EX/E WB E/WB WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 33

34 Datapath with Control IF: xxxx ID: add $4, $8, $9 EX: or $3, $6, $7 E: and $2 WB: sub $,.. PCSrc IF/ID add Control ID/EX WB E X EX/E WB E/WB WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 34

35 Datapath with Control IF: xxxx ID: xxxx EX: add $4, $8, $9 E: or $3,.. WB: and $2 PCSrc IF/ID Control ID/EX WB E X EX/E WB E/WB WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 35

36 Datapath with Control IF: xxxx ID: xxxx EX: xxxx E: add $4,.. WB: or $3 PCSrc IF/ID Control ID/EX WB E X EX/E WB E/WB WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 36

37 Datapath with Control IF: xxxx ID: xxxx EX: xxxx E: xxxx WB: add $4.. PCSrc Control ID/EX WB EX/E WB E/WB IF/ID E X WB PC 4 ress memory register register 2 isters 2 register Shift left2 resu lt Src Zero result Branch em ress Data memory emto [ 5 ] 6 Sign 32 extend 6 control em [2 6] [ 5 ] Op Dst Dr. Shadrokh Samavi 37

38 Simple RISC Pipeline Clock number number i IF ID EX E WB i+ IF ID EX E WB i+ 2 IF ID EX E WB i+ 3 IF ID EX E WB i+ 4 IF ID EX E WB Dr. Shadrokh Samavi 38

39 Review: Visualizing Pipelining Time (clock cycles) I n s t r. Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Dem Dem O r d e r Ifetch Ifetch Dem Dem Adapted from Patterson, Katz and Kubiatowicz UCB Dr. Shadrokh Samavi 39

40 Dr. Shadrokh Samavi 4

41 Example: Consider the unpipelined processor in the previous section. Assume that it has a 4 GHz clock (or a.5 ns clock cycle) and that it uses four cycles for operations and branches and five cycles for memory operations. Assume that the relative frequencies of these operations are 4%, 2%, and 4%, respectively. Suppose that due to clock skew and setup, pipelining the processor adds. ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from a pipeline? Answer: The average instruction execution time on the unpipelined processor is In the pipelined implementation, the clock must run at the speed of the slowest stage plus overhead, which will be.5+. or.6 ns; this is the average instruction execution time. Thus, the speedup from pipelining is The. ns overhead essentially establishes a limit on the effectiveness of pipelining. If the overhead is not affected by changes in the clock cycle, Amdahl s Law tells us that the overhead limits the speedup. Dr. Shadrokh Samavi 4

42 Pipeline Hazards Dr. Shadrokh Samavi 42

43 Hazards: circumstances that would cause incorrect execution if next instruction were launched Structural hazards: Attempting to use the same hardware to do two different things at the same time Data hazards: depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). Dr. Shadrokh Samavi 43

44 Average instruction time unpipelined Speedup from pipelining = Average instruction time pipelined CPI unpipelined = Clock cycle unpipelined CPI pipelined Clock cycle pipelined - CPI unpipelined = Clock cycle unpipelined CPI pipelined Clock cycle pipelined Assuming same Clock cycle for pipelined & unpipelined CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction = + Pipeline stall clock cycles per instruction Speedup= Speedup = Speedup from pipelining = CPI unpipelined Pipeline stall cycles per instruction Pipeline depth Pipeline stall cycles per instruction CPI unpipelined CPI pipelined Clock cycle unpipelined Clock cycle pipelined = Pipeline stall cycles per instruction Clock cycle unpipelined Clock cycle pipelined Dr. Shadrokh Samavi 44

45 Example: Structural Hazard Time (clock cycles) Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Ifetch Instr Instr 2 Instr 3 Instr 4 Ifetch Ifetch Dem Ifetch Dem Dem Dem Structural Hazard Dr. Shadrokh Samavi 45

46 Resolving structural hazards Definition of structural hazard: attempt to use same hardware for two different things at the same time Solution : Wait (stall) must detect the hazard must have mechanism to stall Solution 2: Use more hardware Dr. Shadrokh Samavi 46

47 Detecting and Resolving Structural Hazard Time (clock cycles) Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load Ifetch Instr Instr 2 Stall Instr 3 Ifetch Ifetch Dem Dem Dem Bubble Bubble Bubble Bubble Bubble Ifetch Dem Adapted from Patterson, Katz and Kubiatowicz UCB Dr. Shadrokh Samavi 47

48 Role of ISA in Structural Hazard Resolution Simple to determine the sequence of resources used by an instruction opcode tells it all Uniformity in the resource usage Compare IPS to IA32? IPS approach => all instructions flow through same 5-stage pipelining Dr. Shadrokh Samavi 48

49 Time (clock cycles) Data Hazards IF ID/RF EX E WB I n s t r. add r,r2,r3 sub r4,r,r3 Ifetch Ifetch Dem Dem O r d e r and r6,r,r7 or r8,r,r9 xor r,r,r Ifetch Ifetch Ifetch Dem Dem Dem Adapted from Patterson, Katz and Kubiatowicz UCB Dr. Shadrokh Samavi 49

50 Program execution order (in instructions) Time (in clock cycles) CC CC2 CC3 CC4 CC5 CC6 ADD R, R2, R3 I LW R4, (R) I D D SW 2(R), R4 I D Stores require an operand during E, and forwarding of that operand is shown here Dr. Shadrokh Samavi 5

51 Pipeline w/o forwarding Inst. Fetch Inst. Decode Exec em WB u x IF/ID ID/EX EX/E E/WB 4 Shift left 2 result PC ress memory register register 2 isters 2 register u x Zero result ress Data memory u x 6 Sign extend 32 Dr. Shadrokh Samavi 5

52 Forwarding (from EX/E) Inst. Decode Exec em ID/EX EX/E E/WB ister File UX UX Data emory UX Dr. Shadrokh Samavi 52

53 Forwarding (from E/WB) ID/EX EX/E E/WB ister File UX Data emory UX UX Dr. Shadrokh Samavi 53

54 Control unit for forwarding ID/EX EX/E E/WB ister File UX UX Data emory UX Rd Rt UX 5 Rt Rs Forwarding Unit EX/E Rd 5 E/WB Rd Dr. Shadrokh Samavi 54

55 Forwarding of Pipeline register containing source instruction EX/E Opcode of source instruction Pipeline register containing destination instruction ID/EX Opcode of destination instruction ister-register, immediate, load, store, branch Destination of the forwarded result Top input EX/E isterregister, ID/EX ister-register, Bottom input E/WB isterregister, ID/EX ister-register, immediate, load, store, branch isterregister, Comparison (if equal then forward) EX/E.IR 6..2 == ID/EX.IR 6.. EX/E.IR 6..2 == ID/EX.IR..5 E/WB.IR 6..2 == ID/EX.IR 6.. E/WB ID/EX ister-register Bottom E/WB.IR 6..2 == input ID/EX.IR..5 EX/E EX/E E/WB E/WB isterregister, immediate immediate immediate immediate ID/EX E/WB Load ID/EX ister-register immediate, load, store, branch EX/E.IR..5 == ID/EX.IR 6.. ID/EX ister-register Bottom EX/E.IR..5 == input ID/EX.IR..5 ID/EX ister-register Top E/WB.IR..5 == immediate, load, input ID/EX.IR 6.. store, branch ID/EX ister-register Bottom E/WB.IR..5 == input ID/EX.IR..5 ister-register immediate, load, store, branch Top input Top input Top input E/WB.IR..5 == ID/EX.IR 6.. E/WB Load ID/EX ister-register Bottom E/WB.IR..5 == input ID/EX.IR..5 Dr. Shadrokh Samavi 55

56 Data Hazards Classification Resource Objects (R.O.): all addressable locations Data Objects (D.O.): content of resource objects D(I): Domain of instruction I = all R.O. that their D.O. effect the operation of I. R(I): Range of instruction I = all R.O. that their D.O. are effected by the execution of I. Dr. Shadrokh Samavi 56

57 A RAW hazard exists on register if R ( i ) D( j ) A WAW hazard exists on register if R( i ) R(j ) A WAR hazard exists on register if D( i ) R (j ) Dr. Shadrokh Samavi 57

58 D(I) I write R(I) D(J) J D(J) RAW D(I) D(J) I write J write R(I) R(J) WAW D(J) J write D(I) R(J) I D(I) WAR Dr. Shadrokh Samavi 58

59 Situation No dependence Dependence requiring stall Dependence overcome by forwarding Dependence with accesses in order Example code sequence LW R,45(R2) ADD R5,R6,R7 SUB R8,R6,R7 OR R9,R6,R7 LW R,45(R2) ADD R5,R,R7 SUB R8,R6,R7 OR R9,R6,R7 LW R,45(R2) ADD R5,R6,R7 SUB R8,R,R7 OR R9,R6,R7 LW R,45(R2) ADD R5,R6,R7 SUB R8,R6,R7 OR R9,R,R7 Action No hazard possible because no dependence exists on R in the immediately following. three instructions Comparators detect the use of R in the ADD and stall the ADD (and SUB and OR) before the ADD begins EX. Comparators detect use of R in SUB and forward result of load to in time for SUB to begin EX. No action required because the read of R by OR occurs in the second half of the ID phase, while the write of the loaded occurred in the first half. Situations that the pipeline hazard detection hardware can see by comparing the destination and sources of adjacent instructions. Dr. Shadrokh Samavi 59

60 Three Generic Data Hazards After (RAW) Instr J tries to read operand before Instr I writes it I: add r,r2,r3 J: sub r4,r,r3 Caused by a Data Dependence (in compiler nomenclature). This hazard results from an actual need for communication. Dr. Shadrokh Samavi 6

61 Three Generic Data Hazards After (WAR) Instr J writes operand before Instr I reads it I: sub r4,r,r3 J: add r,r2,r3 K: mul r6,r,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r. Can t happen in IPS 5 stage pipeline because: All instructions take 5 stages, and s are always in stage 2, and s are always in stage 5 Dr. Shadrokh Samavi 6

62 Three Generic Data Hazards After (WAW) Instr J writes operand before Instr I writes it. I: sub r,r4,r3 J: add r,r2,r3 K: mul r6,r,r7 Called an output dependence by compiler writers This also results from the reuse of name r. Can t happen in IPS 5 stage pipeline because: All instructions take 5 stages, and s are always in stage 5 Will see WAR and WAW in later more complicated pipes Dr. Shadrokh Samavi 62

63 Forwarding to Avoid Data Hazard Time (clock cycles) I n s t r. add r,r2,r3 sub r4,r,r3 Ifetch Ifetch Dem Dem O r d e r and r6,r,r7 or r8,r,r9 Ifetch Ifetch Dem Dem xor r,r,r Ifetch Dem Dr. Shadrokh Samavi 63

64 Data Hazard Even with Forwarding Time (clock cycles) I n s t r. lw r, (r2) sub r4,r,r6 Ifetch Ifetch Dem Dem O r d e r and r6,r,r7 or r8,r,r9 Ifetch Ifetch Dem Dem Adapted from Patterson, Katz and Kubiatowicz UCB Dr. Shadrokh Samavi 64

65 Resolving this load hazard ing hardware?... not Detection? Compilation techniques? What is the cost of load delays? Dr. Shadrokh Samavi 65

66 Resolving the Load Data Hazard Time (clock cycles) I n s t r. O r d e r lw r, (r2) sub r4,r,r6 and r6,r,r7 Ifetch Ifetch Ifetch Dem Bubble Bubble Dem Dem or r8,r,r9 Bubble Ifetch Dem How is this different from the instruction issue stall? Dr. Shadrokh Samavi 66

67 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b Fast code: LW Rb, b LW Rc,c LW Rc, c ADD Ra,Rb,Rc LW Re,e SW a,ra ADD Ra, Rb, Rc LW Re,e LW Rf, f LW Rf,f SW a,ra SUB Rd,Re,Rf SUB Rd, Re, Rf SW d,rd SW d,rd Dr. Shadrokh Samavi 67

68 Set Connection What is exposed about this organizational hazard in the instruction set? k cycle delay? bad, CPI is not part of ISA k instruction slot delay load should not be followed by use of the value in the next k instructions Nothing, but code can reduce run-time delays IPS did the transformation in the assembler Dr. Shadrokh Samavi 68

69 23% 24% 2% 2% Fraction of loads that cause a stall 45% 4% 4% 35% 3% 25% 24% 2% 5% % 2% % % 5% 4% % compress eqntott espresso gcc li doduc ear Benchmark hydro2d mdljdp su2cor Dr. Shadrokh Samavi 69

70 Control Hazard on Branches => Three Stage Stall : beq r,r3,36 Ifetch Dem 4: and r2,r3,r5 Ifetch Dem 8: or r6,r,r7 Ifetch Dem 22: add r8,r,r9 Ifetch Dem 36: xor r,r,r Ifetch Dem Dr. Shadrokh Samavi 7

71 Example: Branch Stall Impact If 3% branch, Stall 3 cycles significant Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier IPS branch tests if register = or IPS Solution: ove Zero test to ID/RF stage er to calculate new PC in ID/RF stage clock cycle penalty for branch versus 3 Dr. Shadrokh Samavi 7

72 compress 3% 3% % eqntott 2% 2% 22% Benchmark espresso gcc li doduc ear % 2% 2% 4% 3% 4% 4% 4% 4% 6% 6% 8% % % 2% The frequency of instructions that may change the PC hydro2d % 2% % mdljdp su2cor % % 2% % % 9% % 5% % 5% 2% 25% Percentage of instructions executed Forward conditional branches Backward conditional branches Unconditional branches Dr. Shadrokh Samavi 72

73 Four Branch Hazard Alternatives #: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% IPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% IPS branches taken on average But haven t calculated branch target address in IPS» IPS still incurs cycle branch penalty» Other machines: branch target known before outcome Dr. Shadrokh Samavi 73

74 Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor sequential successor 2... sequential successor n... branch target if taken Branch delay of length n slot delay allows proper decision and branch target address in 5 stage pipeline IPS uses this Dr. Shadrokh Samavi 74

75 Hardware interlock- No-op Dr. Shadrokh Samavi 75

76 Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 6% of branch delay slots About 8% of instructions executed in branch delay slots useful in computation About 5% (6% x 8%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar) Dr. Shadrokh Samavi 76

77 (a) From before (b) From target (c) From fall through ADD R, R2, R3 if R2 = then Delay slot SUB R4, R5, R6 ADD R, R2, R3 if R = then Delay slot ADD R, R2, R3 if R = then Delay slot OR R7, R8, R9 SUB R4, R5, R6 Becomes Becomes Becomes if R2 = then ADD R, R2, R3 SUB R4, R5, R6 ADD R, R2, R3 if R = then SUB R4, R5, R6 ADD R, R2, R3 if R = then OR R7, R8, R9 SUB R4, R5, R6 Dr. Shadrokh Samavi 77

78 Canceling or Nullifying Branch. In a canceling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the branch-delay slot is simply executed as it would normally be with a delayed branch. When the branch is incorrectly predicted, the instruction in the branch-delay slot is simply turned into a no-op. Dr. Shadrokh Samavi 78

79 Delayed-branch scheduling schemes Scheduling strategy Requirements Improves performance when? (a) From before. Branch must not depend on the rescheduled instructions Always. (b) From target (c) From fall through ust be OK to execute rescheduled instructions if branch is not taken. ay need to duplicate instructions. ust be OK to execute instructions if branch is taken. When branch is taken. ay enlarge program if instructions are duplicated. When branch is not taken. Dr. Shadrokh Samavi 79

80 Branch is NOT TAKEN (misprediction) Branch is TAKEN ( predicted correctly) The behavior of a predicted-taken canceling branch depends on whether the branch is taken or not. Dr. Shadrokh Samavi 8

81 Delayed and Canceling Delay Branches Dr. Shadrokh Samavi 8

82 Overall costs of a variety of branch schemes Dr. Shadrokh Samavi 82

83 For an R4-style pipeline, it takes 3 pipeline stages before the branch target address is known & more cycle to evaluate branch condition. This leads to the following branch penalties: Find the effective addition to the CPI arising from branches given that the relative frequency of unconditional, conditional untaken, and conditional taken branches are 4%, 6%, and %, respectively. Dr. Shadrokh Samavi 83

84 Control Hazards. Find out whether the branch is taken or not taken earlier in the pipeline. 2. Compute the taken PC (i.e., the address of the branch target) earlier. Branch instruction IF ID EX E WB Branch successor IF stall stall IF ID EX E WB Branch successor + IF ID EX E WB Branch successor + 2 IF ID EX E Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF Dr. Shadrokh Samavi 84

85 ultiple Pipelines Dr. Shadrokh Samavi 85

86 Superscalar 2 3 s Superscalar of degree 3 Clock cycles Dr. Shadrokh Samavi 86

87 Superpipelined s Clock cycles 23 s Superpipelined of degree 3 Clock cycles Dr. Shadrokh Samavi 87

88 ulticycle Operations Dr. Shadrokh Samavi 88

89 The IPS pipeline with three additional unpipelined, floating-point, functional units. Dr. Shadrokh Samavi 89

90 Dr. Shadrokh Samavi 9

91 Latency : the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. The initiation or repeat interval: the number of cycles that must elapse between issuing two operations of a given type. The pipe stages are shown in the order in which they are used for any operation. The notation S+A indicates a clock cycle in which both the S and A stages are used. The notation D 28 indicates that the D stage is used 28 times in a row. Dr. Shadrokh Samavi 9

occurs. The stage or stages that cause a stall are in bold. Note that this table deals with only the interaction between the multiply and one add issued between clocks and 7.

92 An FP multiply issued at clock is followed by a single FP add issued between clocks and 7. The second column indicates whether an instruction of the specified type stalls when it is issued n cycles later, where n is the clock cycle number in which the U stage of the second instruction occurs. The stage or stages that cause a stall are in bold. Note that this table deals with only the interaction between the multiply and one add issued between clocks and 7. In this case, the add will stall if it is issued four or five cycles after the multiply; otherwise, it issues without stalling. Notice that the add will be stalled for two cycles if it issues in cycle 4 because on the next clock cycle it will still conflict with the multiply; if, however, the add issues in cycle 5, it will stall for only clock cycle, because that will eliminate the conflicts. Dr. Shadrokh Samavi 92

93 A multiply issuing after an add can always proceed without stalling, because the shorter instruction clears the shared pipeline stages before the longer instruction reaches them. Dr. Shadrokh Samavi 93

94 Support ultiple FP Operations 2 3 E X 4 Integer Unit FP multiplier IF ID FP add E WB A A A A Complicate bypass FP divider (non-pipelined) Potential structural hazard ultiple (FP) instructions can complete at the same time RF might need to be multi-ported Ordering issue, who gets to update the register? Out-of-order completion/retirement: Precise exception issue Hsien-Hsin Dr. Shadrokh S. Lee, Georgia SamaviInstitute of Technology 94

95 Bypass/Forwarding Clock Cycles L.D F4,(R2) IF ID EX WB UL.D F,F4,F6 IF ID S WB ADD.D F2,F,F8 IF S ID S S S S S S A A2 A3 A4 WB S.D F2,(R2) IF S S S S S S ID EX S S S WB Hsien-Hsin Dr. Shadrokh S. Lee, Georgia SamaviInstitute of Technology 95

96 The pipeline timing of a set of independent FP operations. A typical FP code sequence showing the stalls arising from RAW hazards. Dr. Shadrokh Samavi 96

97 Three instructions want to perform a write back to the FP register file simultaneously, as shown in clock cycle. Dr. Shadrokh Samavi 97

98 Dr. Shadrokh Samavi 98

99 Dr. Shadrokh Samavi 99

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science Cases that affect instruction execution semantics