Implementing a MIPS Processor. Readings:

Size: px

Start display at page:

Download "Implementing a MIPS Processor. Readings:"

Rosamund Gray
6 years ago
Views:

1 Implementing a MIPS Processor Readings:

2 Goals for this Class Unrstand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other system components (e.g., the OS)? What techniques and technologies are involved and how do they work? Unrstand why CPU performance (and other metrics) varies How does CPU sign impact performance? What tra-offs are involved in signing a CPU? How can we meaningfully measure and compare computer systems? Unrstand why program performance varies How do program characteristics affect performance? How can we improve a programs performance by consiring the CPU running it? How do other system components impact program performance? 2

3 Goals Unrstand how the 5-stage MIPS pipeline works See examples of how architecture impacts ISA sign Unrstand how the pipeline affects performance Unrstand hazards and how to avoid them Structural hazards Data hazards Control hazards 3

4 Processor Design in Two Acts Act I: A single-cycle CPU

5 Foreshadowing Act I: A Single-cycle Processor Simplest sign Not how many real machines work (maybe some eply embedd processors) Figure out the basic parts; what it takes to execute instructions Act II: A Pipelined Processor This is how many real machines work Exploit parallelism by executing multiple instructions at once.

6 Target ISA We will focus on part of MIPS Enough to run into the interesting issues Memory operations A few arithmetic/logical operations (Generalizing is straightforward) BEQ and J This corresponds pretty directly to what you ll be implementing in 141L. 6

7 Basic Steps for Execution Fetch an instruction from the instruction store Deco it What does this instruction do? Gather inputs From the register file From memory Perform the operation Write the outputs To register file or memory Determine the next instruction to execute 7

8 The Processor Design Algorithm Once you have an ISA Design/Draw the datapath Intify and instantiate the hardware for your architectural state Foreach instruction Simulate the instruction Add and connect the datapath elements it requires Is it workable? If not, fix it. Design the control Foreach instruction Simulate the instruction What control lines do you need? How will you compute their value? Modify control accordingly Is it workable? If not, fix it. You ve already done much of this in 141L. 8

9 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

10 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

11 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

12 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

13 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

14 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

15 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

16 Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits

17 ADDI; I-Type PC = PC + 4 REG[rd] = REG[rs] op SignExtImm bits 31:26 25:21 20:16 15:0 name op rs rt imm # bits

18 ADDI; I-Type PC = PC + 4 REG[rd] = REG[rs] op SignExtImm bits 31:26 25:21 20:16 15:0 name op rs rt imm # bits

19 Load Word PC = PC + 4 REG[rt] = MEM[signextendImm + REG[rs]] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

20 Load Word PC = PC + 4 REG[rt] = MEM[signextendImm + REG[rs]] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

21 Store Word PC = PC + 4 MEM[signextendImm + REG[rs]] = REG[rt] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

22 Store Word PC = PC + 4 MEM[signextendImm + REG[rs]] = REG[rt] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits

23 Branch-equal; I-Type PC = (REG[rs] == REG[rt])? PC SignExtImmediate *4 : PC + 4; bits 31:26 25:21 20:16 15:0 name op rs rt displacement # bits

24 Branch-equal; I-Type PC = (REG[rs] == REG[rt])? PC SignExtImmediate *4 : PC + 4; bits 31:26 25:21 20:16 15:0 name op rs rt displacement # bits

25 A Single-cycle Processor Performance refresher ET = IC * CPI * CT Single cycle CPI == 1; That sounds great Unfortunately, Single cycle CT is large Even RISC instructions take quite a bite of effort to execute This is a lot to do in one cycle 14

26 Our Hardware is Mostly Idle Cycle time = 15 ns Slowest module (alu) is ~6ns 15

27 Our Hardware is Mostly Idle Cycle time = 15 ns Slowest module (alu) is ~6ns 15

28 Processor Design in Two Acts Act II: A pipelined CPU

29 Pipelining Review 17

30 Pipelining Cycle 1 10n L a t c h Logic (10ns) L a t c Cycle 2 20n L a t c h Logic (10ns) L a t c Cycle 3 30n L a t c h Logic (10ns) L a t c What s the throughput? What s the latency for one unit of work?

31 Pipelining Break up the logic with latches into pipeline stages L a t c h Each stage can act on different data Latches hold the inputs to their stage Every clock cycle data transfers from one pipe stage to the next Logic (10ns) L a t c h L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h

32 cycle 1 cycle 2 2ns 4ns Pipelining Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic cycle 3 cycle 4 cycle 5 cycle 6 6ns 8ns 10ns 12ns Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic What s the latency for one unit of work? What s the throughput?

33 Critical path review Critical path is the longest possible lay between two registers in a sign. The critical path sets the cycle time, since the cycle time must be long enough for a signal to traverse the critical path. Lengthening or shortening non-critical paths does not change performance Ially, all paths are about the same length Logic 21

34 Pipelining and Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Hopefully, critical path reduced by 1/3 22

35 Limitations of Pipelining Logic Logic Logic Logic Logic Logic You cannot pipeline forever Some logic cannot be pipelined arbitrarily -- Memories Some logic is inconvenient to pipeline. How do you insert a register in the middle of an multiplier? Registers have a cost They cost area -- choose narrow points in the logic They cost energy -- latches don t do any useful work They cost time Extra logic lay Set-up and hold times. Pipelining may not affect the critical path as you expect 23

36 Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Relatively short ns on our FPGAs Register lay (RD) -- Delay through the internals of the register. Longer ns for our FPGAs. Much, much shorter for RAW CMOS. 24

37 Pipelining Overhead Clock Setup time Register in????? New Data Register out Old data New data Register lay 25

38 Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Register lay (RD) -- Delay through the internals of the register. CTbase -- cycle time before pipelining CT base = LD + ST + RD. CTpipe -- cycle time after pipelining N times CT pipe = ST + RD + LD/N Total time = N*ST + N*RD + LD 26

39 Pipelining Difficulties Fast Logic Slow Logic Slow Logic Fast Logic Fast Logic Slow Logic Slow Logic Fast Logic You can t always pipeline how you would like 27

40 How to pipeline a processor Break each instruction into pieces -- remember the basic algorithm for execution Fetch Deco Collect arguments Execute Write results Compute next PC The classic 5-stage MIPS pipeline Fetch -- read the instruction Deco -- co and read from the register file Execute -- Perform arithmetic ops and address calculations Memory -- access data memory. Write -- Store results in the register file. 28

41 Pipelining a processor 29

42 Pipelining a processor Fetch Deco reg read Mem Reality Reg Write 29

43 Pipelining a processor Fetch Deco reg read Mem Reality Reg Write Fetch Deco Mem Reg reg read Write Easier to draw 29

44 Pipelining a processor Fetch Deco reg read Mem Reality Reg Write Fetch Deco Mem Reg reg read Write Easier to draw Fetch Deco/ Mem Reg reg read Write Pipelined 29

45 Pipelined Datapath Add PC 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 ALU Add Address Write Data Data Memory Read Data Sign 16 Extend 32

46 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

47 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 ALU Add Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

48 Pipelined Datapath 4 Read Address Add Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

49 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

50 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

51 Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Subi Sub. Add Add Sign 16 Extend 32

52 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

53 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32 This is a bug rd must flow through the pipeline with the instruc>on. This signal needs to come from the WB stage.

54 Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32 This is a bug rd must flow through the pipeline with the instruc>on. This signal needs to come from the WB stage.

55 Pipelined Control Control lives in co stage, signals flow down the pipe with the instructions they correspond to Control PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec ex ctrl signals Shi< le< 2 Add ALU Exec/Mem mem ctrl signals Address Write Data Data Memory Read Data Mem/WB Write ctrl signals Sign 16 Extend 32 38

56 Impact of Pipelining L = IC * CPI * CT Break the processor into P pipe stages CT new = CT/P CPI new = CPIold CPI is an average: Cycles/instructions The latency of one instruction is P cycles The average CPI = 1 IC new = ICold Total speedup should be 5x! Except for the overhead of the pipeline registers And the realities of logic sign... 39

60 Imem 2.77 ns

61 Imem 2.77 ns Ctrl ns

62 Imem 2.77 ns Ctrl ns ArgBMux ns

63 Imem 2.77 ns Ctrl ns ArgBMux ns ALU 5.94ns

64 Dmem ns

66 ALU 5.94ns

68 WriteRegMux ns

70 RegFile ns

71 + PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time

72 + PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time

73 What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

74 + PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

75 + Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

76 + Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

77 + Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

78 A Structural Hazard Both the co and write stage have to access the register file. There is only one registers file. A structural hazard!! Solution: Write early, read late Writes occur at the clock edge and complete long before the end of the cycle This leave enough time for the outputs to settle for the reads. Hazard avoid! Fetch Deco Mem Write 45

79 Quiz 4 1. For my CPU, CPI = 1 and CT = 4ns. a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline? b. What will the new CT, CPI, and IC be in the ial case? c. Give 2 reasons why achieving the speedup in (a) will be very difficult. 2. Instructions in a VLIW instruction word are executed a. Sequentially b. In parallel c. Conditionally d. Optionally 3. List x86, MIPS, and ARM in orr from most CISC-like to most RISC-like. 4. List 3 characteristics of MIPS that make it a RISC ISA. 5. Give one advantage of the multiple, complex addressing mos that ARM and x86 provi. 46

80 HW 1 Score Distribution # of stunts F D- D+ C B- B+ A 47

81 HW 2 Score Distribution # of stunts F D- D+ C B- B+ A 48

82 Projected Gra Distribution # of stunts F D- D+ C B- B+ A 49

83 What You Like More real world stuff. More class interactions. Cool factoids si notes, etc. Example in the slis. Jokes 50

84 What You Would Like to See Improved More time for the quizzes. Too much math. Going too fast. The subject of the class. The pace should be faster Late-breaking changes to the homeworks. Too much discussion of latency and metrics. It d be nice to have the slis before lecture. Not enough time to ask questions MIPS is not the most relevant ISA. More precise instructions for HW and quizzes. The pace should be slower The software (SPIM, Quartus, and Molsim) are hard to use 51

85 Pipelining is Tricky Simple pipelining is easy If the data flows in one direction only If the stages are inpennt In fact the tool can do this automatically via retiming (If you are curious, experiment with this in Quartus). Not so, for processors. Branch instructions affect the next PC -- ward flow Instructions need values computed by previous instructions -- not inpennt 52

86 Not just tricky, Hazardous! Hazards are situations where pipelining does not work as elegantly as we would like Three kinds Structural hazards -- we have run out of a hardware resource. Data hazards -- an input is not available on the cycle it is need. Control hazards -- the next instruction is not known. Dealing with hazards increases complexity or creases performance (or both) Dealing efficienctly with hazards is much of what makes processor sign hard. That, and the Quartus tools ;-) 53

87 Hazards: Key Points They prevent us from achieving CPI = 1 Hazards cause imperfect pipelining They are generally causes by counter flow data pennces in the pipeline Three kinds Structural -- contention for hardware resources Data -- a data value is not available when/where it is need. Control -- the next instruction to execute is not known. Two ways to al with hazards Removal -- add hardware and/or complexity to work around the hazard so it does not occur Stall -- Hold up the execution of new instructions. Let the olr instructions finish, so the hazard will clear. 54

88 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

89 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

90 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

91 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

92 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

93 Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

94 Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

95 Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

96 Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

97 How can we fix it? Ias? 57

98 Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. This is called a register lay slot What is N? Can it change? add $s0, $t0, $t1 Fetch Deco Mem Write lay slot Fetch Deco Mem Write lay slot Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 58

99 Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. This is called a register lay slot What is N? Can it change? N = 2, for our sign add $s0, $t0, $t1 Fetch Deco Mem Write lay slot Fetch Deco Mem Write lay slot Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 58

100 Compiling for lay slots The compiler must fill the lay slots Ially, with useful instructions, but nops will work too. add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 and $t7, $t5, $t4 add $s0, $t0, $t1 and $t7, $t5, $t4 nop sub $t2, $s0, $t3 add $t3, $s0, $t4 59

101 Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write NOPs in the bubble Mem Write Mem Write add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

102 Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write Both of these instructions are stalled NOPs in the bubble Mem Write Mem Write add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

103 Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write Both of these instructions are stalled NOPs in the bubble Mem Write Mem Write One instruction or nop completes each cycle add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

104 Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This is equivalent to inserting into the pipeline when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? 61

105 Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This is equivalent to inserting into the pipeline when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? The compiler can still act like there are lay slots to avoid stalls. Implementation tails are not exposed in the ISA 61

106 Calculating CPI for Stalls In this case, the bubble lasts for 2 cycles. As a result, in cycle (6 and 7), no instruction completes. What happens to CPI? In the absence of stalls, CPI is one, since one instruction completes per cycle If an instruction stalls for N cycles, it s CPI goes up by N Cyc 1 Cyc 2 Cyc 3 Cyc 4Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 dd $s0, $t0, $t1 Fetch Deco Mem Write ub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write dd $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write nd $t3, $t2, $t4 Fetch Deco Mem Write

107 Midterm and Midterm Review Midterm in 1 week! Midterm review in 2 days! I ll just take questions from you. Come prepared with questions. Today Go over last 2 quizzes More about hazards 63

108 Quiz 4 1. For my CPU, CPI = 1 and CT = 4ns. a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline? a. 8x b. What will the new CT, CPI, and IC be in the ial case? a. CT = CT/8, CPI = 1, IC = IC c. Give 2 reasons why achieving the speedup in (a) will be very difficult. a. Setup time; Difficult-to-pipeline logic; register logic lay. 2. Instructions in a VLIW instruction word are executed a. Sequentially b. In parallel c. Conditionally d. Optionally 3. List x86, MIPS, and ARM in orr from most CISC-like to most RISC-like. 1. x86, ARM, MIPS 4. List 3 characteristics of MIPS that make it a RISC ISA. 1. Fixed inst size; few inst formats; Orthogonality; general purpose registers; Designed for fast implementations; few addressing mos. 5. Give one advantage of the multiple, complex addressing mos that ARM and x86 provi. 1. Reduced co size 2. User-friendliness for hand-written assembly 64

109 Hardware for Stalling Turn off the enables on the earlier pipeline stages The earlier stages will keep processing the same instruction over and over. No new instructions get fetched. Insert control and data values corresponding to a nop into the downstream pipeline register. This will create the bubble. The nops will flow downstream, doing nothing. When the stall is over, re-enable the pipeline registers The instructions in the upstream stages will start moving again. New instructions will start entering the pipeline again. inject_nop_co_execute stall_pc stall_fetch_co enb enb 0 Fetch Deco/ Mem Write reg read 0 65

110 The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? What do we need to know to figure it out? 66

111 The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = = 3 New CPI = 67

112 The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = = 3 New CPI = 0.3* *1 =

113 Solution 3: Bypassing/ Data values are computed in Ex and Mem but publicized in write The data exists! We should use it. inputs are need results known Results "published" to registers Fetch Deco Mem Write 68

114 Bypassing or Forwarding Take the values, where ever they are Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 69

115 Forwarding Paths Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 70

116 Forwarding in Hardware Add PC 4 Read Address Add Instruction Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shift left 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

117 Hardware Cost of Forwarding In our pipeline, adding forwarding required relatively little hardware. For eper pipelines it gets much more expensive Roughly: ALU * pipe_stages you need to forward over Some morn processor have multiple ALUs (4-5) And eper pipelines (4-5 stages of to forward across) Not all forwarding paths need to be supported. If a path does not exist, the processor will need to stall. 72

118 Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem 73

119 Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Time travel presents significant implementation challenges 73

120 Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Which does MIPS do? Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Time travel presents significant implementation challenges 73

121 Pros and Cons Punt to the compiler This is what MIPS does and is the source of the loadlay slot Future versions must emulate a single load-lay slot. The compiler fills the slot if possible, or drops in a nop. Always stall. The compiler is oblivious, but performance will suffer 10-15% of instructions are loads, and the CPI for loads will be 2 Forward when possible, stall otherwise Here the compiler can orr instructions to avoid the stall. If the compiler can t fix it, the hardware will. 74

122 Stalling for Load Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac Fetch Deco Me To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

123 Stalling for Load Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write All stages of the pipeline earlier Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac than Fetch Deco Me the stall stand still. To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

124 Stalling for Load Cycles Fetch Deco Mem Write Only four stages Load $s1, 0($s1) Fetch Deco Mem Write Fetch Deco Mem Write are occupied. What s in Mem? Addi $t1, $s1, 4 Fetch Deco Mem Write All stages of the pipeline earlier Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac than Fetch Deco Me the stall stand still. To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

125 Inserting Noops Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write The noop is in Mem Noop inserted Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write Fetch Deco Deco Mem Write Fetch Fetch Deco Mem To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 76

126 Quiz 1 Score Distribution # of stunts F D- D+ C B- B+ A 77

127 Quiz 2 Score Distribution # of stunts F D- D+ C B- B+ A 78

128 Quiz 3 Score Distribution # of stunts F D- D+ C B- B+ A 79

129 Quiz # of stunts F D- D+ C B- B+ A 80

130 Key Points: Control Hazards Control occur when we don t know what the next instruction is Caused by branches and jumps. Strategies for aling with them Stall Guess! Leads to speculation Flushing the pipeline Strategies for making better guesses Unrstand the difference between stall and flush 81

131 Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write 82

132 Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write 82

133 Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 82

134 Fixing the Ubiquitous Control Hazard We need to know if an instruction is a branch in the fetch stage! How can we accomplish this? Solution 1: Partially co the instruction in fetch. You just need to know if it s a branch, a jump, or something else. Solution 2: We ll discuss later. Fetch Deco Mem Write 83

135 Fixing the Ubiquitous Control Hazard We need to know if an instruction is a branch in the fetch stage! How can we accomplish this? Solution 1: Partially co the instruction in fetch. You just need to know if it s a branch, a jump, or something else. Solution 2: We ll discuss later. Fetch Deco Mem Write 83

136 Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write 84

137 Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write 84

138 Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 84

139 Computing the PC for Branches Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write 85

140 Computing the PC for Branches Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write 85

141 Computing the PC for Jumps Jump instructions jr $s1 -- jump register PC = $s1 When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write jr $s4 Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write 86

142 Computing the PC for Jumps Jump instructions jr $s1 -- jump register PC = $s1 When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write jr $s4 Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write 86

143 Dealing with Branches: Option 0 -- stall Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Fetch Stall Fetch Deco Mem Write NOP Fetch Deco Mem Write NOP Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write What does this do to our CPI? 87

144 Option 1: The compiler Use branch lay slots. The next N instructions after a branch are always executed How big is N? For jumps? For branches? Good Simple hardware Bad N cannot change. 88

145 Delay slots. Cycles Taken bne $t2, $s0, somewhere Fetch Deco Mem Writ bac add $t2, $s4, $t1 Fetch Deco Me Branch Delay add $s0, $t0, $t1 Fetch Deco... somewhere: sub $t2, $s0, $t3 Fetch De d 89

146 But MIPS Only Has One Delay Slot! The second branch lay slot is expensive! Filling one slot is hard. Filling two is even more so. Solution!: Resolve branches in co. en en en Control en en en en 4 en A Out + B inst[0:31] <<2 A Out + B in0 Out in1 sel A B func branch jump CMP Out PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel inst[16:31] WriteData clk reset sign ext WriteData clk reset 90

147 For the rest of this sli ck, we will assume that MIPS has no branch lay slot. If you have questions about whether part of the homework/test/quiz makes this assumption. 91

148 Option 2: Simple Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? Branch prediction vocabulary Prediction -- a guess about whether a branch will be taken or not taken Misprediction -- a prediction that turns out to be incorrect. Misprediction rate -- fraction of predictions that are incorrect. 92

149 Predict Not-taken Not-taken bne $t2, $s0, foo Cycles Fetch Deco Mem Write add $s4, $t6, $t5 Fetch Deco Mem Write Taken bne $t2, $s4, else Fetch Deco Mem Write and $s4, $t0, $t1... else: sub $t2, $s0, $t3 Fetch Deco Mem Write Fetch Squash Deco Mem We start the add, and then, when we discover the branch outcome, we squash it. Also called flushing the pipeline Just like a stall, flushing one instruction increases the branch s CPI by 1 93

150 Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

151 Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. These signals for stalling stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

152 Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. These signals for stalling This signal is for both stalling and flushing stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

153 Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

154 Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Loops are commons Predict not-taken Pros? Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

155 Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Loops are commons Predict not-taken Pros? Not all branches are for loops. Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

156 Basic Pipeline Recap The PC is required in Fetch For branches, it s not know till co. en en en en 4 en A B Out + <<2 A B Out + in0 Out in1 sel A B func branch jump CMP Out PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel inst[16:31] WriteData clk reset sign ext WriteData clk reset Should this be here? Branches only, one lay slot, simplified ISA, no control 96

157 Supporting Speculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

158 Supporting Speculation PC Calculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 sel in1 Out in2 inst[0:31] en sel in0 Out in1 A B func branch jump CMP Out en en en PC Addr Inst Instruction Memory inst[7:11] inst[12:16] inst[12:16] inst[17:21] sel in0 Out in1 Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

159 Supporting Speculation PC Calculation Branch Predict Control Branch Predict nop in0 Out in1 sel en Control nop in0 Out in1 sel en nop en 4 en PC Addr A Out + Instruction Memory B Inst inst[16:31] sign ext <<2 4 en PC Addr A Out + B Out inst[0:31] + Instruction Memory A B in0 in1 Out in2 sel inst[7:11] inst[12:16] inst[12:16] inst[17:21] Inst inst[16:31] en inst[16:31] in0 Out in1 sel sign ext in0 Out in1 Read0 Read1 Write sel WriteData write_enable <<2 Register File sign ext Data0 Data1 clk reset A B Out + A B func branch jump in0 in1 Out CMP in2 sel Out inst[0:31] inst[7:11] inst[12:16] en en in0 Out in1 sel A B func ALU Out in0 Out in1 Read0 Read1 sel en write_enable Da Da Register File Addr write_enable WriteData signed width read_enable Data Memory clk Data reset en in0 Out in1 sel inst[12:16] in0 se Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

160 Supporting Speculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

161 Supporting Speculation Squash/Flush Control Branch Predict Control nop sel in0 Out in1 en nop sel in0 Out in1 en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

162 Supporting Speculation Branch Predict Control Squash/Flush Contro nop in0 en nop in0 en en Out Out in1 in1 sel sel 4 en PC Addr A Out + Instruction Memory B Control Inst inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel nop inst[0:31] inst[7:11] inst[12:16] inst[12:16] inst[17:21] en in0 Out in1 in0 Out in1 sel sel in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 A enb func branch jump CMP Out en in0 Out in1 sel A B func nop Out ALU en in0 Out in1 sel Addr en write_enable signed width read_enable Data Memory Data en in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

163 Implementing Backward taken/forward not taken (BTFNT) A new branch predictor module termines what guess we are going to make. The BTFNT branch predictor has one input The sign of the offset -- to make the prediction The branch signal from the comparator -- to check if the prediction was correct. And two output The PC mux selector Steers execution in the predicted direction Re-directs execution when the branch resolves. A mis-predict signal that causes control to flush the pipe. 98

164 Performance Impact (ex 1) ET = I * CPI * CT BTFTN is has a misprediction rate of 20%. Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup BTFNT compared to just stalling on every branch? 99

165 Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1.1 IC = IC ET = Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 IC = IC ET = 1.2 Speed up = 1.2/1.144 =

166 The Branch Delay Penalty The number of cycle between fetch and branch resolution is called the branch lay penalty It is the number of instruction that get flushed on a misprediction. It is the number of extra cycles the branch gets charged (i.e., the CPI for mispredicted branches goes up by the penalty for) 101

167 Performance Impact (ex 2) ET = I * CPI * CT Our current sign resolves branches in co, so the branch lay penalty is 1 cycle. If removing the comparator from co (and resolving branches in execute) would reduce cycle time by 20%, would it help or hurt performance? Mis predict rate = 20% Branches are 20% of instructions Resolve in Deco CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1 IC = IC ET = 1.04 Resolve in execute CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = 1.08 CT = 0.8 IC = IC ET = Speedup =

168 Performance Impact (ex 2) ET = I * CPI * CT Our current sign resolves branches in co, so the branch lay penalty is 1 cycle. If removing the comparator from co (and resolving branches in execute) would reduce cycle time by 20%, would it help or hurt performance? Mis predict rate = 20% Branches are 20% of instructions 103

169 The Importance of Pipeline pth There are two important parameters of the pipeline that termine the impact of branches on performance Branch co time -- how many cycles does it take to intify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 104

170 Pentium 4 pipeline Branches take 19 cycles to resolve Stalling is not an option. Intifying a branch takes 4 cycles. 80% branch prediction accuracy is also not an option. Not quite as bad now, but BP is still very important.

171 Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = IC = IC ET = Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 What if this were 20 instead of 1? IC = IC ET = 1.2 Speed up = 1.2/1.144 =

172 Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = IC = IC ET = Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 What if this were 20 instead of 1? Branches are relatively infrequent (~20% of instructions), but Amdahl s Law tells that we can t completely ignore this uncommon case. IC = IC ET = 1.2 Speed up = 1.2/1.144 =

173 Dynamic Branch Prediction Long pipes mand higher accuracy than static schemes can liver. Instead of making the the guess once (i.e. statically), make it every time we see the branch. Many ways to predict dynamically We will focus on predicting future behavior based on past behavior 107

174 Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? 108

175 Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 not-taken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function 109

176 Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Cons? 110

177 Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better. Cons? 110

178 Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better. Cons? An unpredictable branch in a loop will mess everything up. It can t tell the difference between branches 110

179 Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table Look up the prediction bit for the branch How big does the table need to be? Pros: Cons: 111

180 Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table Look up the prediction bit for the branch How big does the table need to be? Pros: Cons: It can differentiate between branches. Bad behavior by one won t mess up others... mostly. 111

Implementing a MIPS Processor. Readings:

Implementing a MIPS Processor. Readings: Implementing a MIPS Processor Readings: 4.1-4.11 1 Goals for this Class Unrstand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other