Implementing a MIPS Processor. Readings:

Implementing a MIPS Processor Readings: 4.1-4.11 1

Goals for this Class Unrstand how CPUs run programs How do we express the computation the CPU? How does the CPU execute it? How does the CPU support other system components (e.g., the OS)? What techniques and technologies are involved and how do they work? Unrstand why CPU performance (and other metrics) varies How does CPU sign impact performance? What tra-offs are involved in signing a CPU? How can we meaningfully measure and compare computer systems? Unrstand why program performance varies How do program characteristics affect performance? How can we improve a programs performance by consiring the CPU running it? How do other system components impact program performance? 2

Goals Unrstand how the 5-stage MIPS pipeline works See examples of how architecture impacts ISA sign Unrstand how the pipeline affects performance Unrstand hazards and how to avoid them Structural hazards Data hazards Control hazards 3

Processor Design in Two Acts Act I: A single-cycle CPU

Foreshadowing Act I: A Single-cycle Processor Simplest sign Not how many real machines work (maybe some eply embedd processors) Figure out the basic parts; what it takes to execute instructions Act II: A Pipelined Processor This is how many real machines work Exploit parallelism by executing multiple instructions at once.

Target ISA We will focus on part of MIPS Enough to run into the interesting issues Memory operations A few arithmetic/logical operations (Generalizing is straightforward) BEQ and J This corresponds pretty directly to what you ll be implementing in 141L. 6

Basic Steps for Execution Fetch an instruction from the instruction store Deco it What does this instruction do? Gather inputs From the register file From memory Perform the operation Write the outputs To register file or memory Determine the next instruction to execute 7

The Processor Design Algorithm Once you have an ISA Design/Draw the datapath Intify and instantiate the hardware for your architectural state Foreach instruction Simulate the instruction Add and connect the datapath elements it requires Is it workable? If not, fix it. Design the control Foreach instruction Simulate the instruction What control lines do you need? How will you compute their value? Modify control accordingly Is it workable? If not, fix it. You ve already done much of this in 141L. 8

Arithmetic; R-Type Inst = Mem[PC] PC = PC + 4 REG[rd] = REG[rs] op REG[rt] bits 31:26 25:21 20:16 15:11 10:6 5:0 name op rs rt rd shamt funct # bits 6 5 5 5 5 6

ADDI; I-Type PC = PC + 4 REG[rd] = REG[rs] op SignExtImm bits 31:26 25:21 20:16 15:0 name op rs rt imm # bits 6 5 5 16 10

Load Word PC = PC + 4 REG[rt] = MEM[signextendImm + REG[rs]] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits 6 5 5 16 11

Store Word PC = PC + 4 MEM[signextendImm + REG[rs]] = REG[rt] bits 31:26 25:21 20:16 15:0 name op rs rt immediate # bits 6 5 5 16 12

Branch-equal; I-Type PC = (REG[rs] == REG[rt])? PC + 4 + SignExtImmediate *4 : PC + 4; bits 31:26 25:21 20:16 15:0 name op rs rt displacement # bits 6 5 5 16 13

A Single-cycle Processor Performance refresher ET = IC * CPI * CT Single cycle CPI == 1; That sounds great Unfortunately, Single cycle CT is large Even RISC instructions take quite a bite of effort to execute This is a lot to do in one cycle 14

Our Hardware is Mostly Idle Cycle time = 15 ns Slowest module (alu) is ~6ns 15

Processor Design in Two Acts Act II: A pipelined CPU

Pipelining Review 17

Pipelining Cycle 1 10n L a t c h Logic (10ns) L a t c Cycle 2 20n L a t c h Logic (10ns) L a t c Cycle 3 30n L a t c h Logic (10ns) L a t c What s the throughput? What s the latency for one unit of work?

Pipelining Break up the logic with latches into pipeline stages L a t c h Each stage can act on different data Latches hold the inputs to their stage Every clock cycle data transfers from one pipe stage to the next Logic (10ns) L a t c h L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h Logic(2ns) L a t c h

cycle 1 cycle 2 2ns 4ns Pipelining Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic cycle 3 cycle 4 cycle 5 cycle 6 6ns 8ns 10ns 12ns Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic What s the latency for one unit of work? What s the throughput?

Critical path review Critical path is the longest possible lay between two registers in a sign. The critical path sets the cycle time, since the cycle time must be long enough for a signal to traverse the critical path. Lengthening or shortening non-critical paths does not change performance Ially, all paths are about the same length Logic 21

Pipelining and Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Logic Hopefully, critical path reduced by 1/3 22

Limitations of Pipelining Logic Logic Logic Logic Logic Logic You cannot pipeline forever Some logic cannot be pipelined arbitrarily -- Memories Some logic is inconvenient to pipeline. How do you insert a register in the middle of an multiplier? Registers have a cost They cost area -- choose narrow points in the logic They cost energy -- latches don t do any useful work They cost time Extra logic lay Set-up and hold times. Pipelining may not affect the critical path as you expect 23

Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Relatively short -- 0.036 ns on our FPGAs Register lay (RD) -- Delay through the internals of the register. Longer -- 1.5 ns for our FPGAs. Much, much shorter for RAW CMOS. 24

Pipelining Overhead Clock Setup time Register in????? New Data Register out Old data New data Register lay 25

Pipelining Overhead Logic Delay (LD) -- How long does the logic take (i.e., the useful part) Set up time (ST) -- How long before the clock edge do the inputs to a register need be ready? Register lay (RD) -- Delay through the internals of the register. CTbase -- cycle time before pipelining CT base = LD + ST + RD. CTpipe -- cycle time after pipelining N times CT pipe = ST + RD + LD/N Total time = N*ST + N*RD + LD 26

Pipelining Difficulties Fast Logic Slow Logic Slow Logic Fast Logic Fast Logic Slow Logic Slow Logic Fast Logic You can t always pipeline how you would like 27

How to pipeline a processor Break each instruction into pieces -- remember the basic algorithm for execution Fetch Deco Collect arguments Execute Write results Compute next PC The classic 5-stage MIPS pipeline Fetch -- read the instruction Deco -- co and read from the register file Execute -- Perform arithmetic ops and address calculations Memory -- access data memory. Write -- Store results in the register file. 28

Pipelining a processor 29

Pipelining a processor Fetch Deco reg read Mem Reality Reg Write 29

Pipelining a processor Fetch Deco reg read Mem Reality Reg Write Fetch Deco Mem Reg reg read Write Easier to draw 29

Pipelining a processor Fetch Deco reg read Mem Reality Reg Write Fetch Deco Mem Reg reg read Write Easier to draw Fetch Deco/ Mem Reg reg read Write Pipelined 29

Pipelined Datapath Add PC 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 ALU Add Address Write Data Data Memory Read Data Sign 16 Extend 32

Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 ALU Add Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

Pipelined Datapath 4 Read Address Add Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

Pipelined Datapath Add 4 Read Address Instruc(on Memory Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Shi< le< 2 Add ALU Address Write Data Data Memory Read Data add lw Sub Sub. Add Add Sign 16 Extend 32

Pipelined Datapath PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32 This is a bug rd must flow through the pipeline with the instruc>on. This signal needs to come from the WB stage.

Pipelined Control Control lives in co stage, signals flow down the pipe with the instructions they correspond to Control PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec ex ctrl signals Shi< le< 2 Add ALU Exec/Mem mem ctrl signals Address Write Data Data Memory Read Data Mem/WB Write ctrl signals Sign 16 Extend 32 38

Impact of Pipelining L = IC * CPI * CT Break the processor into P pipe stages CT new = CT/P CPI new = CPIold CPI is an average: Cycles/instructions The latency of one instruction is P cycles The average CPI = 1 IC new = ICold Total speedup should be 5x! Except for the overhead of the pipeline registers And the realities of logic sign... 39

Imem 2.77 ns

Imem 2.77 ns Ctrl 0.797 ns

Imem 2.77 ns Ctrl 0.797 ns ArgBMux 1.124 ns

Imem 2.77 ns Ctrl 0.797 ns ArgBMux 1.124 ns ALU 5.94ns

Dmem 2.098 ns

ALU 5.94ns

WriteRegMux 2.628 ns

RegFile 1.424 ns

+ PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time

What we want: 5 balanced pipeline stages Each stage would take 5.266 ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

+ PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take 5.266 ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

+ Cycle time = 15.8 ns Clock Speed = 63Mhz PC IMem RegFile Read ALU Mux RegFile Write DMem read Control Mux DMem write Cycle Time What we want: 5 balanced pipeline stages Each stage would take 5.266 ns Clock speed: 316 Mhz Speedup: 5x What s easy: 5 unbalanced pipeline stages Longest stage: 6.14 ns <- this is the critical path Shortest stage: 0.29 ns Clock speed: 163Mhz Speedup: 2.6x 44

A Structural Hazard Both the co and write stage have to access the register file. There is only one registers file. A structural hazard!! Solution: Write early, read late Writes occur at the clock edge and complete long before the end of the cycle This leave enough time for the outputs to settle for the reads. Hazard avoid! Fetch Deco Mem Write 45

Quiz 4 1. For my CPU, CPI = 1 and CT = 4ns. a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline? b. What will the new CT, CPI, and IC be in the ial case? c. Give 2 reasons why achieving the speedup in (a) will be very difficult. 2. Instructions in a VLIW instruction word are executed a. Sequentially b. In parallel c. Conditionally d. Optionally 3. List x86, MIPS, and ARM in orr from most CISC-like to most RISC-like. 4. List 3 characteristics of MIPS that make it a RISC ISA. 5. Give one advantage of the multiple, complex addressing mos that ARM and x86 provi. 46

HW 1 Score Distribution 20 18 16 14 # of stunts 12 10 8 6 4 2 0 F D- D+ C B- B+ A 47

HW 2 Score Distribution 18 16 14 # of stunts 12 10 8 6 4 2 0 F D- D+ C B- B+ A 48

Projected Gra Distribution 18 16 14 # of stunts 12 10 8 6 4 2 0 F D- D+ C B- B+ A 49

What You Like More real world stuff. More class interactions. Cool factoids si notes, etc. Example in the slis. Jokes 50

What You Would Like to See Improved More time for the quizzes. Too much math. Going too fast. The subject of the class. The pace should be faster Late-breaking changes to the homeworks. Too much discussion of latency and metrics. It d be nice to have the slis before lecture. Not enough time to ask questions MIPS is not the most relevant ISA. More precise instructions for HW and quizzes. The pace should be slower The software (SPIM, Quartus, and Molsim) are hard to use 51

Pipelining is Tricky Simple pipelining is easy If the data flows in one direction only If the stages are inpennt In fact the tool can do this automatically via retiming (If you are curious, experiment with this in Quartus). Not so, for processors. Branch instructions affect the next PC -- ward flow Instructions need values computed by previous instructions -- not inpennt 52

Not just tricky, Hazardous! Hazards are situations where pipelining does not work as elegantly as we would like Three kinds Structural hazards -- we have run out of a hardware resource. Data hazards -- an input is not available on the cycle it is need. Control hazards -- the next instruction is not known. Dealing with hazards increases complexity or creases performance (or both) Dealing efficienctly with hazards is much of what makes processor sign hard. That, and the Quartus tools ;-) 53

Hazards: Key Points They prevent us from achieving CPI = 1 Hazards cause imperfect pipelining They are generally causes by counter flow data pennces in the pipeline Three kinds Structural -- contention for hardware resources Data -- a data value is not available when/where it is need. Control -- the next instruction to execute is not known. Two ways to al with hazards Removal -- add hardware and/or complexity to work around the hazard so it does not occur Stall -- Hold up the execution of new instructions. Let the olr instructions finish, so the hazard will clear. 54

Data Depennces A data pennce occurs whenever one instruction needs a value produced by another. Register values Also memory accesses (more on this later) add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 add $t3, $t2, $t4 55

Depennces in the pipeline In our simple pipeline, these instructions cause a data hazard Time Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 56

How can we fix it? Ias? 57

Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. This is called a register lay slot What is N? Can it change? add $s0, $t0, $t1 Fetch Deco Mem Write lay slot Fetch Deco Mem Write lay slot Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 58

Solution 1: Make the compiler al with it. Expose hazards to the big A architecture A result is available N instructions after the instruction that generates it. In the meantime, the register file has the old value. This is called a register lay slot What is N? Can it change? N = 2, for our sign add $s0, $t0, $t1 Fetch Deco Mem Write lay slot Fetch Deco Mem Write lay slot Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 58

Compiling for lay slots The compiler must fill the lay slots Ially, with useful instructions, but nops will work too. add $s0, $t0, $t1 sub $t2, $s0, $t3 add $t3, $s0, $t4 and $t7, $t5, $t4 add $s0, $t0, $t1 and $t7, $t5, $t4 nop sub $t2, $s0, $t3 add $t3, $s0, $t4 59

Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write NOPs in the bubble Mem Write Mem Write add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write Both of these instructions are stalled NOPs in the bubble Mem Write Mem Write add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

Solution 2: Stall When you need a value that is not ready, stall Suspend the execution of the executing instruction and those that follow. This introduces a pipeline bubble. A bubble is a lack of work to do, it propagates through the pipeline like nop instructions Cyc 1 Cyc 2 Cyc 3 Cyc 4 Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write Both of these instructions are stalled NOPs in the bubble Mem Write Mem Write One instruction or nop completes each cycle add $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write

Stalling the pipeline Freeze all pipeline stages before the stage where the hazard occurred. Disable the PC update Disable the pipeline registers This is equivalent to inserting into the pipeline when a hazard exists Insert nop control bits at stalled stage (co in our example) How is this solution still potentially better than relying on the compiler? 61

Calculating CPI for Stalls In this case, the bubble lasts for 2 cycles. As a result, in cycle (6 and 7), no instruction completes. What happens to CPI? In the absence of stalls, CPI is one, since one instruction completes per cycle If an instruction stalls for N cycles, it s CPI goes up by N Cyc 1 Cyc 2 Cyc 3 Cyc 4Cyc 5 Cyc 6 Cyc 7 Cyc 8 Cyc 9 Cyc 10 dd $s0, $t0, $t1 Fetch Deco Mem Write ub $t2, $s0, $t3 Fetch Deco Deco Deco Mem Write dd $t3, $s0, $t4 Fetch Fetch Fetch Deco Mem Write nd $t3, $t2, $t4 Fetch Deco Mem Write

Midterm and Midterm Review Midterm in 1 week! Midterm review in 2 days! I ll just take questions from you. Come prepared with questions. Today Go over last 2 quizzes More about hazards 63

Quiz 4 1. For my CPU, CPI = 1 and CT = 4ns. a. What is the maximum possible speedup I can achieve by turning it into a 8-stage pipeline? a. 8x b. What will the new CT, CPI, and IC be in the ial case? a. CT = CT/8, CPI = 1, IC = IC c. Give 2 reasons why achieving the speedup in (a) will be very difficult. a. Setup time; Difficult-to-pipeline logic; register logic lay. 2. Instructions in a VLIW instruction word are executed a. Sequentially b. In parallel c. Conditionally d. Optionally 3. List x86, MIPS, and ARM in orr from most CISC-like to most RISC-like. 1. x86, ARM, MIPS 4. List 3 characteristics of MIPS that make it a RISC ISA. 1. Fixed inst size; few inst formats; Orthogonality; general purpose registers; Designed for fast implementations; few addressing mos. 5. Give one advantage of the multiple, complex addressing mos that ARM and x86 provi. 1. Reduced co size 2. User-friendliness for hand-written assembly 64

Hardware for Stalling Turn off the enables on the earlier pipeline stages The earlier stages will keep processing the same instruction over and over. No new instructions get fetched. Insert control and data values corresponding to a nop into the downstream pipeline register. This will create the bubble. The nops will flow downstream, doing nothing. When the stall is over, re-enable the pipeline registers The instructions in the upstream stages will start moving again. New instructions will start entering the pipeline again. inject_nop_co_execute stall_pc stall_fetch_co enb enb 0 Fetch Deco/ Mem Write reg read 0 65

The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? What do we need to know to figure it out? 66

The Impact of Stalling On Performance ET = I * CPI * CT I and CT are constant What is the impact of stalling on CPI? Fraction of instructions that stall: 30% Baseline CPI = 1 Stall CPI = 1 + 2 = 3 New CPI = 67

Solution 3: Bypassing/ Data values are computed in Ex and Mem but publicized in write The data exists! We should use it. inputs are need results known Results "published" to registers Fetch Deco Mem Write 68

Bypassing or Forwarding Take the values, where ever they are Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 69

Forwarding Paths Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 70

Forwarding in Hardware Add PC 4 Read Address Add Instruction Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shift left 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

Hardware Cost of Forwarding In our pipeline, adding forwarding required relatively little hardware. For eper pipelines it gets much more expensive Roughly: ALU * pipe_stages you need to forward over Some morn processor have multiple ALUs (4-5) And eper pipelines (4-5 stages of to forward across) Not all forwarding paths need to be supported. If a path does not exist, the processor will need to stall. 72

Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem 73

Forwarding for Loads Values can also come from the Mem stage In this case, forward is not possible to the next instruction (and is not necessary for later instructions) Choices Always stall! Stall when need! Expose this in the ISA. Which does MIPS do? Cycles ld $s0, (0)$t0 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Time travel presents significant implementation challenges 73

Pros and Cons Punt to the compiler This is what MIPS does and is the source of the loadlay slot Future versions must emulate a single load-lay slot. The compiler fills the slot if possible, or drops in a nop. Always stall. The compiler is oblivious, but performance will suffer 10-15% of instructions are loads, and the CPI for loads will be 2 Forward when possible, stall otherwise Here the compiler can orr instructions to avoid the stall. If the compiler can t fix it, the hardware will. 74

Stalling for Load Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac Fetch Deco Me To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

Stalling for Load Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write All stages of the pipeline earlier Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac than Fetch Deco Me the stall stand still. To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

Stalling for Load Cycles Fetch Deco Mem Write Only four stages Load $s1, 0($s1) Fetch Deco Mem Write Fetch Deco Mem Write are occupied. What s in Mem? Addi $t1, $s1, 4 Fetch Deco Mem Write All stages of the pipeline earlier Fetch Deco Deco Mem Write Fetch Fetch Deco Mem Writ bac than Fetch Deco Me the stall stand still. To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 75

Inserting Noops Cycles Fetch Deco Mem Write Fetch Deco Mem Write Load $s1, 0($s1) Fetch Deco Mem Write The noop is in Mem Noop inserted Mem Write Addi $t1, $s1, 4 Fetch Deco Mem Write Fetch Deco Deco Mem Write Fetch Fetch Deco Mem To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 76

Quiz 1 Score Distribution 18 16 14 # of stunts 12 10 8 6 4 2 0 F D- D+ C B- B+ A 77

Quiz 2 Score Distribution 14 12 10 # of stunts 8 6 4 2 0 F D- D+ C B- B+ A 78

Quiz 3 Score Distribution 25 20 # of stunts 15 10 5 0 F D- D+ C B- B+ A 79

Quiz 4 18 16 14 # of stunts 12 10 8 6 4 2 0 F D- D+ C B- B+ A 80

Key Points: Control Hazards Control occur when we don t know what the next instruction is Caused by branches and jumps. Strategies for aling with them Stall Guess! Leads to speculation Flushing the pipeline Strategies for making better guesses Unrstand the difference between stall and flush 81

Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write 82

Computing the PC Normally Non-branch instruction PC = PC + 4 When is PC ready? Fetch Deco Mem Write Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 82

Fixing the Ubiquitous Control Hazard We need to know if an instruction is a branch in the fetch stage! How can we accomplish this? Solution 1: Partially co the instruction in fetch. You just need to know if it s a branch, a jump, or something else. Solution 2: We ll discuss later. Fetch Deco Mem Write 83

Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write 84

Computing the PC Normally Pre-co in the fetch unit. PC = PC + 4 The PC is ready for the next fetch cycle. Fetch Deco Mem Write Cycles add $s0, $t0, $t1 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write sub $t2, $s0, $t3 Fetch Deco Mem Write 84

Computing the PC for Branches Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write 85

Computing the PC for Jumps Jump instructions jr $s1 -- jump register PC = $s1 When is the value ready? Fetch Deco Mem Write Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write jr $s4 Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Deco Mem Write 86

Dealing with Branches: Option 0 -- stall Cycles sll $s4, $t6, $t5 Fetch Deco Mem Write bne $t2, $s0, somewhere Fetch Deco Mem Write add $s0, $t0, $t1 Fetch Fetch Stall Fetch Deco Mem Write NOP Fetch Deco Mem Write NOP Fetch Deco Mem Write and $s4, $t0, $t1 Fetch Deco Mem Write What does this do to our CPI? 87

Option 1: The compiler Use branch lay slots. The next N instructions after a branch are always executed How big is N? For jumps? For branches? Good Simple hardware Bad N cannot change. 88

Delay slots. Cycles Taken bne $t2, $s0, somewhere Fetch Deco Mem Writ bac add $t2, $s4, $t1 Fetch Deco Me Branch Delay add $s0, $t0, $t1 Fetch Deco... somewhere: sub $t2, $s0, $t3 Fetch De d 89

But MIPS Only Has One Delay Slot! The second branch lay slot is expensive! Filling one slot is hard. Filling two is even more so. Solution!: Resolve branches in co. en en en Control en en en en 4 en A Out + B inst[0:31] <<2 A Out + B in0 Out in1 sel A B func branch jump CMP Out PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel inst[16:31] WriteData clk reset sign ext WriteData clk reset 90

For the rest of this sli ck, we will assume that MIPS has no branch lay slot. If you have questions about whether part of the homework/test/quiz makes this assumption. 91

Option 2: Simple Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? Branch prediction vocabulary Prediction -- a guess about whether a branch will be taken or not taken Misprediction -- a prediction that turns out to be incorrect. Misprediction rate -- fraction of predictions that are incorrect. 92

Predict Not-taken Not-taken bne $t2, $s0, foo Cycles Fetch Deco Mem Write add $s4, $t6, $t5 Fetch Deco Mem Write Taken bne $t2, $s4, else Fetch Deco Mem Write and $s4, $t0, $t1... else: sub $t2, $s0, $t3 Fetch Deco Mem Write Fetch Squash Deco Mem We start the add, and then, when we discover the branch outcome, we squash it. Also called flushing the pipeline Just like a stall, flushing one instruction increases the branch s CPI by 1 93

Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. These signals for stalling stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

Flushing the Pipeline When we flush the pipe, we convert instructions into noops Turn off the write enables for write and mem stages Disable branches (i.e., make sure the ALU does raise the branch signal). Instructions do not stop moving through the pipeline For the example on the previous sli the inject_nop_co_execute signal will go high for one cycle. These signals for stalling This signal is for both stalling and flushing stall_pc inject_nop_co_execute inject_nop_execute_mem Control signals stall_fetch_co Deco 0 0 enb enb Fetch Reg Mem Write file 94

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Loops are commons Predict not-taken Pros? Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Loops are commons Predict not-taken Pros? Not all branches are for loops. Backward taken/forward not taken The best of both worlds! Most loops have have a ward branch at the bottom, those will predict taken Others (non-loop) branches will be not-taken. 95

Basic Pipeline Recap The PC is required in Fetch For branches, it s not know till co. en en en en 4 en A B Out + <<2 A B Out + in0 Out in1 sel A B func branch jump CMP Out PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel inst[16:31] WriteData clk reset sign ext WriteData clk reset Should this be here? Branches only, one lay slot, simplified ISA, no control 96

Supporting Speculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

Supporting Speculation PC Calculation Branch Predict Control nop in0 Out in1 sel en nop in0 Out in1 sel en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 sel in1 Out in2 inst[0:31] en sel in0 Out in1 A B func branch jump CMP Out en en en PC Addr Inst Instruction Memory inst[7:11] inst[12:16] inst[12:16] inst[17:21] sel in0 Out in1 Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

Supporting Speculation PC Calculation Branch Predict Control Branch Predict nop in0 Out in1 sel en Control nop in0 Out in1 sel en nop en 4 en PC Addr A Out + Instruction Memory B Inst inst[16:31] sign ext <<2 4 en PC Addr A Out + B Out inst[0:31] + Instruction Memory A B in0 in1 Out in2 sel inst[7:11] inst[12:16] inst[12:16] inst[17:21] Inst inst[16:31] en inst[16:31] in0 Out in1 sel sign ext in0 Out in1 Read0 Read1 Write sel WriteData write_enable <<2 Register File sign ext Data0 Data1 clk reset A B Out + A B func branch jump in0 in1 Out CMP in2 sel Out inst[0:31] inst[7:11] inst[12:16] en en in0 Out in1 sel A B func ALU Out in0 Out in1 Read0 Read1 sel en write_enable Da Da Register File Addr write_enable WriteData signed width read_enable Data Memory clk Data reset en in0 Out in1 sel inst[12:16] in0 se Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

Supporting Speculation Squash/Flush Control Branch Predict Control nop sel in0 Out in1 en nop sel in0 Out in1 en en 4 en A Out + B inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel inst[0:31] en in0 Out in1 sel A B func branch jump CMP Out en en en PC Addr Instruction Memory Inst inst[7:11] inst[12:16] inst[12:16] inst[17:21] in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 in0 Out in1 sel A B func ALU Out Addr write_enable signed width read_enable Data Memory Data in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

Supporting Speculation Branch Predict Control Squash/Flush Contro nop in0 en nop in0 en en Out Out in1 in1 sel sel 4 en PC Addr A Out + Instruction Memory B Control Inst inst[16:31] sign ext <<2 A Out + B in0 in1 Out in2 sel nop inst[0:31] inst[7:11] inst[12:16] inst[12:16] inst[17:21] en in0 Out in1 in0 Out in1 sel sel in0 Out in1 sel Read0 Read1 write_enable Register File Write Data0 Data1 A enb func branch jump CMP Out en in0 Out in1 sel A B func nop Out ALU en in0 Out in1 sel Addr en write_enable signed width read_enable Data Memory Data en in0 Out in1 sel WriteData clk reset inst[16:31] sign ext WriteData clk reset Predict: Compute all possible next PCs in fetch. Choose one. The correct next PC is known in co Flush as need: Replace wrong path instructions with no-ops. 97

Implementing Backward taken/forward not taken (BTFNT) A new branch predictor module termines what guess we are going to make. The BTFNT branch predictor has one input The sign of the offset -- to make the prediction The branch signal from the comparator -- to check if the prediction was correct. And two output The PC mux selector Steers execution in the predicted direction Re-directs execution when the branch resolves. A mis-predict signal that causes control to flush the pipe. 98

Performance Impact (ex 1) ET = I * CPI * CT BTFTN is has a misprediction rate of 20%. Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup BTFNT compared to just stalling on every branch? 99

Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1.1 IC = IC ET = 1.144 Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 IC = IC ET = 1.2 Speed up = 1.2/1.144 = 1.05 100

The Branch Delay Penalty The number of cycle between fetch and branch resolution is called the branch lay penalty It is the number of instruction that get flushed on a misprediction. It is the number of extra cycles the branch gets charged (i.e., the CPI for mispredicted branches goes up by the penalty for) 101

Performance Impact (ex 2) ET = I * CPI * CT Our current sign resolves branches in co, so the branch lay penalty is 1 cycle. If removing the comparator from co (and resolving branches in execute) would reduce cycle time by 20%, would it help or hurt performance? Mis predict rate = 20% Branches are 20% of instructions Resolve in Deco CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1 IC = IC ET = 1.04 Resolve in execute CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = 1.08 CT = 0.8 IC = IC ET = 0.864 Speedup = 1.2 102

The Importance of Pipeline pth There are two important parameters of the pipeline that termine the impact of branches on performance Branch co time -- how many cycles does it take to intify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 104

Pentium 4 pipeline Branches take 19 cycles to resolve Stalling is not an option. Intifying a branch takes 4 cycles. 80% branch prediction accuracy is also not an option. Not quite as bad now, but BP is still very important.

Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1.144 IC = IC ET = 1.144 Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 What if this were 20 instead of 1? IC = IC ET = 1.2 Speed up = 1.2/1.144 = 1.05 106

Performance Impact (ex 1) ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 1) + (1-.2*.2)*1 = 1.04 CT = 1.144 IC = IC ET = 1.144 Stall CPI =.2*2 +.8*1 = 1.2 CT = 1 What if this were 20 instead of 1? Branches are relatively infrequent (~20% of instructions), but Amdahl s Law tells that we can t completely ignore this uncommon case. IC = IC ET = 1.2 Speed up = 1.2/1.144 = 1.05 106

Dynamic Branch Prediction Long pipes mand higher accuracy than static schemes can liver. Instead of making the the guess once (i.e. statically), make it every time we see the branch. Many ways to predict dynamically We will focus on predicting future behavior based on past behavior 107

Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? 108

Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 not-taken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function 109

Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Cons? 110

Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous branch did. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better. Cons? 110

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table Look up the prediction bit for the branch How big does the table need to be? Pros: Cons: 111

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table Look up the prediction bit for the branch How big does the table need to be? Pros: Cons: It can differentiate between branches. Bad behavior by one won t mess up others... mostly. 111