Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = PDF Free Download

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview of Pipelining 37 Chapter 4 The Processor 1

MIPS Pipeline Five stages, one step per stage 1. IF: Instruction Fetch from memory 2. ID: Instruction Decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access MEMory operand 5. WB: Write result Back to register 38 Chapter 4 The Processor 2

Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register read ALU op Memory access Register write Total time lw 200ps 100 ps 200ps 200ps 100 ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps 39 Chapter 4 The Processor 3

Pipeline Performance Single-cycle (T c = 800ps) Pipelined (T c = 200ps) 40 Chapter 4 The Processor 4

Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease 41 Chapter 4 The Processor 5

MIPS Pipeline Stages (review) Instruction Fetch (IF): The instruction is fetched from memory and placed in the instruction register (IR). Instruction decode (ID): The bits of the instruction are decoded into control signals. Operands are moved from registers or immediate fields to working registers. For branch instructions, the branch condition is tested and the branch address computed. Execution (EX): The instruction is executed. Specifically, if the instruction is an arithmetic or logical operation, its results are computed. If it is a load-store instruction, the address is computed. All this is done by an elaborate logic circuit called the arithmeticlogical unit (ALU). MEMory read/write (MEM): If the instruction is a load-store, the memory is read or written. Write Back (WB): The results of the operation are written to the destination register. 42 Chapter 4 The Processor 6

Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle 43 Chapter 4 The Processor 7

Hazards Situations that prevent starting the next instruction ti in the next cycle Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends d on previous instruction 44 Chapter 4 The Processor 8

Structure Hazards Conflict for use of a resource Execution of Instruction combination in one clock cycle is not supported. (Combine Washer + Dryer) In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches 45 Chapter 4 The Processor 9

Structure Hazards lw $4, 400($0) Single memory for data and program? 46 Chapter 4 The Processor 10

Data Hazards An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3 47 Chapter 4 The Processor 11

Forwarding (aka Bypassing) Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath 48 Chapter 4 The Processor 12

Load-Use Data Hazard Can t always avoid stalls by forwarding If value not computed when needed Can t forward backward in time! 49 Chapter 4 The Processor 13

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; lw $t1, 0($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t2, 4($t0) stall add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) add $t3, $t1, $t2 lw $t4, 8($t0) sw $t3, 12($t0) stall add $t5, $t1, $t4 add $t5, $t1, $t4 sw $t5, 16($t0) sw $t5, 16($t0) 13 cycles 11 cycles 50 Chapter 4 The Processor 14

Control/Branch Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction Still working on ID stage of branch In MIPS pipeline Need to compare registers and compute target early in the pipeline Add hardware to do it in ID stage 51 Chapter 4 The Processor 15

Stall on Branch Wait until branch outcome determined before fetching next instruction 52 Chapter 4 The Processor 16

Branch Prediction Longer pipelines can t readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay 53 Chapter 4 The Processor 17

MIPS with Predict Not Taken Prediction correct Prediction incorrect 54 Chapter 4 The Processor 18

More-Realistic Branch Prediction Static branch prediction Based on typical branch behavior Example: loop and if-statement branches Predict backward branches taken Predict forward branches not taken Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history 55 Chapter 4 The Processor 19

Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of pipeline implementation 56 Chapter 4 The Processor 20

MIPS Pipelined Datapath 4.6 Pipelined Datapath and Control MEM Right-to-left t flow leads to hazards WB 57 Chapter 4 The Processor 21

Pipeline Instruction Execution Single cycle datapath Chapter 4 The Processor 58 Chapter 4 The Processor 22

Pipeline registers Need registers between stages To hold information produced in previous cycle 59 64, 128, 97, and 64 Chapter 4 The Processor 23

Pipeline Operation Cycle-by-cycle flow of instructions through the pipelined datapath Single-clock-cycle pipeline diagram Shows pipeline usage in a single cycle Highlight resources used c.f. multi-clock-cycle diagram Graph of operation over time We ll look at single-clock-cycle diagrams for load & store 60 64, 128, 97, and 64 Chapter 4 The Processor 24

IF for Load, Store, Right Half Shaded: Read Left Half Shaded: Write 61 64, 128, 97, and 64 Chapter 4 The Processor 25

ID for Load, Store, IF/ID =16-bit immediate field the register numbers to read the two registers. All three values are stored in the ID/EX pipeline register, along with the 62 incremented PC address. 64, 128, 97, and 64 Chapter 4 The Processor 26

EX for Load EX/MEM=contents of register 1 [from ID/EX ] + the sign-extended immediate 63 64, 128, 97, and 64 Chapter 4 The Processor 27

MEM for Load MEM/WB = is loaded with the contents from memory location [address EX/MEM] 64 64, 128, 97, and 64 Chapter 4 The Processor 28

WB for Load Wrong register number 65 Chapter 4 The Processor 29

Corrected Datapath for Load 66 Chapter 4 The Processor 30

EX for Store 67 Chapter 4 The Processor 31

MEM for Store 68 Chapter 4 The Processor 32

WB for Store 69 Chapter 4 The Processor 33

Multi-Cycle Pipeline Diagram Form showing resource usage 70 Chapter 4 The Processor 34

Multi-Cycle Pipeline Diagram Traditional form 71 Chapter 4 The Processor 35

Single-Cycle Pipeline Diagram State of pipeline in a given cycle 72 Chapter 4 The Processor 36

Pipelined Control (Simplified) 73 Chapter 4 The Processor 37

Pipelined Control Control signals derived from instruction As in single-cycle implementation 74 Chapter 4 The Processor 38

Pipelined Control 75 Chapter 4 The Processor 39

Data Hazards in ALU Instructions Consider this sequence: sub $2, $1,$3 # Register $2 written by sub and $12,$2,$5 # 1st operand($2) depends on sub or $13,$6,$2 # 2nd operand($2) depends on sub add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub sw $15,100($2) 100($2) # Base ($2) depends d on sub We can resolve hazards with forwarding How do we detect when to forward? 4.7 Data Hazard ds: Forwarding vs. Stalling 76 Chapter 4 The Processor 40

Dependencies & Forwarding 77 Chapter 4 The Processor 41

Detecting the Need to Forward Pass register numbers along pp pipeline e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg 78 Chapter 4 The Processor 42

Detecting the Need to Forward EX/MEM.RegisterRd = ID/EX.RegisterRs = $2 The sub-and is a type 1a hazard. sub $2, $1,$3 # Register $2 written by sub and $12,$2,$5 # 1st operand($2) depends on sub or $13,$6,$2 # 2nd operand($2) depends on sub add $14,$2,$2 # 1st($2) & 2nd($2) depend on sub sw $15,100($2) # Base ($2) depends on sub The sub-or is a type 2b hazard:mem/wb.registerrd = ID/EX.RegisterRt = $2 The two dependences on sub-add are not hazards because the register file supplies the proper data during the ID stage of add. There is no data hazard between sub and sw because sw reads $2 the clock cycle after sub writes $2. 79 Chapter 4 The Processor 43

Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd 0, MEM/WB.RegisterRd 0 80 Chapter 4 The Processor 44

Forwarding Paths 81 Chapter 4 The Processor 45

Forwarding Paths 82 Chapter 4 The Processor 46

Forwarding Conditions EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 83 Chapter 4 The Processor 47

Forwarding Registers The control values for the forwarding multiplexors. 84 Chapter 4 The Processor 48

Double Data Hazard Consider the sequence: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 Both hazards occur Want to use the most recent Revise MEM hazard condition Only fwd if EX hazard a condition dto isn t true 85 Chapter 4 The Processor 49

Revised Forwarding Condition MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01 86 Chapter 4 The Processor 50

Datapath with Forwarding 87 Chapter 4 The Processor 51

Load-Use Data Hazard tests to see if the instruction is a load if (ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))) stall the pipeline Check to see if the destination register field of the load in the EX stage matches either source register of the instruction in the ID stage. 88 Chapter 4 The Processor 52

Load-Use Data Hazard Need to stall for one cycle 89 Chapter 4 The Processor 53

Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by IF/ID.RegisterRs, IF/ID.RegisterRt Load-use hazard when ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble 90 Chapter 4 The Processor 54

How to Stall the Pipeline Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage 91 Chapter 4 The Processor 55

Stall/Bubble in the Pipeline Stall inserted here 92 Chapter 4 The Processor 56

Stall/Bubble in the Pipeline Or, more accurately 93 Chapter 4 The Processor 57

Datapath with Hazard Detection 94 Chapter 4 The Processor 58

Stalls and Performance The BIG Picture Stalls reduce performance But are required to get correct results Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure 95 Chapter 4 The Processor 59

Branch Hazards If branch outcome determined in MEM 4.8 Control Hazards beq rs, rt, offset if rs= rt then PC [PC] + 4 + 4*offset else PC [PC] + 4 Flush these instructions (Set control values to 0) PC 96 Chapter 4 The Processor 60

Branch Hazards beq rs, rt, offset if rs= rt then PC [PC] + 4 + (4*offset) else PC [PC] + 4 bgez rs, offset if rs 0 then PC [PC] + 4 + (4*offset) else PC [PC] + 4 bgtz rs, offset if rs > 0 then PC [PC] + 4 + (4*offset) else PC [PC] + 4 97 Chapter 4 The Processor 61

Reducing Branch Delay Move hardware to determine outcome to ID stage Target address adder Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7... 72: lw $4, 50($7) 98 Chapter 4 The Processor 62

Example: Branch Taken 99 Chapter 4 The Processor 63

Example: Branch Taken 100 Chapter 4 The Processor 64

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.