Chapter 4 (Part II) Sequential Laundry

Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 Time Sequential laundry takes 8 hours for 4 loads 2 Division of Engineering Programs SUNY New Paltz 1

Pipelined Laundry 6 P 7 8 9 T a s k O r d e r A B C Time 30 30 30 30 30 30 30 D Pipelined laundry: overlapping execution Parallelism improves performance 3 Pipelining Analogy Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Pipelining doesn t help latency of single task, it helps throughput of entire workload ultiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup Stall for dependencies Speedup = 2n/0.5n + 1.5 4 = number of stages 4 Division of Engineering Programs SUNY New Paltz 2

Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (2ns), and adders (2ns), register file access (1ns) PCSrc 4 Add Write Shift left 2 Add resu lt 1 u x 0 PC address In s tru ction [3 1 0 ] Instruction memory In structio n [25 2 1] In structio n [20 1 6] In structio n [15 1 1] 1 u x 0 Dst In structio n [15 0 ] register 1 register 2 Write register Write data d ata 1 d ata 2 isters 16 Sign 32 extend Src 1 u x 0 control Zero result emwrite Address Write data Data memory em data emto 1 u x 0 Instru ction [5 0] Op Instruction Instr. emory ister Op. Data emory. Write Total R-format 200ps 100ps 200ps 0 100ps 600ps lw 200ps 100ps 200ps 200ps 100ps 800ps sw 200ps 100ps 200ps 200ps 700ps beq 200ps 100ps 200ps 500ps 5 Single Stage VS. Pipeline Performance Instruction Instr. emory ister Op. Data emory. Write Total R-format 200ps 100ps 200ps 0 100ps 600ps lw 200ps 100ps 200ps 200ps 100ps 800ps sw 200ps 100ps 200ps 200ps 700ps beq 200ps 100ps 200ps 500ps Single-cycle (T c = 800ps) Pipelined (T c = 200ps) 6 Division of Engineering Programs SUNY New Paltz 3

IPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. E: Access memory operand 5. WB: Write result back to register 7 Pipelining What makes it easy in IPS All instructions are the same length Just a few instruction formats emory operands appear only in loads and stores What makes it hard? Structural hazards: suppose we had only one memory Control hazards: need to worry about branch instructions Data hazards: an instruction depends on a previous instruction We ll build a simple pipeline and look at these issues what makes it really hard for modern processors Exception handling Trying to improve performance with out-of-order execution 8 Division of Engineering Programs SUNY New Paltz 4

Pipeline Speedup 9 The Five Stages of Load What do we need to add to actually split the datapath into stages? LW Ifetch /Dec Exec em Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction emory /Dec: isters Fetch and Instruction Decode Exec: Calculate the memory address em: the data from the Data emory Wr: Write the data back to the register file 10 Division of Engineering Programs SUNY New Paltz 5

Basic Idea IF: Instruction fetch 0 u x ID: Instruction decode/ registerfile read EX: Execute/ address calculation E: emory access WB: Write back 1 Add 4 Add Add re s ult Shift left 2 PC Addre s s Ins tructio n Ins truction me mory re gis ter 1 re gis ter 2 isters Write re gis te r Write data data 1 data 2 0 u x 1 Ze ro res ult Addre ss Data me mory Write data data 1 u x 0 16 Sign extend 32 11 Basic Idea IF: Instruction fetch 0 u x ID: Instruction decode/ registerfile read EX: Execute/ address calculation E: emory access WB: Write back 1 Add 4 Add Add re s ult Shift left 2 PC Addre s s Ins tructio n Ins truction me mory re gis ter 1 re gis ter 2 isters Write re gis te r Write data data 1 data 2 0 u x 1 Ze ro res ult Addre ss Data me mory Write data data 1 u x 0 16 Sign extend 32 12 Division of Engineering Programs SUNY New Paltz 6

Pipelining and ISA Design IPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands emory access takes only one cycle 13 Pipeline registers Need registers between stages To hold information produced in previous cycle Can you find a problem? What instructions can we execute to manifest the problem? 14 Division of Engineering Programs SUNY New Paltz 7

Pipeline Operation Cycle-by-cycle flow of instructions through the pipelined datapath Single-clock-cycle pipeline diagram Shows pipeline usage in a single cycle Highlight resources used c.f. multi-clock-cycle diagram Graph of operation over time We ll look at single-clock-cycle diagrams for load & store 15 IF for Load, Store, 16 Division of Engineering Programs SUNY New Paltz 8

ID for Load, Store, 17 EX for Load 18 Division of Engineering Programs SUNY New Paltz 9

E for Load 19 WB for Load Wrong register number 20 Division of Engineering Programs SUNY New Paltz 10

Corrected Datapath for Load 21 EX for Store 22 Division of Engineering Programs SUNY New Paltz 11

E for Store 23 WB for Store 24 Division of Engineering Programs SUNY New Paltz 12

Graphically Representing Pipelines Program execution order (in instructions) lw $10, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D sub $11, $2, $3 I D Can help with answering questions like: How many cycles does it take to execute this code? What is the doing during cycle 4? Use this representation to help understand datapaths 25 Why Pipeline? Time (clock cycles) I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Im Dm Im Dm Im Dm Im Dm Im Dm 26 Division of Engineering Programs SUNY New Paltz 13

ulti-cycle Pipeline Diagram Form showing resource usage 27 Single-Cycle Pipeline Diagram State of pipeline in a given cycle 28 Division of Engineering Programs SUNY New Paltz 14

Pipelined Control (Simplified) 29 Pipeline Control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / ister Fetch Execution emory Stage Write Back How would control be handled in an automobile plant? A fancy control center telling everyone what to do? Should we use a finite state machine? 30 Division of Engineering Programs SUNY New Paltz 15

Pipeline Control Control signals derived from instruction As in single-cycle implementation Pass control signals along just like the data Execution/Address Calculation stage control lines emory access stage control lines Write-back stage control lines Instruction Dst Op1 Op0 Src Branch em em Write write em to R-format 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X WB Instruction Control WB EX WB IF/ID ID/EX EX/ E E/W B 31 Pipelined Control 32 Division of Engineering Programs SUNY New Paltz 16

Designing a Pipelined Processor Go back and examine your datapath and control diagram Associated resources with states Ensure that flows do not conflict, or figure out how to resolve Assert control in appropriate stage 33 Pipelining Troubles? Pipeline Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards: attempt to use the same resource two different ways at the same time. Data hazards: attempt to use item before it is ready. Instruction depends on result of prior instruction still in the pipeline. Control hazards: attempt to make a decision before condition is evaluated. Branch instructions Can always resolve hazards by waiting. Pipeline control must detect the hazard. Take action (or delay action) to resolve hazards. 34 Division of Engineering Programs SUNY New Paltz 17

Structure Hazards Conflict for use of a resource In IPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches 35 Data Hazards Problem with starting next instruction before first is finished Dependencies that go backward in time are data hazards Time (in clock cycles) Value of register $2 : Program execution order (in instructions) sub $2, $1,$3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10 10/ 20 20 20 20 20 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D 36 Division of Engineering Programs SUNY New Paltz 18

Data Hazard Solution An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3 nop nop 37 Software Solution Have compiler guarantee no hazards Where should compiler insert nop instructions? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: It happens too often to rely on compiler It really slows us down! 38 Division of Engineering Programs SUNY New Paltz 19

Data Hazards Software Solution Time (in clock cycles) Value of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 register $2 : 10 10 10 10 10/ 20 20 20 20 20 Program execution order (in instructions) sub $2, $1,$3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D 39 Code Scheduling to Avoid Stalls C code for A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) I I D D I D I D I D I D I D 11 cycles not counting the dependencies 40 Division of Engineering Programs SUNY New Paltz 20

Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall 15 cycles lw $t1, 0($t0) lw $t2, 4($t0) nop nop add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) nop nop add $t5, $t1, $t4 sw $t5, 16($t0) 12 cycles lw $t1, 0($t0) lw $t2, 4($t0) nop lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 41 Forwarding (aka Bypassing) Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath 42 Division of Engineering Programs SUNY New Paltz 21

Data Hazards in Instructions Consider this sequence: sub $2, $1,$3 and $12,$2,$5 or $13,$6,$2 add $14,$2,$2 sw $15,100($2) We can resolve hazards with forwarding How do we detect when to forward? 4.7 Data Hazards: Forwarding vs. Stalling 43 Data Hazard Solution: Forwarding Use temporary results ( forwarding), don t wait for them to be written Also, write register file during 1st half of clock and read during 2nd Time (in clock cycles) half CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/ 20 20 20 20 20 Value of EX/E : X X X 20 X X X X X Value of E/W B : X X X X 20 X X X X Program execution order (in instructions) sub $2,$1,$3 I R eg D and $12, $2,$5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D what if this $2 was $13? 44 Division of Engineering Programs SUNY New Paltz 22

Data Hazard Solution: Forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/ 20 20 20 20 20 Value of EX/E : X X X 20 X X X X X Value of E/W B : X X X X 20 X X X X Program execution order (in instructions) sub $2, $1, $3 I R eg D and $12, $2,$5 I D or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) what if this $2 was $13? 45 Dependencies & Forwarding 46 Division of Engineering Programs SUNY New Paltz 23

Forwarding Paths 47 Detecting the Need to Forward Pass register numbers along pipeline e.g., ID/EX.isterRs = register number for Rs sitting in ID/EX pipeline register operand register numbers in EX stage are given by ID/EX.isterRs, ID/EX.isterRt Data hazards when 1a. EX/E.isterRd = ID/EX.isterRs 1b. EX/E.isterRd = ID/EX.isterRt 2a. E/WB.isterRd = ID/EX.isterRs 2b. E/WB.isterRd = ID/EX.isterRt Fwd from EX/E pipeline reg Fwd from E/WB pipeline reg 48 Division of Engineering Programs SUNY New Paltz 24

Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/E.Write, E/WB.Write And only if Rd for that instruction is not $zero EX/E.isterRd 0, E/WB.isterRd 0 49 Datapath with Forwarding 00 ister file 01 em. or earlier 10 Prior 50 Division of Engineering Programs SUNY New Paltz 25

Forwarding Conditions EX hazard if (EX/E.Write and (EX/E.isterRd 0) and (EX/E.isterRd = ID/EX.isterRs)) ForwardA = 10 if (EX/E.Write and (EX/E.isterRd 0) and (EX/E.isterRd = ID/EX.isterRt)) ForwardB = 10 E hazard if (E/WB.Write and (E/WB.isterRd 0) and (E/WB.isterRd = ID/EX.isterRs)) ForwardA = 01 if (E/WB.Write and (E/WB.isterRd 0) and (E/WB.isterRd = ID/EX.isterRt)) ForwardB = 01 51 Hazard Conditions Steer the result from previous instruction to the ID/EX WB 00 ister file 01 em. or earlier 10 Prior EX/E Control WB E/WB IF/ID EX WB PC Instruction memory Instruction isters ux Data memory ux ux EX hazard if (EX/E.Write and (EX/E.isterRd 0) IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd and (EX /E.isterRd = ID/EX.isterRs)) ForwardA = 10 if (EX/E.Write and (EX/E.isterRd 0) and (EX /E.isterRd = ID/EX.isterRt)) ForwardB = 10 Rs Rt Rt Rd ux Forwarding unit EX/E.isterRd E/WB.isterRd 52 Division of Engineering Programs SUNY New Paltz 26

Hazard Conditions Steer the result from precious instruction to the ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC Instruction memory Instruction isters ux Data memory ux ux E hazard if (E/WB.Write and (E/WB.isterRd 0) IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd and (E/WB.isterRd = ID/EX.isterRs)) ForwardA = 01 if (E/WB.Write and (E/WB.isterRd 0) and (E/WB.isterRd = ID/EX.isterRt)) ForwardB = 01 Rs Rt Rt Rd ux Forwarding unit EX/E.isterRd E/WB.isterRd 53 Double Data Hazard Consider the sequence: add $1,$1,$2 add $1,$1,$3 add $1,$1,$4 Both hazards occur Want to use the most recent Revise E hazard condition Only fwd if EX hazard condition isn t true I D I D I D 54 Division of Engineering Programs SUNY New Paltz 27

Revised Forwarding Condition E hazard if (E/WB.Write and (E/WB.isterRd 0) and not (EX/E.Write and (EX/E.isterRd 0) and (EX/E.isterRd = ID/EX.isterRs)) and (E/WB.isterRd = ID/EX.isterRs)) ForwardA = 01 if (E/WB.Write and (E/WB.isterRd 0) and not (EX/E.Write and (EX/E.isterRd 0) and (EX/E.isterRd = ID/EX.isterRt)) and (E/WB.isterRd = ID/EX.isterRt)) ForwardB = 01 55 Can't always forward lw can still cause a hazard: An instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) Program CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 execution order (in instructions) lw $2, 20($1) I D and $4, $2, $5 I D CC 7 CC 8 CC 9 Need to stall for one cycle or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Thus, we need a hazard detection unit to stall the load instruction 56 Division of Engineering Programs SUNY New Paltz 28

Load-Use Data Hazard Can t always avoid stalls by forwarding If value not computed when needed Can t forward backward in time! 57 Load-Use Hazard Detection Check when using instruction is decoded in ID stage operand register numbers in ID stage are given by IF/ID.isterRs, IF/ID.isterRt Load-use hazard when ID/EX.em and ((ID/EX.isterRt = IF/ID.isterRs) or (ID/EX.isterRt = IF/ID.isterRt)) If detected, stall and insert bubble 58 Division of Engineering Programs SUNY New Paltz 29

How to Stall the Pipeline Force control values in ID/EX register to 0 EX, E and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows E to read data for lw Can subsequently forward to EX stage 59 Stall/Bubble in the Pipeline Stall inserted here 60 Division of Engineering Programs SUNY New Paltz 30

Stall/Bubble in the Pipeline Or, more accurately 61 Datapath with Hazard Detection Stall by letting an instruction that won t write anything go forward Controls writing of the PC and IF/ID plus UX 62 Division of Engineering Programs SUNY New Paltz 31

Code Scheduling to Avoid Stalls Revisiting reordering code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) nop add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) nop add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles 13 cycles 63 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction Still working on ID stage of branch In IPS pipeline Need to compare registers and compute target early in the pipeline Add hardware to do it in ID stage 64 Division of Engineering Programs SUNY New Paltz 32

Branch Hazards When decide to branch, other instructions may be in the pipeline! If branch outcome determined in E 4.8 Control Hazards Flush these instructions (Set control values to 0) PC 65 Stall on Branch Wait until branch outcome determined before fetching next instruction 66 Division of Engineering Programs SUNY New Paltz 33

Branch Prediction Longer pipelines can t readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In IPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay Need to add hardware for flushing instructions if we are wrong 67 IPS with Predict Not Taken Prediction correct Prediction incorrect 68 Division of Engineering Programs SUNY New Paltz 34

Our Original Datapath PCSrc ux 0 1 Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC 4 Address Instruction memory Instruction Write register 1 data 1 register 2 isters Write data 2 register Write data Shift left 2 0 ux 1 Add result Add Src Zero result Branch Write data emwrite Address Data memory data emto ux 1 0 Instruction [15 0] Instruction [20 16] Instruction [15 11] 16 32 Sign extend Dst 69 ux 0 1 6 control Op em Reduce Branch Delay IF.Flush ux Hazard detection unit ID/EX WB EX/E Control 0 ux WB E/WB IF/ID EX WB 4 Shift left 2 PC Instruction memory isters = ux ux Data memory ux Sign extend ux Forwarding unit 70 Division of Engineering Programs SUNY New Paltz 35

Reducing Branch Delay ove hardware to determine outcome to ID stage Target address adder ister comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7... 72: lw $4, 50($7) 71 Example: Branch Taken 72 Division of Engineering Programs SUNY New Paltz 36

Example: Branch Taken 48 73 Pipeline Summary The BIG Picture Pipelining improves performance by increasing instruction throughput Executes multiple instructions in parallel Each instruction has the same latency Subject to hazards Structure, data, control Instruction set design affects complexity of pipeline implementation 74 Division of Engineering Programs SUNY New Paltz 37

Stalls and Performance The BIG Picture Stalls reduce performance But are required to get correct results Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure 75 Division of Engineering Programs SUNY New Paltz 38