Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing Note: Appendices A-E in the hardcopy text correspond to chapters 7- in the online text. (2)

Assume time for stages is Pipeline Performance ps for register read or write 2ps for other stages Compare pipelined path with singlecycle path Instr Instr fetch Register read ALU op emory access Register write Total time lw 2ps ps 2ps 2ps ps 8ps sw 2ps ps 2ps 2ps 7ps R-format 2ps ps 2ps ps 6ps beq 2ps ps 2ps 5ps (3) Pipeline Performance Single-cycle (T c = 8ps) Pipelined (T c= 2ps) (4) 2

Pipeline Speedup If all stages are balanced i.e., all take the same time Inter instruction gap pipelined = Inter instruction gap nonpipelined number of stages If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease (5) Basic Idea All instructions are 32-bits Few & regular instruction formats Alignment of memory operands (6) 3

What makes it easy All instructions are the same length Pipelining Simple instruction formats emory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions hazards: an instruction depends on a preious instruction What really makes it hard: exception handling trying to improe performance with out-of-order execution, etc. (7) Need registers between stages Pipeline registers To hold information produced in preious cycle Pipeline stage execution time (8) 4

Graphically Representing Pipelines Time 2 4 6 8 lw IF ID EX E WB add IF ID EX E WB Shading indicates the unit is being used by the instruction Shading on the right half of the register file (ID or WB) or memory means the element is being read in that stage Shading on the left half means the element is being written in that stage (9) Graphically Representing Pipelines Program execution order (in instructions) lw$, 2($) Time (in clockcycles) CC CC2 CC3 CC4 CC5 CC6 I Reg ALU D Reg sub $, $2, $3 I Reg ALU D Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand paths () 5

Structural Hazard Time 2 4 6 8 lw IF ID EX E WB add IF ID EX E WB sub IF ID EX E WB add IF ID EX E WB Need to separate instruction and memory () IF for Load, Store, Pipeline stage execution time (2) 6

ID for Load, Store, Pipeline stage execution time (3) EX for Load Pipeline stage execution time (4) 7

E for Load Pipeline stage execution time (5) WB for Load Wrong register number (6) 8

Corrected Datapath for Load Pipeline stage execution time (7) EX for Store Pipeline stage execution time (8) 9

E for Store Pipeline stage execution time (9) WB for Store Pipeline stage execution time (2)

Pipelining Example add $4, $5, $6 lw $3, 24($) add $2, $3, $4 sub $, $2, $3 lw $, 2($) u x Note what is happening in the register file IF/ID ID/EX EX/E E/WB 4 result Shift left 2 PC ress Pipeline stage execution time memory register register 2 Registers 2 register [2 6] [5 ] 6 Sign extend 32 u x u x RegDst Zero ALU ALU result ress Data memory (2) Pipelined Control (Simplified) (22)

Pipelined Control Execution/ress Calculation stage control lines emory access stage control lines -back stage control lines Reg Dst ALU Op ALU Op ALU Src Branch em em Reg write em to Reg R-format lw sw X X beq X X Control signals deried from instruction As in single-cycle implementation Pass control signals along like (23) Pipelined Control (24) 2

Datapath with Control IF: lw $, 9($) PCSrc Control ID/EX WB EX/E WB E/WB IF/ID EX WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (25) Datapath with Control IF: sub $, $2, $3 ID: lw $, 9($) PCSrc lw ID/EX WB Control EX/E WB E/WB IF/ID E X WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (26) 3

Datapath with Control IF: and $2, $4, $5 PCSrc ID: sub $, $2, $3 EX: lw $, 9($) IF/ID sub ID/EX WB Control EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (27) Datapath with Control IF: or $3, $6, $7 PCSrc ID: and $2, $4, $5 EX: sub $, $2, $3 E: lw $, 9($) IF/ID and ID/EX WB Control EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (28) 4

Datapath with Control IF: add $4, $8, $9 PCSrc ID: or $3, $6, $7 EX: and $2, $4, $5 E: sub $,.. WB: lw $, 9($) IF/ID or ID/EX WB Control EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (29) Datapath with Control IF: xxxx PCSrc ID: add $4, $8, $9 EX: or $3, $6, $7 E: and $2 WB: sub $,.. IF/ID add ID/EX WB Control EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (3) 5

Datapath with Control IF: xxxx PCSrc ID: xxxx EX: add $4, $8, $9 E: or $3,.. WB: and $2 IF/ID Control ID/EX WB EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (3) Datapath with Control IF: xxxx PCSrc ID: xxxx EX: xxxx E: add $4,.. WB: or $3 IF/ID Control ID/EX WB EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (32) 6

Datapath with Control IF: xxxx PCSrc ID: xxxx EX: xxxx E: xxxx WB: add $4.. IF/ID Control ID/EX WB EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (33) Data Hazards (4.7) An instruction depends on completion of access by a preious instruction add $s, $t, $t sub $t2, $s, $t3 (34) 7

Problem with starting next instruction before first is finished Dependencies dependencies that go backward in time are hazards Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $, $3 CC CC 2 CC 3 CC 4 CC 5 CC 6 I Reg CC 7 CC 8 CC 9 / 2 2 2 2 2 D Reg and $2, $2, $5 I Reg D Reg or $3, $6, $2 I Reg D Reg add $4, $2, $2 I Reg D Reg sw $5, ($2) I Reg D Reg (35) Hae compiler guarantee no hazards Where do we insert the nops? sub $2, $, $3 and $2, $2, $5 or $3, $6, $2 add $4, $2, $2 sw $5, ($2) Software Solution Problem: this really slows us down! (36) 8

A Better Solution Consider this sequence: sub $2, $,$3 and $2,$2,$5 or $3,$6,$2 add $4,$2,$2 sw $5,($2) We can resole hazards with forwarding How do we detect when to forward? (37) Dependencies & Forwarding Do not wait for results to be written to the register file find them in the pipeline à forward to ALU (38) 9

Forwarding IF: add $4, $8, $9 PCSrc ID: or $3, $6, $7 EX: and $6, $4, $5 E: sub $,.. WB: lw $, 9($) IF/ID or ID/EX WB Control EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (39) Forwarding (simplified) ID/EX EX/E E/WB Register File ALU Data emory UX (4) 2

Forwarding (from EX/E) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX (4) Forwarding (from E/WB) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX (42) 2

Forwarding (operand selection) ID/EX EX/E E/WB UX Register File ALU UX Data emory UX Forwarding Unit (43) Forwarding (operand propagation) ID/EX EX/E E/WB Register File UX ALU UX Data emory UX Rd Rt UX Rt Rs Forwarding Unit EX/E Rd E/WB Rd Combinational Logic! (44) 22

Detecting the Need to Forward Pass register numbers along pipeline e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are gien by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when a. EX/E.RegisterRd = ID/EX.RegisterRs b. EX/E.RegisterRd = ID/EX.RegisterRt Fwd from EX/E pipeline reg 2a. E/WB.RegisterRd = ID/EX.RegisterRs 2b. E/WB.RegisterRd = ID/EX.RegisterRt Fwd from E/WB pipeline reg (45) Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/E.Reg, E/WB.Reg And only if Rd for that instruction is not $zero EX/E.RegisterRd, E/WB.RegisterRd (46) 23

Forwarding Paths (47) Forwarding Conditions EX hazard if (EX/E.Reg and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRs)) ForwardA = if (EX/E.Reg and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRt)) ForwardB = E hazard if (E/WB.Reg and (E/WB.RegisterRd ) and (E/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = if (E/WB.Reg and (E/WB.RegisterRd ) and (E/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = (48) 24

Consider the sequence: add $,$,$2 add $,$,$3 add $,$,$4 Both hazards occur Want to use the most recent Double Data Hazard Reise E hazard condition Only forward if EX hazard condition isn t true (49) E hazard Reised Forwarding Condition if (E/WB.Reg and (E/WB.RegisterRd ) and not (EX/E.Reg and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRs)) and (E/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = if (E/WB.Reg and (E/WB.RegisterRd ) and not (EX/E.Reg and (EX/E.RegisterRd ) and (EX/E.RegisterRd = ID/EX.RegisterRt)) and (E/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = Checking precedence of EX hazard (5) 25

Datapath with Forwarding (5) Concurrent Execution Correct execution is about managing dependencies Producer-consumer Structural (using the same hardware component) We will come across other types of dependencies later! (52) 26

Load-Use Data Hazard Need to stall for one cycle (53) Forwarding IF: add $4, $8, $9 PCSrc ID: or $3, $6, $7 EX: and $6, $4, $ E: lw $, ($2) WB: lw $, 9($) IF/ID or ID/EX WB Control EX EX/E WB E/WB WB PC 4 ress memory register Reg register 2 Registers 2 register Shift left 2 result ALUSrc Zero ALU ALU result Branch ress Data memory em emtoreg [5 ] 6 32 Sign extend 6 ALU control em [2 6] [5 ] RegDst ALUOp (54) 27

Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are gien by IF/ID.RegisterRs, IF/ID.RegisterRt Load-use hazard when ID/EX.em and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble (55) Code Scheduling to Aoid Stalls Reorder code to aoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall lw $t, ($t) lw $t2, 4($t) add $t3, $t, $t2 sw $t3, 2($t) lw $t4, 8($t) add $t5, $t, $t4 sw $t5, 6($t) 3 cycles lw $t, ($t) lw $t2, 4($t) lw $t4, 8($t) add $t3, $t, $t2 sw $t3, 2($t) add $t5, $t, $t4 sw $t5, 6($t) cycles (56) 28

How to Stall the Pipeline Force control alues in ID/EX register to EX, E and WB perform a nop (no-operation) Preent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again -cycle stall allows E to read for lw o Can subsequently forward to EX stage (57) Stall/Bubble in the Pipeline Stall inserted here (58) 29

Stall/Bubble in the Pipeline Or, more accurately (59) Datapath with Hazard Detection Pipeline stage execution time ALUSrc m is missing! (6) 3

Control Hazards (4.8) Branch instruction determines flow of control Fetching next instruction depends on branch outcome Pipeline cannot always fetch correct instruction o Still working on ID stage of branch In IPS pipeline Need to compare registers and determine the branch condition (6) Branch Hazards If branch outcome determined in E Flush these instructions (Set control alues to ) PC (62) 3

Reducing Branch Delay oe hardware to determine outcome to ID stage Target address adder Register comparator IF.Flush signal to squash IF/ID register Example: branch taken 36: sub $, $4, $8 4: beq $, $3, 72 44: and $2, $2, $5 48: or $3, $2, $6 52: add $4, $4, $2 56: slt $5, $6, $7... 72: lw $4, 5($7) (63) Example: Branch Taken (64) 32

Example: Branch Taken (65) Data Hazards for Branches If a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction add $, $2, $3 IF ID EX E WB add $4, $5, $6 IF ID EX E WB IF ID EX E WB beq $, $4, target IF ID EX E WB n Can resole using forwarding to ID n Need to add forwarding logic! (66) 33

Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2 nd preceding load instruction Need stall cycle lw $, addr add $4, $5, $6 IF ID EX E WB IF ID EX E WB beq stalled IF ID beq $, $4, target ID EX E WB (67) Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction Need 2 stall cycles lw $, addr IF ID EX E WB beq stalled IF ID beq stalled ID beq $, $, target ID EX E WB (68) 34

Delay Slot (IPS) Expose pipeline Load and jump/branch entail a delay slot The instruction right after the jump or branch is executed before the jump/branch jal add lw function_a $4, $5, $6 ; executed before jmp $2, 8($4) ; executed after return Jump/branch and the delay slot instruction are considered indiisible In the delay slot, the compiler needs to schedule A useful instruction (either before the jmp, or after the jmp w/o side effect) otherwise a NOP (69) Branch Prediction Longer pipelines cannot readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In IPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay (7) 35

IPS with Predict Not Taken Prediction correct Prediction incorrect (7) -Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: inner: beq,, inner beq,, outer n n ispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of inner loop next time around (72) 36

2-Bit Predictor: State achine Only change prediction on two successie mispredictions (73) ore-realistic Branch Prediction Static branch prediction Based on typical branch behaior Example: loop and if-statement branches o Predict backward branches taken o Predict forward branches not taken Dynamic branch prediction Hardware measures actual branch behaior o e.g., record recent history of each branch Assume future behaior will continue the trend o When wrong, stall while re-fetching, and update history (74) 37

AD Bobcat ECE 6 Later in this course ECE 6 Leel Parallelism (ILP) Later in this course http://hothardware.com (75) Intel Sandy Bridge bdti.com (76) 38

Exceptions and Interrupts (4.9) Unexpected eents requiring change in flow of control Different ISAs use the terms differently Exception Arises within the CPU o e.g., undefined opcode, oerflow, syscall, Interrupt From an external I/O controller Updates to the path Recording the cause of the exception and transferring control to the OS Consider the impact of hardware modifications on the critical path (77) Handling Exceptions In IPS, exceptions managed by a System Control Coprocessor (CP) Sae PC of offending (or interrupted) instruction In IPS: Exception Program Counter (EPC) Sae indication of the problem In IPS: Cause register We ll assume -bit o for undefined opcode, for oerflow Jump to handler at 88 (78) 39

Exception Handling: Operations two registers to the path EPC and Cause registers a state for each exception condition Use the ALU to compute the EPC contents the Cause register with exception condition Update the PC with OS handler address Generate control signals for each operation See Appendix A.7 for details of IPS 2/3 implementation (79) The OS Interactions The IPS 32 Status and Cause Registers 3 5 8 4 Status Register Interrupt ask User mode Excep leel int enable 3 5 8 6 2 Cause Register Branch Delay Pending Interrupts Exception Code Operating System handlers interrogate these registers anage all state saing requirements example in A.7 (8) 4

An Alternate echanism Vectored Interrupts Handler address determined by the cause Example: Undefined opcode: C Oerflow: C 2 : C 4 s either Deal with the interrupt, or Jump to real handler (8) Handler Actions cause, and transfer to releant handler Determine action required If restartable Take correctie action use EPC to return to program Otherwise Terminate program Report error using EPC, cause, (82) 4

Another form of control hazard Exceptions in a Pipeline Consider oerflow on add in EX stage add $, $2, $ Preent $ from being clobbered Complete preious instructions Flush add and subsequent instructions Set Cause and EPC register alues Transfer control to handler Similar to mispredicted branch Use much of the same hardware (83) Pipeline with Exceptions (84) 42

Exception Properties Restartable exceptions Pipeline can flush the instruction Handler executes, then returns to the instruction o Re-fetched and executed from scratch PC saed in EPC register Identifies causing instruction Actually PC + 4 is saed o Handler must adjust (85) Exception Example Exception on add in 4 sub $, $2, $4 44 and $2, $2, $5 48 or $3, $2, $6 4C add $, $2, $ 5 slt $5, $6, $7 54 lw $6, 5($7) Handler 88 sw $25, ($) 884 sw $26, 4($) (86) 43

Exception Example (87) Exception Example (88) 44

ultiple Exceptions Pipelining oerlaps multiple instructions Could hae multiple exceptions at once Simple approach: deal with exception from earliest instruction Flush subsequent instructions Precise exceptions In complex pipelines ultiple instructions issued per cycle Out-of-order completion aintaining precise exceptions is difficult! (89) Just stop pipeline and sae state Including exception cause(s) Let the handler work out Which instruction(s) had exceptions Which to complete or flush o ay require manual completion Imprecise Exceptions Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines (9) 45

Performance How do we assess the impact of stall cycles? How close do we approach the ideal of one instruction per cycle execution time? Back to the CPI model! (9) Recall: Program Execution time Number of instruction classes # n & ExecutionTime = % C i CPI ( i= i cycle_time $ % '( ~= _count * CPI ag * clock_cycle_time algorithms/compiler architecture technology Clock Cycles CPI ag = Count = n i= " CPI i Count % i $ ' # Count & Relatie frequency (92) 46

Study Guide Gien a code block, and initial register alues (those that are accessed) be able to determine state of all pipeline registers at some future clock cycle. Determine the size of each pipeline register Track pipeline state in the case of forwarding and branches Compute the number of cycles to execute a code block odify the path to include forwarding and hazard detection for branches (this is trickier and time consuming but well worth it) (93) Study Guide (cont.) Schedule code (manually) to improe performance, for example to eliminate hazards and fill delay slots odify the path to add new instructions such as j odify the path to accommodate a two cycle memory access, i.e., the memory itself is a two cycle pipeline odify the forwarding and hazard control logic Gien a code sequence, be able to compute the number of stall cycles (94) 47

Study Guide (cont.) Track the state of the 2-bit branch predictor oer a sequence of branches in a code segment, for example a for-loop Show the pipeline state before and after an exception has taken place. (95) Glossary Branch prediction Branch hazards Branch delay Control hazard Data hazard Delay slot Dynamic instruction issue Forwarding Imprecise exception scheduling leel parallelism (ILP) Load-to-use hazard Pipeline bubbles Stall cycles Static instruction issue Structural hazard (96) 48