What do we have so far? Multi-Cycle Datapath

What do we have so far? lti-cycle Datapath CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instrction being processed in datapath How to lower CPI frther? #1 Lec # 8 Spring2 4-11-2

Pipelining pipelining is a CPU implementation techniqe where mltiple operations on a nmber of instrctions are overlapped. The net instrction is fetched in the net cycle withot waiting for the crrent instrction to complete. An instrction eection pipeline involves a nmber of steps, where each step completes one part of an instrction. Each step is called a pipeline stage or a pipeline segment. The stages or steps are connected one to the net to form a pipeline -- instrctions enter at one end and progress throgh the stages and eit at the other end when completed. Pipeline Throghpt : The instrction completion rate of the pipeline and is determined by how often an instrction eists the pipeline. The time to move an instrction one step down the line is is eqal to the machine cycle and is determined by the stage with the longest processing delay. Pipeline Latency: The time reqired to complete an instrction: Cycle time Nmber of pipeline stages. #2 Lec # 8 Spring2 4-11-2

Single Cycle Vs. Pipelining P rogram e ection Tim e o rder (in instrctions) lw $ 1, 1 ($ ) fetch 2 4 6 8 1 1 2 14 16 18 ALU Data access Single Cycle lw $ 2, 2 ($ ) 8 ns fetch ALU Data access lw $ 3, 3 ($ ) Time for 1 instrctions = 8 1 = 8 ns 8 n s fetch 8 ns... Prog ram eection Time ord er (in instrctions) lw $1, 1 ($) lw $2, 2 ($) fetch 2 ns 2 4 6 8 1 1 2 14 fetch ALU Data access ALU Data access 5 Stage Pipeline lw $3, 3 ($) 2 ns fetch ALU Data access 2 ns 2 n s 2 ns 2 ns 2 n s Time for 1 instrctions = time to fill pipeline + cycle time 1 = 8 + 2 1 = 28 ns Pipelining Speedp = 8/28 = 3.98 #3 Lec # 8 Spring2 4-11-2

Pipelining: Design Goals The length of the machine clock cycle is determined by the time reqired for the slowest pipeline stage. An important pipeline design consideration is to balance the length of each pipeline stage. If all stages are perfectly balanced, then the time per instrction on a pipelined machine (assming ideal conditions with no stalls): Time per instrction on npipelined machine Nmber of pipe stages Under these ideal conditions: Speedp from pipelining = the nmber of pipeline stages = k One instrction is completed every cycle: CPI = 1. #4 Lec # 8 Spring2 4-11-2

From IPS lti-cycle Datapath: Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load IF ID EX E WB 1- Fetch (IF) Fetch Fetch the instrction from the emory. 2- Decode (ID): isters Fetch and Decode. 3- Eecte (EX): Calclate the memory address. 4- emory (E): the data from the Data emory. 5- Write Back (WB): Write the data back to the register file. #5 Lec # 8 Spring2 4-11-2

Pipelined Processing Representation Clock cycle Nmber Time in clock cycles Nmber 1 2 3 4 5 6 7 8 9 I IF ID EX E WB I+1 IF ID EX E WB I+2 IF ID EX E WB I+3 IF ID EX E WB I +4 IF ID EX E WB Time to fill the pipeline Pipeline Stages: IF = Fetch ID = Decode EX = Eection E = emory Access WB = Write Back First instrction, I Completed Last instrction, I+4 completed #6 Lec # 8 Spring2 4-11-2

Pipelined Processing Time IF ID EX E WB Representation IF ID EX E WB IF ID EX E WB IF ID EX E WB Program Flow IF ID EX E WB IF ID EX E WB #7 Lec # 8 Spring2 4-11-2

Clk Single Cycle, lti-cycle, Vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 8 ns Load Store Waste 2ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 1 Clk ltiple Cycle Implementation: Load IF ID EX E WB Store IF ID EX E R-type IF Pipeline Implementation: Load IF ID EX E WB Store IF ID EX E WB R-type IF ID EX E WB #8 Lec # 8 Spring2 4-11-2

Single Cycle, lti-cycle, Pipeline: Performance Comparison Eample For 1 instrctions, eection time: Single Cycle achine: 8 ns/cycle 1 CPI 1 inst = 8 ns lticycle achine: 2 ns/cycle 4.6 CPI (de to inst mi) 1 inst = 92 ns Ideal pipelined machine, 5-stages: 2 ns/cycle (1 CPI 1 inst + 4 cycle fill) = 28 ns #9 Lec # 8 Spring2 4-11-2

IPS Pipeline Stage Identification IF: fetch ID: decode/ register file read EX: Eecte/ address calclation E: emory access WB: Write back 1 Add 4 Shift left 2 Add reslt Add PC Address memory register 1 data 1 register 2 isters data 2 Write register Write data 16 Sign etend 32 1 Zero ALU ALU reslt Address Write data Data memory data 1 What is needed to divide datapath into pipeline stages? #1 Lec # 8 Spring2 4-11-2

IPS: An Initial Pipelined Datapath 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add Add reslt PC Address memory register 1 data 1 register 2 isters data 2 Write register Write data 1 Zero ALU ALU reslt Address Write data Data memory data 1 16 Sign etend 32 IF ID EX E WB Fetch Decode Eection emory Write Back Can yo find a problem even if there are no dependencies? What instrctions can we eecte to manifest the problem? #11 Lec # 8 Spring2 4-11-2

A Corrected Pipelined Datapath 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add reslt Add PC Address memory register 1 data 1 register 2 isters data 2 Write register Write data 1 Zero ALU ALU reslt Address Write data Data memory data 1 16 Sign etend 32 IF ID EX E WB Fetch Decode Eection emory Write Back #12 Lec # 8 Spring2 4-11-2

Representing Pipelines Graphically Time (in clock cycles) Program eection order (in instrctions) lw $1, 2($1) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I ALU D sb $11, $2, $3 I ALU D Can help with answering qestions like: How many cycles does it take to eecte this code? What is the ALU doing dring cycle 4? Use this representation to help nderstand datapaths #13 Lec # 8 Spring2 4-11-2

Adding Pipeline Control Points PCSrc 1 IF/ID ID/EX EX/E E/WB Add 4 Write Shift left 2 Add Add reslt Branch PC Address memory register 1 data 1 register 2 isters Write data 2 register Write data ALUSrc 1 Zero ALU ALU reslt Address Write emwrite Data memory data emto 1 [15 ] 16 Sign 32 etend 6 ALU control data em [2 16] [15 11] 1 ALUOp Dst #14 Lec # 8 Spring2 4-11-2

Pipeline Control Pass needed control signals along from one stage to the net as the instrction travels throgh the pipeline jst like the data Eection/Address Calclation stage control lines emory access stage control lines Write-back stage control lines Dst ALU Op1 ALU Op ALU Src Branch em em Write write em to R-format 1 1 1 lw 1 1 1 1 sw X 1 1 X beq X 1 1 X WB Control WB EX WB IF/ID ID/EX EX/E E/WB #15 Lec # 8 Spring2 4-11-2

Pipeline Control The ain Control generates the control signals dring /Dec Control signals for Eec (EtOp, ALUSrc,...) are sed 1 cycle later Control signals for em (emwr Branch) are sed 2 cycles later Control signals for Wr (emto emwr) are sed 3 cycles later ID EX em WB EtOp EtOp ALUSrc ALUSrc IF/ID ister ain Control ALUOp Dst emwr Branch emto ID/E ister ALUOp Dst emwr Branch emto E/em ister emwr Branch emto em/wb ister emto Wr Wr Wr Wr #16 Lec # 8 Spring2 4-11-2

Pipelined Datapath with Control Added PCSrc 1 Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC 4 Address memory register 1 data 1 register 2 isters Write data 2 register Write data R egwrite Shift left 2 1 Add Add reslt ALUSrc Zero ALU ALU reslt Branch Write data emwrite Address Data memory data emto 1 [15 ] 16 Sign 32 etend 6 ALU control em [2 16] [15 11] 1 Dst ALUOp Target address of branch determined in E #17 Lec # 8 Spring2 4-11-2

Basic Performance Isses In Pipelining Pipelining increases the CPU instrction throghpt: The nmber of instrctions completed per nit time. Under ideal condition instrction throghpt is one instrction per machine cycle, or CPI = 1 Pipelining does not redce the eection time of an individal instrction: The time needed to complete all processing steps of an instrction (also called instrction completion latency). It sally slightly increases the eection time of each instrction over npipelined implementations de to the increased control overhead of the pipeline and pipeline stage registers delays. #18 Lec # 8 Spring2 4-11-2

Pipelining Performance Eample Eample: For an npipelined machine: Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instrction freqencies of 4%, 2% and 4%, respectively. If pipelining adds 1ns to the machine clock cycle then the speedp in instrction eection from pipelining is: Non-pipelined Average instrction eection time = Clock cycle Average CPI = 1 ns ((4% + 2%) 4 + 4% 5) = 1 ns 4.4 = 44 ns In the pipelined five implementation five stages are sed with an average instrction eection time of: 1 ns + 1 ns = 11 ns Speedp from pipelining = time npipelined time pipelined = 44 ns / 11 ns = 4 times #19 Lec # 8 Spring2 4-11-2

Pipeline Hazards Hazards are sitations in pipelining which prevent the net instrction in the instrction stream from eecting dring the designated clock cycle reslting in one or more stall cycles. Hazards redce the ideal speedp gained from pipelining and are classified into three classes: Strctral hazards: Arise from hardware resorce conflicts when the available hardware cannot spport all possible combinations of instrctions. Data hazards: Arise when an instrction depends on the reslts of a previos instrction in a way that is eposed by the overlapping of instrctions in the pipeline. Control hazards: Arise from the pipelining of conditional branches and other instrctions that change the PC. #2 Lec # 8 Spring2 4-11-2

Strctral Hazards In pipelined machines overlapped instrction eection reqires pipelining of fnctional nits and dplication of resorces to allow all possible combinations of instrctions in the pipeline. If a resorce conflict arises de to a hardware resorce being reqired by more than one instrction in a single cycle, and one or more sch instrctions cannot be accommodated, then a strctral hazard has occrred, for eample: when a machine has only one register file write port or when a pipelined machine has a shared single-memory pipeline for data and instrctions. stall the pipeline for one cycle for register writes or memory data access #21 Lec # 8 Spring2 4-11-2

Strctral hazard Eample: Single emory For s & Data Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 ALU em em em em ALU em ALU em em ALU em ALU em em Detection is easy in this case (right half highlight means read, left half write) #22 Lec # 8 Spring2 4-11-2

Data Hazards Eample Problem with starting net instrction before first is finished Data dependencies here that go backward in time create data hazards. sb $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 1($2) Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC 8 CC 9 1 1 1 1 1/ 2 2 2 2 2 D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 1($2) I D #23 Lec # 8 Spring2 4-11-2

Data Hazard Resoltion: Stall Cycles Stall the pipeline by a nmber of cycles. The control nit mst detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC 8 1 1 1 1 1/ 2 2 2 2 D CC 9 2 CC 1 2 CC 11 2 and $12, $2, $5 I STALL STALL D or $13, $6, $2 STALL STALL I D add $14, $2, $2 I D sw $15, 1($2) I D #24 Lec # 8 Spring2 4-11-2

Performance of Pipelines with Stalls Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and ths degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instrction If pipelining overhead is ignored and we assme that the stages are perfectly balanced then: Speedp = CPI npipelined / (1 + Pipeline stall cycles per instrction) When all instrctions take the same nmber of cycles and is eqal to the nmber of pipeline stages then: Speedp = Pipeline depth / (1 + Pipeline stall cycles per instrction) #25 Lec # 8 Spring2 4-11-2

Data Hazard Resoltion: Compiler Schedling The compiler can garantee that no data hazards eist by re-ordering instrctions and/or adding NOP instrctions where needed. For the previos eample: sb $2, $1, $3 nop nop and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 1($2) #26 Lec # 8 Spring2 4-11-2

Data Hazard Resoltion: Forwarding Observation: Why not se temporary reslts prodced by memory/alu and not wait for them to be written back in the register bank. Forwarding is a hardware-based techniqe (also called register bypassing or short-circiting) sed to eliminate or minimize data hazard stalls that makes se of this observation. Using forwarding hardware, the reslt of an instrction is copied directly from where it is prodced (ALU, memory read port etc.), to where sbseqent instrctions need it (ALU inpt register, memory write port etc.) #27 Lec # 8 Spring2 4-11-2

Data Hazard Resoltion: Forwarding ister file forwarding to handle read/write to same register ALU forwarding #28 Lec # 8 Spring2 4-11-2

Pipelined Datapath With Forwarding ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC memory isters ALU Data memory IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd EX/E.isterRd Forwarding nit E/WB.isterRd #29 Lec # 8 Spring2 4-11-2

Data Hazard Eample With Forwarding Vale of register $2 : Vale of EX/E : Vale of E/WB : Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 1 1 1 1 1/ 2 2 2 2 2 X X X 2 X X X X X X X X X 2 X X X X Program eection order (in instrctions) sb $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 1($2) I D #3 Lec # 8 Spring2 4-11-2

A Data Hazard Reqiring A Stall A load followed by an R-type instrction that ses the loaded vale Program eection order (in instrctions) lw $2, 2($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D CC 7 CC 8 CC 9 and $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Even with forwarding in place a stall cycle is needed This condition mst be detected by hardware #31 Lec # 8 Spring2 4-11-2

A Data Hazard Reqiring A Stall A load followed by an R-type instrction that ses the loaded vale Program eection order (in instrctions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 1 lw $2, 2($1) I D and $4, $2, $5 I D or $8, $2, $6 add $9, $4, $2 I I D bbble I D slt $1, $6, $7 I D We can stall the pipeline by keeping an instrction in the same stage #32 Lec # 8 Spring2 4-11-2

Compiler Schedling Eample Reorder the instrctions to avoid as many pipeline stalls as possible: lw $15, ($2) lw $16, 4($2) sw $16, ($2) sw $15, 4($2) The data hazard occrs on register $16 between the second lw and the first sw reslting in a stall cycle With forwarding we need to find only one independent instrctions to place between them, swapping the lw instrctions works: lw $15, ($2) lw $16, 4($2) sw $15, ($2) sw $16, 4($2) Withot forwarding we need three independent instrctions to place between them, so in addition two nops are added. lw $15, ($2) lw $16, 4($2) nop nop sw $15, ($2) sw $16, 4($2) #33 Lec # 8 Spring2 4-11-2

Datapath With Hazard Detection Unit A load followed by an instrction that ses the loaded vale is detected and a stall cycle is inserted. Hazard detection nit ID/EX.em ID/EX IF/IDWrite IF/ID Control WB EX EX/E WB E/WB WB PCWrite PC memory isters ALU Data memory IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding nit E/WB.isterRd #34 Lec # 8 Spring2 4-11-2

Control Hazards: Eample Three other instrctions are in the pipeline before branch instrction target decision is made when BEQ is in E stage. Program eection order (in instrctions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 4 beq $1, $3, 7 I D 44 and $12, $2, $5 I D 48 or $13, $6, $2 I D 52 add $14, $2, $2 I D 72 lw $4, 5($7) I D In the above diagram, we are predicting branch not taken Need to add hardware for flshing the three following instrctions if we are wrong losing three cycles. #35 Lec # 8 Spring2 4-11-2

Redcing Delay of Taken Branchs Net PC of a branch known in E stage: Costs three lost cycles if taken. If net PC is known in EX stage, one cycle is saved. Branch address calclation can be moved to ID stage sing a register comparator, costing only one cycle if branch is taken. IF.Flsh Hazard detection nit ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC 4 memory Shift left 2 isters = ALU Data memory Sign etend Forwarding nit #36 Lec # 8 Spring2 4-11-2

Pipeline Performance Eample Assme the following IPS instrction mi: Type Freqency Arith/Logic 4% Load 3% of which 25% are followed immediately by an instrction sing the loaded vale Store 1% branch 2% of which 45% are taken What is the reslting CPI for the pipelined IPS with forwarding and branch address calclation in ID stage? CPI = Ideal CPI + Pipeline stall clock cycles per instrction = 1 + stalls by loads + stalls by branches = 1 +.3.251 +.2.451 = 1 +.75 +.9 = 1.165 #37 Lec # 8 Spring2 4-11-2