What do we have so far? Multi-Cycle Datapath

Size: px

Start display at page:

Download "What do we have so far? Multi-Cycle Datapath"

Colin O’Brien’
6 years ago
Views:

1 What do we have so far? lti-cycle Datapath CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instrction being processed in datapath How to lower CPI frther? #1 Lec # 8 Spring

2 Pipelining pipelining is a CPU implementation techniqe where mltiple operations on a nmber of instrctions are overlapped. The net instrction is fetched in the net cycle withot waiting for the crrent instrction to complete. An instrction eection pipeline involves a nmber of steps, where each step completes one part of an instrction. Each step is called a pipeline stage or a pipeline segment. The stages or steps are connected one to the net to form a pipeline -- instrctions enter at one end and progress throgh the stages and eit at the other end when completed. Pipeline Throghpt : The instrction completion rate of the pipeline and is determined by how often an instrction eists the pipeline. The time to move an instrction one step down the line is is eqal to the machine cycle and is determined by the stage with the longest processing delay. Pipeline Latency: The time reqired to complete an instrction: Cycle time Nmber of pipeline stages. #2 Lec # 8 Spring

3 Single Cycle Vs. Pipelining P rogram e ection Tim e o rder (in instrctions) lw $ 1, 1 ($ ) fetch ALU Data access Single Cycle lw $ 2, 2 ($ ) 8 ns fetch ALU Data access lw $ 3, 3 ($ ) Time for 1 instrctions = 8 1 = 8 ns 8 n s fetch 8 ns... Prog ram eection Time ord er (in instrctions) lw $1, 1 ($) lw $2, 2 ($) fetch 2 ns fetch ALU Data access ALU Data access 5 Stage Pipeline lw $3, 3 ($) 2 ns fetch ALU Data access 2 ns 2 n s 2 ns 2 ns 2 n s Time for 1 instrctions = time to fill pipeline + cycle time 1 = = 28 ns Pipelining Speedp = 8/28 = 3.98 #3 Lec # 8 Spring

4 Pipelining: Design Goals The length of the machine clock cycle is determined by the time reqired for the slowest pipeline stage. An important pipeline design consideration is to balance the length of each pipeline stage. If all stages are perfectly balanced, then the time per instrction on a pipelined machine (assming ideal conditions with no stalls): Time per instrction on npipelined machine Nmber of pipe stages Under these ideal conditions: Speedp from pipelining = the nmber of pipeline stages = k One instrction is completed every cycle: CPI = 1. #4 Lec # 8 Spring

5 From IPS lti-cycle Datapath: Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load IF ID EX E WB 1- Fetch (IF) Fetch Fetch the instrction from the emory. 2- Decode (ID): isters Fetch and Decode. 3- Eecte (EX): Calclate the memory address. 4- emory (E): the data from the Data emory. 5- Write Back (WB): Write the data back to the register file. #5 Lec # 8 Spring

6 Pipelined Processing Representation Clock cycle Nmber Time in clock cycles Nmber I IF ID EX E WB I+1 IF ID EX E WB I+2 IF ID EX E WB I+3 IF ID EX E WB I +4 IF ID EX E WB Time to fill the pipeline Pipeline Stages: IF = Fetch ID = Decode EX = Eection E = emory Access WB = Write Back First instrction, I Completed Last instrction, I+4 completed #6 Lec # 8 Spring

7 Pipelined Processing Time IF ID EX E WB Representation IF ID EX E WB IF ID EX E WB IF ID EX E WB Program Flow IF ID EX E WB IF ID EX E WB #7 Lec # 8 Spring

8 Clk Single Cycle, lti-cycle, Vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 8 ns Load Store Waste 2ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 1 Clk ltiple Cycle Implementation: Load IF ID EX E WB Store IF ID EX E R-type IF Pipeline Implementation: Load IF ID EX E WB Store IF ID EX E WB R-type IF ID EX E WB #8 Lec # 8 Spring

9 Single Cycle, lti-cycle, Pipeline: Performance Comparison Eample For 1 instrctions, eection time: Single Cycle achine: 8 ns/cycle 1 CPI 1 inst = 8 ns lticycle achine: 2 ns/cycle 4.6 CPI (de to inst mi) 1 inst = 92 ns Ideal pipelined machine, 5-stages: 2 ns/cycle (1 CPI 1 inst + 4 cycle fill) = 28 ns #9 Lec # 8 Spring

10 IPS Pipeline Stage Identification IF: fetch ID: decode/ register file read EX: Eecte/ address calclation E: emory access WB: Write back 1 Add 4 Shift left 2 Add reslt Add PC Address memory register 1 data 1 register 2 isters data 2 Write register Write data 16 Sign etend 32 1 Zero ALU ALU reslt Address Write data Data memory data 1 What is needed to divide datapath into pipeline stages? #1 Lec # 8 Spring

11 IPS: An Initial Pipelined Datapath 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add Add reslt PC Address memory register 1 data 1 register 2 isters data 2 Write register Write data 1 Zero ALU ALU reslt Address Write data Data memory data 1 16 Sign etend 32 IF ID EX E WB Fetch Decode Eection emory Write Back Can yo find a problem even if there are no dependencies? What instrctions can we eecte to manifest the problem? #11 Lec # 8 Spring

12 A Corrected Pipelined Datapath 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add reslt Add PC Address memory register 1 data 1 register 2 isters data 2 Write register Write data 1 Zero ALU ALU reslt Address Write data Data memory data 1 16 Sign etend 32 IF ID EX E WB Fetch Decode Eection emory Write Back #12 Lec # 8 Spring

13 Representing Pipelines Graphically Time (in clock cycles) Program eection order (in instrctions) lw $1, 2($1) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I ALU D sb $11, $2, $3 I ALU D Can help with answering qestions like: How many cycles does it take to eecte this code? What is the ALU doing dring cycle 4? Use this representation to help nderstand datapaths #13 Lec # 8 Spring

14 Adding Pipeline Control Points PCSrc 1 IF/ID ID/EX EX/E E/WB Add 4 Write Shift left 2 Add Add reslt Branch PC Address memory register 1 data 1 register 2 isters Write data 2 register Write data ALUSrc 1 Zero ALU ALU reslt Address Write emwrite Data memory data emto 1 [15 ] 16 Sign 32 etend 6 ALU control data em [2 16] [15 11] 1 ALUOp Dst #14 Lec # 8 Spring

15 Pipeline Control Pass needed control signals along from one stage to the net as the instrction travels throgh the pipeline jst like the data Eection/Address Calclation stage control lines emory access stage control lines Write-back stage control lines Dst ALU Op1 ALU Op ALU Src Branch em em Write write em to R-format lw sw X 1 1 X beq X 1 1 X WB Control WB EX WB IF/ID ID/EX EX/E E/WB #15 Lec # 8 Spring

16 Pipeline Control The ain Control generates the control signals dring /Dec Control signals for Eec (EtOp, ALUSrc,...) are sed 1 cycle later Control signals for em (emwr Branch) are sed 2 cycles later Control signals for Wr (emto emwr) are sed 3 cycles later ID EX em WB EtOp EtOp ALUSrc ALUSrc IF/ID ister ain Control ALUOp Dst emwr Branch emto ID/E ister ALUOp Dst emwr Branch emto E/em ister emwr Branch emto em/wb ister emto Wr Wr Wr Wr #16 Lec # 8 Spring

17 Pipelined Datapath with Control Added PCSrc 1 Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC 4 Address memory register 1 data 1 register 2 isters Write data 2 register Write data R egwrite Shift left 2 1 Add Add reslt ALUSrc Zero ALU ALU reslt Branch Write data emwrite Address Data memory data emto 1 [15 ] 16 Sign 32 etend 6 ALU control em [2 16] [15 11] 1 Dst ALUOp Target address of branch determined in E #17 Lec # 8 Spring

18 Basic Performance Isses In Pipelining Pipelining increases the CPU instrction throghpt: The nmber of instrctions completed per nit time. Under ideal condition instrction throghpt is one instrction per machine cycle, or CPI = 1 Pipelining does not redce the eection time of an individal instrction: The time needed to complete all processing steps of an instrction (also called instrction completion latency). It sally slightly increases the eection time of each instrction over npipelined implementations de to the increased control overhead of the pipeline and pipeline stage registers delays. #18 Lec # 8 Spring

19 Pipelining Performance Eample Eample: For an npipelined machine: Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instrction freqencies of 4%, 2% and 4%, respectively. If pipelining adds 1ns to the machine clock cycle then the speedp in instrction eection from pipelining is: Non-pipelined Average instrction eection time = Clock cycle Average CPI = 1 ns ((4% + 2%) 4 + 4% 5) = 1 ns 4.4 = 44 ns In the pipelined five implementation five stages are sed with an average instrction eection time of: 1 ns + 1 ns = 11 ns Speedp from pipelining = time npipelined time pipelined = 44 ns / 11 ns = 4 times #19 Lec # 8 Spring

20 Pipeline Hazards Hazards are sitations in pipelining which prevent the net instrction in the instrction stream from eecting dring the designated clock cycle reslting in one or more stall cycles. Hazards redce the ideal speedp gained from pipelining and are classified into three classes: Strctral hazards: Arise from hardware resorce conflicts when the available hardware cannot spport all possible combinations of instrctions. Data hazards: Arise when an instrction depends on the reslts of a previos instrction in a way that is eposed by the overlapping of instrctions in the pipeline. Control hazards: Arise from the pipelining of conditional branches and other instrctions that change the PC. #2 Lec # 8 Spring

21 Strctral Hazards In pipelined machines overlapped instrction eection reqires pipelining of fnctional nits and dplication of resorces to allow all possible combinations of instrctions in the pipeline. If a resorce conflict arises de to a hardware resorce being reqired by more than one instrction in a single cycle, and one or more sch instrctions cannot be accommodated, then a strctral hazard has occrred, for eample: when a machine has only one register file write port or when a pipelined machine has a shared single-memory pipeline for data and instrctions. stall the pipeline for one cycle for register writes or memory data access #21 Lec # 8 Spring

22 Strctral hazard Eample: Single emory For s & Data Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 ALU em em em em ALU em ALU em em ALU em ALU em em Detection is easy in this case (right half highlight means read, left half write) #22 Lec # 8 Spring

23 Data Hazards Eample Problem with starting net instrction before first is finished Data dependencies here that go backward in time create data hazards. sb $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 1($2) Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC 8 CC / D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 1($2) I D #23 Lec # 8 Spring

24 Data Hazard Resoltion: Stall Cycles Stall the pipeline by a nmber of cycles. The control nit mst detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC / D CC 9 2 CC 1 2 CC 11 2 and $12, $2, $5 I STALL STALL D or $13, $6, $2 STALL STALL I D add $14, $2, $2 I D sw $15, 1($2) I D #24 Lec # 8 Spring

25 Performance of Pipelines with Stalls Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and ths degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instrction If pipelining overhead is ignored and we assme that the stages are perfectly balanced then: Speedp = CPI npipelined / (1 + Pipeline stall cycles per instrction) When all instrctions take the same nmber of cycles and is eqal to the nmber of pipeline stages then: Speedp = Pipeline depth / (1 + Pipeline stall cycles per instrction) #25 Lec # 8 Spring

26 Data Hazard Resoltion: Compiler Schedling The compiler can garantee that no data hazards eist by re-ordering instrctions and/or adding NOP instrctions where needed. For the previos eample: sb $2, $1, $3 nop nop and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 1($2) #26 Lec # 8 Spring

27 Data Hazard Resoltion: Forwarding Observation: Why not se temporary reslts prodced by memory/alu and not wait for them to be written back in the register bank. Forwarding is a hardware-based techniqe (also called register bypassing or short-circiting) sed to eliminate or minimize data hazard stalls that makes se of this observation. Using forwarding hardware, the reslt of an instrction is copied directly from where it is prodced (ALU, memory read port etc.), to where sbseqent instrctions need it (ALU inpt register, memory write port etc.) #27 Lec # 8 Spring

28 Data Hazard Resoltion: Forwarding ister file forwarding to handle read/write to same register ALU forwarding #28 Lec # 8 Spring

29 Pipelined Datapath With Forwarding ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC memory isters ALU Data memory IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd EX/E.isterRd Forwarding nit E/WB.isterRd #29 Lec # 8 Spring

30 Data Hazard Eample With Forwarding Vale of register $2 : Vale of EX/E : Vale of E/WB : Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC / X X X 2 X X X X X X X X X 2 X X X X Program eection order (in instrctions) sb $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 1($2) I D #3 Lec # 8 Spring

31 A Data Hazard Reqiring A Stall A load followed by an R-type instrction that ses the loaded vale Program eection order (in instrctions) lw $2, 2($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D CC 7 CC 8 CC 9 and $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Even with forwarding in place a stall cycle is needed This condition mst be detected by hardware #31 Lec # 8 Spring

32 A Data Hazard Reqiring A Stall A load followed by an R-type instrction that ses the loaded vale Program eection order (in instrctions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 1 lw $2, 2($1) I D and $4, $2, $5 I D or $8, $2, $6 add $9, $4, $2 I I D bbble I D slt $1, $6, $7 I D We can stall the pipeline by keeping an instrction in the same stage #32 Lec # 8 Spring

33 Compiler Schedling Eample Reorder the instrctions to avoid as many pipeline stalls as possible: lw $15, ($2) lw $16, 4($2) sw $16, ($2) sw $15, 4($2) The data hazard occrs on register $16 between the second lw and the first sw reslting in a stall cycle With forwarding we need to find only one independent instrctions to place between them, swapping the lw instrctions works: lw $15, ($2) lw $16, 4($2) sw $15, ($2) sw $16, 4($2) Withot forwarding we need three independent instrctions to place between them, so in addition two nops are added. lw $15, ($2) lw $16, 4($2) nop nop sw $15, ($2) sw $16, 4($2) #33 Lec # 8 Spring

34 Datapath With Hazard Detection Unit A load followed by an instrction that ses the loaded vale is detected and a stall cycle is inserted. Hazard detection nit ID/EX.em ID/EX IF/IDWrite IF/ID Control WB EX EX/E WB E/WB WB PCWrite PC memory isters ALU Data memory IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding nit E/WB.isterRd #34 Lec # 8 Spring

35 Control Hazards: Eample Three other instrctions are in the pipeline before branch instrction target decision is made when BEQ is in E stage. Program eection order (in instrctions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 4 beq $1, $3, 7 I D 44 and $12, $2, $5 I D 48 or $13, $6, $2 I D 52 add $14, $2, $2 I D 72 lw $4, 5($7) I D In the above diagram, we are predicting branch not taken Need to add hardware for flshing the three following instrctions if we are wrong losing three cycles. #35 Lec # 8 Spring

36 Redcing Delay of Taken Branchs Net PC of a branch known in E stage: Costs three lost cycles if taken. If net PC is known in EX stage, one cycle is saved. Branch address calclation can be moved to ID stage sing a register comparator, costing only one cycle if branch is taken. IF.Flsh Hazard detection nit ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC 4 memory Shift left 2 isters = ALU Data memory Sign etend Forwarding nit #36 Lec # 8 Spring

37 Pipeline Performance Eample Assme the following IPS instrction mi: Type Freqency Arith/Logic 4% Load 3% of which 25% are followed immediately by an instrction sing the loaded vale Store 1% branch 2% of which 45% are taken What is the reslting CPI for the pipelined IPS with forwarding and branch address calclation in ID stage? CPI = Ideal CPI + Pipeline stall clock cycles per instrction = 1 + stalls by loads + stalls by branches = = = #37 Lec # 8 Spring

What do we have so far? Multi-Cycle Datapath (Textbook Version)

What do we have so far? Multi-Cycle Datapath (Textbook Version) What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001