Enhanced Performance with Pipelining

Size: px

Start display at page:

Download "Enhanced Performance with Pipelining"

Magdalene Snow
5 years ago
Views:

1 Chapter 6 Enhanced Performance with Pipelining Note: The slides being presented represent a mi. Some are created by ark Franklin, Washington University in St. Lois, Dept. of CSE. any are taken from the Patterson & Hennessy book, Compter Organization & Design, Copyright 1998 organ Kafmann Pblishers. This material may not be copied or distribted for commercial prposes withot epress written permission of the copyright holder. The original slides may be fond at: efaltindividal.asp&isbn= &contry=united+states&srccode=&re f=&sbcode=&head=&pdf=&basiccode=&ttsearch=&searchfield=&operator =&order=&commnity=mk 1

2 Pipelining: Its Natral! Landry Eample Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 mintes A B C D Dryer takes 40 mintes Folder takes 20 mintes 2

3 Seqential Landry 6 P idnight Time T a s k O r d e r A B C D Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry take? 3

4 Pipelined Landry Start work ASAP 6 P idnight Time T a s k O r d e r A B C D Pipelined landry takes 3.5 hors for 4 loads 4

5 Pipelining Lessons T a s k O r d e r 6 P Time A B C D Pipelining doesn t help latency of single task, it helps throghpt of entire workload Pipeline rate limited by slowest pipeline stage ltiple tasks operating simltaneosly Potential speedp = Nmber pipe stages Unbalanced lengths of pipe stages redces speedp Time to fill pipeline and time to drain it redces speedp 5

6 Compter Pipelines Eecte billions of instrctions, so throghot is what matters IPS desirable featres: all instrctions same length, registers located in same place in instrction format, memory operands only in loads or stores 6

7 Pipelining Improve perfomance by increasing instrction throghpt Program eection order Time (in instrctions) lw $1, 100($0) Instrction fetch ALU Data access lw $2, 200($0) 8 ns Instrction fetch ALU Data access lw $3, 300($0) Program eection Time order (in instrctions) lw $1, 100($0) lw $2, 200($0) Instrction fetch 2 ns 8 ns Instrction fetch ALU Data access ALU Data access Instrction fetch 8 ns... lw $3, 300($0) 2 ns Instrction fetch ALU Data access 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedp is nmber of stages in the pipeline. Do we achieve this? 7

8 Pipelining What makes it easy all instrctions are the same length jst a few instrction formats memory operands appear only in loads and stores What makes it hard? strctral hazards: sppose we had only one memory control hazards: need to worry abot branch instrctions data hazards: an instrction depends on a previos instrction We ll bild a simple pipeline and look at these isses We ll talk abot modern processors and what really makes it hard: eception handling trying to improve performance with ot-of-order eection, etc. 8

9 Basic Idea IF: Instrction fetch 0 1 ID: Instrction decode/ register file read EX: Eecte/ address calclation E: emory access : Write back Add 4 Add reslt Add Shift left 2 PC Address Instrction Instrction memory register 1 data 1 register 2 isters data 2 Write register Write data 0 1 Zero ALU ALU reslt Address Data memory Write data data Sign etend 32 What do we need to add to actally split the datapath into stages? 9

10 Branch Stalls Introdction of stall slot. 10

11 Using Delayed Branch Slot 11

12 Data Hazards Data dependencies that reslt in pipeline stalls add $s0, $t0, $t1 sb $t2, $s0, $t3 (see previos slide) 12

13 Pipelined Datapath 0 1 IF/ID ID/EX EX/E E/ Add 4 Add reslt Add Shift left 2 PC Address Instrction memory Instrction register 1 data 1 register 2 isters data 2 Write register Write data 0 1 Zero ALU ALU reslt Address Write data Data memory data Sign etend 32 Note 1: We no longer share the ALU for PC incrementing & net address calclation (to prevent strctral hazards). Note 2: Flow is from left to right ecept for reg. write-back & PC calclation 13

14 Corrected Datapath Need to ensre that write register selection corresponds to correct instrction when the data is available at the end of the last stage. We can generally assme that the instrction is passed from one stage to the net. 0 1 IF/ID ID/EX EX/E E/ Add 4 Add Add reslt Shift left 2 PC Address Instrction memory Instrction register 1 data 1 register 2 isters data 2 Write register Write data 0 1 Zero ALU ALU reslt Address Data memory Write data data Sign etend 32 14

15 Graphically Representing Pipelines Time (in clock cycles) Program eection order (in instrctions) lw $10, 20($1) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I ALU D sb $11, $2, $3 I ALU D Resorces sed by instrction on each cycle shown. isters & memories are written dring the 1 st half of the clock cycle & read dring the 2 nd half of the cycle (hence shading). Can help with answering qestions sch as: how many cycles does it take to eecte this code? what is the ALU doing dring cycle 4? se this representation to help nderstand datapaths 15

16 Two Sccessive Instrctions 16

17 Simple Pipeline Performance (no hazards) Say there are five stages taking 50, 50, 60, 50, 50 ns each. Say there is an overhead (de to clock skew & setp) of 5 ns in each stage. How mch speedp in the average instrction eection rate is gained from having pipelined implementation verss a non-pipelined implementation? 17

18 Pipeline Control (Figre 6.25) PCSrc 0 1 IF/ID ID/EX EX/E E/ Add 4 Write Shift left 2 Add Add reslt Branch PC Address Instrction memory Instrction register 1 data 1 register 2 isters Write data 2 register Write data Instrction [15 0] 16 Sign 32 etend ALUSrc ALU control Zero ALU ALU reslt Address Write data emwrite Data memory em data emto 1 0 Instrction [20 16] Instrction [15 11] 0 1 ALUOp Dst Fnction bits ALUop1, ALUop0 Eection/Address Calclation stage control lines emory access stage control lines Write-back stage control lines Instrction Dst ALU Op1 ALU Op0 ALU Src Branch em em Write write em to R-format lw sw X X beq X X 18

19 Pipeline Control Pass control signals along jst like the data Eection/Address Calclation stage control lines emory access stage control lines Write-back stage control lines Instrction Dst ALU Op1 ALU Op0 ALU Src Branch em em Write write em to R-format lw sw X X beq X X Instrction Control EX Only selected bits passed from stage to stage. IF/ID ID/EX EX/E E/ 19

20 Datapath with Control (Figre 6.30) PCSrc 0 1 Control ID/EX EX/E E/ IF/ID EX Add PC 4 Address Instrction memory Instrction Write register 1 data 1 register 2 isters Write data 2 register Write data Shift left Add Add reslt ALUSrc Zero ALU ALU reslt Branch Write data emwrite Address Data memory data emto 1 0 Instrction [15 0] Sign etend 6 ALU control em Instrction [20 16] Instrction [15 11] 0 1 Dst ALUOp 20

21 Dependencies and Hazards Data Hazard: Problem with starting net instrction before first is finished. Techniqes to overcome hazards: 1) Software Soltion - Reorder Instrctions lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) hazard lw $t0, 0($t1) lw $t2, 4($t1) sw $t0, 4($t1) sw $t2, 0($t1) 21

22 Hazards: 2) Software Soltion - Insert instrctions Data Hazard: Problem with starting net instrction before first is finished Time (in clock cycles) Vale of register $2: Program eection order (in instrctions) sb $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC / I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D 22

23 Hazards: Software Soltion Have compiler garantee no hazards Where do we insert the nops? sb $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: It slows s down! 23

24 Hazards: Hardware Soltion 3) ister Write/ Clock Cycle Division Time (in clock cycles) Vale of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 register $2: / Program eection order (in instrctions) sb $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D Remove hazard; Have written in 1 st half of clock cycle & read in 2 nd half of clock cycle. 24

25 Hazards: Hardware Soltion 4) Use Forwarding Forwarding: a) ove otpt of ALU back to its inpt. b) ove otpt of emory (on LW) directly into ALU inpt. Time (in clock cycles) Vale of CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 register $2: / Program eection order (in instrctions) sb $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D 25

26 Hazards: Forwarding Use temporary reslts, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Vale of register $2 : / Vale of EX/E : X X X 20 X X X X X Vale of E/ : X X X X 20 X X X X Program eection order (in instrctions) sb $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D 26

27 Hazard Detection: Note the pipeline register/signal set as: ID/EX, EX/E, E/ Note the particlar register or control signal in the set as: isterrd, isterrs, isterrt, Write Combine the two together to note a particlar register/signal in the pipeline set: e.g., EX/E.isterRd Specify hazard conditions as follows: EX/E.isterRd = ID/EX.isterRs EX/E.isterRd = ID/EX.isterRt E/.isterRd = ID/EX.isterRs E/.isterRd = ID/EX.isterRt The first hazard in the eample can be epressed as: EX/E.isterRd = ID/EX.isterRs = $2 27

28 Forwarding Paths & Control Signals st 1) have procedre for determining whether hazard has occrred & 2) set p control for forwarding 28

29 Forwarding: Control of Forwarding Paths EX Hazard: If ((EX/E.Write and EX/E.isterRd /= 0) and (EX/E.isterRd = ID/EX.isterRs)) ForwardA = 10 If ((EX/E.Write and EX/E.isterRd /= 0) and (EX/E.isterRd = ID/EX.isterRt)) ForwardB = 10 E Hazard: See book We now have a) conditions for data hazards, and b) forwarding control logic based on these conditions. These can be implemented as part of the Forwarding nit. 29

30 Forwarding: Control and Datapaths ID/EX EX/E Control E/ IF/ID EX PC Instrction memory Instrction isters ALU Data memory IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd EX/E.isterRd Forwarding nit E/.isterRd 30

31 Data Hazards: Forwarding can t always solve problem Load word can still case a hazard: an instrction tries to read a register following a load instrction that writes to the same register. Program eection order (in instrctions) lw $2, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D CC 7 CC 8 CC 9 and $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Hazard Hardware Techniqe 5) Introdce stalls. 31

32 Stalling We can stall the pipeline by keeping an instrction in the same stage Program Time (in clock cycles) eection order (in instrctions) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 lw $2, 20($1) I D and $4, $2, $5 I D or $8, $2, $6 add $9, $4, $2 I I D bbble I D slt $1, $6, $7 I D 32

33 Hazard Detection Unit to Control Stalling Detects hazards and inserts stall signals. Hazard detection nit ID/EX.em ID/EX IF/IDWrite Control 0 EX/E E/ IF/ID EX PCWrite PC Instrction memory Instrction isters ALU Data memory IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding nit E/.isterRd 33

34 Branch Hazards: When we decide to branch, other instrctions are in the pipeline Program eection order (in instrctions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 I D 44 and $12, $2, $5 I D 48 or $13, $6, $2 I D 52 add $14, $2, $2 I D 72 lw $4, 50($7) I D Techniqe 1) Predict branch not taken need to add hardware for flshing instrctions if we are wrong 34

35 Redce Nmber of Flshed Instrctions IF.Flsh Flsh only a single instrction. Hazard detection nit ID/EX EX/E Control 0 E/ IF/ID EX PC 4 Instrction memory Shift left 2 isters = ALU Data memory Sign etend Forwarding nit Check for eqality earlier in pipeline ove Branch decision earlier in the pipeline. 35

36 Improving Performance Avoid stalls by having reordering instrctions Add a branch delay slot the net instrction after a branch is always eected rely on compiler to fill the slot with something sefl Add more intelligent branch predictor: 2-bit branch predictor Branch history table ore advanced predictors 36

37 The Branch Delay Slot 37

38 Obtaining Higher Performance Sperpipelining: ore stages less.delay/stage higher clock rate Problems? Sperscalar Design (related to VLIW, Very Long Instrction Word designs) ore components/stage mltiple.inst.eecting/stage more parallelism CPI can be < 1; 2 to 4 instrctions/clk.cycle Problems? Dynamic Pipeline Schedling: Allow ot-of-order instrction eection if dependencies are satisfied, good to overlap eection with instrction stalling. lw $t0, 20($s2) add $t1, $t0, $t2 #say lw, add stall on memory fetch sb $s4, $s4, $t3 #sb & slti can eecte if they can slti $t5, $s4, 20 #bypass the lw & add instrctions ltithreading: Have mltiple threads available for eection, start one thread when another thread is stalled de to hazard (e.g., memory delay). 38

39 A Sperscalar IPS Assme two instrctions issed per clock cycle (fetch 64- bits aligned on a 64-bit doble word bondary). Instrction Type ALU or BR Inst. IF ID EX E Load or Store Inst. IF ID EX E ALU or BR Inst. IF ID EX E Load or Store Inst. IF ID EX E ALU or BR Inst. IF ID EX E Load or Store Inst. IF ID EX E ALU or BR Inst. IF ID EX E Load or Store Inst. IF ID EX E With no hazards, potentially doble the performance. st increase hardware resorces for each stage. 39

40 IPS Sperscalar Processor 1. Added 32-bits from memory 2. Two more read ports, one more write port on. File 3. One more ALU (say new ALU handles addr. calclations for data transfers, original ALU does everything else). 40

41 IPS Sperscalar Processor Compiler mst take into accont dependencies between pairs of inst. Eample: Loop: lw $t0, 0($s1) add $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, loop SperScalar Version: Loop ALU or BR Inst. Data Transfer Instr. lw $t0, 0($s1) Clk.Cycle 1 addi $s1, $s1, -4 2 add $t0, $t0, $s2 3 bne $s1, $zero, loop sw $t0, 4($s1) 4 Another techniqe: Loop nrolling. 41

42 emory: The ajor Performance Bottleneck Have completed: IPS basic design IPS pipelined design Datapaths Control Implementation Dealing with hazards Asynchronos designs Advanced concepts (sperpipelining, sperscalar, dynamic eection, mltithreading, VLIW) Net big topic: The EORY HIERARCHY 42

What do we have so far? Multi-Cycle Datapath

What do we have so far? Multi-Cycle Datapath What do we have so far? lti-cycle Datapath CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instrction being processed in datapath How to lower CPI frther? #1 Lec # 8 Spring2 4-11-2 Pipelining pipelining