Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Size: px

Start display at page:

Download "Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ..."

Paulina Riley
6 years ago
Views:

1 CHAPTER 6 1

2 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word Store word R-format Branch Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) Instruction fetch ALU Data access lw $2, 200($0) 8 ns Instruction fetch ALU Data access lw $3, 300($0) Program execution Time order (in instructions) lw $1, 100($0) lw $2, 200($0) Instruction fetch 2 ns 8 ns Instruction fetch ALU Data access ALU Data access Instruction fetch 8 ns... lw $3, 300($0) 2 ns Instruction fetch ALU Data access 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedup is number of stages in the pipeline. Do we achieve this? 2

3 Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We ll build a simple pipeline and look at these issues We ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. 3

4 Hazards A=B+E C=B+F lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 4

5 Basic Idea What do we need to add to actually split the datapath into stages? 5

6 Pipelined datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 6

7 Five Stages (lw) Memory and registers Left half: write Right half: read 7

8 Five Stages (lw) 8

9 Five Stages (lw) 9

10 What is wrong with this datapath? 10

11 Graphically representing pipelines Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Use this representation to help understand datapaths 11

12 Pipeline operation In pipeline one operation begins in every cycle Also, one operation completes in each cycle Each instruction takes 5 clock cycles k cycles in general, where k is pipeline depth When a stage is not used, no control needs to be applied In one clock cycle, several instructions are active Different stages are executing different instructions How to generate control signals for them is an issue 12

13 Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / ister Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? A fancy control center telling everyone what to do? Should we use a finite state machine? 13

14 Pipeline control PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Write Shift left 2 Add Add result Branch PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 isters Read Write data 2 register Write data Instruction [15 0] 16 Sign 32 extend ALUSrc 0 M u x 1 6 ALU control Zero ALU ALU result Address Write data MemWrite Data memory MemRead Read data Memto 1 M u x 0 Instruction [20 16] Instruction [15 11] 0 M u x 1 ALUOp Dst 14

15 Pipeline control Execution/Address Calculation stage control lines ALU ALU ALU Write-back stage control lines Memory access stage control lines Branc Mem Mem Instruction Dst Op1 Op0 Src h Read Write write R-format lw sw X X beq X X Mem to 15

16 Datapath with control PCSrc 0 M u x 1 Control ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB Add PC 4 Address Instruction memory Instruction Read register 1 Read Write register 2 isters Write register Write data Read data 1 Read data 2 Shift left 2 0 M u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data MemWrite Address Data memory Read data Memto 1 M u x 0 Instruction [15 0] Sign extend 6 ALU control MemRead Instruction [20 16] Instruction [15 11] 0 M u x 1 Dst ALUOp 16

17 Dependencies Problem with starting next instruction before first is finished Dependencies that go backward in time are data hazards Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC / IM DM and $12, $2, $5 IM DM or $13, $6, $2 IM DM add $14, $2, $2 IM DM sw $15, 100($2) IM DM 17

18 Forwarding Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2: / Value of EX/MEM: X X X 20 X X X X X Value of MEM/WB: X X X X 20 X X X X Program execution order (in instructions) sub $2, $1, $3 IM DM and $12, $2, $5 IM DM or $13, $6, $2 IM DM add $14,$2, $2 IM DM sw $15, 100($2) IM DM 18

19 Forwarding sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Control ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB PC Instruction memory Instruction isters M u x M u x ALU Data memory M u x IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd M u x EX/MEM.isterRd Forwarding unit MEM/WB.isterRd 19

20 Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Program execution order (in instructions) lw $2, 20($1) IM DM and $4, $2, $5 IM DM or $8, $2, $6 IM DM add $9, $4, $2 IM DM slt $1, $6, $7 IM DM Thus, we need a hazard detection unit to stall the load instruction 20

21 Forwarding Forward Forward from from EX/MEM MEM/WB registers registers If (EX/MEM.Write) If (MEM/WB.Write) and If (EX/MEM.Rd!= 0) and and (ID/EX.Rs If (MEM/WB.Rd == EX/MEM.Rd)!= 0) and If (ID/EX.Rt==EX/MEM.Rd) 21

22 Stalling Hardware detection and no-op insertion is called stalling Stall pipeline by keeping instruction in the same stage Program Time (in clock cycles) execution order (in instructions) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 lw $2, 20($1) IM DM and $4, $2, $5 IM DM or $8, $2, $6 add $9, $4, $2 IM IM DM bubble IM DM slt $1, $6, $7 IM DM 22

23 Example 23

24 24

25 Stall logic Stall logic If (ID/EX.MemRead) // Load word instruction AND If ((ID/EX.Rt == IF/ID.Rs) or (ID/EX.Rt == IF/ID.Rt)) PCWrite Insert no-op (no-operation) Deasserting all control signals Stall following instruction Not writing program counter Not writing IF/ID registers IF/ID.Rs IF/ID.Rt ID/EX.Rt 25

26 Pipeline with hazard detection 26

27 Summary 27

28 Forwarding Case Summary 28

29 Multi-cycle 29

30 Multi-cycle 30

31 Multi-cycle Pipeline 31

32 Branch Hazards PCSrc 0 M u x 1 Control ID/EX WB M EX/MEM WB MEM/WB IF/ID EX M WB Add PC 4 Address Instruction memory Instruction Read register 1 Read Write register 2 isters Write register Write data Read data 1 Read data 2 Shift left 2 0 M u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data MemWrite Address Data memory Read data Memto 1 M u x 0 Instruction [15 0] Sign extend 6 ALU control MemRead Instruction [20 16] Instruction [15 11] 0 M u x 1 Dst ALUOp 32

33 Branch hazards When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken need to add hardware for flushing instructions if we are wrong Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 IM DM 44 and $12, $2, $5 IM DM 48 or $13, $6, $2 IM DM 52 add $14, $2, $2 IM DM 72 lw $4, 50($7) IM DM 33

34 Solution to control hazards Branch prediction We are predicting branch not taken Need to add hardware for flushing instructions if we are wrong Reduce branch penalty By advancing the branch decision to ID stage Compare the data read from two registers read in ID stage Comparison for equality is a simpler design! (Why?) Still need to flush instruction in IF stage Make the hazard into a feature! Delayed branch slot - Always execute instruction following branch 34

35 Branch detection in ID stage 35

Dynamic branch prediction Use lower part of instruction address Use one bit to say denote branch taken or not taken Disadvantage: poor performance in loops Dynamic branch

36 Dynamic branch prediction Use lower part of instruction address Use one bit to say denote branch taken or not taken Disadvantage: poor performance in loops Dynamic branch prediction Use two bits instead of one Condition must be satisfied twice to predict More sophisticated Count the number of times branch is taken 2-bit branch prediction State diagram 36

37 Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table In general, (m,n) predictor means record last m branches to select between 2 m history tables each with n-bit counters Old 2-bit BHT is then a (0,2) predictor If (aa == 2) aa=0; If (bb == 2) bb = 0; If (aa!= bb) do something; 37

38 Correlating Branches (2,2) predictor Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Branch address Branch address 2-bit per branch predictors XX 4 2-bits per branch predictors XX Prediction Prediction 2-bit global branch history 2-bit global branch history 38

39 Accuracy of Different Schemes Frequency of Mispredictions 18% Frequency of Mispredictions 16% 14% 12% 10% 8% 6% 4% 2% 0% nasa7 1% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT matrix300 0% tomcatv 1% doducd 5% spice 6% 6% fpppp gcc 11% espresso 4% eqntott 6% li 5% 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 39

40 Branch Prediction Sophisticated Techniques: A branch target buffer to help us look up the destination Correlating predictors that base prediction on global behavior and recently executed branches (e.g., prediction for a specific branch instruction based on what happened in previous branches) Tournament predictors that use different types of prediction strategies and keep track of which one is performing best. A branch delay slot which the compiler tries to fill with a useful instruction (make the one cycle delay part of the ISA) Branch prediction is especially important because it enables other more advanced pipelining techniques to be effective! Modern processors predict correctly 95% of the time! 40

41 Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Predicted PC Branch Prediction: Taken or not Taken Return instruction addresses predicted with stack 41

42 Scheduling in delayed branching 42

43 Other issues in pipelines Exceptions Errors in ALU for arithmetic instructions Memory non-availability Exceptions lead to a jump in a program However, the current PC value must be saved so that the program can return to it back for recoverable errors Multiple exception can occur in a pipeline Preciseness of exception location is important in some cases I/O exceptions are handled in the same manner 43

44 Exceptions 44

45 Improving Performance Try and avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Dynamic Pipeline Scheduling Hardware chooses which instructions to execute next Will execute instructions out of order (e.g., doesn t wait for a dependency to be resolved, but rather keeps going!) Speculates on branches and keeps the pipeline full (may need to rollback if prediction incorrect) Trying to exploit instruction-level parallelism 45

46 Advanced Pipelining Increase the depth of the pipeline Start more than one instruction each cycle (multiple issue) Loop unrolling to expose more ILP (better scheduling) Superscalar processors DEC Alpha 21264: 9 stage pipeline, 6 instruction issue All modern processors are superscalar and issue multiple instructions usually with some limitations (e.g., different pipes ) VLIW: very long instruction word, static multiple issue (relies more on compiler technology) This class has given you the background you need to learn more! 46

47 Superscalar architecture -- Two instructions executed in parallel 47

48 Dynamically scheduled pipeline 48

49 Motorola G4e 49

50 Intel Pentium 4 50

51 IBM PowerPC

52 Important facts to remember Pipelined processors divide execution in multiple steps However pipeline hazards reduce performance Structural, data, and control hazard Data forwarding helps resolve data hazards But all hazards cannot be resolved Some data hazards require bubble or noop insertion Effects of control hazard reduced by branch prediction Predict always taken, delayed slots, branch prediction table Structural hazards are resolved by duplicating resources Time to execute n instructions depends on # of stages (k) # of control hazard and penalty of each step # of data hazards and penalty for each Time = n + k (load hazard penalty) + (branch penalty) Load hazard penalty is 1 or 0 cycle Depending on data use with forwarding Branch penalty is 3, 2, 1, or zero cycles depending on scheme 52

53 Design and performance issues with pipelining Pipelined processors are not EASY to design Technology affect implementation Instruction set design affect the performance i.e., beq, bne More stages do not lead to higher performance! 53

54 Chapter 6 Summary Pipelining does not improve latency, but does improve throughput Deeply pipelined Multiple issue with deep pipeline (Section 6.10) Multiple issue with deep pipeline (Section 6.10) Multicycle (Section 5.5) Pipelined Multiple-issue pipelined (Section 6.9) Multiple-issue pipelined (Section 6.9) Single-cycle (Section 5.4) Pipelined Deeply pipelined Single-cycle (Section 5.4) Multicycle (Section 5.5) Slower Instructions per clock (IPC = 1/CPI) Faster 1 Several Use latency in instructions 54

Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns

Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns Chapter Si Pipelining Improve perfomance by increasing instruction throughput eecutionı Time lw $, ($) 2 6 8 2 6 8 access lw $2, 2($) 8 ns access lw $3, 3($) eecutionı Time lw $, ($) lw $2, 2($) 2 ns 8