Complex Pipelines and Branch Prediction

Size: px

Start display at page:

Download "Complex Pipelines and Branch Prediction"

Loren Chambers
5 years ago
Views:

1 Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1

2 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK L22-2

3 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? L22-2

4 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? CPI = CPI ideal + CPI hazard CPI ideal : cycles per instruction if no stall L22-2

5 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? CPI = CPI ideal + CPI hazard CPI ideal : cycles per instruction if no stall CPI hazard contributors Data hazards: long operations, cache misses Control hazards: branches, jumps, exceptions L22-2

6 5-Stage Pipelined Processors IF DEC Advantages CPI ideal =1 (pipelining) Simple and elegant Still used in ARM & MIPS processors EXE MEM WB November 27, MIT Fall 2018

7 5-Stage Pipelined Processors IF DEC EXE MEM WB Advantages CPI ideal =1 (pipelining) Simple and elegant Still used in ARM & MIPS processors Shortcomings Upper performance bound is CPI=1 High-latency instructions not handled well 1 stage for access to large caches or multiplier Long clock cycle time Unnecessary stalls due to rigid pipeline If one instruction stalls, anything behind stalls November 27, MIT Fall 2018

8 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions L22-4

9 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available L22-4

10 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available Reduce impact of control hazards: branch prediction Predict both target and direction of jumps and branches L22-4

11 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available Demystification, not on the Quiz Reduce impact of control hazards: branch prediction Predict both target and direction of jumps and branches L22-4

12 Deeper Pipelines Fetch 1 Fetch 2 Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Decode Read Registers ALU Memory 1 Memory 2 Write Registers L22-5

13 Deeper Pipelines Fetch 1 Fetch 2 Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Decode Read Registers ALU Memory 1 Memory 2 Write Registers L22-5

14 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 ALU Memory 1 Memory 2 Write Registers L22-5

15 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost ALU Memory 1 Memory 2 Write Registers L22-5

16 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers ALU Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost More pipeline registers Complexity: more bypass paths and control logic Memory 1 Memory 2 Write Registers L22-5

17 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers ALU Memory 1 Memory 2 Write Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost More pipeline registers Complexity: more bypass paths and control logic Disadvantages More overlapping more dependencies CPI hazard grows due to data and control hazards Clock overhead becomes increasingly important Power consumption L22-5

18 Wider (aka Superscalar) Pipelines Fetch Each stage operates on up to W instructions each clock cycle Decode Read Registers ALU Memory Write Registers L22-6

19 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 ALU Memory Write Registers L22-6

20 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers ALU Memory Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 Cost Need wider path to instruction cache Need more ALUs, register file ports, Complexity: more bypass & stall cases to check Write Registers L22-6

21 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers ALU Memory Write Registers Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 Cost Need wider path to instruction cache Need more ALUs, register file ports, Complexity: more bypass & stall cases to check Disadvantages Parallel execution more dependencies CPI hazard grows due to data and control hazards L22-6

22 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value L22-7

23 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-7

24 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-7

25 Out-of-Order Execution Consider the expression D 3( a b) 7ac L22-8

26 Out-of-Order Execution Consider the expression D 3( a b) 7ac Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d L22-8

27 Out-of-Order Execution Consider the expression Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d D 3( a b) 7ac Dataflow graph ld b ld a ld c - * * * + st d L22-8

28 Out-of-Order Execution Consider the expression Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d D 3( a b) 7ac Dataflow graph ld b ld a ld c - * * * + Out-of-order execution runs instructions as soon as their inputs become available st d L22-8

29 Out-of-Order Execution Example If ld b takes a few cycles (e.g., cache miss), can execute instructions that do not depend on b Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d L22-9

30 Out-of-Order Execution Example If ld b takes a few cycles (e.g., cache miss), can execute instructions that do not depend on b Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d Dataflow graph ld b ld a - * * * + ld c Completed Executing Not ready st d L22-9

31 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Dispatch Reservation Stations In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache L22-10

32 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch Reservation Stations In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache L22-10

33 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available L22-10

34 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available Write back results in program order L22-10

35 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available Write back results in program order Why is this needed? L22-10

36 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! PC Fetch Decode RegRead Execute WriteBack L22-11

37 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started Branch executed PC Fetch Decode RegRead Execute WriteBack L22-11

38 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started Loose loop Branch executed PC Fetch Decode RegRead Execute WriteBack L22-11

39 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loose loop Branch executed Decode RegRead Execute WriteBack L22-11

40 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loop length x Pipeline width Loose loop Branch executed Decode RegRead Execute WriteBack L22-11

41 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loop length x Pipeline width One branch every 5-20 instructions performance impact of mispredictions? Loose loop Branch executed Decode RegRead Execute WriteBack L22-11

42 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches L22-12

43 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL After Inst. Decode JALR Branches L22-12

44 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode L22-12

45 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Decode L22-12

46 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Decode After Inst. Execute L22-12

47 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Execute After Inst. Decode After Inst. Execute L22-12

48 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Execute After Inst. Decode After Inst. Execute After Inst. Decode L22-12

49 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13

50 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13

51 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Predict jump/branch target and direction Guess a value and continue executing anyway Strategy 3: Speculate When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13

52 Static Branch Prediction Probability a branch is taken is ~60-70%, but: backward 90% BEQ forward 50% BEQ Some ISAs attach preferred direction hints to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) Achieves ~80% accuracy L22-14

53 predict Dynamic Branch Prediction Learning from past behavior update Truth/Feedback PC Predictor Prediction L22-15

54 predict Dynamic Branch Prediction Learning from past behavior Truth/Feedback update PC Predictor Prediction Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution L22-15

55 predict Dynamic Branch Prediction Learning from past behavior Truth/Feedback update PC Predictor Prediction Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) L22-15

56 Predicting the Target Address: Branch Target Buffer (BTB) PC k 2 k -entry direct-mapped BTB (can also be set-associative) Entry PC Valid predicted target PC valid BTB is a cache for targets: Remembers last target PC for taken branches and jumps If hit, use stored target as predicted next PC If miss, use PC+4 as predicted next PC = match target After target is known, update if prediction is wrong L22-16

57 Integrating the BTB in the Pipeline Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack L22-17

58 Integrating the BTB in the Pipeline Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack BTB L22-17

59 Integrating the BTB in the Pipeline Tight loop Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack BTB L22-17

60 Integrating the BTB in the Pipeline Tight loop Predict next PC immediately Correct misprediction when the right outcome is known PC Fetch Decode RegRead Execute WriteBack BTB Correct mispred L22-17

61 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match L22-18

62 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! L22-18

63 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by L22-18

64 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) L22-18

65 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) L22-18

66 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) Not storing valid bits L22-18

67 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) Not storing valid bits Even small BTBs are very effective! L22-18

68 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface L22-19

69 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage L22-19

70 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage update: On a pc misprediction, if the jump or branch at the pc was taken, then the BTB is updated with the new (pc, nextpc). Otherwise, the pc entry is deleted L22-19

71 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage update: On a pc misprediction, if the jump or branch at the pc was taken, then the BTB is updated with the new (pc, nextpc). Otherwise, the pc entry is deleted A BTB is a good way to improve performance on parts 2 and 3 of the design project L22-19

72 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop L22-20

73 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? L22-20

74 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? One on loop exit L22-20

75 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? One on loop exit Another one on first iteration L22-20

76 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions L22-21

77 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions How many mispredictions per loop? L22-21

78 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions How many mispredictions per loop? 1 L22-21

79 Modern Processors Combine Multiple Specialized Predictors Predict next PC immediately PC Fetch BTB Instruction type & branch/jal target known Branch direction & JALR target known Decode RegRead Execute WriteBack Branch dir predictor Return addr predictor Correct mispred Loop predictor L22-22

80 Modern Processors Combine Multiple Specialized Predictors Predict next PC immediately PC Fetch BTB Instruction type & branch/jal target known Branch direction & JALR target known Decode RegRead Execute Branch dir predictor Return addr predictor Correct mispred Loop predictor WriteBack Best predictors reflect program behavior L22-22

81 Summary Modern processors rely on a handful of techniques Deep pipelines Multi-GHz frequency Wide (superscalar) pipelines Multiple instructions/cycle Out-of-order execution Reduce impact of data hazards Branch prediction Reduce impact of control hazards L22-23

82 Summary Modern processors rely on a handful of techniques Deep pipelines Multi-GHz frequency Wide (superscalar) pipelines Multiple instructions/cycle Out-of-order execution Reduce impact of data hazards Branch prediction Reduce impact of control hazards Main objective is high sequential performance High costs (area, power) Requires high memory bandwidth and low latency Simple to use (but knowing these techniques often needed to write high-performance programs!) L22-23

83 Thank you! Next lecture: Bypassing in Bluespec + Design Project Tips L22-24

Putting it all Together: Modern Computer Architecture

Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz