Complex Pipelines and Branch Prediction

Size: px
Start display at page:

Download "Complex Pipelines and Branch Prediction"

Transcription

1 Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1

2 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK L22-2

3 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? L22-2

4 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? CPI = CPI ideal + CPI hazard CPI ideal : cycles per instruction if no stall L22-2

5 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? CPI = CPI ideal + CPI hazard CPI ideal : cycles per instruction if no stall CPI hazard contributors Data hazards: long operations, cache misses Control hazards: branches, jumps, exceptions L22-2

6 5-Stage Pipelined Processors IF DEC Advantages CPI ideal =1 (pipelining) Simple and elegant Still used in ARM & MIPS processors EXE MEM WB November 27, MIT Fall 2018

7 5-Stage Pipelined Processors IF DEC EXE MEM WB Advantages CPI ideal =1 (pipelining) Simple and elegant Still used in ARM & MIPS processors Shortcomings Upper performance bound is CPI=1 High-latency instructions not handled well 1 stage for access to large caches or multiplier Long clock cycle time Unnecessary stalls due to rigid pipeline If one instruction stalls, anything behind stalls November 27, MIT Fall 2018

8 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions L22-4

9 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available L22-4

10 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available Reduce impact of control hazards: branch prediction Predict both target and direction of jumps and branches L22-4

11 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available Demystification, not on the Quiz Reduce impact of control hazards: branch prediction Predict both target and direction of jumps and branches L22-4

12 Deeper Pipelines Fetch 1 Fetch 2 Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Decode Read Registers ALU Memory 1 Memory 2 Write Registers L22-5

13 Deeper Pipelines Fetch 1 Fetch 2 Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Decode Read Registers ALU Memory 1 Memory 2 Write Registers L22-5

14 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 ALU Memory 1 Memory 2 Write Registers L22-5

15 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost ALU Memory 1 Memory 2 Write Registers L22-5

16 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers ALU Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost More pipeline registers Complexity: more bypass paths and control logic Memory 1 Memory 2 Write Registers L22-5

17 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers ALU Memory 1 Memory 2 Write Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost More pipeline registers Complexity: more bypass paths and control logic Disadvantages More overlapping more dependencies CPI hazard grows due to data and control hazards Clock overhead becomes increasingly important Power consumption L22-5

18 Wider (aka Superscalar) Pipelines Fetch Each stage operates on up to W instructions each clock cycle Decode Read Registers ALU Memory Write Registers L22-6

19 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 ALU Memory Write Registers L22-6

20 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers ALU Memory Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 Cost Need wider path to instruction cache Need more ALUs, register file ports, Complexity: more bypass & stall cases to check Write Registers L22-6

21 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers ALU Memory Write Registers Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 Cost Need wider path to instruction cache Need more ALUs, register file ports, Complexity: more bypass & stall cases to check Disadvantages Parallel execution more dependencies CPI hazard grows due to data and control hazards L22-6

22 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value L22-7

23 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-7

24 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-7

25 Out-of-Order Execution Consider the expression D 3( a b) 7ac L22-8

26 Out-of-Order Execution Consider the expression D 3( a b) 7ac Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d L22-8

27 Out-of-Order Execution Consider the expression Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d D 3( a b) 7ac Dataflow graph ld b ld a ld c - * * * + st d L22-8

28 Out-of-Order Execution Consider the expression Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d D 3( a b) 7ac Dataflow graph ld b ld a ld c - * * * + Out-of-order execution runs instructions as soon as their inputs become available st d L22-8

29 Out-of-Order Execution Example If ld b takes a few cycles (e.g., cache miss), can execute instructions that do not depend on b Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d L22-9

30 Out-of-Order Execution Example If ld b takes a few cycles (e.g., cache miss), can execute instructions that do not depend on b Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d Dataflow graph ld b ld a - * * * + ld c Completed Executing Not ready st d L22-9

31 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Dispatch Reservation Stations In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache L22-10

32 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch Reservation Stations In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache L22-10

33 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available L22-10

34 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available Write back results in program order L22-10

35 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available Write back results in program order Why is this needed? L22-10

36 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! PC Fetch Decode RegRead Execute WriteBack L22-11

37 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started Branch executed PC Fetch Decode RegRead Execute WriteBack L22-11

38 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started Loose loop Branch executed PC Fetch Decode RegRead Execute WriteBack L22-11

39 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loose loop Branch executed Decode RegRead Execute WriteBack L22-11

40 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loop length x Pipeline width Loose loop Branch executed Decode RegRead Execute WriteBack L22-11

41 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loop length x Pipeline width One branch every 5-20 instructions performance impact of mispredictions? Loose loop Branch executed Decode RegRead Execute WriteBack L22-11

42 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches L22-12

43 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL After Inst. Decode JALR Branches L22-12

44 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode L22-12

45 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Decode L22-12

46 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Decode After Inst. Execute L22-12

47 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Execute After Inst. Decode After Inst. Execute L22-12

48 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Execute After Inst. Decode After Inst. Execute After Inst. Decode L22-12

49 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13

50 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13

51 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Predict jump/branch target and direction Guess a value and continue executing anyway Strategy 3: Speculate When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13

52 Static Branch Prediction Probability a branch is taken is ~60-70%, but: backward 90% BEQ forward 50% BEQ Some ISAs attach preferred direction hints to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) Achieves ~80% accuracy L22-14

53 predict Dynamic Branch Prediction Learning from past behavior update Truth/Feedback PC Predictor Prediction L22-15

54 predict Dynamic Branch Prediction Learning from past behavior Truth/Feedback update PC Predictor Prediction Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution L22-15

55 predict Dynamic Branch Prediction Learning from past behavior Truth/Feedback update PC Predictor Prediction Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) L22-15

56 Predicting the Target Address: Branch Target Buffer (BTB) PC k 2 k -entry direct-mapped BTB (can also be set-associative) Entry PC Valid predicted target PC valid BTB is a cache for targets: Remembers last target PC for taken branches and jumps If hit, use stored target as predicted next PC If miss, use PC+4 as predicted next PC = match target After target is known, update if prediction is wrong L22-16

57 Integrating the BTB in the Pipeline Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack L22-17

58 Integrating the BTB in the Pipeline Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack BTB L22-17

59 Integrating the BTB in the Pipeline Tight loop Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack BTB L22-17

60 Integrating the BTB in the Pipeline Tight loop Predict next PC immediately Correct misprediction when the right outcome is known PC Fetch Decode RegRead Execute WriteBack BTB Correct mispred L22-17

61 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match L22-18

62 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! L22-18

63 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by L22-18

64 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) L22-18

65 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) L22-18

66 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) Not storing valid bits L22-18

67 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) Not storing valid bits Even small BTBs are very effective! L22-18

68 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface L22-19

69 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage L22-19

70 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage update: On a pc misprediction, if the jump or branch at the pc was taken, then the BTB is updated with the new (pc, nextpc). Otherwise, the pc entry is deleted L22-19

71 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage update: On a pc misprediction, if the jump or branch at the pc was taken, then the BTB is updated with the new (pc, nextpc). Otherwise, the pc entry is deleted A BTB is a good way to improve performance on parts 2 and 3 of the design project L22-19

72 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop L22-20

73 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? L22-20

74 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? One on loop exit L22-20

75 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? One on loop exit Another one on first iteration L22-20

76 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions L22-21

77 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions How many mispredictions per loop? L22-21

78 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions How many mispredictions per loop? 1 L22-21

79 Modern Processors Combine Multiple Specialized Predictors Predict next PC immediately PC Fetch BTB Instruction type & branch/jal target known Branch direction & JALR target known Decode RegRead Execute WriteBack Branch dir predictor Return addr predictor Correct mispred Loop predictor L22-22

80 Modern Processors Combine Multiple Specialized Predictors Predict next PC immediately PC Fetch BTB Instruction type & branch/jal target known Branch direction & JALR target known Decode RegRead Execute Branch dir predictor Return addr predictor Correct mispred Loop predictor WriteBack Best predictors reflect program behavior L22-22

81 Summary Modern processors rely on a handful of techniques Deep pipelines Multi-GHz frequency Wide (superscalar) pipelines Multiple instructions/cycle Out-of-order execution Reduce impact of data hazards Branch prediction Reduce impact of control hazards L22-23

82 Summary Modern processors rely on a handful of techniques Deep pipelines Multi-GHz frequency Wide (superscalar) pipelines Multiple instructions/cycle Out-of-order execution Reduce impact of data hazards Branch prediction Reduce impact of control hazards Main objective is high sequential performance High costs (area, power) Requires high memory bandwidth and low latency Simple to use (but knowing these techniques often needed to write high-performance programs!) L22-23

83 Thank you! Next lecture: Bypassing in Bluespec + Design Project Tips L22-24

Putting it all Together: Modern Computer Architecture

Putting it all Together: Modern Computer Architecture Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz

More information

6.004 Tutorial Problems L22 Branch Prediction

6.004 Tutorial Problems L22 Branch Prediction 6.004 Tutorial Problems L22 Branch Prediction Branch target buffer (BTB): Direct-mapped cache (can also be set-associative) that stores the target address of jumps and taken branches. The BTB is searched

More information

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add; CSE 30321 Lecture 13/14 In Class Handout For the sequence of instructions shown below, show how they would progress through the pipeline. For all of these problems: - Stalls are indicated by placing the

More information

LECTURE 1 INTRODUCTION AND COURSE OVERVIEW DANIEL SANCHEZ AND JOEL EMER PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013

LECTURE 1 INTRODUCTION AND COURSE OVERVIEW DANIEL SANCHEZ AND JOEL EMER PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 LECTURE 1 INTRODUCTION AND COURSE OVERVIEW DANIEL SANCHEZ AND JOEL EMER 6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 Why 6.888? The current revolution: Parallel computing 2 The impending

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

The Processor: Improving the performance - Control Hazards

The Processor: Improving the performance - Control Hazards The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

Branch prediction ( 3.3) Dynamic Branch Prediction

Branch prediction ( 3.3) Dynamic Branch Prediction prediction ( 3.3) Static branch prediction (built into the architecture) The default is to assume that branches are not taken May have a design which predicts that branches are taken It is reasonable to

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Static, multiple-issue (superscaler) pipelines

Static, multiple-issue (superscaler) pipelines Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

ECE 505 Computer Architecture

ECE 505 Computer Architecture ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 RISC-V Pipeline Pipeline Control Hazards Structural Data R-type

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! Control Hazards 1 Today Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! 2 Key Points: Control Hazards Control hazards occur when we don t know which instruction to execute

More information

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Lecture 13: Branch Prediction

Lecture 13: Branch Prediction S 09 L13-1 18-447 Lecture 13: Branch Prediction James C. Hoe Dept of ECE, CMU March 4, 2009 Announcements: Spring break!! Spring break next week!! Project 2 due the week after spring break HW3 due Monday

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs. Exam 2 April 12, 2012 You have 80 minutes to complete the exam. Please write your answers clearly and legibly on this exam paper. GRADE: Name. Class ID. 1. (22 pts) Circle the selected answer for T/F and

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

ECE/CS 552: Introduction to Superscalar Processors

ECE/CS 552: Introduction to Superscalar Processors ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions) CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II (Branches, Exceptions) John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Superscalar Processor

Superscalar Processor Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

ECEC 355: Pipelining

ECEC 355: Pipelining ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions

More information

Improving Performance: Pipelining

Improving Performance: Pipelining Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 Assigned 2/28/2018 CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp18

More information

Advanced Instruction-Level Parallelism

Advanced Instruction-Level Parallelism Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

What about branches? Branch outcomes are not known until EXE What are our options?

What about branches? Branch outcomes are not known until EXE What are our options? What about branches? Branch outcomes are not known until EXE What are our options? 1 Control Hazards 2 Today Quiz Control Hazards Midterm review Return your papers 3 Key Points: Control Hazards Control

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 17: Pipelining Wrapup Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Outline The textbook includes lots of information Focus on

More information

Final Exam Fall 2007

Final Exam Fall 2007 ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd

More information

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

COSC3330 Computer Architecture Lecture 14. Branch Prediction

COSC3330 Computer Architecture Lecture 14. Branch Prediction COSC3330 Computer Architecture Lecture 14. Branch Prediction Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston opic Out-of-Order Execution Branch Prediction Superscalar

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

More information

LECTURE 9. Pipeline Hazards

LECTURE 9. Pipeline Hazards LECTURE 9 Pipeline Hazards PIPELINED DATAPATH AND CONTROL In the previous lecture, we finalized the pipelined datapath for instruction sequences which do not include hazards of any kind. Remember that

More information

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the

More information