Complex Pipelines and Branch Prediction
|
|
- Loren Chambers
- 5 years ago
- Views:
Transcription
1 Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1
2 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK L22-2
3 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? L22-2
4 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? CPI = CPI ideal + CPI hazard CPI ideal : cycles per instruction if no stall L22-2
5 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle t CK Pipelining lowers t CK. What about CPI? CPI = CPI ideal + CPI hazard CPI ideal : cycles per instruction if no stall CPI hazard contributors Data hazards: long operations, cache misses Control hazards: branches, jumps, exceptions L22-2
6 5-Stage Pipelined Processors IF DEC Advantages CPI ideal =1 (pipelining) Simple and elegant Still used in ARM & MIPS processors EXE MEM WB November 27, MIT Fall 2018
7 5-Stage Pipelined Processors IF DEC EXE MEM WB Advantages CPI ideal =1 (pipelining) Simple and elegant Still used in ARM & MIPS processors Shortcomings Upper performance bound is CPI=1 High-latency instructions not handled well 1 stage for access to large caches or multiplier Long clock cycle time Unnecessary stalls due to rigid pipeline If one instruction stalls, anything behind stalls November 27, MIT Fall 2018
8 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions L22-4
9 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available L22-4
10 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available Reduce impact of control hazards: branch prediction Predict both target and direction of jumps and branches L22-4
11 Improving 5-Stage Pipeline Performance Increase clock frequency: deeper pipelines Overlap more instructions Reduce CPI ideal : wider pipelines Each pipeline stage processes multiple instructions Reduce impact of data hazards: out-of-order execution Execute each instruction as soon as its source operands are available Demystification, not on the Quiz Reduce impact of control hazards: branch prediction Predict both target and direction of jumps and branches L22-4
12 Deeper Pipelines Fetch 1 Fetch 2 Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Decode Read Registers ALU Memory 1 Memory 2 Write Registers L22-5
13 Deeper Pipelines Fetch 1 Fetch 2 Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Decode Read Registers ALU Memory 1 Memory 2 Write Registers L22-5
14 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 ALU Memory 1 Memory 2 Write Registers L22-5
15 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost ALU Memory 1 Memory 2 Write Registers L22-5
16 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers ALU Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost More pipeline registers Complexity: more bypass paths and control logic Memory 1 Memory 2 Write Registers L22-5
17 Deeper Pipelines Fetch 1 Fetch 2 Decode Read Registers ALU Memory 1 Memory 2 Write Registers Break up datapath into N pipeline stages Ideal t CK = 1/N compared to non-pipelined So let s use a large N! Advantage: Higher clock frequency The workhorse behind multi-ghz processors Opteron: 11, Power5: 17, Pentium4: 34; Nehalem: 16 Cost More pipeline registers Complexity: more bypass paths and control logic Disadvantages More overlapping more dependencies CPI hazard grows due to data and control hazards Clock overhead becomes increasingly important Power consumption L22-5
18 Wider (aka Superscalar) Pipelines Fetch Each stage operates on up to W instructions each clock cycle Decode Read Registers ALU Memory Write Registers L22-6
19 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 ALU Memory Write Registers L22-6
20 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers ALU Memory Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 Cost Need wider path to instruction cache Need more ALUs, register file ports, Complexity: more bypass & stall cases to check Write Registers L22-6
21 Wider (aka Superscalar) Pipelines Each stage operates on up to W instructions each clock cycle Fetch Decode Read Registers ALU Memory Write Registers Advantage: Lower CPI ideal (1/W) Opteron: 3, Pentium4: 3, Nehalem: 4, Power7: 8 Cost Need wider path to instruction cache Need more ALUs, register file ports, Complexity: more bypass & stall cases to check Disadvantages Parallel execution more dependencies CPI hazard grows due to data and control hazards L22-6
22 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value L22-7
23 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-7
24 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-7
25 Out-of-Order Execution Consider the expression D 3( a b) 7ac L22-8
26 Out-of-Order Execution Consider the expression D 3( a b) 7ac Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d L22-8
27 Out-of-Order Execution Consider the expression Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d D 3( a b) 7ac Dataflow graph ld b ld a ld c - * * * + st d L22-8
28 Out-of-Order Execution Consider the expression Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d D 3( a b) 7ac Dataflow graph ld b ld a ld c - * * * + Out-of-order execution runs instructions as soon as their inputs become available st d L22-8
29 Out-of-Order Execution Example If ld b takes a few cycles (e.g., cache miss), can execute instructions that do not depend on b Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d L22-9
30 Out-of-Order Execution Example If ld b takes a few cycles (e.g., cache miss), can execute instructions that do not depend on b Sequential code ld a ld b sub a-b mul 3(a-b) ld c mul ac mul 7ac add 3(a-b)+7ac st d Dataflow graph ld b ld a - * * * + ld c Completed Executing Not ready st d L22-9
31 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Dispatch Reservation Stations In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache L22-10
32 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch Reservation Stations In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache L22-10
33 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available L22-10
34 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available Write back results in program order L22-10
35 A Modern Out-of-Order Superscalar Processor I-Cache In Order Branch Predict Fetch Unit Instruction Buffer Decode/Rename Reconstruct dataflow graph Dispatch In Order Out Of Order Int Int FP FP L/S L/S Reorder Buffer Retire Write Buffer D-Cache Reservation Stations Execute each instruction as soon as it source operands are available Write back results in program order Why is this needed? L22-10
36 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! PC Fetch Decode RegRead Execute WriteBack L22-11
37 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started Branch executed PC Fetch Decode RegRead Execute WriteBack L22-11
38 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started Loose loop Branch executed PC Fetch Decode RegRead Execute WriteBack L22-11
39 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loose loop Branch executed Decode RegRead Execute WriteBack L22-11
40 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loop length x Pipeline width Loose loop Branch executed Decode RegRead Execute WriteBack L22-11
41 Control Flow Penalty Modern processors have >10 pipeline stages between next PC calculation and branch resolution! Next fetch started PC Fetch How much work is lost every time pipeline does not follow correct instruction flow? Loop length x Pipeline width One branch every 5-20 instructions performance impact of mispredictions? Loose loop Branch executed Decode RegRead Execute WriteBack L22-11
42 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches L22-12
43 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL After Inst. Decode JALR Branches L22-12
44 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode L22-12
45 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Decode L22-12
46 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Decode After Inst. Execute L22-12
47 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Execute After Inst. Decode After Inst. Execute L22-12
48 RISC-V Branches and Jumps Each instruction fetch depends on information from the preceding instruction: 1) Is the preceding instruction a taken branch or jump? 2) If so, what is the target address? Instruction Taken known? Target known? JAL JALR Branches After Inst. Decode After Inst. Decode After Inst. Execute After Inst. Decode After Inst. Execute After Inst. Decode L22-12
49 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13
50 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Strategy 3: Speculate Guess a value and continue executing anyway When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13
51 Resolving Hazards Strategy 1: Stall. Wait for the result to be available by freezing earlier pipeline stages Strategy 2: Bypass. Route data to the earlier pipeline stage as soon as it is calculated Predict jump/branch target and direction Guess a value and continue executing anyway Strategy 3: Speculate When actual value is available, two cases Guessed correctly do nothing Guessed incorrectly kill & restart with correct value Strategy 4: Find something else to do L22-13
52 Static Branch Prediction Probability a branch is taken is ~60-70%, but: backward 90% BEQ forward 50% BEQ Some ISAs attach preferred direction hints to branches, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) Achieves ~80% accuracy L22-14
53 predict Dynamic Branch Prediction Learning from past behavior update Truth/Feedback PC Predictor Prediction L22-15
54 predict Dynamic Branch Prediction Learning from past behavior Truth/Feedback update PC Predictor Prediction Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution L22-15
55 predict Dynamic Branch Prediction Learning from past behavior Truth/Feedback update PC Predictor Prediction Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) L22-15
56 Predicting the Target Address: Branch Target Buffer (BTB) PC k 2 k -entry direct-mapped BTB (can also be set-associative) Entry PC Valid predicted target PC valid BTB is a cache for targets: Remembers last target PC for taken branches and jumps If hit, use stored target as predicted next PC If miss, use PC+4 as predicted next PC = match target After target is known, update if prediction is wrong L22-16
57 Integrating the BTB in the Pipeline Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack L22-17
58 Integrating the BTB in the Pipeline Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack BTB L22-17
59 Integrating the BTB in the Pipeline Tight loop Predict next PC immediately PC Fetch Decode RegRead Execute WriteBack BTB L22-17
60 Integrating the BTB in the Pipeline Tight loop Predict next PC immediately Correct misprediction when the right outcome is known PC Fetch Decode RegRead Execute WriteBack BTB Correct mispred L22-17
61 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match L22-18
62 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! L22-18
63 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by L22-18
64 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) L22-18
65 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) L22-18
66 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) Not storing valid bits L22-18
67 BTB Implementation Details 2 k -entry direct-mapped BTB imem pc k tag(pc i ) target i valid match Unlike caches, it is fine if the BTB produces an invalid next PC It s just a prediction! Therefore, BTB area & delay can be reduced by Making tags arbitrarily small (match with a subset of PC bits) Storing only a subset of target PC bits (fill missing bits from current PC) Not storing valid bits Even small BTBs are very effective! L22-18
68 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface L22-19
69 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage L22-19
70 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage update: On a pc misprediction, if the jump or branch at the pc was taken, then the BTB is updated with the new (pc, nextpc). Otherwise, the pc entry is deleted L22-19
71 BTB Interface interface BTB; method Addr predict(addr pc); method Action update(addr pc, Addr nextpc, Bool taken); endinterface predict: Simple lookup to predict nextpc in Fetch stage update: On a pc misprediction, if the jump or branch at the pc was taken, then the BTB is updated with the new (pc, nextpc). Otherwise, the pc entry is deleted A BTB is a good way to improve performance on parts 2 and 3 of the design project L22-19
72 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop L22-20
73 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? L22-20
74 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? One on loop exit L22-20
75 Better Branch Direction Prediction Consider the following loop: loop: addi a1, a1, -1 bnez a1, loop How many mispredictions does the BTB incur per loop? One on loop exit Another one on first iteration L22-20
76 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions L22-21
77 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions How many mispredictions per loop? L22-21
78 Two-Bit Direction Predictor Smith 1981 Use two bits per BTB entry instead of one valid bit Manage them as a saturating counter: On not-taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly not-taken 0 0 Strongly not-taken Direction prediction changes only after two wrong predictions How many mispredictions per loop? 1 L22-21
79 Modern Processors Combine Multiple Specialized Predictors Predict next PC immediately PC Fetch BTB Instruction type & branch/jal target known Branch direction & JALR target known Decode RegRead Execute WriteBack Branch dir predictor Return addr predictor Correct mispred Loop predictor L22-22
80 Modern Processors Combine Multiple Specialized Predictors Predict next PC immediately PC Fetch BTB Instruction type & branch/jal target known Branch direction & JALR target known Decode RegRead Execute Branch dir predictor Return addr predictor Correct mispred Loop predictor WriteBack Best predictors reflect program behavior L22-22
81 Summary Modern processors rely on a handful of techniques Deep pipelines Multi-GHz frequency Wide (superscalar) pipelines Multiple instructions/cycle Out-of-order execution Reduce impact of data hazards Branch prediction Reduce impact of control hazards L22-23
82 Summary Modern processors rely on a handful of techniques Deep pipelines Multi-GHz frequency Wide (superscalar) pipelines Multiple instructions/cycle Out-of-order execution Reduce impact of data hazards Branch prediction Reduce impact of control hazards Main objective is high sequential performance High costs (area, power) Requires high memory bandwidth and low latency Simple to use (but knowing these techniques often needed to write high-performance programs!) L22-23
83 Thank you! Next lecture: Bypassing in Bluespec + Design Project Tips L22-24
Putting it all Together: Modern Computer Architecture
Putting it all Together: Modern Computer Architecture Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. May 10, 2018 L23-1 Administrivia Quiz 3 tonight on room 50-340 (Walker Gym) Quiz
More information6.004 Tutorial Problems L22 Branch Prediction
6.004 Tutorial Problems L22 Branch Prediction Branch target buffer (BTB): Direct-mapped cache (can also be set-associative) that stores the target address of jumps and taken branches. The BTB is searched
More informationCSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;
CSE 30321 Lecture 13/14 In Class Handout For the sequence of instructions shown below, show how they would progress through the pipeline. For all of these problems: - Stalls are indicated by placing the
More informationLECTURE 1 INTRODUCTION AND COURSE OVERVIEW DANIEL SANCHEZ AND JOEL EMER PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013
LECTURE 1 INTRODUCTION AND COURSE OVERVIEW DANIEL SANCHEZ AND JOEL EMER 6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE SPRING 2013 Why 6.888? The current revolution: Parallel computing 2 The impending
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationControl Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.
Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationFor this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units
CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)
More information1 Hazards COMP2611 Fall 2015 Pipelined Processor
1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationLecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time
More informationBranch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines
6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationPipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationCS433 Homework 2 (Chapter 3)
CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationInstruction Level Parallelism (Branch Prediction)
Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution
More informationCS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.
More informationThe Processor: Improving the performance - Control Hazards
The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary
More informationECE 341. Lecture # 15
ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationBranch prediction ( 3.3) Dynamic Branch Prediction
prediction ( 3.3) Static branch prediction (built into the architecture) The default is to assume that branches are not taken May have a design which predicts that branches are taken It is reasonable to
More informationCS433 Homework 2 (Chapter 3)
CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..
More informationChapter 4 The Processor 1. Chapter 4B. The Processor
Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationLecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )
Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II
CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time
More informationStatic, multiple-issue (superscaler) pipelines
Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationECE 505 Computer Architecture
ECE 505 Computer Architecture Pipelining 2 Berk Sunar and Thomas Eisenbarth Review 5 stages of RISC IF ID EX MEM WB Ideal speedup of pipelining = Pipeline depth (N) Practically Implementation problems
More informationComputer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationCS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz
CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 RISC-V Pipeline Pipeline Control Hazards Structural Data R-type
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationQuiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!
Control Hazards 1 Today Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! 2 Key Points: Control Hazards Control hazards occur when we don t know which instruction to execute
More informationThe Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationLecture 13: Branch Prediction
S 09 L13-1 18-447 Lecture 13: Branch Prediction James C. Hoe Dept of ECE, CMU March 4, 2009 Announcements: Spring break!! Spring break next week!! Project 2 due the week after spring break HW3 due Monday
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationLecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism
Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism
More informationComputer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM
Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationCENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.
Exam 2 April 12, 2012 You have 80 minutes to complete the exam. Please write your answers clearly and legibly on this exam paper. GRADE: Name. Class ID. 1. (22 pts) Circle the selected answer for T/F and
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationECE/CS 552: Introduction to Superscalar Processors
ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines
More informationInstruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov
Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated
More informationCS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)
CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II (Branches, Exceptions) John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationReduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction
ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationSuperscalar Processor
Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU
More informationDetermined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version
MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationLecture 12 Branch Prediction and Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer
More informationHY425 Lecture 05: Branch Prediction
HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationLooking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction
Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions
More informationImproving Performance: Pipelining
Improving Performance: Pipelining Memory General registers Memory ID EXE MEM WB Instruction Fetch (includes PC increment) ID Instruction Decode + fetching values from general purpose registers EXE EXEcute
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationCS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12
Assigned 2/28/2018 CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp18
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationPipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...
CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100
More informationWhat about branches? Branch outcomes are not known until EXE What are our options?
What about branches? Branch outcomes are not known until EXE What are our options? 1 Control Hazards 2 Today Quiz Control Hazards Midterm review Return your papers 3 Key Points: Control Hazards Control
More informationArchitectures for Instruction-Level Parallelism
Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software
More informationECE232: Hardware Organization and Design
ECE232: Hardware Organization and Design Lecture 17: Pipelining Wrapup Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Outline The textbook includes lots of information Focus on
More informationFinal Exam Fall 2007
ICS 233 - Computer Architecture & Assembly Language Final Exam Fall 2007 Wednesday, January 23, 2007 7:30 am 10:00 am Computer Engineering Department College of Computer Sciences & Engineering King Fahd
More informationECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines
More informationSRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design
SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these
More informationCOSC3330 Computer Architecture Lecture 14. Branch Prediction
COSC3330 Computer Architecture Lecture 14. Branch Prediction Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston opic Out-of-Order Execution Branch Prediction Superscalar
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationChapter 4. The Processor
Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA
More informationCS 61C: Great Ideas in Computer Architecture Pipelining and Hazards
CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time
More informationLECTURE 9. Pipeline Hazards
LECTURE 9 Pipeline Hazards PIPELINED DATAPATH AND CONTROL In the previous lecture, we finalized the pipelined datapath for instruction sequences which do not include hazards of any kind. Remember that
More informationDynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers
Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the
More information