Branch Prediction Chapter 3

Similar documents
Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

HY425 Lecture 05: Branch Prediction

COSC 6385 Computer Architecture. Instruction Level Parallelism

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Branch prediction ( 3.3) Dynamic Branch Prediction

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CSE4201 Instruction Level Parallelism. Branch Prediction

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

COSC 6385 Computer Architecture Dynamic Branch Prediction

Hardware-based Speculation

Adapted from David Patterson s slides on graduate computer architecture

Instruction Level Parallelism

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Tomasulo Loop Example

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Instruction-Level Parallelism and Its Exploitation

Hardware-Based Speculation

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CS422 Computer Architecture

Instruction Level Parallelism. Taken from

EE557--FALL 2000 MIDTERM 2. Open books and notes

3.16 Historical Perspective and References

Control Dependence, Branch Prediction

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Exercises (III) Output dependencies: Instruction 6 has output dependence with instruction 1 for access to R1

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

CS433 Homework 2 (Chapter 3)

ILP: Instruction Level Parallelism

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

CS433 Homework 2 (Chapter 3)

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

5008: Computer Architecture

Advanced Computer Architecture

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Static Branch Prediction

Lecture: Out-of-order Processors

6.004 Tutorial Problems L22 Branch Prediction

Static vs. Dynamic Scheduling

CMSC411 Fall 2013 Midterm 2 Solutions

Getting CPI under 1: Outline

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Instruction-Level Parallelism (ILP)

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Four Steps of Speculative Tomasulo cycle 0

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Super Scalar. Kalyan Basu March 21,

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

Dynamic Control Hazard Avoidance

Metodologie di Progettazione Hardware-Software

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Instruction Level Parallelism (Branch Prediction)

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

CS425 Computer Systems Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Copyright 2012, Elsevier Inc. All rights reserved.

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Control Hazards. Prediction

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced Pipelining and Instruction- Level Parallelism 4

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture Topics. Why Predict Branches? ECE 486/586. Computer Architecture. Lecture # 17. Basic Branch Prediction. Branch Prediction

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Lecture 15: Instruc.on Level Parallelism -- Introduc.on, Compiler Techniques, and Advanced Branch Predic.on

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Hardware-Based Speculation

EEC 581 Computer Architecture. Lec 4 Instruction Level Parallelism

Computer System Architecture Quiz #2 April 5th, 2019

Floating Point/Multicycle Pipelining in DLX

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

EE 4683/5683: COMPUTER ARCHITECTURE

CPE 631 Advanced Computer Systems Architecture: Homework #3,#4

Transcription:

1 Branch Prediction Chapter 3

2 More on Dependencies We will now look at further techniques to deal with dependencies which negatively affect ILP in program code. A dependency may be overcome in two ways: 1. Maintain the dependency but avoid the related hazard, or 2. transform the program code to eliminate the dependency. 2

3 Data Flow and Dependency Detection Data flows between instruction either through registers or memory locations. - Dependency detection based on registers is straightforward since registers are explicitly named in opcodes. - Dependencies that flow through memory are harder to detect. Aliasing: 100(R4) and 20(R6) may refer to the same memory location. 3

4 Name Dependencies Two instructions may use the same register (or memory location) a name but no data flows between them. Antidependence Output dependence DIV.D F0,F2,F4 ADD.D F2,F6,F8 DIV.D F0,F2,F4 ADD.D F0,F6,F8 No true data dependence. Can reorder instructions or execute in parallel after register renaming. 4

5 Control Dependencies Every instruction in a program except for those in the first basic block are control dependent on a branch instruction. S2 depends on p2: if (p1) { S1; } if (p2) { S2; } S2 depends on p1 and p2: if (p1) { S1; if (p2) { S2; } } 5

6 Dynamic Branch Prediction The frequency of branch instructions in program code calls for a closer look at how branch cost can be reduced. The predicted-untaken scheme and delay slots are static techniques to deal with branches: - In these schemes, the action of the CPU does not depend on the actual dynamic branching behavior (jump or fall-through?) of a given instruction. 6

7 1-Bit Branch-Prediction Buffer Maintain a branch-prediction buffer (branch history table), recording last behavior of branch: - Small memory buffer indexed by least significant bits of branch instruction address. 0x4000003c: BEQZ R2,label LSB Predictor 111100 01 Branch untaken Branch taken 7

8 1-Bit Branch-Prediction Buffer Branch mispredicted: invert predictor bit and store back into branch-prediction buffer. 1. Branch may have changed behavior (e.g., loop exit), or 2. another branch may have overwritten the entry. CPU always assumes the predictor bit to be correct and starts fetching instructions from target (1) or fallthrough (0). Most useful for pipelines that determine branch target address early (and evaluate condition later on). 8

1-Bit Prediction Quality Assume the following loop is iterated 10 times (the BNEZ is taken 9 times, 10th time untaken). What is the 1-bit branch-prediction accuracy? loop: BNEZ R2,loop Answer: 80% 9 9

10 2-Bit Branch-Prediction For a 2-bit branch predictor, a prediction must miss twice before it is changed: taken Predict taken 11 taken Predict untaken 01 untaken taken untaken taken Predict taken 10 untaken Predict untaken 00 untaken 10

11 n-bit Branch-Prediction 2-bit predictor is a special case of n-bit saturating counter, counter value [0, 2 n -1]: - Increment counter if branch taken; decrement if untaken. - Predict branch to be taken if counter value 2 n-1. Accuracy of 4096-entry buffer 82% 99% even for a 2- bit predictor; n-bit predictors found to be no significant improvement. 11

12 Branch-Prediction Buffer Size 12

13 Correlated Branches More complex predictor setups are required if separate branches are correlated. Consider (d R1): if (d == 0) d = 1; if (d == 1) b1 BNEZ R1,l1 DADDIU R1,R0,#1 l1: DADDIU R3,R1,#-1 b2 BNEZ R3,l2 l2: - Branch b1 not taken branch b2 not taken. 13

14 Correlated Branches if (d == 0) d = 1; if (d == 1) b1 BNEZ R1,l1 DADDIU R1,R0,#1 l1: DADDIU R3,R1,#-1 b2 BNEZ R3,l2 l2: d =? b1 b1 b1 b2 b2 b2 prediction action new prediction prediction actionnew prediction d = 2 NT T T NT T T d = 0 T NT NT T NT NT d = 2 NT T T NT T T d = 0 T NT NT T NT NT 14

15 1-Bit Predictor with 1-Bit Correlation NT/T Each branch is assigned two prediction bits: NT / T Prediction used if last branch was not taken Prediction used if last branch was taken Note: The last branch in general is not the branch being predicted. 15

16 Correlated Prediction if (d == 0) d = 1; if (d == 1) b1 BNEZ R1,l1 DADDIU R1,R0,#1 l1: DADDIU R3,R1,#-1 b2 BNEZ R3,l2 l2: d =? b1 b1 b1 b2 b2 b2 prediction action new prediction prediction actionnew prediction d = 2 NT/NT T T/NT NT/NT T NT/T d = 0 T/NT NT T/NT NT/T NT NT/T d = 2 T/NT T T/NT NT/T T NT/T d = 0 T/NT NT T/NT NT/T NT NT/T 16

17 Branch Target Buffers In the 5-stage pipeline, the branch prediction buffer is accessed during stage ID (when the CPU knows it is indeed dealing with a branch). - By the end of ID, the CPU knows enough to fetch the next instruction (does not wait for actual branch outcome). - This is still 1 cycle too late to fill the pipeline without disruption. 17

18 Branch Target Buffers A branch target buffer (BTB) is accessed at stage IF, before the CPU knows that the fetched instruction is a branch. - Before decoding, the CPU uses the current PC as a key to perform a BTB lookup. - If the lookup is a hit, the CPU knows the predicted PC (stored in the BTB) at the end of stage IF just in time. 18

19 Branch Target Buffers 19

Branch Target Buffer Timing 20 20