Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Similar documents
Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

CS 104 Computer Organization and Design

Control Hazards. Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Instruction Level Parallelism (Branch Prediction)

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Branch prediction ( 3.3) Dynamic Branch Prediction

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Control Hazards. Prediction

Lecture 13: Branch Prediction

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

November 7, 2014 Prediction

Static Branch Prediction

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Dynamic Branch Prediction

HY425 Lecture 05: Branch Prediction

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

6.004 Tutorial Problems L22 Branch Prediction

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

ECE 571 Advanced Microprocessor-Based Design Lecture 7

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Complex Pipelines and Branch Prediction

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Branch Prediction Chapter 3

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

ECE 571 Advanced Microprocessor-Based Design Lecture 8

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

COSC 6385 Computer Architecture. Instruction Level Parallelism

ECE232: Hardware Organization and Design

The Processor: Improving the performance - Control Hazards

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Chapter 4. The Processor

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Alexandria University

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

CS433 Homework 2 (Chapter 3)

Instruction-Level Parallelism and Its Exploitation

COSC 6385 Computer Architecture Dynamic Branch Prediction

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

ECE331: Hardware Organization and Design

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

COMPUTER ORGANIZATION AND DESIGN

EECS 470 Midterm Exam Fall 2006

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Wide Instruction Fetch

Instruction Level Parallelism

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Computer Systems Architecture

Static, multiple-issue (superscaler) pipelines

CS252 Graduate Computer Architecture Midterm 1 Solutions

What about branches? Branch outcomes are not known until EXE What are our options?

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

CS 152, Spring 2011 Section 8

CS433 Homework 2 (Chapter 3)

Dynamic Control Hazard Avoidance

LECTURE 3: THE PROCESSOR

Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining

CS152 Computer Architecture and Engineering January 29, 2013 ISAs, Microprogramming and Pipelining Assigned January 29 Problem Set #1 Due February 14

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

COSC3330 Computer Architecture Lecture 14. Branch Prediction

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Micro-Architectural Attacks and Countermeasures

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Full Datapath. Chapter 4 The Processor 2

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Spring 2009 Prof. Hyesoon Kim

Chapter 4. The Processor

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESI

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Improvement: Correlating Predictors

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Topic 14: Dealing with Branches

Transcription:

Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions Implications? or sometimes you just have to guess Can only keep the pipeline full if we can consistently keep fetching well past unresolved branches. Can only exploit high levels of parallelism if we consistently have multiple basic blocks in the processor at once. Importance of Branch Prediction MIPS R20 -- branch hazard of 1 cycle, 1 instruction issued per cycle delayed branch next generation 2-3 cycle hazard, 1-2 instructions issued per cycle cost of branch misprediction goes up Pentium 4 Branch Prediction Easiest (static prediction) always not taken, always taken forward not taken, backward always taken compiler predicted (branch likely, branch not likely) Next easiest (1-bit dynamic) remember last taken/not taken per branch (1 bit) per branch approximated per I cache line use part of address what happens on a loop? for (I=0;I<5;I++) { loop: } bnez r1, loop:

2-bit branch prediction Two different 2-bit schemes has 4 states instead of 2, allowing for more information about tendencies. Strongly Taken Loops? Decrement when not taken Weakly Taken 10 Weakly Not Taken 01 Increment when taken Strongly Not Taken Branch History Table has limited size 2 bits by N (e.g. 4K) uses low bits of branch address to choose entry branch address what happens when table too small? what about even/odd branch? Can We Do Better? Can we get more information dynamically than just the general history of this branch? We can look at patterns (local predictor) for a particular branch. last eight branches, then it is a good guess that the next one is 1 (taken) address even/odd branch?

Can We Do Better? Correlating Branch Predictors also look at other branches for clues if (i == 0)... if (i > 7)... Typically use two indexes Global history register (GHR) --> history of last m branches (e.g., 010) branch address Correlating Branch Predictors The global history register is a shift register that records the last n branches (of any address) encountered by the processor. ghr 01 2-bit predictors Two-level correlating branch predictors Can use both the PC address and the GHR PC ghr combining function 01 2-bit predictors 2-level branch prediction If we concatenate the GHR and the PC, we get... Branch address 4 2 bit per branch predictors XX XX prediction If the combining function is xor, this is called the predictor. This is a (2,2) scheme (2 bits of global history, 2-bit predictors) (not a particularly effective predictor but described in the book) 2 bit global branch history

Are we happy yet???? But... Combining branch predictors or tournament predictors use multiple schemes and a voter to decide which one typically does better for that branch. P1 P2 When do we need to do the prediction to avoid any control hazards on a correct prediction? A taken/not taken prediction only helps us if...? use P2 PC Branch Target Buffers BTB Operation predict the predict the of branches in the instruction stream of branches use PC (all bits) for lookup match implies this is a branch if match and predict bits => taken, set PC to predicted PC if branch predict wrong, must recover (same as branch hazards we ve already seen) but what about dynamically scheduled (out of order) processor?? if decode indicates branch when no BTB match, two choices: look up prediction now and act on it just predict not taken when branch resolved, update BTB (at least prediction bits, maybe more)

BTB Performance Two things that can go wrong didn t predict the branch (misfetch) mispredicted a branch (mispredict) Suppose BTB hit rate of 85% and predict accuracy of 90%, misfetch penalty of 2 cycles and mispredict penalty of 10 cycles, what is average branch penalty? Can use both BTB and branch predictor have no prediction bits in BTB (why is that a good idea?) presence of PC in BTB indicates a lookup in branch predictor to predict whether the branch will go to destination address in BTB. What about indirect jumps/returns? Branch predictor does really well with conditional jumps BTB does really well with unconditional jumps (jump, jal, etc.) Indirect jumps often jump to different destinations, even from the same instruction. Indirect jumps most often used for return instructions. Return easily handled by a stack. jal-> push PC+4 return -> predict jump to address on top of stack, pop stack Real BP -- PowerPC 620 256-entry 2-way set-associative BTB 2048-entry indexed by PC return-address stack Power 4 Up to 2 branches per cycle predicted PC XOR GHR* 16K 16K 16K gshare *GHR composed of 1 bit per fetch group chooser (use bht or gshare result?)

Pentium Pro Compaq/Digital Alpha 21264 512-entry BTB 4-way set-associative 2-level predictor (1st level in BTB, one per set, 4 bits) return stack PC 10 Local Predictor Global Predictor Chooser 10 3 2 GHR 2 12 branch PC target branch PC target branch pattern next-cache-line field in I-cache replaces BTB return address stack Branch Prediction Branch Prediction Key Points The better we predict, the behinder we get. 2-bit predictors capture tendencies well. Correlating predictors improve accuracy, particularly when combined with 2-bit predictors. Accurate branch prediction does no good if we don t know there was a branch to predict BTB identifies branches in (or before) IF stage. BTB combined with branch prediction table identifies branches to predict, and predicts them well.