Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Similar documents
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Control Hazards. Branch Prediction

Control Hazards. Prediction

HY425 Lecture 05: Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Lecture: Branch Prediction

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Lecture 13: Branch Prediction

Dynamic Control Hazard Avoidance

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Branch prediction ( 3.3) Dynamic Branch Prediction

Branch Prediction Chapter 3

ECE 571 Advanced Microprocessor-Based Design Lecture 7

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Instruction Level Parallelism (Branch Prediction)

Complex Pipelines and Branch Prediction

ECE 571 Advanced Microprocessor-Based Design Lecture 9

CS433 Homework 2 (Chapter 3)

Lecture: Branch Prediction

CS422 Computer Architecture

CSE4201 Instruction Level Parallelism. Branch Prediction

CS433 Homework 2 (Chapter 3)

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

ECE 571 Advanced Microprocessor-Based Design Lecture 8

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Control Dependence, Branch Prediction

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

Static Branch Prediction

1 Hazards COMP2611 Fall 2015 Pipelined Processor

November 7, 2014 Prediction

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS 104 Computer Organization and Design

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

LECTURE 3: THE PROCESSOR

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture: Branch Prediction

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Computer Systems Architecture

COSC 6385 Computer Architecture Dynamic Branch Prediction

Spring 2009 Prof. Hyesoon Kim

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

COSC 6385 Computer Architecture. Instruction Level Parallelism

Tomasulo Loop Example

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

The Processor: Improving the performance - Control Hazards

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

ECE232: Hardware Organization and Design

Dynamic Branch Prediction

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,..

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

A Study for Branch Predictors to Alleviate the Aliasing Problem

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CS433 Final Exam. Prof Josep Torrellas. December 12, Time: 2 hours

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Fall 2011 Prof. Hyesoon Kim

Lecture: Out-of-order Processors

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Instruction Level Parallelism. Taken from

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

COMPUTER ORGANIZATION AND DESIGN

Processor Architecture V! Wrap-Up!

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

Chapter 4. The Processor

ECE331: Hardware Organization and Design

Lecture Topics ECE 341. Lecture # 14. Unconditional Branches. Instruction Hazards. Pipelining

Tutorial 11. Final Exam Review

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Transcription:

Chapter 3 (CONT) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Dynamic Hardware Branch Prediction Control hazards are sources of losses, especially for processors that want to issue > 1 instr / cycle Proposed approach : use HW to dynamically predict the outcome of a branch (may change with time) 1 Branch prediction buffer (a.k.a. branch history table) Small memory indexed by lower bits of addr of branch instruction Contains 1 bit that says if branch was recently taken 2

Note : the prediction may refer to another branch that has some low-order address bits If hint is wrong : prediction bit is inverted Problem : if branch is taken 9 times in a row in a loop we have two mispredictions 2 Two-bit prediction schemes Need to mispredict twice before changing the predict See Figure C.18 3

Figure C.18 4

3. N-bit saturating counter n n bit counter can take from 0 to 2-1 n-1 if count 2 branch predict taken else predict untaken if taken, increment ; if untaken, decrement Accuracy : 4096 entries in table, 2 bits misprediction rate 1-18% in spec89 usually better at FP programs (more loops) 5

4 Look at recent behavior of other branches too correlating predictors or two-level predictors if ( d == 0) d =1 ; if (d == 1) e.g scheme where each branch has 2 separate bits prediction used if the last branch not taken prediction used if the last branch taken example works very well for b2 6

correct prediction for b1 is by chance b2 will always work (every execution) except if d=1 This is called a (1, 1) predictor last branch choose among 2 states (1 bit) Can be ( m, n ) look at last m branches use n bits to predict } Easy hardware to remember outcome of last m branches 7

A two bit predictor with no global history is a (0,2) predictor # bits in an (m,n) predictor? m 2 * n * #entries e.g (2,2) 8

Tournament Predictors Use multiple predictors, usually One based on global info One based on local info Combining them with a selector They do very well They are a popular form of Multilevel branch predictors (use several levels of branch prediction tables together with an algorithm for choosing among them) Existing ones: use a 2-bit saturating counter per branch to choose among two different predictors (the four states of the counter dictate whether to use predictor 1 or 2) 9

Tournament Predictors The counter is incremented whenever the predicted predictor is correct and the other is incorrect, and is decremented in the reverse situation Advantage: ability to select the right predictor for right branch Figure 3.3: fraction of predictions that use the local predictor Figure 3.4: comparing local/global/tournament 10

Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed by a noncorrelating 2-bit predictor with unlimited entries and a 2-bit predictor with 2 bits of global history and a total of 1024 entries. Although these data are for an older version of SPEC, data for more recent SPEC benchmarks would show similar differences in accuracy. Copyright 2011, Elsevier Inc. All rights Reserved.

Figure 3.4 The misprediction rate for three different predictors on SPEC89 as the total number of bits is increased. The predictors are a local 2-bit predictor, a correlating predictor that is optimally structured in its use of global and local information at each point in the graph, and a tournament predictor. Although these data are for an older version of SPEC, data for more recent SPEC benchmarks would show similar behavior, perhaps converging to the asymptotic limit at slightly larger predictor sizes. Copyright 2011, Elsevier Inc. All rights Reserved.

Branch Target Buffers (BTB) In addition to predicting the branch,we need to guess the target address If the target address can be determined by end of IF zero branch penalty BTB : Small cache that stores the predicted address for the next instruction after a branch Note: If we use a branch prediction table accessed during ID If we use a BTB accessed during IF branches have 0 cycle penalty branches have 1 cycle penalty 13

How a BTB Works During the IF, we use the addr to access the BTB If hit, we extract the address of the next inst. to fetch Note : unlike the branch prediction buffer, the entry must be for this instruction, else would fetch something wrong Note : Only need to store predicted - taken branches Easiest scheme : Store in the BTB only PC-relative conditional branches target address is a constant 14

No branch delay if entry found in BTB and it is correct There is some cost in updating the BTB in case of misprediction or wrong target We do not try to update BTB while fetching instr could not access it best to stall 1 or 2 cycles Can be combined with a branch prediction table to decide when to put entries in BTB See Figure 3.21 and Figure 3.22 15

Figure 3.21 A branch-target buffer. The PC of the instruction being fetched is matched against a set of instruction addresses stored in the first column; these represent the addresses of known branches. If the PC matches one of these entries, then the instruction being fetched is a taken branch, and the second field, predicted PC, contains the prediction for the next PC after the branch. Fetching begins immediately at that address. The third field, which is optional, may be used for extra prediction state bits. Copyright 2011, Elsevier Inc. All rights Reserved.

Figure 3.22 The steps involved in handling an instruction with a branch-target buffer. Copyright 2011, Elsevier Inc. All rights Reserved.

If branch not correctly predicted (not taken or taken to wrong target ) : 1 cycle update BTB 1 cycle to restart fetching If branch not found & taken: 2 cycles update BTB See figure 3.23 18

Example : Find branch penalty if BTB hit rate = 90% Prediction accuracy = 90% (for instructions in the buffer) Branch taken frequency (for I not in buffer) = 60% Branch Penalty Hit in BTB and wong prediction Miss in BTB and taken = * 2 + *2 = 0.90 * 0.1 * 2 + 0.1 * 0.6 * 2 19

Fancier BTBs Store target instruction instead of target address BTB access can take longer now ( larger BTB) can do branch folding (achieves zero cycle unconditional BR and sometimes zero cycle cond. BR.) Processsor Uncond. Br. cache BTB target Zero cost unconditional Br 20

Integrated Instruction Fetch Units Have a fancy module that implements IF in multiple cycles (since it has to provide multiple I) IFU performs the following functions: Branch prediction Instruction prefetch Buffering of instructions (may come from multiple cache lines, etc) IFU provides I to the issue stage 21

Return Address Predictors Handling indirect jumps (from procedure returns) cannot use traditional BTB s (target changes) use a stack : push the return addr on a call; pop it at return. For example 1-16 entries Other Fetch from both the predicted and unpredicted direction: Some processors have used costly 22