Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Similar documents
Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

November 7, 2014 Prediction

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Control Hazards. Branch Prediction

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Control Hazards. Prediction

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Control Dependence, Branch Prediction

Static Branch Prediction

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Dynamic Control Hazard Avoidance

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Lecture 13: Branch Prediction

ISA P40 P36 P38 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24

Branch prediction ( 3.3) Dynamic Branch Prediction

CS 152, Spring 2011 Section 8

Instruction Level Parallelism (Branch Prediction)

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

Dynamic Branch Prediction

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

CSE4201 Instruction Level Parallelism. Branch Prediction

Instruction Level Parallelism

HY425 Lecture 05: Branch Prediction

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelines and Branch Prediction

COSC 6385 Computer Architecture. Instruction Level Parallelism

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

COSC 6385 Computer Architecture Dynamic Branch Prediction

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

CS252 Graduate Computer Architecture Midterm 1 Solutions

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Chapter-5 Memory Hierarchy Design

CS422 Computer Architecture

Instruction Level Parallelism. Taken from

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Spring 2009 Prof. Hyesoon Kim

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Reducing Branch Costs via Branch Alignment

Tomasulo Loop Example

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,..

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

PowerPC 620 Case Study

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

ECE 571 Advanced Microprocessor-Based Design Lecture 7

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Topic 14: Dealing with Branches

Topic 14: Dealing with Branches

Dynamic Branch Prediction for a VLIW Processor

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Instruction Level Parallelism

A Study for Branch Predictors to Alleviate the Aliasing Problem

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

CS 104 Computer Organization and Design

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Wide Instruction Fetch

Review: Independent Fetch unit. Review:Dynamic Branch Prediction Problem. Branch Target Buffer. CS252 Graduate Computer Architecture Lecture 10

Instruction-Level Parallelism and Its Exploitation

Outline. Pipelining basics The Basic Pipeline for DLX & MIPS Pipeline hazards. Handling exceptions Multi-cycle operations

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Computer Architecture and Engineering. CS152 Quiz #4 Solutions

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Metodologie di Progettazione Hardware-Software

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

ECE/CS 552: Introduction to Superscalar Processors

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11

What about branches? Branch outcomes are not known until EXE What are our options?

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

Transcription:

ECE4750/CS4420 Computer Architecture L10: Branch Prediction Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab2 and prelim grades Back to the regular office hours 2 1

Overview Dynamic scheduling handles various types of data hazards BUT, how about control hazard? Today dealing with control hazards through prediction Branch prediction Temporal correlation Spatial correlation Branch target buffer (BTB) Reading H&P Chapter 2.3 Scott McFarling, Combining Branch Predictors, WRL Technical Note TN-36 3 Run-Length Between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult 13 % load 26 % 23 % store 9 % 9 % branch 16 % 8 % other 10 % 12 % SPECint92: SPECfp92: compress, eqntott, espresso, gcc, li doduc, ear, hydro2d, mdijdp2, su2cor What is the average run length between branches? 4 2

MIPS Branches and Jumps Each instruction fetch depends on one or two pieces of information from the preceding instruction: 5 Branch Penalties in Modern Pipelines UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000) Branch Target Address Known Branch Direction & Jump Register Target Known A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B I J R E Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+ another 6 stages) 6 3

Branch Prediction Motivation: branch penalties limit performance of deeply pipelined processors Required hardware support: 7 Static Branch Predication Overall probability a branch is taken is ~60-70% but: JZ JZ 8 4

Dynamic Branch Prediction Learning based on past behavior Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) 9 Branch History Table (BHT) Small memory to store the history of each branch Sometimes called Pattern History Table (PHT) Simplest: remember last outcome one-bit BHT Targets highly biased branches Ex) loop branch taken 9/10 times; what is the misprediction rate? Branch Instance 1 2 3 4 5 6 7 8 9 10 11 12 Prediction Outcome T T T T T T T T T NT T T 10 5

2-bit BHT Solution: two-bit prediction (saturating counter, generalize to n-bits) On taken On taken 1 1 Strongly taken 1 0 Weakly taken 0 1 Weakly taken 0 0 Strongly taken (note: textbook & HW3 follows different encoding) Branch Instance 1 2 3 4 5 6 7 8 9 10 11 12 Prediction Outcome T T T T T T T T T NT T T 11 2-bit BHT Implementation Fetch PC 0 0 12 6

Accuracy of 2-bit BHT 4K vs. infinite # entries for SPEC89 (average) From McFarling, 1993 13 Local History Predictor for (j=0; j<3; j++) { } Exploit the history of a particular branch The loop: 2 taken branches + 1 not taken branch Select a prediction based on a branch s recent history Local history table remembers the history of a particular branch Use 2-bit BHT for a prediction 14 7

Local Predictor Example Branch Instance 1 2 3 4 5 6 7 8 9 10 11 12 Prediction Outcome T T NT T T NT T T NT T T NT 15 Local Predictor Accuracy From McFarling, 1993 16 8

Exploiting Spatial Correlation Yeh and Patt, 1992 if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; Look at behavior of recently executed branches Ex) If first condition false, second condition also false Select a prediction based on the outcome of past branches (global history) 17 Global-Select Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BHT bits (~95% correct) 0 0 Fetch PC k 2-bit global branch history shift register Shift in Taken/ Taken results of each branch Taken/ Taken? 18 9

(m,n) Branch Predictors Generalized global-select BPs m-bit GHR n-bit BHTs m = 0? How many bits are required for (2,2) BP? 19 Global Predictor Example B1: if (x[i] < 7) then y += 1; B2: if (x[i] < 5) then c -= 4; Branch Instance B1 B2 B1 B2 B1 B2 B1 B2 B1 B2 B1 B2 Prediction Outcome T T NT NT T T NT NT NT NT T T 20 10

Global Predictor Variants Global Global-Share 21 Tournament Predictors Multilevel predictors predicting predictors! which predictor works best for this particular branch? Useful to pick between local, global All predictors cast vote only predicted predictor s used all predictions compared against branch outcome, then each branch predictor updates its content predictor history table updated Ex.: two-bit saturating counter, select between two predictors 0X use pred. #1, 1X use pred. #2 00 11 01 10 22 11

4,096 2b Tournament Predictor: Alpha 21264 12b branch outcome PC 1,024 10b 1,024 3b 4,096 2b 23 Combined Predictor Performance 24 12

Limitations of BHTs Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined. A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B I J R E Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+ another 6 stages) UltraSPARC-III fetch pipeline 25 Branch Target Buffer (BTB) IMEM k predicted target BPb Branch Target Buffer (2 k entries) PC target BP BP bits are stored with the predicted target address. IF stage: If (BP=taken) then npc=target else npc=pc+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb 26 13

Address Collisions Assume a 64-entry BTB target BPb 0x110 Jump 0x200 0x220 Add... What will be fetched after the instruction at 0x220? BTB prediction = Correct target = Instruction Memory 27 BTB is only for Control Instructions BTB contains useful information for branch and jump instructions only How to achieve this effect without decoding the instruction? 28 14

Branch Target Buffer (BTB) I-Cache PC 2 k -entry direct-mapped BTB (can also be associative) Entry PC Valid predicted target PC k = match valid target Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded 29 Consulting BTB Before Decoding 0x110 Jump 0x200 entry PC target BPb 0x220 Add... 30 15

Combining BTB and BHT BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) BHT can hold many more entries and is more accurate BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB BHT A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B Branch Address Calc/Begin Decode I Complete Decode J Steer Instructions to Functional units R Register File Read E Integer Execute BTB/BHT only updated after branch resolves in E stage 31 Uses of Jump Register (JR) Switch statements (jump to address of matching case) Dynamic function call (jump to run-time function address) Subroutine returns (jump to return address) How well does BTB work for each of these cases? 32 16

Return Address Stack Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs. Push call address when function call executed fa() { fb(); } fb() { fc(); } fc() { fd(); } Pop return address when subroutine return decoded &fd() &fc() &fb() k entries (typically k=8-16) 33 17