Lecture 13: Branch Prediction

Similar documents
Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Instruction Level Parallelism (Branch Prediction)

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Control Hazards. Branch Prediction

ECE 571 Advanced Microprocessor-Based Design Lecture 7

Complex Pipelines and Branch Prediction

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Static Branch Prediction

Branch prediction ( 3.3) Dynamic Branch Prediction

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

Control Hazards. Prediction

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

COSC3330 Computer Architecture Lecture 14. Branch Prediction

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Dynamic Branch Prediction

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

Improvement: Correlating Predictors

ECE 571 Advanced Microprocessor-Based Design Lecture 8

November 7, 2014 Prediction

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Wide Instruction Fetch

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

HY425 Lecture 05: Branch Prediction

Fall 2011 Prof. Hyesoon Kim

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Static Branch Prediction

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

CS 104 Computer Organization and Design

CS 152 Computer Architecture and Engineering. Lecture 5 - Pipelining II (Branches, Exceptions)

Control Dependence, Branch Prediction

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Dynamic Control Hazard Avoidance

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Spring 2009 Prof. Hyesoon Kim

What about branches? Branch outcomes are not known until EXE What are our options?

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,..

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

CMU Introduction to Computer Architecture, Spring 2012 Handout 9/ HW 4: Pipelining

6.004 Tutorial Problems L22 Branch Prediction

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

ECE331: Hardware Organization and Design

LECTURE 3: THE PROCESSOR

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Computer Architecture Lecture 9: Branch Prediction I. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/4/2015

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture: Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

COMPUTER ORGANIZATION AND DESI

18-740/640 Computer Architecture Lecture 5: Advanced Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 9/16/2015

CS 152, Spring 2011 Section 8

Full Datapath. Chapter 4 The Processor 2

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS252 Graduate Computer Architecture Midterm 1 Solutions

COMPUTER ORGANIZATION AND DESIGN

CS146 Computer Architecture. Fall Midterm Exam

Chapter 4. The Processor

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

ECE/CS 552: Introduction to Superscalar Processors

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

15-740/ Computer Architecture Lecture 29: Control Flow II. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture: Branch Prediction

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Last lecture. Some misc. stuff An older real processor Class review/overview.

Lecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University

ELE 655 Microprocessor System Design

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Systems Architecture

CS311 Lecture: Pipelining and Superscalar Architectures

Thomas Polzer Institut für Technische Informatik

Transcription:

S 09 L13-1 18-447 Lecture 13: Branch Prediction James C. Hoe Dept of ECE, CMU March 4, 2009 Announcements: Spring break!! Spring break next week!! Project 2 due the week after spring break HW3 due Monday after spring break (no more homework until week 12) Handouts: Control Speculation: PC+4 S 09 L13-2 t 0 t 1 t 2 t 3 t 4 t 5 Inst h IF PC ID ALU MEM Inst i Inst j IF PC+4 ID IF PC+8 ALU ID Inst k Inst l Inst h is a branch IF target first opportunity to decode Inst h should we correct now? Inst h branch condition and target evaluated in ALU When a branch resolves - branch target (Inst k )is fetched - all instructions fetched since inst h (so called wrong-path instructions) must be flushed

Performance Impact S 09 L13-3 correct guess no penalty incorrect guess 2 bubbles Assume no data hazards 20% control flow instructions 70% of control flow instructions are IPC = 1 / [ 1 + (0.20*0.7) * 2 ] = = 1 / [ 1 + 0.14 * 2 ] = 1 / 1.28 = 0.78 ~86% of the time probability of a wrong guess penalty for a wrong guess Can we reduce either of the two penalty terms? Making a Better Guess S 09 L13-4 For ALU instructions can t do better than guessing nextpc=pc+4 still tricky since must guess nextpc before the current instruction is fetched For Branch/Jump instructions why not always guess in the direction since 70% correct again, must guess nextpc before the branch instruction is fetched (but branch target is encoded in the instruction) Must make a guess based only on the current fetch PC!!! Fortunately, - PC-offset branch/jump target is static - We are allowed to be wrong some of the time

Branch Target Buffer (Oracle) S 09 L13-5 BTB (Oracle) a giant table indexed by PC returns the guess for nextpc PC BTB Instruction Memory When encountering a PC for the first time, store in BTB PC + 4 if ALU/LD/ST PC+offset if Branch or Jump?? if JR Effectively guessing branches are always IPC = 1 / [ 1 + (0.20*0.3) * 2 ] = 0.89 PC BTB (Reality) S 09 L13-6 BTB idx unused N-bit BTB with 2 N entries npc Hash PC into a 2 N entry table On collision, BTB returns something meaningless and possibly (since 80% of entries all hold PC+4) wrong How big should this table be?

Tagged BTB S 09 L13-7 BTB idx table BTB PC+4 = 1 0 nextpc Only store branch instructions (save 80% storage) Update and BTB for the new branch after each collision Even Better Guess S 09 L13-8 We can get 100% correct on non-branch instructions Can we do better than 70% on branch instructions? We get 90% right on backward branch (dynamic) We only get 50% right on forward branch (dynamic) What pattern can we leverage on forward branches? a given static branch instruction is likely to be highly biased in one direction (either or not 80~90% correct if we always guessed the same outcome as the last time the same branch was executed IPC = 1 / [ 1 + (0.20*0.15) * 2 ] = 0.94

S 09 L13-9 Branch History Table and Target Buffer BTB idx N-bit table BHT BTB? PC+4 = 1 0 The 1-bit BHT entry is updated with the true outcome after each execution of a branch nextpc Branch Prediction State Machine S 09 L13-10 not ict not ict not

2-Bit Saturation Counter S 09 L13-11 11! 10!! 01!! 00! 2-Bit Hysteresis Counter S 09 L13-12 strongl y weakly!!!!! weakly! strongl y!! Change iction after 2 consecutive mistakes

State-Machine-Based Predictors S 09 L13-13 2-bit ictor can get >90% correct IPC = 1 / [ 1 + (0.20*0.10) * 2 ] = 0.96 any reasonable 2-bit ictor does about the same Adding more bits to counters does not help much more Major branch behaviors exploited almost always do the same thing again and again (>80%) 1-bit and 2-bit ictors equally effective occasionally do the opposite once (5~10%) 2 misiction with a 1-bit ictor 1 misiction with a 2-bit ictor miscellaneous (<10%) some could be captured with more elaborate ictors what does Amdahl s law say about this? (be careful!!) Path History S 09 L13-14 Branch outcome can be correlated to other branches Equntott, SPEC92 if (aa==2) ;; B1 aa=0; if (bb==2) ;; B2 bb=0; if (aa!=bb) { ;; B3. } If B1 is not (i.e. aa==0@b3) and B2 is not (i.e. bb=0@b3) then B3 is certainly How do you capture this information?

S 09 L13-15 Gshare Branch Prediction [McFarling] BTB idx N-bit M-bit xor table BHT BTB BHSR? PC+4 = 1 0 nextpc Global BHSR (Branch History Shift Register) tracks the outcomes of the last M branch instructions Return Address Stack S 09 L13-16 The targets of register-indirect jumps have little locality history-based ictors don t work but a simple stack captures the usage pattern of function call and return very well Return Address Stack (RAS) the return address is pushed when a link instruction (e.g., JAL) is executed when the PC of a return instruction (e.g., JR) is encountered ict npc from the top of the stack and pop What happens when the stack overflows? How do you know when to follow RAS vs BTB?

S 09 L13-17 Alpha 21264 Tournament Predictor [Fig 4, Kessler, IEEE Micro 1999] Make separate ictions using local history (per branch) and global history (correlating all branches) to capture different branch behaviors A meta-ictor decides which ictor to believe Better than 97% correct Multiple Predictors: PPC 604 S 09 L13-18 instruction cache BHT BTAC +2 +4 FAR fetch Prediction Logic (4 instructions) Target Seq Addr decode Prediction Logic (4 instructions) Target Seq Addr dispatch branch execute Prediction Logic (4 instructions) Target Seq Addr Target Exception Logic + complete PC

Speculative Execution Summary S 09 L13-19 Each control flow instruction must carry the icted nextpc down the pipeline When the control flow outcome of an instruction certain, the icted nextpc is checked if nextpc was icted correctly update BHT (reinforce iction) do nothing more if nextpc was icted incorrectly update BHT and/or BTB flush all younger instructions in the pipeline restart fetching at the correct target Involving SW in Branch Prediction S 09 L13-20 Static branch hints can be encoded with every branch vs. not- whether to allocate an entry in the dynamic BP hardware SW and HW has joint control of BP hardware Intel Itanium has a brp (branch iction) instruction that can be issued ahead of the actual branch to preset the state of the BTB TAR (Target Address Register) a small, fully-associative BTB controlled entirely by prepare-to-branch instructions a hit in TAR overrides all other ictions Why wait until the last instruction in the basic block to calculate branch condition and target?

Predicated Execution: If-conversion Example: ication in Intel Itanium each instruction can be separately icated 64 one-bit icate registers each instruction carries a 6-bit icate field an instruction is effectively a NOP if its icate is false Converts control flow into dataflow S 09 L13-21 cmp br else1 else2 br then1 then2 join1 join2 p2 p1 p1 p2 p1 p2 cmp else1 then1 join1 then2 else2 join2 Make sense if processors have lots of spare resources and BP is hard S 09 L13-22 Branch Prediction: the bottom-line Given current PC, how to determine the next PC waiting for anymore information would need stalls The easy part the same PC always points to the same instruction (barring self-modifying code) nextpc is always PC+4 for non-control-flow instructions, the target of a PC-offset control-flow is always the same A memoization table can get these nearly100% right The not so easy part versus not- decision is not static 90% of backward branches are (loops) 50% of forward branches are (if-then-else) a given branch almost always repeats itself