EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Similar documents
EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

EECS 470 Midterm Exam Fall 2006

EECS 470 Midterm Exam Answer Key Fall 2004

EECS 470 Midterm Exam Winter 2008 answers

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Multiple Instruction Issue and Hardware Based Speculation

EECS 470 Midterm Exam

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Wide Instruction Fetch

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

EECS 470 Midterm Exam Winter 2015

Last lecture. Some misc. stuff An older real processor Class review/overview.

COSC3330 Computer Architecture Lecture 14. Branch Prediction

What about branches? Branch outcomes are not known until EXE What are our options?

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

EECS 470 Midterm Exam

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

The Tomasulo Algorithm Implementation

Lecture 13: Branch Prediction

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EECS 470 Final Exam Fall 2013

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture: Out-of-order Processors

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CSE502: Computer Architecture CSE 502: Computer Architecture

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Control Dependence, Branch Prediction

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Hardware Speculation Support

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

EECS 470 Midterm Exam Fall 2014

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CSE502: Computer Architecture CSE 502: Computer Architecture

Instruction Level Parallelism

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Dynamic Control Hazard Avoidance

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

/ : Computer Architecture and Design Fall 2014 Midterm Exam Solution

/ : Computer Architecture and Design Fall Midterm Exam October 16, Name: ID #:

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Instruction Level Parallelism (Branch Prediction)

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Superscalar Processor Design

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Hardware-Based Speculation

Computer Architecture EE 4720 Final Examination

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Four Steps of Speculative Tomasulo cycle 0

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Lecture-13 (ROB and Multi-threading) CS422-Spring

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

HY425 Lecture 05: Branch Prediction

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

LECTURE 3: THE PROCESSOR

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Static Branch Prediction

Computer Architecture EE 4720 Final Examination

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

ECE 571 Advanced Microprocessor-Based Design Lecture 7

CS 152 Computer Architecture and Engineering

Short Answer: [3] What is the primary difference between Tomasulo s algorithm and Scoreboarding?

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Super Scalar. Kalyan Basu March 21,

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EECS 470 Final Exam Fall 2005

Transcription:

EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.)

Announcements: P3 posted, due a week from Sunday HW2 due Monday Reading Book: 3.1, 3.3-3.6, 3.8 Combining Branch Predictors, S. McFarling, WRL Technical Note TN-36, June 1993. On the website.

Warning: Crazy times coming HW2 due on Monday 2/3 Will do group formation and project materials on Monday too. P3 is due on Sunday 2/9 It s a lot of work (20 hours?) Proposal is due on Monday (2/10) It s not a lot of work (1 hour?) to do the write-up, but you ll need to meet with your group and discuss things. Don t worry too much about getting this right. You ll be allowed to change (we ll meet the following Friday). Just a line in the sand. HW3 is due on Wednesday 2/12 It s a fair bit of work (3 hours?) 20 minute group meetings on Friday 2/14 (rather than inlab) Midterm is on Monday 2/17 in the evening (6-8pm) Exam Q&A on Saturday (2/15 from 6-8pm) Q&A in class 2/17 Best way to study is look at old exams (posted on-line!)

Also Reminder that my office hours have moved for tomorrow 8:30-10am. This Friday s lab is really helpful for doing P3. I d strongly urge you to be there.

Last time: Started on branch predictors Branch prediction consists of Branch taken predictor Address predictor Mis-predict recovery. Discussed direction predictors and started looking at some variations Bimodal Local history predictor Global history predictor Gshare

Today Review predictors from last time Do detailed example of tournament predictor Start on the P6 scheme

Using history bimodal 1-bit history (direction predictor) Remember the last direction for a branch branchpc Branch History Table NT T How big is the BHT?

Using history bimodal 2-bit history (direction predictor) branchpc Branch History Table SN NT T ST How big is the BHT?

Using History Patterns ~80 percent of branches are either heavily TAKEN or heavily NOT-TAKEN For the other 20%, we need to look a patterns of reference to see if they are predictable using a more complex predictor Example: gcc has a branch that flips each time T(1) NT(0) 10101010101010101010101010101010101010

Local history branchpc Branch History Table Pattern History Table 10101010 What is the prediction for this BHT 10101010? NT T When do I update the tables?

Local history branchpc Branch History Table Pattern History Table 01010101 NT T On the next execution of this branch instruction, the branch history table is 01010101, pointing to a different pattern What is the accuracy of a flip/flop branch 0101010101010?

Global history Branch History Register Pattern History Table 01110101 if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa!= bb) { How can branches interfere with each other?

Relative performance (Spec 89)

Gshare predictor branchpc Branch History Register 01110101 xor Pattern History Table Must read! Ref: Combining Branch Predictors

Hybrid predictors Local predictor (e.g. 2-bit) Prediction 1 Global/gshare predictor (much more state) Prediction 2 Selection table (2-bit state machine) Prediction How do you select which predictor to use? How do you update the various predictor/selector?

Trivial example: Tournament Branch Predictor Local 8-entry 3-bit local history table indexed by PC 8-entry 2-bit up/down counter indexed by local history Global 8-entry 2-bit up/down counter indexed by global history Tournament 8-entry 2-bit up/down counter indexed by PC

Local predictor 1 st level table (BHT) 0=NT, 1=T ADR[4:2] History 0 001 1 101 2 100 3 110 4 110 5 001 6 111 7 101 Branch History Register Local predictor 2 nd level table (PHT) 00=NT, 11=T History 0 00 1 11 2 10 3 00 4 01 5 01 6 11 7 11 Pred. state Global predictor table 00=NT, 11=T History 0 11 1 10 2 00 3 00 4 00 5 11 6 11 7 00 Pred. state Tournament selector 00=local, 11=global ADR[4:2] 0 00 1 01 2 00 3 10 4 11 5 00 6 11 7 10 Pred. state

Tournament selector 00=local, 11=global Local predictor 1 st level table (BHT) 0=NT, 1=T Local predictor 2 nd level table (PHT) 00=NT, 11=T Global predictor table 00=NT, 11=T ADR[4:2] Pred. state ADR[4:2] History History Pred. state History Pred. state 0 00 1 01 2 00 3 10 4 11 5 00 6 11 7 10 0 001 1 101 2 100 3 110 4 110 5 001 6 111 7 101 0 00 1 11 2 10 3 00 4 01 5 01 6 11 7 11 0 11 1 10 2 00 3 00 4 00 5 11 6 11 7 00 r1=2, r2=6, r3=10, r4=12, r5=4 Address of joe =0x100 and each instruction is 4 bytes. Branch History Register = 110 joe: skip: next: add r1 r2 r3 beq r3 r4 next bgt r2 r3 skip // if r2>r3 branch lw r6 4(r5) add r6 r8 r8 add r5 r2 r2 bne r4 r5 joe noop

Overriding Predictors Big predictors are slow, but more accurate Use a single cycle predictor in fetch Start the multi-cycle predictor When it completes, compare it to the fast prediction. If same, do nothing If different, assume the slow predictor is right and flush pipline. Advantage: reduced branch penalty for those branches mispredicted by the fast predictor and correctly predicted by the slow predictor

BTB (Chapter3.5) Branch Target Buffer Addresses predictor Lots of variations Keep the target of likely taken branches in a buffer With each branch, associate the expected target.

BTB indexed by current PC If entry is in BTB fetch target address next Generally set associative (too slow as FA) Often qualified by branch taken predictor Branch PC 0x05360AF0 Target address 0x05360000

So BTB lets you predict target address during the fetch of the branch! If BTB gets a miss, pretty much stuck with nottaken as a prediction So limits prediction accuracy. Can use BTB as a predictor. If it is there, predict taken. Replacement is an issue LRU seems reasonable, but only really want branches that are taken at least a fair amount.

Pipeline recovery is pretty simple Squash and restart fetch with right address Just have to be sure that nothing has committed its state yet. In our 5-stage pipe, state is only committed during MEM (for stores) and WB (for registers)

Tomasulo s Recovery seems really hard What if instructions after the branch finish after we find that the branch was wrong? This could happen. Imagine R1=MEM[R2+0] BEQ R1, R3 DONE Predicted not taken R4=R5+R6 So we have to not speculate on branches or not let anything pass a branch Which is really the same thing. Branches become serializing instructions. Note that can be executing some things before and after the branch once branch resolves.

What we need is: Some way to not commit instructions until all branches before it are committed. Just like in the pipeline, something could have finished execution, but not updated anything real yet.

Interrupts These have a similar problem. If we can execute out-of-order a slower instruction might not generate an interrupt until an instruction in front of it has finished. This sounds like the end of out-of-order execution I mean, if we can t finish out-of-order, isn t this pointless?

Exceptions and Interrupts Exception Type Sync/Async Maskable? Restartable? I/O request Async Yes Yes System call Sync No Yes Breakpoint Sync Yes Yes Overflow Sync Yes Yes Page fault Sync No Yes Misaligned access Sync No Yes Memory Protect Sync No Yes Machine Check Async/Sync No No Power failure Async No No

Precise Interrupts Instructions Completely Finished PC No Instruction Has Executed At All Precise State Speculative State Implementation approaches Don t E.g., Cray-1 Buffer speculative results E.g., P4, Alpha 21264 History buffer Future file/reorder buffer

Precise Interrupts via the Reorder Buffer MEM IF ID Alloc Sched EX PC Dst regid Dst value Except? In-order Head ROB Any order Tail CT ARF In-order Reorder Buffer (ROB) Circular queue of spec state May contain multiple definitions of same register @ Alloc Allocate result storage at Tail @ Sched Get inputs (ROB T-to-H then ARF) Wait until all inputs ready @ WB Write results/fault to ROB Indicate result is ready @ CT Wait until inst @ Head is done If fault, initiate handler Else, write results to ARF Deallocate entry from ROB

Time Reorder Buffer Example Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 Initial Conditions - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 regid: r8 result: 2 Except: n H regid: r8 result: 2 Except: n H regid: r8 result: 2 Except: n H regid: f1 result:? Except:? T regid: f1 result:? Except:? regid: f1 result:? Except:? ROB regid: r3 result:? Except:? T regid: r3 regid: r4 result: 11 result:? Except: N Except:? r3 T

Time Reorder Buffer Example ROB Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 Initial Conditions regid: r8 result: 2 Except: n H regid: r8 result: 2 Except: n regid: f1 result:? Except:? regid: f1 result:? Except: y regid: r3 result: 11 Except: n regid: r3 result: 11 Except: n regid: r4 result: 5 Except: n T regid: r4 result: 5 Except: n - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H regid: f1 result:? Except: y H regid: r3 result: 11 Except: n T regid: r4 result: 5 Except: n T

Time Reorder Buffer Example Code Sequence ROB f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 Initial Conditions H T first inst of fault handler - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H T

There is more complexity here Rename table needs to be cleared Everything is in the ARF Really do need to finish everything which was before the faulting instruction in program order. What about branches? Would need to drain everything before the branch. Why not just squash everything that follows it?

And while we re at it Does the ROB replace the RS? Is this a good thing? Bad thing?

ROB ROB ROB is an in-order queue where instructions are placed. Instructions complete (retire) in-order Instructions still execute out-of-order Still use RS Instructions are issued to RS and ROB at the same time Rename is to ROB entry, not RS. When execute done instruction leaves RS Only when all instructions in before it in program order are done does the instruction retire.

Adding a Reorder Buffer

Map Table Reg Tag r0 r1 r2 r3 r4 Tomasulo Data Structures (Timing Free Example) Reservation Stations (RS) T FU busy op R T1 T2 V1 V2 1 2 3 4 5 CDB T V ARF Reg V r0 r1 r2 r3 r4 Instruction r0=r1*r2 r1=r2*r3 Branch if r1=0 r0=r1+r1 r2=r2+1 Reorder Buffer (RoB) RoB Number 0 1 2 3 4 5 6 Dest. Reg. Value

Review Questions Could we make this work without a RS? If so, why do we use it? Why is it important to retire in order? Why must branches wait until retirement before they announce their mispredict? Any other ways to do this?

More review questions 1. What is the purpose of the RoB? 2. Why do we have both a RoB and a RS? Yes, that was pretty much on the last page 3. Misprediction a) When to we resolve a mis-prediction? b) What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4. What is the whole purpose of OoO execution?