EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

Similar documents
EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

Wide Instruction Fetch

EECS 470 Midterm Exam Winter 2008 answers

EECS 470 Midterm Exam Answer Key Fall 2004

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Multiple Instruction Issue and Hardware Based Speculation

EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018

EECS 470 Midterm Exam Fall 2014

Lecture 15 Pipelining & ILP Instructor: Prof. Falsafi

Last lecture. Some misc. stuff An older real processor Class review/overview.

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EECS 470 Midterm Exam

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

The Problem with P6. Problem for high performance implementations

EECS 470 Midterm Exam Winter 2009

EECS 470 Midterm Exam

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

Computer Architecture Spring 2016

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Instruction scheduling. Based on slides by Ilhyun Kim and Mikko Lipasti

Advanced Computer Architecture

Hardware-Based Speculation

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Handout 2 ILP: Part B

EECS 470 Final Exam Fall 2013

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions


Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EECS 470 Lecture 1. Computer Architecture Winter 2014

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

5008: Computer Architecture

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

EECS 470 Midterm Exam Fall 2006

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

b) Register renaming c) CDB, register file, and ROB d) 0,1,X (output of a gate is never Z)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CSE502: Computer Architecture CSE 502: Computer Architecture

COSC3330 Computer Architecture Lecture 14. Branch Prediction

E0-243: Computer Architecture

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Four Steps of Speculative Tomasulo cycle 0

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware Speculation Support

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

The Tomasulo Algorithm Implementation

EECC551 Exam Review 4 questions out of 6 questions

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Chapter 4 The Processor 1. Chapter 4D. The Processor

EECS 470 Midterm Exam Winter 2015

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

The Pentium II/III Processor Compiler on a Chip

Pipelining & Hazards. Prof. Thomas Wenisch GAS STATION. Lecture 3 EECS 470. Slide 1

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Advanced Computer Architectures

Case Study IBM PowerPC 620

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

The memory unit has three sets of reservation stations, not one:

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Register Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison

ECE/CS 552: Introduction to Superscalar Processors

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Architectures for Instruction-Level Parallelism

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Getting CPI under 1: Outline

CSE502: Computer Architecture CSE 502: Computer Architecture

EECS 470 Lecture 1. Computer Architecture. Winter 2019 Prof. Ron Dreslinski h6p://

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Portland State University ECE 587/687. Memory Ordering

EECS 470 Final Exam Fall 2005

Transcription:

EECS 470 Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.

Warning: Crazy times coming HW2 due on Tuesday 1/30 Will do group formation and project materials on Tuesday too. P3 is due on Sunday 2/4 It s a lot of work (20 hours?) Proposal is due on Tuesday (2/6) It s not a lot of work (1 hour?) to do the write-up, but you ll need to meet with your group and discuss things. Don t worry too much about getting this right. You ll be allowed to change (we ll meet the following Friday). Just a line in the sand. HW3 is due on Thursday 2/8 It s a fair bit of work (4 hours?) Answers will be posted right after class. 20 minute group meetings on Friday 2/9 (rather than inlab) Midterm is on Thursday 2/8 in the evening (6-8pm) Review session over that weekend (date/time TBA) Q&A in class 2/12 Best way to study is look at old exams (posted on-line later this week)

Last time: Covered branch predictors Direction Address Started on the P6 algorithm Tomasulo s algorithm with a Reorder Buffer

General speculation Control speculation I think this branch will go to address 90004 Data speculation I ll guess the result of the load will be zero Memory conflict speculation I don t think this load conflicts with any proceeding store. Error speculation I don t think there were any errors in this calculation

Speculation in general Need to be 100% sure on final correctness! So need a recovery mechanism Must make forward progress! Want to speed up overall performance So recovery cost should be low or expected rate of occurrence should be low. There can be a real trade-off on accuracy, cost of recovery, and speedup when correct. Should keep the worst case in mind

Precise Interrupts and branches via the MEM IF ID Alloc Sched EX PC Dst regid Dst value Except? In-order Head ROB Any order Tail Reorder Buffer CT ARF In-order Reorder Buffer (ROB) Circular queue of spec state May contain multiple definitions of same register @ Alloc Allocate result storage at Tail @ Sched Get inputs (ROB T-to-H then ARF) Wait until all inputs ready @ WB Write results/fault to ROB Indicate result is ready @ CT Wait until inst @ Head is done If fault, initiate handler Else, write results to ARF Deallocate entry from ROB

Time Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 Initial Conditions Reorder Buffer Example regid: r8 result: 2 Except: n H regid: r8 result: 2 Except: n regid: f1 result:? Except:? T regid: f1 result:? Except:? ROB regid: r3 result:? Except:? - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H regid: r8 result: 2 Except: n H regid: f1 result:? Except:? T regid: r3 regid: r4 result: 11 result:? Except: N Except:? r3 T

Time Code Sequence f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 Initial Conditions Reorder Buffer Example regid: r8 result: 2 Except: n H regid: r8 result: 2 Except: n regid: f1 result:? Except:? regid: f1 result:? Except: y ROB regid: r3 result: 11 Except: n regid: r3 result: 11 Except: n regid: r4 result: 5 Except: n T regid: r4 result: 5 Except: n - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H regid: f1 result:? Except: y H regid: r3 result: 11 Except: n T regid: r4 result: 5 Except: n T

Time Code Sequence Reorder Buffer Example ROB f1 = f2 / f3 r3 = r2 + r3 r4 = r3 r2 Initial Conditions H T first inst of fault handler - reorder buffer empty - f2 = 3.0 - f3 = 2.0 - r2 = 6 - r3 = 5 H T

There is more complexity here Rename table needs to be cleared Everything is in the ARF Really do need to finish everything which was before the faulting instruction in program order. What about branches? Would need to drain everything before the branch. Why not just squash everything that follows it?

And while we re at it Does the ROB replace the RS? Is this a good thing? Bad thing?

ROB ROB ROB is an in-order queue where instructions are placed. Instructions complete (retire) in-order Instructions still execute out-of-order Still use RS Instructions are issued to RS and ROB at the same time Rename is to ROB entry, not RS. When execute done instruction leaves RS Only when all instructions in before it in program order are done does the instruction retire.

Adding a Reorder Buffer

Map Table Reg Tag r0 r1 r2 r3 r4 Tomasulo Data Structures (Timing Free Example, P6 scheme ) Reservation Stations (RS) T FU busy op RoB T1 T2 V1 V2 1 2 3 4 5 CDB T V ARF Reg V r0 r1 r2 r3 r4 Instruction r0=r1*r2 r1=r2*r3 Branch if r1=0 r0=r1+r1 r2=r2+1 Reorder Buffer (RoB) RoB Number 0 1 2 3 4 5 6 Dest. Reg. Value

Review Questions Could we make this work without the RS? If so, why do we do that? Why is it important to retire in order? Why must branches wait until retirement before they announce their mispredict? Any other ways to do this?

More review questions 1. What is the purpose of the RoB? 2. Why do we have both a RoB and a RS? Yes, that was pretty much on the last page 3. Misprediction a) When to we resolve a mis-prediction? b) What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4. What is the whole purpose of OoO execution?

And yet more review questions! 1. What is the purpose of the RoB? 2. Why do we have both a RoB and a RS? 3. Misprediction a) When to we resolve a mis-prediction? b) What happens to the main structures (RS, RoB, ARF, Rename Table) when we mispredict? 4. What is the whole purpose of OoO execution?

When an instruction is dispatched how does it impact each major structure? Rename table? ARF? RoB? RS?

When an instruction completes execution how does it impact each major structure? Rename table? ARF? RoB? RS?

When an instruction retires how does it impact each major structure? Rename table? ARF? RoB? RS?

Topic change Why on earth are we doing this? Why do we think it helps? Homework 2 problems 5 and 6 made the argument. Only need to obey true data dependencies. Huge speedup potential.

Optimizing CPU Performance Golden Rule: t CPU = N inst *CPI*t CLK Given this, what are our options Reduce the number of instructions executed Reduce the cycles to execute an instruction Reduce the clock period Our first focus: Reducing CPI Approach: Instruction Level Parallelism (ILP)

Why ILP? Vs. Requirements Parallelism Large window Limited control deps Eliminate false deps Find run-time deps

How Much ILP is There? (Chapter 3.10)

How Large Must the Window Be?

ALU Operation GOOD, Branch BAD Expected Number of Branches Between Mispredicts E(X) ~ 1/(1-p) E.g., p = 95%, E(X) ~ 20 brs, 100-ish insts

How Accurate are Branch Predictors?

Impact of Physical Storage Limitations Each instruction in flight must have storage for its result Really worse than this because of mispeculation

Benefits of registers Registers GOOD, Memory BAD Well described deps Fast access Finite resource Memory loses these benefits for flexibility *p = *q = = *p?

Bottom Line for an Ambitious Design