Precise Exceptions and Out-of-Order Execution. Samira Khan

Similar documents
Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Chapter. Out of order Execution

Caches. Samira Khan March 21, 2017

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Handout 2 ILP: Part B

Computer Architecture Today (I)

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Computer Architecture

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS152 Computer Architecture and Engineering. Complex Pipelines

Superscalar Processors

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Caches. Samira Khan March 23, 2017

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Computer Architecture: Dataflow/Systolic Arrays

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Design of Digital Circuits Lecture 17: Pipelining Issues. Prof. Onur Mutlu ETH Zurich Spring April 2017

Multithreaded Processors. Department of Electrical Engineering Stanford University

Advanced issues in pipelining

E0-243: Computer Architecture

Hardware-based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Dynamic Scheduling. CSE471 Susan Eggers 1

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Processor: Superscalars Dynamic Scheduling

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Exploitation of instruction level parallelism

Portland State University ECE 587/687. Superscalar Issue Logic

The Processor: Instruction-Level Parallelism

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

5008: Computer Architecture

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

COMPUTER ORGANIZATION AND DESI

CS 152, Spring 2011 Section 8

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

Chapter 4 The Processor 1. Chapter 4D. The Processor

November 7, 2014 Prediction

Super Scalar. Kalyan Basu March 21,

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Chapter 3: Instruc0on Level Parallelism and Its Exploita0on

Processor (IV) - advanced ILP. Hwansoo Han

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Superscalar Machines. Characteristics of superscalar processors

Hardware-Based Speculation

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Static vs. Dynamic Scheduling

Advanced Computer Architectures

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Superscalar Processors

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

CS425 Computer Systems Architecture

EE 660: Computer Architecture Superscalar Techniques

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Computer Architecture Spring 2016

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Static & Dynamic Instruction Scheduling

Limitations of Scalar Pipelines

EC 513 Computer Architecture

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Superscalar Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

Hardware-based Speculation

Transcription:

Precise Exceptions and Out-of-Order Execution Samira Khan

Multi-Cycle Execution Not all instructions take the same amount of time for execution Idea: Have multiple different functional units that take different number of cycles Can be pipelined or not pipelined Can let independent instructions start execution on a different functional unit before a previous long-latency instruction finishes execution 2

ISSUES IN PIPELINING: MULTI-CYCLE EXECUTE Instructions can take different number of cycles in EXECUTE stage Integer ADD versus FP Multiply FMUL R4 ß R1, R2 ADD R3 ß R1, R2 FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E F D E F D E F D E F D E F D E hat is wrong with this picture? hat if FMUL incurs an exception? Sequential semantics of the ISA NOT preserved! F D E E E E E E E E 3

The Von Neumann Model/Architecture Also called stored program computer (instructions in memory). Two key properties: Stored program Instructions stored in a linear memory array Memory is unified between instructions and data The interpretation of a stored value depends on the control signals Sequential instruction processing One instruction processed (fetched, executed, and completed) at a time Program counter (instruction pointer) identifies the current instr. Program counter is advanced sequentially except for control transfer instructions 4

HANDLING EXCEPTIONS IN PIPELINING Exceptions versus interrupts Cause Exceptions: internal to the running thread Interrupts: external to the running thread hen to Handle Exceptions: when detected (and known to be non-speculative) Interrupts: when convenient Except for very high priority ones Power failure Machine check Priority: process (exception), depends (interrupt) Handling Context: process (exception), system (interrupt) 5

PRECISE EXCEPTIONS/INTERRUPTS The architectural state should be consistent when the exception/interrupt is ready to be handled 1. All previous instructions should be completely retired. 2. No later instruction should be retired. Retire = commit = finish execution and update arch. state 6

HY DO E ANT PRECISE EXCEPTIONS? Aid software debugging Enable (easy) recovery from exceptions, e.g. page faults Enable (easily) restartable processes 7

ENSURING PRECISE EXCEPTIONS IN PIPELINING Idea: Make each operation take the same amount of time FMUL R3 ß R1, R2 ADD R4 ß R1, R2 F D E E E E E E E E F D E E E E E E E E F D E E E E E E E E F D E E E E E E E E F D E E E E E E E E F D E E E E E E E E F D E E E E E E E E Downside hat about memory operations? Each functional unit takes 500 cycles? 8

SOLUTION: REORDER BUFFER (ROB) Idea: Complete instructions out-of-order, but reorder them before making results visible to architectural state hen instruction is decoded it reserves an entry in the ROB hen instruction completes, it writes result into ROB entry hen instruction oldest in ROB and it has completed, its result moved to reg. file or memory Func Unit Instruction Cache Register File Func Unit Reorder Buffer Func Unit 9

Oldest Youngest FMUL ADD FMUL ADD V DEST REG DEST VAL 1 R4 -- 0 1 R3 -- 0 1 0 1 0 1 0 CO MPL ETE Reorder File

REORDER BUFFER: INDEPENDENT OPERATIONS 0 1 2 3 4 5 6 7 8 9 10 11 F D E E E E E E E E R FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E R Oldest Youngest FMUL ADD FMUL ADD V CYCLE 5 DEST REG DEST VAL 1 R4 -- 0 1 R3 1000 1 1 0 1 0 1 R2 -- 0 CO MPL ETE Reorder File 11

REORDER BUFFER: INDEPENDENT OPERATIONS 0 1 2 3 4 5 6 7 8 9 10 11 F D E E E E E E E E R FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E R Oldest Youngest FMUL ADD FMUL ADD V CYCLE 5 DEST REG DEST VAL 1 R4 -- 0 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE Reorder File 12

REORDER BUFFER: INDEPENDENT OPERATIONS 0 1 2 3 4 5 6 7 8 9 10 11 F D E E E E E E E E R FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E R Oldest Youngest FMUL ADD FMUL ADD V CYCLE 11 DEST REG DEST VAL 1 R4 101 0 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE Reorder File 13

REORDER BUFFER: INDEPENDENT OPERATIONS 0 1 2 3 4 5 6 7 8 9 10 11 F D E E E E E E E E R FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E R Oldest Youngest FMUL ADD FMUL ADD RETIRE V OLDEST CYCLE 12 DEST REG DEST VAL 1 R4 101 1 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE Reorder File 14

REORDER BUFFER: INDEPENDENT OPERATIONS 0 1 2 3 4 5 6 7 8 9 10 11 F D E E E E E E E E R FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E R Oldest Youngest FMUL ADD FMUL ADD RETIRE V OLDEST CYCLE 12 DEST REG DEST VAL 0 R4 101 1 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE Reorder File 15

REORDER BUFFER: INDEPENDENT OPERATIONS 0 1 2 3 4 5 6 7 8 9 10 11 F D E E E E E E E E R FMUL R2 ß R5, R6 ADD R4 ß R5, R6 F D E E E E E E E E R Oldest Youngest ADD FMUL ADD V 0 CYCLE 12 DEST REG DEST VAL 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE hat if a later operation needs a value in the reorder buffer? Read reorder buffer in parallel with the register file. How? Reorder File 16

REORDER BUFFER: HO TO ACCESS? A register value can be in the register file, reorder buffer, (or bypass paths) Instruction Cache Content Addressable Memory (searched with register ID) Register File Reorder Buffer Func Unit Func Unit Func Unit bypass path 17

Search for Register Value VAL V R1 1 1 R2 0 R3 0 R4 0 R5 5 1 R6 6 1 R7 8 1 R8 8 1 Oldest Youngest ADD ADD V 0 DEST REG DEST VAL 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE R9 9 1 R10 10 1 R11 11 0

SIMPLIFYING REORDER BUFFER ACCESS Idea: Use indirection Access register file first If register not valid, register file stores the ID of the reorder buffer entry that contains (or will contain) the value of the register Mapping of the register to a ROB entry Access reorder buffer next hat is in a reorder buffer entry? V DestRegID DestRegVal StoreAddr StoreData BranchTarget PC/IP Control/valid bits Can it be simplified further? 19

Search for Register Value VAL TAG V R1 1 1 R2 5 0 R3 2 0 R4 6 0 R5 5 1 R6 6 1 R7 8 1 R8 8 1 Oldest Youngest ADD ADD V 0 DEST REG DEST VAL 1 R3 1000 1 1 0 1 0 1 R2 -- 0 1 R4 -- 0 CO MPL ETE R9 9 1 R10 10 1 R11 11 1

REORDER BUFFER PROS AND CONS Pro Conceptually simple for supporting precise exceptions Con Reorder buffer needs to be accessed to get the results that are yet to be written to the register file CAM or indirection à increased latency and complexity 21

Reorder Buffer in Intel Pentium III Boggs et al., The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal, 2001. 22

In-Order Pipeline with Reorder Buffer Decode (D): Access regfile/rob, allocate entry in ROB, check if instruction can execute, if so dispatch instruction Execute (E): Instructions can complete out-of-order Completion (R): rite result to reorder buffer Retirement/Commit (): Check for exceptions; if none, write result to architectural register file or memory; else, flush pipeline and start from exception handler In-order dispatch/execution, out-of-order completion, in-order retirement E Integer add F D E E E E Integer mul E E E E E E E E FP mul R E E E E E E E E... Load/store 23

Out-of-Order Execution (Dynamic Instruction Scheduling)

AN IN-ORDER PIPELINE E Integer add F D E E E E Integer mul E E E E E E E E FP mul R E E E E E E E E... Cache miss Problem: A true data dependency stalls dispatch of younger instructions into functional (execution) units Dispatch: Act of sending an instruction to a functional unit 25

CAN E DO BETTER? hat do the following two pieces of code have in common (with respect to execution in the previous design)? IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9 LD R3 ß R1 (0) ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R9, R9 Answer: First ADD stalls the whole pipeline! ADD cannot dispatch because its source registers unavailable Later independent instructions cannot get executed How are the above code portions different? Answer: Load latency is variable (unknown until runtime) hat does this affect? Think compiler vs. microarchitecture 26

IN-ORDER VS. OUT-OF-ORDER DISPATCH In order dispatch + precise exceptions: F D E E E E R F D STALL E R Out-of-order dispatch + precise exceptions: F STALL D E R F D E E E E E R F D STALL E R IMUL R3 ß R1, R2 ADD R3 ß R3, R1 ADD R1 ß R6, R7 IMUL R5 ß R6, R8 ADD R7 ß R3, R5 16 vs. 12 cycles F D E E E E R F D F D AIT E R E R F D E E E E R F D AIT E R 27

PREVENTING DISPATCH STALLS Any way to prevent dispatch stalls? Dataflow: fetch and fire an instruction when its inputs are ready Problem: in-order dispatch (scheduling, or execution) Solution: out-of-order dispatch (scheduling, or execution) 28

TOMASULO S ALGORITHM OoO with register renaming invented by Robert Tomasulo Used in IBM 360/91 Floating Point Units Read: Tomasulo, An Efficient Algorithm for Exploiting Multiple Arithmetic Units, IBM Journal of R&D, Jan. 1967. hat is the major difference today? Precise exceptions: IBM 360/91 did NOT have this Patt, Hwu, Shebanow, HPS, a new microarchitecture: rationale and introduction, MICRO 1985. Patt et al., Critical issues regarding HPS, a high performance microarchitecture, MICRO 1985. Variants are used in most high-performance processors Initially in Intel Pentium Pro, AMD K5 Alpha 21264, MIPS R10000, IBM POER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15 29

These slides are not covered in the class These are for students who want to know more

hat is the insight of OOO execution?

OUT-OF-ORDER EXECUTION (DYNAMIC SCHEDULING) Idea: Move the dependent instructions out of the way of independent ones (s.t. independent ones can execute) Rest areas for dependent instructions: Reservation stations Monitor the source values of each instruction in the resting area hen all source values of an instruction are available, fire (i.e. dispatch) the instruction Instructions dispatched in dataflow (not control-flow) order Benefit: Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation 32

The Von Neumann Model/Architecture Also called stored program computer (instructions in memory). Two key properties: Stored program Instructions stored in a linear memory array Memory is unified between instructions and data The interpretation of a stored value depends on the control signals hen is a value interpreted as an instruction? Sequential instruction processing One instruction processed (fetched, executed, and completed) at a time Program counter (instruction pointer) identifies the current instr. Program counter is advanced sequentially except for control transfer instructions 33

The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed in control flow order As specified by the instruction pointer Sequential unless explicit control flow instruction Dataflow model: An instruction is fetched and executed in data flow order i.e., when its operands are ready i.e., there is no instruction pointer Instruction ordering specified by data flow dependence Each instruction specifies who should receive the result An instruction can fire whenever all operands are received Potentially many instructions can execute at the same time Inherently more parallel 34

Von Neumann vs Dataflow nconsider a Von Neumann program qhat is the significance of the program order? qhat is the significance of the storage locations? v <= a + b; w <= b * 2; x <= v - w y <= v + w z <= x * y Sequential z nhich model is more natural to you as a programmer? 35 a + *2 - + * b Dataflow

Data Flow Advantages/Disadvantages Advantages Very good at exploiting irregular parallelism Only real dependencies constrain processing Disadvantages Debugging difficult (no precise state) Interrupt/exception handling is difficult (what is precise state semantics?) Too much parallelism? (Parallelism control needed) High bookkeeping overhead (tag matching, data storage) Memory locality is not exploited 36

OOO EXECUTION: RESTRICTED DATAFLO An out-of-order engine dynamically builds the dataflow graph of a piece of the program which piece? The dataflow graph is limited to the instruction window Instruction window: all decoded but not yet retired instructions Can we do it for the whole program? hy would we like to? In other words, how can we have a large instruction window? 37

GENERAL ORGANIZATION OF AN OOO PROCESSOR n Smith and Sohi, The Microarchitecture of Superscalar Processors, Proc. IEEE, Dec. 1995. 38

TOMASULO S MACHINE: IBM 360/91 from memory from instruction unit FP registers load buffers store buffers operation bus FP FU FP FU reservation stations to memory Common data bus 39