Chapter. Out of order Execution

Similar documents
15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Precise Exceptions and Out-of-Order Execution. Samira Khan

Handout 2 ILP: Part B

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

1 Tomasulo s Algorithm

Hardware-based Speculation

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Superscalar Processors

Hardware-Based Speculation

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

5008: Computer Architecture

Dynamic Scheduling. CSE471 Susan Eggers 1

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

COMPUTER ORGANIZATION AND DESI

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture-13 (ROB and Multi-threading) CS422-Spring

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

CMSC Computer Architecture Lecture 18: Exam 2 Review Session. Prof. Yanjing Li University of Chicago

1 Tomasulo s Algorithm

Advanced issues in pipelining

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Static vs. Dynamic Scheduling

Preventing Stalls: 1

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Processor: Superscalars Dynamic Scheduling

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

EECS 470 Final Project Report

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism (ILP)

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

EECS 470 Midterm Exam Winter 2009

Copyright 2012, Elsevier Inc. All rights reserved.

The basic structure of a MIPS floating-point unit

SUPERSCALAR AND VLIW PROCESSORS

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Hardware-Based Speculation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Getting CPI under 1: Outline

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Complex Pipelines and Branch Prediction

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

EECS 470 Midterm Exam

Full Datapath. Chapter 4 The Processor 2

Lecture 19: Instruction Level Parallelism

Case Study IBM PowerPC 620

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

LECTURE 3: THE PROCESSOR

Instruction Level Parallelism

Adapted from David Patterson s slides on graduate computer architecture

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

E0-243: Computer Architecture

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Portland State University ECE 587/687. Superscalar Issue Logic

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

CS433 Homework 2 (Chapter 3)

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Course on Advanced Computer Architectures

CS 2410 Mid term (fall 2018)

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Pipelining. CSC Friday, November 6, 2015

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Superscalar Processor Design

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

The Processor: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CSE502: Computer Architecture CSE 502: Computer Architecture

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Hardware-based Speculation

CS433 Homework 2 (Chapter 3)

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

University of Toronto Faculty of Applied Science and Engineering

Design of Digital Circuits ( L) ETH Zürich, Spring 2017

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Transcription:

Chapter

Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until the MUL is finished. This stalls the pipeline so is not desirable. But suppose there is no dependence between the two instructions. A = B*C & D = E + F

Long EX Start as soon as possible So start any instructions which are independent of the long latency instruction. If there are enough we can hide the latency Seems to work if we have enough instructions. Even if we don t have enough instructions before the next dependent instruction we may be able to do some out of re-ordering. Move the MUL earlier Look beyond the next dependent instruction

Exception Exception handling MUL There is a problem with the MUL instruction (an overflow for instance). The exception doesn t occur until the next two instructions have completed. We would like the machine to be in a consistent state when the exception is handled: All previous instructions should be retired No later instruction should be retired Or the instruction before MUL is complete throws an exception

Precise exceptions Retired This means an instruction is committed, the execution is finished values are stored and the architectural state of the machine is updated to show the effect of that instruction. Why Von Neuman requires it Enables (easy) recovery from exceptions Enables (easy) restartable processes Aids debugging. [Makes it possible] Debugging: becomes very hard because for instance if the statements following the MUL cause a message to be printed out it looks as if the programme should have completed the instruction that is causing the problem. IBM 360/195

ROB Reorder buffer ROB Complete the instructions out of order but retire them (update the architectural state of the machine in order. Instructions are decoded and reserve an entry in the Reorder Buffer When an instruction completes it writes the result into the ROB. When an instruction is the oldest in the ROB (means it has completed [without throwing an exception]) make it architecturally visible. Architectural visible: moved to the register file or memory.

ROB ROB Modification Register File Functional Unit Data in the Register File is part of the ISA. Visible to Programmer: Architectural state. Register File Functional Unit ROB Functional Unit Must have more functional units Plus the ROB which is not architecturally visible

ROB Entry ROB Entry Must reserve the space before the op starts If no space stall Must contain information which allows it to write the results to register V DestRegID DestRegVal StoreAddr StoreData PC Valid bits for reg/data + control bits Exc? Must have the Destination register. Must have value for Destination Register. Must have PC (says which is the oldest instruction) But suppose a value which is written in the ROB is wanted before it is architecturally visible Get it directly from the ROB

ROB (i) ROB Modification Register File Functional Unit Functional Unit ROB An extra data path from the ROB to the functional units (and of course the forwarding path from the output of the Funtional Units to their inputs (By passing) The ROB is content addressable. Search by register ID

ROB (ii) ROB Simplification A value for the functional unit is assumed to be in the register. Remember we are already using register renaming. It means the physical location of (say) register 2 depends on context. Register 2 is part of the architectural state. But when you ask for the value of register 2, or ask to store a value in register 2, the physical location of register 2 varies. So the register had a valid bit and if that is set to false when you access a register then what is in the register is the ROB entry that contains (or will contain the value of the register.

Register renaming Renaming The register ID is renamed to the re-order buffer entry that will hold the register s value. Register ID -> ROB entry ID Architectural register ID - > Physical register ID Now the ROB entry ID is the register ID Now there appear to be a large number of registers

Register renaming Operations IF fetch the instruction. In order. D Access the register file or the ROB. Check if instruction can execute are all the values present (may be in the ROB if so not architectural E Instructions can complete out of orer R completion. Write result to ROB. W retirement/commit if no exceptions write to an architectural register file (or memory). If exceptions flush pipeline. Start exception handler Dispatch/ in order Completion out of order Retirement in order

ROB review Simple principle and supports precise exceptions Need to access ROB to get results not yet in register file indirection means increased latency and complexity Elegant But We are increasing the complexity of execution in order to take care of something which is not likely to occur an exception Alternative: assume all is well and only take action when something goes wrong. History Buffer

History Buffer A similar buffer but with a different purpose Register File Functional Unit Functional Unit History Buffer Only used for exceptions On decode an entry is reserved in the history buffer. On completion the old value of the destination is stored in the HB When the instruction is the oldest the HB location simply becomes accessible for the system to write a value.

Review Dynamic Memory Scheduling The order of execution is determined dynamically (and not by the compiler) The reorder buffer solves the problems of instructions which take longer to complete and via renaming solves the problem of data dependencies, which are not true data dependencies. Includes delays not predictable at compile time. Is and operand in cache? But true data dependency stalls the dispatch of instructions to the functional unit if you don t have the values you can t execute the instructions

Out of Order Operation of an in-order pipeline in the presence of a data dependency: False data dependency (register conflict) allows the dispatch of younger instructions into execution units ROB True data dependency Stalls the dispatch younger instructions to proceed Any following commands cannot dispatch Problem of register v memory MULL R1 <- R2, R3 ADD R4 <- R5, R1 True dependency stall LD R1 <- R2(0) ADD R4 <- R5, R1 True dependency stall We know how long we have to wait for the multiply to finish but we don t know how long the load will take. Is it from memory; Level 1 cache; Level 2 cache..? Cannot tell ahead of time because of possibility of dynamic memory allocation

Preventing dispatch stalls Have talked about compiler re-ordering. Which will work with MUL fixed latency. But LD has variable latency, so harder to do at compile time. Can also do fine-grained multi-threading but again difficult to in the presence of variable latency. Variable latency is where out of order execution becomes valuable. Only dispatch the instruction when the operands become available. Dispatch subsequent instructions How do you know when the operands become available? Data flow architecture - not Von Neuman but exists on all (nearly all?) current high performance computers.

Move dependent instructions Any instructions which are truly dependent are moved to an area to wait : reservation stations Monitor the source values of each instruction. When all available dispatch instruction No longer dispatch in control order, but now in data flow order. Variable latency is no longer a problem we just wait until all operands are available. (If too long stall will of course occur) Non dependent instructions can keep executing.

Requirements for execution 1.Need to link the instruction which is waiting for a value to the instruction which will produce it No longer communicate through registers 1.Need to hold instructions out of the way of the execution stream until they have all their operands 2.Instructions need to keep track on the status of their source values 3.When all the source values are available the instruction needs to be re-inserted into the execution stream (sent to the functional unit)

Actions implementing the requirements 1.Associate a tag with each data value register renaming This is the same as using the ROB associate the architectural register ID, with a ROB ID. 1.Reservation station. After renaming each instruction is placed in the reservation station. Moving out of the way 1.When the value of a data item is available broadcast a tag. The tag is a string which identifies the data value; simplest is to use the reservation station ID. Broadcast because more than one instruction may need the value. Instructions in the reservation station need to compare the tags, with the tag. on their required values 1.When all source values are ready, dispatch instruction(s) to appropriate functional units. If more than one instruction need to arbitrate on functional units.

Review of concepts Developed fo 1.Register renaming means that only true dependencies are left and also allows a mechanism to link the results of calculations to the operations which require those results 1.The Reservation stations allows the pipeline to keep going bypassing dependent operations 1.The tag broadcast allows the creation of a data flow like architecture. Note that every place in the RS as well as every entry in the RAT needs a comparator to check the tag bit. This means that there is a significant increase in complexity

Imprecise Imprecise Interrupts/Exceptions Developed for 360/91. Used in the 360/195 Computer stopped at this PC Probably the problem is close to here. Patterson comments Not so popular with programmers He can say that again! Basically you do imprecise debugging Taught me early that debugging running to a crash. Is a bad way to develop code Final stage of parallelism CPI must be at least one, unless more than one instruction issued per clock cycle. 1.Superscalar processors a.static b.dynamic 2.VLIW very long instruction Word Reaching 4 instructions / clock