Multiple Instruction Issue and Hardware Based Speculation

Similar documents
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Lecture-13 (ROB and Multi-threading) CS422-Spring

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Hardware-based Speculation

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Hardware Speculation Support

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Super Scalar. Kalyan Basu March 21,

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

The basic structure of a MIPS floating-point unit

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Instruction Level Parallelism

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Handout 2 ILP: Part B

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Instruction Level Parallelism (ILP)

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CS422 Computer Architecture

Four Steps of Speculative Tomasulo cycle 0

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Lecture 5: VLIW, Software Pipelining, and Limits to ILP Professor David A. Patterson Computer Science 252 Spring 1998

Hardware-Based Speculation

Instruction Level Parallelism

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Instruction Level Parallelism. Taken from

Metodologie di Progettazione Hardware-Software

5008: Computer Architecture

Instruction Level Parallelism

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

EECC551 Exam Review 4 questions out of 6 questions

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Chapter 4 The Processor 1. Chapter 4D. The Processor

TDT 4260 lecture 7 spring semester 2015

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Adapted from David Patterson s slides on graduate computer architecture

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

TDT 4260 TDT ILP Chap 2, App. C

Hardware-based Speculation

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Tomasulo Loop Example

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3)

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Advanced Computer Architecture

Static vs. Dynamic Scheduling

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Instruction-Level Parallelism and Its Exploitation

Dynamic Scheduling. CSE471 Susan Eggers 1

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Recall from Pipelining Review. Instruction Level Parallelism and Dynamic Execution

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

LIMITS OF ILP. B649 Parallel Architectures and Programming

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Advanced Computer Architecture

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

ILP: Instruction Level Parallelism

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

NOW Handout Page 1. COSC 5351 Advanced Computer Architecture

Exploitation of instruction level parallelism

Transcription:

Multiple Instruction Issue and Hardware Based Speculation Soner Önder Michigan Technological University, Houghton MI www.cs.mtu.edu/~soner

Hardware Based Speculation Exploiting more ILP requires that we overcome the limitation of control dependence: With branch prediction we allowed the processor continue issuing instructions past a branch based on a prediction: Those fetched instructions do not modify the processor state. These instructions are squashed if prediction is incorrect. We now allow the processor to execute these instructions before we know if it is ok to execute them: We need to correctly restore the processor state if such an instruction should not have been executed. We need to pass the results from these instructions to future instructions as if the program is just following that path.

Hardware Based Speculation 3 B x < y? N A =b+c C=c- T C= A= Assume the processor predicts B to be taken and executes. What will happen if the prediction was wrong? B=b+ A=a+ X < z N T B C=a What value of each variable should be used if the processor predicts B and B taken and executes instructions along the way? D=a+b+c. Use d

Hardware Based Speculation In order to execute instructions speculatively, we need to provide means: To roll back the values of both registers and the memory to their correct values upon a misprediction, To communicate speculatively calculated values to the new uses of those values. Both can be provided by using a simple structure called Reorder Buffer (ROB).

Reorder Buffer It is a simple circular array with a head and a tail pointer: New instructions is allocated a position at the tail in program order. Each entry provides a location for storing the instruction s result. New instructions look for the values starting from tail back. When the instruction at the head complete and becomes nonspeculative the values are committed and the instruction is removed from the buffer. Tail Head

Reorder Buffer 3 fields: instr, destination, value Reorder buffer can be operand source => more registers like RS Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit Once operand commits, result is put into register Instructions commit As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

Steps of Speculative Tomasulo Algorithm 7. Issue [get instruction from FP Op Queue]. Check if the reorder buffer is full.. Check if a reservation station is available. 3. Access the register file and the reorder buffer for the current values of the source operands.. Send the instruction, its reorder buffer slot number and the source operands to the reservation station. Once issued, the instruction stays in the reservation station until it gets both operands.

Steps of Speculative Tomasulo Algorithm 8. Execute [operate on operands (EX)] When both operands ready and a functional unit is available, the instruction executes. This step checks RAW hazards and as long as operands are not ready, watches CDB for results.

Steps of Speculative Tomasulo Algorithm 9 3. Write result [finish execution (WB)] Write on Common Data Bus to all awaiting FUs and the reorder buffer; mark reservation station available.

Steps of Speculative Tomasulo Algorithm. Commit [update register file with reorder result] When instruction reaches the head of reorder buffer The result is present No exceptions associated with the instruction: The instruction becomes non-speculative: Update register file with result (or store to memory) Remove the instruction from the reorder buffer. A mispredicted branch flushes the reorder buffer.

MIPS FP Unit

Renaming Registers Common variation of speculative design Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result into rename register on execution complete; rename register into real register on commit Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file)

Renaming Registers 3. Index a MAP table using the source register identifiers to get the physical register number.. Get the previous physical register number for the destination register. 9 3 3.. Map table 3. Allocate a free physical register and modify the MAP table by indexing it with the destination register identifier.. When instruction commits, return the previous physical register to the pool. 7.. Physical registers

Renaming Registers 3 3 7 8 7 Map table R7=r+r3 R=r+r R3=r+r7 R=r+ Code sequence 9 3 7

Renaming Registers 3 3 7 7 Map table 9 3 7 R7=r+r3 R=r+r R3=r+r7 R=r+ Code sequence Renamed Code sequence

Renaming Registers Previous Dest 3 3 7 9 R7=r+r3 R=r+r R3=r+r7 R=r+ R9=r+r3 R7 Map table 3 7 Code sequence Renamed Code sequence

Renaming Registers 7 3 3 7 9 Map table 3 7 R7=r+r3 R=r+r R3=r+r7 R=r+ Code sequence R9=r+r3 R=r+r Previous Dest R7 r Renamed Code sequence

Renaming Registers 8 Previous Dest 3 7 9 R7=r+r3 R=r+r R3=r+r7 R=r+ R9=r+r3 R=r+r R=r+r9 R7 R R3 Map table 3 7 Code sequence Renamed Code sequence

Renaming Registers 9 Previous Dest 3 3 7 9 R7=r+r3 R=r+r R3=r+r7 R=r+ R9=r+r3 R=r+r R=r+r9 R3=r+ R7 R R3 R Map table 7 Code sequence Renamed Code sequence

Renaming Registers Previous Dest 3 3 7 9 R7=r+r3 R=r+r R3=r+r7 R=r+ R9=r+r3 R=r+r R=r+r9 R3=r+ R7 R R3 R Map table 7 Code sequence When r3=r+ retires Renamed Code sequence

Limits to ILP Assumptions for ideal/perfect machine to start:. Register renaming infinite virtual registers and all WAW & WAR hazards are avoided. Branch prediction perfect; no mispredictions 3. Jump prediction all jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available. Memory-address alias analysis addresses are known & a load can be moved before a store provided addresses not equal cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Upper Limit to ILP: Ideal Machine Integer: 8 - FP: 7-8.7. IPC 8.8. 7. 7.9 gcc espresso li fpppp doducd tomcatv Programs

More Realistic HW: Branch Impact 3 Change from Infinite window to examine to and maximum issue of instructions per clock cycle 8 8 FP: - 3 3 Integer: - 9 IPC 9 7 7 3 9 gcc espresso li fpppp doducd tomcatv Program Perfect Selective predictor Standard -bit Static None

More Realistic HW: Register Impact Change instr window, instr issue, 8K level Prediction 9 9 FP: - IPC 3 Integer: - 3 9 8 9 3 7 gcc espresso li fpppp doducd tomcatv Program Infinite 8 3 None

More Realistic HW: Alias Impact 3 3 Change instr window, instr issue, 8K level Prediction, renaming registers 9 9 FP: - (Fortran, no heap) IPC Integer: - 9 7 7 3 9 3 3 gcc espresso li fpppp doducd tomcatv Program Perfect Global/stack Perfect Inspection None

Realistic HW for 9X: Window Impact Perfect disambiguation (HW), K Selective Prediction, entry return, registers, issue as many as window Integer: - 7 3 FP: 8-3 3 IPC 9 7 3 8 8 9 8 9 9 7 3 3 3 3 3 gcc expresso li fpppp doducd tomcatv Program Infinite 8 3 8