Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Similar documents
Dynamic Scheduling. CSE471 Susan Eggers 1

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Hardware-based Speculation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Hardware-Based Speculation

Static vs. Dynamic Scheduling

Handout 2 ILP: Part B

Multiple Instruction Issue and Hardware Based Speculation

Hardware-based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Superscalar Processors

Case Study IBM PowerPC 620

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Scoreboard information (3 tables) Four stages of scoreboard control

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Lecture-13 (ROB and Multi-threading) CS422-Spring

Advanced Computer Architecture

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Instruction Level Parallelism

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Superscalar Processors Ch 14

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Adapted from David Patterson s slides on graduate computer architecture

5008: Computer Architecture

Super Scalar. Kalyan Basu March 21,

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Processor: Superscalars Dynamic Scheduling

EECC551 Exam Review 4 questions out of 6 questions

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

E0-243: Computer Architecture

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Instruction-Level Parallelism and Its Exploitation

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Chapter 4 The Processor 1. Chapter 4D. The Processor

Four Steps of Speculative Tomasulo cycle 0

Tomasulo s Algorithm

The basic structure of a MIPS floating-point unit

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Instruction Level Parallelism

Computer Architecture Spring 2016

Precise Exceptions and Out-of-Order Execution. Samira Khan

Superscalar Processors

Advanced issues in pipelining

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

Superscalar Machines. Characteristics of superscalar processors

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Last lecture. Some misc. stuff An older real processor Class review/overview.

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Portland State University ECE 587/687. Superscalar Issue Logic

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Superscalar Processor

Tutorial 11. Final Exam Review

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

EECS 470 Midterm Exam

Dynamic Control Hazard Avoidance

Out of Order Processing

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

CS433 Homework 2 (Chapter 3)

CMSC411 Fall 2013 Midterm 2 Solutions

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Science 146. Computer Architecture

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Limitations of Scalar Pipelines

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

Transcription:

Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers) physical register file that is the same size as the architectural registers one-to-one relationship reorder buffer (ROB) (~ R10K active list) provides in-order instruction commit circular queue with head & tail pointers holds 40 executing instructions in program order (dispatched but not yet committed) field for either integer or FP result after it has been computed a result value is put in its register in the RRF after its producing instruction reaches the head of the buffer & is removed (i.e., commits) Fall 2004 CSE 471 1 Reorder Buffer Implementation (Pentium Pro) register alias table (RAT) (~ R10K map table) provides register renaming important because very few GPRs in the x86 architecture indicates whether a source operand of a new instruction points to the reorder buffer or the physical register file do an associative search of ROB destination registers for the new source operands if found, consumer instruction points to the producer instruction in the ROB the data hazard check before instruction dispatch reservation station (~ IBM 360/91 reservation stations, R10000 instruction buffer) holds instructions waiting to execute provides forwarding to reduce RAW hazards result values go back to the reservation station (as well as ROB) so dependent instructions have source operand values provides out-of-order execution Fall 2004 CSE 471 2

Fall 2004 CSE 471 3 Pentium Pro Execution In-order issue decode instructions enter uops into reorder buffer for in-order completion rename registers via register alias table detect structural hazards for reservation station Out-of-order execution one reservation station, multiple entries check source operands for RAW hazards check structural hazards for separate integer, FP, memory units result goes to reservation station & reorder buffer In-order completion this & previous uops have completed write G PR registers rollback on interrupts Fall 2004 CSE 471 4

fetch & decode pipeline BTB access (1 stage) Pentium Pro instruction fetch & align for decoding (2.5 stages) decode & uop generation (2.5 stages) register renaming & instruction issue to reservation stations (3 stages minimum) integer pipeline execute, resolve branch write registers & commit load pipeline address calculation & to memory reorder buffer integrated L1 & L2 data cache access pipelined FP add & multiply Fall 2004 CSE 471 5 Pentium Pro Fall 2004 CSE 471 6

Pentium Pro Fall 2004 CSE 471 7 Pentium Pro Some bandwidth constraints: maximum for one cycle 16 bytes fetched 3 instructions decoded 6 µops to the reorder buffer 3 µops dispatched to reservation station & functional units 1 load & 1 store access to the L1 data cache 1 cache result returned 3 µops committed if good instruction mix right instruction order operands available functional units available load & store to different cache banks all previous instructions already committed Fall 2004 CSE 471 8

Making Tomasulo Precise Reorder buffer contains: instruction type (register operation, store, branch) destination field (register, memory, none) result values ready bit to indicate instruction has executed replaces load & store buffers does renaming (instead of the reservation stations) operand fields in reservation stations point to reorder buffer location results tagged with reorder buffer entry number provides hardware speculation (speculative execution, not just fetch & issue) plus precise interrupts Fall 2004 CSE 471 9 Precise Implementation for Tomasulo Fall 2004 CSE 471 10

Tomasulo s Algorithm: Execution Steps issue & read (assume the instruction has been fetched) structural hazard detection for entry in reservation station & reorder buffer issue if no hazard stall if hazard read registers for source operands can come from either registers or reorder buffer RAT indicates which allocate entry in reorder buffer execute RAW hazard detection for source operands snoop on common data bus for missing operands tag in reservation station is entry number in reorder buffer dispatch to functional unit when obtain both operand values Fall 2004 CSE 471 11 Tomasulo s Algorithm: Execution Steps write result broadcast result & reorder buffer entry (tag) on the common data bus to reservation stations & reorder buffer commit retire the instruction at the head of the reorder buffer update register with result in reorder buffer or do a store remove instruction from reorder buffer if a mispredicted branch, restart with right-path instructions Fall 2004 CSE 471 12

F6, F8, F2 Name Add1 Add2 Add3 Mult1 Mult2 F0 ROB 3 Busy no F2 Precise Tomasulo, in action 1 First load has written its result subd F4 Issue (ROB 1) F6 ROB 6 Execute (F4) (ROB 1) F8 Write Result ROB 3 F10 ROB 5 Precise Tomasulo, in action 2 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5

Precise Tomasulo, in action 3 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5 Precise Tomasulo, in action 4 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5

Precise Tomasulo, in action 5 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5 Precise Tomasulo, in action 6 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5

Precise Tomasulo, in action 7 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5 Precise Tomasulo, in action 8 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5

Precise Tomasulo, in action 9 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5 Precise Tomasulo, in action 10 F6, F8, F2 Add1 Add2 Mult1 Mult2 ROB 3 (F4) ROB 6 ROB 5

Pool of Physical Registers vs. Reorder Buffer Think about the advantages and disadvantages of these implementations clean-up all done at instruction commit (physical register pool) record that value no longer speculative in register busy table unmap previous mapping for the architectural register instruction issue simpler (physical register pool) only look in one place for the source operands (the physical register file) book claims that deallocating register is more complicated with a physical register pool have to search for outstanding uses in the active list but not done in practice: wait until the instruction that redefines the architectural register commits Fall 2004 CSE 471 23 Limits Limits on out-of-order execution amount of ILP in the code scheduling window size need to do associative searches & its effect on cycle time number & types of functional units number of ports to memory Fall 2004 CSE 471 24