Portland State University ECE 587/687. Superscalar Issue Logic

Similar documents
Portland State University ECE 587/687. Memory Ordering

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Case Study IBM PowerPC 620

Advanced Computer Architecture

Hardware-Based Speculation

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

ECE 587 Advanced Computer Architecture I

Processor: Superscalars Dynamic Scheduling

Portland State University ECE 587/687. Memory Ordering

Exploring Wakeup-Free Instruction Scheduling

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Static vs. Dynamic Scheduling

Handout 2 ILP: Part B

Register Data Flow. ECE/CS 752 Fall Prof. Mikko H. Lipasti University of Wisconsin-Madison

Portland State University ECE 588/688. Cray-1 and Cray T3E

E0-243: Computer Architecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Precise Exceptions and Out-of-Order Execution. Samira Khan

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

CS425 Computer Systems Architecture

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

Dynamic Scheduling. CSE471 Susan Eggers 1

Chapter. Out of order Execution

Lecture-13 (ROB and Multi-threading) CS422-Spring

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

AN ABSTRACT OF THE THESIS OF. superscalar processor. Superscalar processors use multiple pipelines in an attempt to

ECE 341. Lecture # 15

On Pipelining Dynamic Instruction Scheduling Logic

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Limitations of Scalar Pipelines

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Complex Pipelines and Branch Prediction

Architectures for Instruction-Level Parallelism

Four Steps of Speculative Tomasulo cycle 0

CS152 Computer Architecture and Engineering. Complex Pipelines

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Hardware-Based Speculation

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Hardware-based Speculation

Complexity-Effective Superscalar Processors

5008: Computer Architecture

CS146 Computer Architecture. Fall Midterm Exam

Superscalar Organization

Chapter 4 The Processor 1. Chapter 4D. The Processor

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Portland State University ECE 588/688. Dataflow Architectures

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Instruction scheduling. Based on slides by Ilhyun Kim and Mikko Lipasti

Advanced issues in pipelining

CS433 Homework 2 (Chapter 3)

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

EECC551 Exam Review 4 questions out of 6 questions

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Adapted from David Patterson s slides on graduate computer architecture

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Pipelining to Superscalar

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Superscalar Processor Design

Alexandria University

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

ECE 4750 Computer Architecture, Fall 2017 T10 Advanced Processors: Out-of-Order Execution

The Nios II Family of Configurable Soft-core Processors

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CS433 Homework 2 (Chapter 3)

Multiple Instruction Issue. Superscalars

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Multiple Instruction Issue and Hardware Based Speculation

Metodologie di Progettazione Hardware-Software

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation


Computer Science 146. Computer Architecture

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Hardware-based Speculation

A Dynamic Multithreading Processor

Transcription:

Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017

Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are fetched, decoded and renamed, they are placed in instruction buffers where they wait until issue An instruction can be issued when its input operands are ready, and there is a functional unit available Portland State University ECE 587/687 Fall 2017 2

Issue Logic Operation All out-of-order issue methods must handle the same basic steps Identify all instructions that are ready to issue Select among ready instructions to issue as many as possible Issue the selected instructions, e.g., pass operands and other information to the functional units Reclaim instruction window storage used by the now issued instructions Portland State University ECE 587/687 Fall 2017 3

Methods of Organizing Instruction Single shared queue Only for in-order issue Issue Buffers Multiple queues, one per instruction type Multiple reservation stations, one per instruction type Single central reservation stations buffer Reference: Smith & Sohi, Proc. IEEE 1995, The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 Fall 2017 4

Tomasulo s Algorithm Based on a technique used in the IBM 360/91 floating point execution unit Dispatch: an instruction is read from the instruction queue and sent into an empty reservation station The operands or tags (if the operands are not available) are read from the register file and written into the operand fields of the reservation station Tags identify functional unit that will write a register Portland State University ECE 587/687 Fall 2017 5

Tomasulo s Algorithm (Cont.) Execute: If both operands are available, execute the instruction If one or more of the operands is not available, monitor the result writeback bus until the operand is computed and broadcast Result writeback: When a result is computed, broadcast it with its tag on the writeback bus If the tag in its destination register matches, it is written into the register file, and all reservation stations that need it also grab it Portland State University ECE 587/687 Fall 2017 6

From memory From instruction unit Load Buffers Instruction Queue FP Registers Operand buses Store Buffers To memory Reservation Stations Instruction bus Reservation Stations FP Adders FP Multipliers Common data bus Structure of FP unit using Tomasulo s Algorithm Portland State University ECE 587/687 Fall 2017 7

Central Window Example The Register Update Unit (RUU) combines the reservation stations and reorder buffer units Allocation and removal is in order Instructions may stay in reservation stations longer than necessary See Figures 2, 3 and 4 in the Sohi & Vajapeyam paper Portland State University ECE 587/687 Fall 2017 8

Central Window (Cont.) Note the continuing importance of the reorder buffer's ability to maintain program order Exception recovery Branch misprediction recovery Memory accesses Central window simplification Portland State University ECE 587/687 Fall 2017 9

Complexity-Effective Superscalar Processors (Palacharla et al., 1997) Tradeoff between complexity and speed Pipeline consists of many stages with different functions (Paper Figure 1) Clock speed is determined by the slowest, most complex stage Unless stage is split or work is redistributed to other stages (discuss overhead) A good design is balanced: No single stage is the only bottleneck To achieve high IPC, we need larger instruction window which increases complexity Remember execution time equation Portland State University ECE 587/687 Fall 2017 10

Structures with High Complexity Structures that get more complex with larger issue width or larger instruction window Register Rename Logic Translates logical registers to physical registers Wakeup Logic Wakes up instructions waiting for their source operands Selection Logic Selects instructions for execution from pool of ready instructions Bypass Logic Bypass operand values from instructions that finished execution but not write back to subsequent instructions Portland State University ECE 587/687 Fall 2017 11

Register Rename Logic Translates logical to physical registers by accessing a map table indexed by logical register Map table is multi-ported Multiple instructions (each with one or more register operands) need renaming every cycle Block diagram (paper figure 2) Dependence check logic needed Detects whether logical register being renamed is written by an earlier instruction in the current group being renamed Sets up output MUXes to select appropriate physical registers Portland State University ECE 587/687 Fall 2017 12

Register Rename Logic Implementations RAM Scheme (e.g., MIPS R10000) Logical register used to index table, corresponding entry contains current physical register mapping Number of entries = number of logical registers CAM Scheme (e.g., DEC 21264) Number of entries = number of physical registers Each entry contains current logical register mapped to this physical register, and a valid bit Renaming done by CAM matching on logical register field Dependence check logic done in parallel, has less latency that could be hidden behind map table access Rename delay vs. issue width: Paper figure 3 Portland State University ECE 587/687 Fall 2017 13

Wakeup Logic Responsible for updating source dependences for instructions in issue window waiting for source operands to become available (Paper figure 4) Operation When a result is produced, result tag is broadcast to all instructions in issue window Each instruction compares tag to its source operands If match, the operand is marked as available by setting appropriate flag (rdyl or rdyr) Once both operands of an instruction are available, the ready flag is set (instruction is ready to execute) Issue window is a CAM, buffers drive the result tags, each CAM entry has 2xIW comparators to compare each of the result tags against the two operand tags of the entry Wakeup logic delay: Paper figure 5, 6 Portland State University ECE 587/687 Fall 2017 14

Selection Logic Responsible for choosing instructions for execution from the pool of ready instructions Inputs: request signals, one per instruction that is set when wakeup logic detects all operands are ready Outputs: grant signals, one per request signal that allows the instruction to be issued to the functional unit Selection policy decides which requesting instruction to grant (e.g., oldest first) Structure: A tree of arbiters that works in two phases (Figure 7) First phase: request signals propagate up the tree, root detects if any instruction is ready, root grants functional unit to one of its children Second phase: grant signal propagates down the tree to the selected instruction Selection delay: Figure 8 Portland State University ECE 587/687 Fall 2017 15

Data Bypass Logic Responsible for forwarding results from completing instructions to dependent instructions, bypassing register file Number of bypass paths depends on pipeline depth and issue width Assuming 2-input functional units, IW issue width, S pipe stages after first result-producing stage, we need 2 x IW 2 x S paths Two components of data bypass logic: Datapath and control logic (Paper figure 9) Portland State University ECE 587/687 Fall 2017 16

Data Bypass Logic (Cont.) Datapath: Result busses used to broadcast bypass values from each functional unit source to all possible destinations Buffers used to drive bypass values on result busses Control logic: Controls operand MUXes Compares tags of result values with tags of source value required at each functional unit Sets appropriate MUX control signals on match Delay is dominated by datapath not control Bypass delays: Paper Table 1 Summary of delay results: Paper Table 2 Portland State University ECE 587/687 Fall 2017 17

Reading Assignment G.Z. Chrysos and J. S. Emer, Memory Dependence Prediction Using Store Sets, ISCA 1998 (Review) Portland State University ECE 587/687 Fall 2017 18