Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Similar documents
Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Portland State University ECE 587/687. Memory Ordering

Superscalar Processors

Portland State University ECE 587/687. Memory Ordering

Handout 2 ILP: Part B

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 06: Instruction Pipelining and Parallel Processing

Portland State University ECE 587/687. Superscalar Issue Logic

5008: Computer Architecture

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

The Microarchitecture of Superscalar Processors

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Hardware-Based Speculation

The Microarchitecture of Superscalar Processors

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Copyright 2012, Elsevier Inc. All rights reserved.

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Advanced Computer Architecture

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

ECE 341. Lecture # 15

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Dynamic Scheduling. CSE471 Susan Eggers 1

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Lecture: Out-of-order Processors

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Hardware-based Speculation

Wide Instruction Fetch

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

Data-flow prescheduling for large instruction windows in out-of-order processors. Pierre Michaud, André Seznec IRISA / INRIA January 2001

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

ECE 587 Advanced Computer Architecture I

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

The Processor: Instruction-Level Parallelism

Superscalar Processors

Static vs. Dynamic Scheduling

Computer Systems Architecture

Getting CPI under 1: Outline

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

Superscalar Machines. Characteristics of superscalar processors

CHECKPOINT PROCESSING AND RECOVERY: AN EFFICIENT, SCALABLE ALTERNATIVE TO REORDER BUFFERS

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

ECE/CS 552: Introduction to Superscalar Processors

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

A Fault Tolerant Superscalar Processor

Case Study IBM PowerPC 620

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

November 7, 2014 Prediction

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Multiple Instruction Issue and Hardware Based Speculation

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

COMPUTER ORGANIZATION AND DESI

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Portland State University ECE 588/688. Cray-1 and Cray T3E

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Instruction Level Parallelism

LIMITS OF ILP. B649 Parallel Architectures and Programming

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)


Instruction-Level Parallelism and Its Exploitation

Instruction Level Parallelism (ILP)

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Architectures for Instruction-Level Parallelism

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CS425 Computer Systems Architecture

The Microarchitecture of Superscalar Processors

E0-243: Computer Architecture

Transcription:

Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011

Program Representation An application is written as a program, typically using a high level language Program is compiled into static machine code (binary) Sequencing model implicit in the program The sequence of executed instructions forms a dynamic instruction stream The address of the next dynamic instruction: Incremented program counter Target of a taken branch Portland State University ECE 587/687 Spring 2011 2

Sequential Execution Model Inherent in instruction sets and program binaries Led to the concept of precise architecture state Interrupt and restart Exceptions Branch mispredictions Out of order issue deviates from sequential execution But we still need to maintain binary compatibility and retain appearance of sequential execution Portland State University ECE 587/687 Spring 2011 3

Dependences and Parallel Execution To execute more instructions in parallel, control dependences need to be addressed: Program Counter (PC) Branches To overcome PC dependence, one can view the program as a collection of basic blocks, separated by branches There is a limited number of parallel instructions on average within basic blocks Portland State University ECE 587/687 Spring 2011 4

Dependences and Parallel Execution (cont.) Instructions have be serialized according to true data dependences A true dependence appears as a read after write (RAW) sequence Ideally, we should eliminate output dependences and anti-dependences An output dependence appears as write after write (WAW) sequence An anti-dependence appears as write after read (WAR) sequence Portland State University ECE 587/687 Spring 2011 5

Elements of Superscalar Processing Fetch: Strategies for fetching multiple instructions every cycle, supported by Predicting branch outcomes Fetching beyond conditional branch instructions, well before branches are executed Decode: Methods for determining true register dependencies and eliminating artificial dependencies Register renaming Mechanisms to communicate register values during execution Issue/Dispatch: Methods for issuing multiple instructions in parallel Based upon availability of inputs, not upon program order Portland State University ECE 587/687 Spring 2011 6

Elements of Superscalar Processing (cont.) Execution: Parallel execution resources Multiple pipelined functional units Memory hierarchies capable of simultaneously serving multiple memory requests Memory: Methods for communicating data through memory via load and store instructions, potentially issued out of order Memory interfaces have to allow for the dynamic and often unpredictable behavior of memory hierarchies Commit: Methods for committing architecture state in order Maintain an outward appearance of sequential execution Portland State University ECE 587/687 Spring 2011 7

Typical Superscalar Microarchitecture Fig. 3 (Paper): Parallel execution model Fig. 4 (Paper): Microarchitecture or hardware organization of a typical superscalar processor Portland State University ECE 587/687 Spring 2011 8

Instruction Fetch Read instructions from the instruction cache and write them to a queue (instr. buffer in Fig 4) The number of instructions fetched per cycle should at least match the peak decode rate (why?) The fetcher must be told the address of the next block of instructions to fetch An instruction cache is usually organized as lines of several instructions A cache line starts on a fixed boundary (regardless of the instruction needed from the line) Question: What are the pros and cons of having separate I- and D- caches? Portland State University ECE 587/687 Spring 2011 9

Instruction Fetch (Cont.) Calculating the next address to fetch Non-branch instructions: PC is incremented by the number of bytes in current instruction Can require fetching next cache block Branch instructions: the fetch unit has to Recognize a branch Determine its outcome (taken or not taken) Compute branch target address Fetch the next block using Next sequential address or Branch target address Portland State University ECE 587/687 Spring 2011 10

Instruction Fetch (Cont.) Branch prediction is used to avoid having to wait for the branch execution to complete Target comes from Branch target buffer (BTB) Outcome comes from Static prediction based on branch type or profile (or even compiler hints) Dynamic prediction based on result of previous branches If branch is mispredicted, we must be able to undo the work and fetch the correct instruction This incurs a significant misprediction penalty Branch prediction discussed in more detail later in the course Portland State University ECE 587/687 Spring 2011 11

Instruction Fetch (Cont.) Transferring control to target address on a taken branch could cause pipeline bubbles Stockpile instructions in instruction queue Or keep next address in cache block Or use delayed branches? The instruction queue helps Smooth fetch irregularities caused by cache misses Sustain fetch bandwidth in cycles when fewer than the maximum #instructions can be fetched Portland State University ECE 587/687 Spring 2011 12

Instruction Fetch (Cont.) Superscalar machines pay a penalty for instruction misalignment Branches and targets don't always fall on cache line boundaries Fetched instructions that are not executed waste fetch bandwidth Sometimes called instruction cache fragmentation due to branches Portland State University ECE 587/687 Spring 2011 13

ICache Fragmentation X X+32 BR X+188 X+64 Discard X+96 X+128 X+160 Discard X+188 X+192 Portland State University ECE 587/687 Spring 2011 14

Instruction Fetch (Cont.) Cache fragmentation caused by branches places a severe limit on very wide superscalars Easy to fetch sequential runs of instructions However, the average sequential run length is ~6 for general integer programs The distribution is very broad, with a few long runs raising the average How many decode cycles are needed for 6 fetched instructions? 3 decode cycles for a two instruction decoder Only 1.5 decode cycles for a four instruction decoder Portland State University ECE 587/687 Spring 2011 15

Instruction Fetch (Cont.) Given enough fetch bandwidth, the fetcher can realign or merge instructions from multiple lines to make more efficient use of the decoder For a branch target in the middle of a cache line, the fetcher combines the cache line with the one following it Decoder "lines" are not aligned with cache lines Harder to find the program counter associated with one instruction Trace cache discussed in more detail later in the course Portland State University ECE 587/687 Spring 2011 16

Instruction Decode Instructions are removed from the instruction queue Execution tuples are set for each decoded instruction containing Opcode: Operation to be executed Sources: Identities of storage elements where the inputs reside Destination: Identity of the storage element where result must be placed Portland State University ECE 587/687 Spring 2011 17

Instruction Decode (Cont.) In the static program, the input and output identifiers represent Storage locations in the logical register file OR Storage locations in memory To overcome WAR and WAW hazards, register renaming maps the register logical identifiers into physical storage locations Allocation logic assigns each instruction physical storage for the result as well as entries in all required instruction buffers Portland State University ECE 587/687 Spring 2011 18

Instruction Rename The decoder looks at one or more instructions and releases them to scheduling stations after renaming Register values created by an instruction are assigned physical locations, and recorded in a map table Map table has as many entries as there are logical registers Source register mappings are read from the map table and attached to the instruction Renaming happens sequentially Map table bypass is sometimes necessary Subsequent stages in the pipeline use mappings attached to an instruction tuple to read and write the physical locations of register values Portland State University ECE 587/687 Spring 2011 19

Rename Map Table Logical Source Registers Register Map Table Physical Source Registers Logical Destination Registers Allocate physical registers from free pool Physical Destination Registers Portland State University ECE 587/687 Spring 2011 20

Renaming Methods There are two methods commonly used: Renaming with a physical register file larger than the logical register file Renaming using a Reorder Buffer (ROB) and a physical register file equal in size to the number of logical registers Portland State University ECE 587/687 Spring 2011 21

Paper Fig. 5 Renaming with a Physical RF A free list of unused physical registers is kept New register results are assigned physical registers from the free list Reclaiming of physical registers into the free list: Usage count is 0 and logical register has been renamed to another physical register Subsequent instruction writing to the same logical register is committed Register map table is checkpointed at conditional branches (why?) Portland State University ECE 587/687 Spring 2011 22

Freeing Physical Registers at Retirement I1 R5 P3.. I4 R5 P5 (free P3 when retired).. I7 R5 P5 Portland State University ECE 587/687 Spring 2011 23

Renaming with a Reorder Buffer Physical registers are allocated sequentially in the Reorder Buffer Physical registers are freed and their values are copied to the register file at retirement Mapping table maps logical registers to entries in the Reorder Buffer or the Register File Paper Fig 6 and 7 Branch handling options: Map table checkpoints Resume renaming from the correct path after mispredicted branch has retired Portland State University ECE 587/687 Spring 2011 24

Instruction Issue After instructions are fetched, decoded and renamed, they are placed in instruction buffers where they wait until issue An instruction can be issued when its input operands are ready, and there is a functional unit available Paper Fig. 8. is an example of parallel execution schedule Portland State University ECE 587/687 Spring 2011 25

Instruction Issue (Cont.) All out-of-order issue methods must handle the same basic steps Identify all instructions that are ready to issue Select among ready instructions to issue as many as possible Issue the selected instructions, e.g., pass operands and other information to the functional units Reclaim instruction window storage used by the now issued instructions Portland State University ECE 587/687 Spring 2011 26

Methods of Organizing Instruction Issue Buffers Single shared queue Only for in-order issue Multiple queues, one per instruction type Multiple reservation stations, one per instruction type Fig. 10. Shows a typical reservation station Single central reservation stations buffer Portland State University ECE 587/687 Spring 2011 27

Multiple Queues Requiring instructions to be issued in order at a functional unit greatly simplifies the identification and selection logic Instructions from different queues could be allowed to issue out of order Portland State University ECE 587/687 Spring 2011 28

Reservation Stations Benefits Logic to identify and select ready instructions is simpler since it need only consider a few locations Storage can be optimized for each type of functional unit Drawback e.g., stores need not have storage for two source operands Storage is statically allocated to functional units This can result in either wasted storage or a resource bottleneck for some programs Portland State University ECE 587/687 Spring 2011 29

Central Window Benefits Only one copy of identification and selection logic Only one copy of storage reclamation logic Dynamically allocated storage Drawbacks Complex identification and selection logic Complex storage reclamation logic Each storage location must be as big as the largest instruction Functional unit arbitration must be handled Portland State University ECE 587/687 Spring 2011 30

Memory Ordering Stores consist of address and data uops Store addresses are buffered in a queue Store addresses remain buffered until: Store data is available Store instruction is committed in the reorder buffer New load addresses are checked with the waiting store addresses. If there is a match: The load waits OR Store data is bypassed to the matching load Fig. 11. shows typical memory ordering logic Memory ordering discussed later in the course More details in ECE 588/688 Portland State University ECE 587/687 Spring 2011 31

Commit (Retire) Implements appearance of sequential execution Recovering a precise state: Need to maintain both state required for recovery and state being updated Recovery options: History buffer Future File Precise interrupts discussed later in the course Portland State University ECE 587/687 Spring 2011 32

Reading Assignments Monday: T.-Y. Yeh and Y.N. Patt, Alternative Implementations of Two-Level Adaptive Branch Prediction, ISCA 1992. (Review) S. McFarling, Combining Branch Predictors, Technical Report TN-36, Digital Western Research Laboratory, June 1993. (Read) Portland State University ECE 587/687 Spring 2011 33