Superscalar Processor Design

Similar documents
Superscalar Processor

Simultaneous Multithreading Architecture

Handout 2 ILP: Part B

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Case Study IBM PowerPC 620

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Advanced Computer Architecture

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Wide Instruction Fetch

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Pipelined Processor Design

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware-Based Speculation

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Chapter 4. The Processor

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Pipeline Processor Design

Four Steps of Speculative Tomasulo cycle 0

" # " $ % & ' ( ) * + $ " % '* + * ' "

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

CS152 Computer Architecture and Engineering. Complex Pipelines

E0-243: Computer Architecture

Lecture-13 (ROB and Multi-threading) CS422-Spring

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

5008: Computer Architecture

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

ECE/CS 552: Introduction to Superscalar Processors

CISC Processor Design

Limitations of Scalar Pipelines

CS 152 Computer Architecture and Engineering

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Out of Order Processing

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

The Processor: Instruction-Level Parallelism

Processor (IV) - advanced ILP. Hwansoo Han

Chapter 4 The Processor 1. Chapter 4D. The Processor

Superscalar Processors

Chapter. Out of order Execution

Lecture 19: Instruction Level Parallelism

EC 513 Computer Architecture

Foundations of Computer Systems

Architectures for Instruction-Level Parallelism

EECC551 Exam Review 4 questions out of 6 questions

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

COMPUTER ORGANIZATION AND DESI

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Chapter 4. The Processor

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

The Pentium II/III Processor Compiler on a Chip

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

6.823 Computer System Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling

Data Speculation. Architecture. Carnegie Mellon School of Computer Science

SUPERSCALAR AND VLIW PROCESSORS

EECS 470 Midterm Exam

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Pipelining to Superscalar

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Simultaneous Multithreading Processor

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences


EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Instruction Level Parallelism

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Processor Architecture

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

Adapted from David Patterson s slides on graduate computer architecture

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Transcription:

Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design

Super-scalar Organization Fetch Instruction buffer Decode Dispatch buffer Dispatch Issue Reservation station Execute finish Complete Re-order/Completion buffer Store buffer Retire Apr 14, 2008 SE-273@SERC 2

Dynamic Execution Core Dispatch buffer Register WB Dispatch ARF RRF Allocate ROB Entries Issue Execute Reservation station Forwarding Results to RS and RRF Re-Order Buffer ROB: Complete Managed as Queue Takeoff Dispatch Landing Completion Apr 14, 2008 SE-273@SERC 3

Dynamic Execution Core For Instruction dispatch There must be availability of Rename Register Reservation Station entry Re-order buffer entry If one of these three is not available then, instruction dispatching is stalled Dispatching is done via complex routing network (less expensive than full crossbar) Apr 14, 2008 SE-273@SERC 4

Reservation Station Dispatch slots Forwarding Buses Dispatch slots Forwarding Buses Busy Operand 1 valid Operand 2 valid Ready Tag Match Tag Match Tag Buses Tag Buses RS Entry Apr 14, 2008 SE-273@SERC 5

Reservation Station Reservation Station can be quite complex to implement, due to It must support many possible sources, including all the dispatch slots and forwarding logic Data routing network on its input side can be quite complex During waiting step, all operand fields of RS with pending operands must continuously compare their tags against potentially multiple tag buses This portion is known as wakeup logic Apr 14, 2008 SE-273@SERC 6

Reservation Station Allocate Unit Dispatching Issuing Unit Entry To be allocated Entry To be issued Busy Issuing Ready Apr 14, 2008 SE-273@SERC 7

Re-Order Buffer (ROB) Busy Issued Finished Inst. Address Rename Reg Spec. valid Next entry to be allocated (Tail pointer) Next instruction to complete (Head pointer) B I F IA RR S V 0 0 0 1 1 1 1 1 1 1 Apr 14, 2008 SE-273@SERC 8

Dynamic Instruction Scheduler Register File Register Update Operand Copying Data captured Scheduling window (Reservation Station) Forward and wakeup Functional Units Apr 14, 2008 SE-273@SERC 9

Dynamic Instruction Scheduler Non-data captured Scheduling window (Reservation Station) Wakeup Register File Forward Functional Units Apr 14, 2008 SE-273@SERC 10

Memory Data Flow techniques Move data between memory and RF Long Latency Bottleneck Operations Address Generation Address Translation Read/write data Apr 14, 2008 SE-273@SERC 11

Memory Data Flow techniques Ordering of memory access Out-of-order execution of Load instruction is a primary source of performance gain Two techniques Load bypassing Load forwarding In both these cases, earlier execution of lload instruction is achieved Apr 14, 2008 SE-273@SERC 12

Load/Store Processing Reservation Station Address generation Address translation Store Unit Load Unit Address generation Address translation Memory Access (finished) store buffer (Completed) store buffer Data Data Cache Address Apr 14, 2008 SE-273@SERC 13

Load Bypassing Out-of-order execution of Load instruction is a primary source of performance gain Dynamic instruction sequence.... Execute Load ahead of two Stores Store X..... Store Y.... Load Z Apr 14, 2008 SE-273@SERC 14

Load/Store Processing Reservation Station Address generation Address translation Store Unit Load Unit Address generation Address translation Memory Access (finished) store buffer (Completed) store buffer Data Data Cache Address Apr 14, 2008 SE-273@SERC 15

Load Bypassing Reservation Station Address generation Address translation Store Unit Tag match Load Unit Address generation Address translation Memory Access (finished) store buffer Data Address If no match: Update destination Reg. Data Address (Completed) store buffer Data Cache Match/ No match Apr 14, 2008 SE-273@SERC 16

Load Forwarding Dynamic instruction sequence.... Store X..... Store Y.... Load X Forward Store data directly to the Load Apr 14, 2008 SE-273@SERC 17

Load Forwarding Reservation Station Address generation Address translation Store Unit Tag match Load Unit Address generation Address translation Memory Access (finished) store buffer (Completed) store buffer Data Address match If match: Forward to destination Reg. Data Data Cache Address Match/ No match Apr 14, 2008 SE-273@SERC 18

Load/ Store Handling Reservation Station Store Load Unit Unit (finished) store buffer Data Address Tag match at store completion Data Address (Completed) store buffer Finished store buffer Address Data Data Cache Match/ No match If match: Flush aliased Load and all trailing instr. At completion: Update architected registers. Apr 14, 2008 SE-273@SERC 19

Advanced Instruction Flow techniques FA Mux FAR I-Cache +4 Branch History Table (BHT) Branch Target Address Cache (BTAC) BHT Prediction BHT Update BTAC Update BTAC Prediction BR Unit Apr 14, 2008 SE-273@SERC 20

Advanced Instruction Flow techniques PowerPC 604 used BHT and BTAC BTAC is 64 entry fully associative cache, and BHT is 512 entry direct mapped table Both the BTAC and BHT are accessed during the fetch stage using current instruction fetch address in the PC BTAC respond in one cycle and BHT respond in two cycles Apr 14, 2008 SE-273@SERC 21

Advanced Instruction Flow techniques If a hit occurs in the BTAC, indicating the presence of the branch instruction in current fetch group, a predict taken occurs and branch target address retrieved from BTAC is used in the next fetch cycle PowerPC is 4 wide fetch, hence, there can be multiple branch instructions BTAC branch address is indexed by the fetch address contains the branch target address of the first branch instruction in the fetch group that is predicted taken During the second cycle, in decode stage, history bits retrieved from BHT are used to generate history based prediction on the same branch Apr 14, 2008 SE-273@SERC 22

Advanced Instruction Flow techniques If the prediction agrees with the taken prediction made by the BTAC, the earlier prediction is allowed to stand If BHT disagree, with the BTAC prediction, the BTAC prediction is annulled and fetching from the fall-through path, corresponding to not taken branch, is initiated BHT predict overrule BTAC prediction When branch is resolved, BHT is updated Based on updated content the BHT updates BTAC by either leaving an entry in the BTAC if it is to be taken next time, or deleting the entry from the BTAC if branch is not taken next time Apr 14, 2008 SE-273@SERC 23

Advanced Instruction Flow techniques Two level adaptive branch prediction Potentially achieve better than 95% accuracy Can adapt to changing dynamic context For two-level prediction, a set of history tables are used Pattern history table The context is determined by a specific pattern of recently executed branch stored Apr 14, 2008 SE-273@SERC 24

Advanced Register Data Flow techniques Instructions scheduled execution time is determined by the position in the DFG Lower bound on program execution time is height of the DFG True data dependency Bottleneck Can we go beyond? Apr 14, 2008 SE-273@SERC 25

Value Reuse Instruction Reuse Value Prediction Value locality: captures the empirical observation that a limited set of unique values constitute the majority of values produced and consumed by real processors. Like caches Two techniques to exploit value locality Non-speculative Speculative Apr 14, 2008 SE-273@SERC 26

Instruction Reuse Memoization short-circuiting complex computation by dynamically recording the outcome of such computation Subsequent instances can use such results by table lookup IR H/w implementation of memoization Value prediction forecast full 32/64 bits Require much wider history Apr 14, 2008 SE-273@SERC 27

Instruction Reuse Cause of Value locality General nature of implementation Data Redundancy Computed branches Base Reg in Load/Store Register Spill code Convergent algorithms Affected by compilers Apr 14, 2008 SE-273@SERC 28

Instruction Reuse The execution of an individual instruction or a set of instruction is stored in history structure that stores the result for future use These set of instructions can be defined by either control flow or data flow The history structure must have mechanism that guarantees that its contents remain coherent with subsequent program execution The history mechanism has a lookup mechanism that allows subsequent instances to check against the stored instances Apr 14, 2008 SE-273@SERC 29

Instruction Reuse A hit or match during this lookup triggers the reuse mechanism, which allows processor to skip execution of the reuse candidates The processor eliminates the structural and data dependencies caused reuse candidates and is able to fast forward to subsequent program instructions Apr 14, 2008 SE-273@SERC 30

Instruction Reuse Fetch Instruction Reuse candidate? No, reuse buffer miss Reuse test? Yes Reuse prior result Fail Succeed: preconditions match prior instances Execute instruction Record outcome Apr 14, 2008 SE-273@SERC 31

Instruction Reuse PC of reuse candidate Result V? PC tag Source Op1 Source Op2 Address PC Reg. File Source operands Reused result Compare? All store check for matching address and make them invalid Apr 14, 2008 SE-273@SERC 32

Reuse Mechanism Reuse candidates (weather individual instruction or group of instructions) must inject their results into processor s architectural state This needs addition of write port to already heavily ported physical register file Instruction wakeup and scheduling logic will have to be modified to accommodate reuse instructions with effectively zero cycles of result latency Apr 14, 2008 SE-273@SERC 33

Reuse Mechanism The reuse candidate must enter into processor s ROB in order to maintain support for precise exceptions, but must simultaneously bypass the issue queue or RS This nonstandard behaviour comes with additional control path complexity Reused memory instructions must still be tracked in the processor s load/store queue (LSQ) to mainatain correct memory reference ordering LSQ entries are typically updated after instruction issue based on address generated during execution, this may entail additional datapath and LSQ ports that may allow updates to occur from earlier pipeline stage Apr 14, 2008 SE-273@SERC 34

Value Prediction Classification Table (CT) Value Prediction Table (VPT) V Pred history PC of predicted V Value history instruction prediction outcome Predicted value Update value Apr 14, 2008 SE-273@SERC 35

Thank You Apr 14, 2008 SE-273@SERC 36