Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Similar documents
Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Hardware-based Speculation

Hardware Speculation Support

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Superscalar Processors

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Superscalar Machines. Characteristics of superscalar processors

Superscalar Processors

Super Scalar. Kalyan Basu March 21,

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Handout 2 ILP: Part B

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS425 Computer Systems Architecture

Chapter 4 The Processor 1. Chapter 4D. The Processor

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware-Based Speculation

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

5008: Computer Architecture

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Hardware-based Speculation

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Instruction Level Parallelism

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

E0-243: Computer Architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Advanced issues in pipelining

Static vs. Dynamic Scheduling

Instruction Level Parallelism

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Computer Systems Architecture I. CSE 560M Lecture 5 Prof. Patrick Crowley

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Lecture 9: Multiple Issue (Superscalar and VLIW)

Multiple Instruction Issue and Hardware Based Speculation

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Multiple Instruction Issue. Superscalars

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multithreaded Processors. Department of Electrical Engineering Stanford University

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Copyright 2012, Elsevier Inc. All rights reserved.

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

COSC4201 Instruction Level Parallelism Dynamic Scheduling

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

The Processor: Instruction-Level Parallelism

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

" # " $ % & ' ( ) * + $ " % '* + * ' "

Metodologie di Progettazione Hardware-Software

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Lecture-13 (ROB and Multi-threading) CS422-Spring

Intel released new technology call P6P

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Dynamic Scheduling. CSE471 Susan Eggers 1

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Four Steps of Speculative Tomasulo cycle 0

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Processor (IV) - advanced ILP. Hwansoo Han

Dynamic Control Hazard Avoidance

EECC551 Exam Review 4 questions out of 6 questions

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Pipelining and Vector Processing

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CS 152, Spring 2011 Section 8

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Out of Order Processing

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

November 7, 2014 Prediction

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Lecture 19: Instruction Level Parallelism

SUPERSCALAR AND VLIW PROCESSORS

EC 513 Computer Architecture

CS 152, Spring 2012 Section 8

Transcription:

Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley

Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue (+ examples) General speculative execution Commit Stage Mechanism: Reorder Buffer MIPS+Tomasulo+ROB

Multiple Issue Alternatives Superscalar (hardware detects conflicts) Statically scheduled (in order execution) Dynamically scheduled (out of order execution) VLIW (compiler schedules) Needs advanced compiler techniques (e.g., trace scheduling)

Impact of Superscalar on IF IF: Need to fetch more than 1 instruction at a time Simpler if instructions are of fixed length Simpler if restricted to single I-cache lines Sample of recent microprocessors Dual issue: Alpha 21064, Sparc 2, Pentium, Cyrix Triple issue: Pentium Pro (P6) Quad issue: Alpha 21164, PowerPC 620, Sun UltraSparc, HP PA-8000, MIPS R10000, Intel Core 2, Intel Nehalem

The Decode Stage Simplification: dual issue and static scheduling ID: Look for conflicts between the (say) 2 instructions Of course need for more than one functional unit If one integer unit and one f-p unit, only check for structural hazard (i.e., check opcodes) Slight difficulty for f-p load/store that use the integer unit multiple access to register file dual ports

Pentium Dual integer pipeline (one of them used for f-p) Decode 2 consecutive instructions I1 and I2. Issue both iff I1 and I2 are simple instructions (no microcode) I1 is not a jump instruction No WAR and WAW hazard between I1 and I2 (I1 precedes I2)

Alpha 21064 IF: prefetcher fetches 2 instructions (8 bytes) at a time Prefetcher contains branch prediction logic 4 entry JSR stack; 1 bit/instruction in the I-cache + static prediction BTFNT Swap stage: initial decode yields 0, 1 or 2 instruction issue Conditions for 2 instruction issue The first instruction must be able to issue (in order execution) Load/store can issue with an operation except stores cannot issue with an operation of different format An integer op. can issue with a f-p op. A branch can issue with a load/store/operation

Alpha 21164 Main differences w.r.t. 21064 (besides caches) Up to 4 instructions issued/cycle Two integer units Two f-p units (one add, one multiply) Slightly different pipeline organization

Alpha 21164 Instruction issue (4 static stages: S0-S3) IF during S0. Gets 4 instructions at a time Up to 4 instructions decoded and buffered during S1 (some predecoding done I-cache miss when brought in). Also branch prediction in this stage. Put up to 4 instructions in slots during S2 (check for structural hazards; a few priority rules). A new block of 4 instructions will be brought in only when the current 4 are all gone. Instruction issue in S3 (check data dependencies and activates forwarding; scoreboarding )

Multiple Issue for Dynamic Scheduling Issue in order, execute out of order Requires possibility of issuing concurrently dependent instructions, otherwise little benefit over static scheduling Update tables during issue stage Restrict concurrent issue to units using different register files (or Res. Stations). Only conflicts left are for load/store/move (from int to f-p regs)

Pentium Pro (P6 architecture) Fetch-decode transforms instructions into RISC-like microoperations (uops) 3 IA-32 instructions fetched per cycle, at most 6 uops created per cycle Uops are issued to functional units that execute them and temporarily store the results uops stored in reservation stations does register renaming (RAT=register alias table) The retire unit commits the instructions in order

PowerPC 620 Issue stage. Up to 4 instructions issued/cycle except if a structural hazard exists No reservation station available No rename register available (every result is renamed) Reorder buffer is full (needed to commit instructions in order) Two operations for the same unit. Only one write port/set of reservation stations Miscellaneous use of special registers Above, first 3 are due to program while final two are due to implementation

General Speculative Execution Allow execution of an instruction before knowing that it is to be executed Branch prediction instructions after the branch executed but not completed before branch is resolved Moving code across basic block boundaries instructions are executed before the branch on which they really depend is issued Conditional instructions Out of order execution

Instruction Commit Step Need a mechanism to Know when an instruction has completed non-speculatively Know whether the result of an instruction is correct Complete instructions in order. This commits the instruction (easy in single pipeline since commit is last stage of the pipe) Note that in Tomasulo s solution An instruction completes as soon as it has finished executing (i.e., result put in register unless there is a WAW hazard) Leads necessarily to imprecise exceptions (unless handled by software tests)

Reorder Buffer Extend Tomasulo s scheme with a reorder buffer Reorder buffer entry contains (this is not the only possible solution) Type of instruction: branch, store or ALU or load Destination: none, memory address, register Value Replaces store buffers Reservation station tags and true register tags are now IDs of entries in the reorder buffer Reorder buffer implemented as a circular queue

4 Instruction Stages Tomasulo, 3 stages: issue, execute, write Now 4: issue, execute, write, commit 1. Issue (often called dispatch) Check for structural hazards (reservation station busy, reorder buffer full). If issue is possible, send values to reservation station if they are available in either the registers or the reorder buffer. Otherwise send tag. Allocate an entry in the reorder buffer and send its number to the reservation station.

4 Stages cont d 2. Execute (like Tomasulo) 3. Write Broadcast on common data bus the value and the tag (reorder buffer number). Reservation stations, if any match the tag, and reorder buffer (always) will grab the value. 4. Commit When instr. At head or ROB has its result in the buffer it stores it in the real register (for ALU) or memory (for store) and is deleted. If the instruction is a branch with correct prediction, do nothing and delete. If the instruction is a branch with incorrect prediction, flush the buffer and restart with correct target address (do it ASAP)

MIPS, Tomasulo + ROB

Assignment Readings For Monday H&P: sections 2.8-2.10 For Wednesday HW2 due H&P: sections 2.2, 2.7