Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Similar documents
Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Out of Order Processing

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture: Out-of-order Processors

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,


Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Hardware-based Speculation

Dynamic Scheduling. CSE471 Susan Eggers 1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Handout 2 ILP: Part B

Hardware-Based Speculation

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Lecture 19: Instruction Level Parallelism

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

EE 660: Computer Architecture Superscalar Techniques

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Hardware-based Speculation

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture-13 (ROB and Multi-threading) CS422-Spring

Advanced Computer Architecture

Getting CPI under 1: Outline

Static vs. Dynamic Scheduling

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Instruction Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

CS 152 Computer Architecture and Engineering

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

5008: Computer Architecture

Multiple Instruction Issue and Hardware Based Speculation

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Superscalar Processor Design

Chapter 4 The Processor 1. Chapter 4D. The Processor

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

E0-243: Computer Architecture

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Chapter 06: Instruction Pipelining and Parallel Processing

ECE 505 Computer Architecture

Static & Dynamic Instruction Scheduling

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CS 152, Spring 2011 Section 8

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Instruction-Level Parallelism (ILP)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS252 Graduate Computer Architecture Midterm 1 Solutions

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

EECC551 Exam Review 4 questions out of 6 questions

Adapted from David Patterson s slides on graduate computer architecture

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

The basic structure of a MIPS floating-point unit

CS6290 Speculation Recovery

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Multi-cycle Instructions in the Pipeline (Floating Point)

Computer Systems Architecture

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Lecture: Pipeline Wrap-Up and Static ILP

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Instruction Level Parallelism (ILP)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

CS 152, Spring 2012 Section 8

Advanced issues in pipelining

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CSE 502 Graduate Computer Architecture

Instruction Level Parallelism

Super Scalar. Kalyan Basu March 21,

Instruction Level Parallelism

Computer Architecture Spring 2016

Processor: Superscalars Dynamic Scheduling

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CSE502: Computer Architecture CSE 502: Computer Architecture

Tomasulo s Algorithm

Transcription:

Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1

Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1 R2+R3 R1 R1+R5 BEQZ R1 R1 R4 + R5 R4 R1 + R7 R1 R6 + R8 R4 R3 + R1 R1 R5 + R9 2

Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1 R2+R3 P33 P2+P3 cycle i R1 R1+R5 P34 P33+P5 i+1 BEQZ R1 BEQZ P34 i+2 R1 R4 + R5 P35 P4+P5 i R4 R1 + R7 P36 P35+P7 i+1 R1 R6 + R8 P1 P6+P8 j R4 R3 + R1 P33 P3+P1 j+1 R1 R5 + R9 P34 P5+P9 j+2 Width is assumed to be 4. j depends on the #stages between issue and commit. 3

OOO Example IQ Assume there are 36 physical registers and 32 logical registers, and width is 4 Estimate the issue time, completion time, and commit time for the sample code 4

Assumptions IQ Perfect branch prediction, instruction fetch, caches ADD dep has no stall; LD dep has one stall An instr is placed in the IQ at the end of its 5 th stage, an instr takes 5 more stages after leaving the IQ (ld/st instrs take 6 more stages after leaving the IQ) 5

OOO Example IQ Original code ADD R1, R2, R3 LD R2, 8(R1) ADD R2, R2, 8 ST R1, (R3) SUB R1, R1, R5 LD R1, 8(R2) ADD R1, R1, R2 Renamed code 6

OOO Example IQ Original code Renamed code ADD R1, R2, R3 ADD P33, P2, P3 LD R2, 8(R1) LD P34, 8(P33) ADD R2, R2, 8 ADD P35, P34, 8 ST R1, (R3) ST P33, (P3) SUB R1, R1, R5 SUB P36, P33, P5 LD R1, 8(R2) Must wait ADD R1, R1, R2 7

OOO Example IQ Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 LD R2, 8(R1) LD P34, 8(P33) ADD R2, R2, 8 ADD P35, P34, 8 ST R1, (R3) ST P33, (P3) SUB R1, R1, R5 SUB P36, P33, P5 LD R1, 8(R2) ADD R1, R1, R2 8

OOO Example IQ Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6 LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8 ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9 ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9 SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9 LD R1, 8(R2) ADD R1, R1, R2 9

OOO Example IQ Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6 LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8 ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9 ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9 SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9 LD R1, 8(R2) LD P1, 8(P35) i+7 i+8 i+14 i+14 ADD R1, R1, R2 ADD P2, P1, P35 i+9 i+10 i+15 i+15 10

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1 P1 R2 P2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 Register File P1-P64 ALU ALU ALU Results written to regfile and tags broadcast to IQ Issue Queue (IQ) 11

Additional Details When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries Issue width: the number of instructions handled by each stage in a cycle. High issue width high peak ILP Window size: the number of in-flight instructions in the pipeline. Large window size high ILP No more WAR and WAW hazards because of rename registers must only worry about RAW hazards 12

Branch Mispredict Recovery On a branch mispredict, must roll back the processor state: throw away IFQ contents, ROB/IQ contents after branch Committed map table is correct and need not be fixed The speculative map table needs to go back to an earlier state To facilitate this spec-map-table rollback, it is checkpointed at every branch 13

Waking Up a Dependent In an in-order pipeline, an instruction leaves the decode stage when it is known that the inputs can be correctly received, not when the inputs are computed Similarly, an instruction leaves the issue queue before its inputs are known, i.e., wakeup is speculative based on the expected latency of the producer instruction 14

Out-of-Order Loads/Stores St R1 [R2] R3 [R4] R5 [R6] R7 [R8] R9 [R10] What if the issue queue also had load/store instructions? Can we continue executing instructions out-of-order? 15

Memory Dependence Checking St St 0x abcdef 0x abcdef 0x abcd00 0x abc000 0x abcd00 The issue queue checks for register dependences and executes instructions as soon as registers are ready Loads/stores access memory as well must check for RAW, WAW, and WAR hazards for memory as well Hence, first check for register dependences to compute effective addresses; then check for memory dependences 16

Memory Dependence Checking St St 0x abcdef 0x abcdef 0x abcd00 0x abc000 0x abcd00 Load and store addresses are maintained in program order in the Load/Store Queue (LSQ) Loads can issue if they are guaranteed to not have true dependences with earlier stores Stores can issue only if we are ready to modify memory (can not recover if an earlier instr raises an exception) happens at commit 17

The Alpha 21264 Out-of-Order Implementation Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 LD R4 8[R3] ST R4 8[R1] Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 Reorder Buffer (ROB) Instr 1 Committed Instr 2 Reg Map Instr 3 R1 P1 Instr 4 R2 P2 Instr 5 Instr 6 Instr 7 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 P37 8[P35] P37 8[P36] Issue Queue (IQ) P37 [P35 + 8] P37 [P36 + 8] LSQ Register File P1-P64 ALU ALU ALU Results written to regfile and tags broadcast to IQ ALU D-Cache 18

Problem 1 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal LD R1 [R2] 3 abcd LD R3 [R4] 6 adde ST R5 [R6] 4 7 abba LD R7 [R8] 2 abce ST R9 [R10] 8 3 abba LD R11 [R12] 1 abba Mem.Acc 19

Problem 1 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1 [R2] 3 abcd 4 5 LD R3 [R4] 6 adde 7 8 ST R5 [R6] 4 7 abba 5 commit LD R7 [R8] 2 abce 3 6 ST R9 [R10] 8 3 abba 9 commit LD R11 [R12] 1 abba 2 10 20

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal LD R1 [R2] 3 abcd LD R3 [R4] 6 adde ST R5 [R6] 5 7 abba LD R7 [R8] 2 abce ST R9 [R10] 1 4 abba LD R11 [R12] 2 abba Mem.Acc 21

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1 [R2] 3 abcd 4 5 LD R3 [R4] 6 adde 7 8 ST R5 [R6] 5 7 abba 6 commit LD R7 [R8] 2 abce 3 7 ST R9 [R10] 1 4 abba 2 commit LD R11 [R12] 2 abba 3 5 22

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal LD R1 [R2] 3 abcd LD R3 [R4] 6 adde ST R5 [R6] 4 7 abba LD R7 [R8] 2 abce ST R9 [R10] 8 3 abba LD R11 [R12] 1 abba Mem.Acc 23

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1 [R2] 3 abcd 4 5 LD R3 [R4] 6 adde 7 8 ST R5 [R6] 4 7 abba 5 commit LD R7 [R8] 2 abce 3 4 ST R9 [R10] 8 3 abba 9 commit LD R11 [R12] 1 abba 2 3/10 24

Title Bullet 25