250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Similar documents
Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Out of Order Processing

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture: Out-of-order Processors

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Multithreaded Processors. Department of Electrical Engineering Stanford University

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Handout 2 ILP: Part B

Lecture-13 (ROB and Multi-threading) CS422-Spring

CS425 Computer Systems Architecture

Lecture 18: Core Design, Parallel Algos

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Dynamic Scheduling. CSE471 Susan Eggers 1

" # " $ % & ' ( ) * + $ " % '* + * ' "

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

Hardware-Based Speculation

Advanced Computer Architecture

Hardware-based Speculation


Hardware-Based Speculation

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

TDT 4260 lecture 7 spring semester 2015

Instruction Level Parallelism

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ILP Ends TLP Begins. ILP Limits via an Oracle

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Processor Architecture

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Simultaneous Multithreading Architecture

5008: Computer Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Getting CPI under 1: Outline

CS 152, Spring 2011 Section 8

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 19: Instruction Level Parallelism

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Multiple Instruction Issue and Hardware Based Speculation

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

E0-243: Computer Architecture

Simultaneous Multithreading Processor

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

SPECULATIVE MULTITHREADED ARCHITECTURES

Architectures for Instruction-Level Parallelism

Exploitation of instruction level parallelism

Static vs. Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Cache Organizations for Multi-cores

Instruction Level Parallelism (ILP)

UNIT I (Two Marks Questions & Answers)

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Limitations of Scalar Pipelines

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Adapted from David Patterson s slides on graduate computer architecture

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Optimizing SMT Processors for High Single-Thread Performance

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

CS 2410 Mid term (fall 2018)

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

Processor (IV) - advanced ILP. Hwansoo Han

Simultaneous Multithreading: a Platform for Next Generation Processors

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Metodologie di Progettazione Hardware-Software

LIMITS OF ILP. B649 Parallel Architectures and Programming

CS 152 Computer Architecture and Engineering

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Hardware-based Speculation

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

HW1 Solutions. Type Old Mix New Mix Cost CPI

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Transcription:

250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1 P1 R2 P2 Register File P1-P64 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 ALU ALU ALU Results written to regfile and tags broadcast to IQ Issue Queue (IQ) 2

Managing Register Names Temporary values are stored in the register file and not the ROB Logical Registers R1-R32 Physical Registers P1-P64 At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 What happens on commit? 3

The Commit Process On commit, no copy is required The register map table is updated the committed value of R1 is now in P33 and not P1 on an exception, P33 is copied to memory and not P1 An instruction in the issue queue need not modify its input operand when the producer commits When instruction-1 commits, we no longer have any use for P1 it is put in a free pool and a new instruction can now enter the pipeline for every instr that commits, a new instr can enter the pipeline number of in-flight instrs is a constant = number of extra (rename) registers 4

Additional Details When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries Issue width: the number of instructions handled by each stage in a cycle. High issue width high peak ILP Window size: the number of in-flight instructions in the pipeline. Large window size high ILP No more WAR and WAW hazards because of rename registers must only worry about RAW hazards 5

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1 P1 R2 P2 Register File P1-P64 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 ALU ALU ALU Results written to regfile and tags broadcast to IQ Issue Queue (IQ) 6

Additional Details When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries Issue width: the number of instructions handled by each stage in a cycle. High issue width high peak ILP Window size: the number of in-flight instructions in the pipeline. Large window size high ILP No more WAR and WAW hazards because of rename registers must only worry about RAW hazards 7

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1 P1 R2 P2 Register File P1-P64 R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 ALU ALU ALU Results written to regfile and tags broadcast to IQ Issue Queue (IQ) 8

Branch Mispredict Recovery On a branch mispredict, must roll back the processor state: throw away IFQ contents, ROB/IQ contents after branch Committed map table is correct and need not be fixed The speculative map table needs to go back to an earlier state To facilitate this spec-map-table rollback, it is checkpointed at every branch 9

Waking Up a Dependent In an in-order pipeline, an instruction leaves the decode stage when it is known that the inputs can be correctly received, not when the inputs are computed Similarly, an instruction leaves the issue queue before its inputs are known, i.e., wakeup is speculative based on the expected latency of the producer instruction 10

Out-of-Order Loads/Stores St R1 [R2] R3 [R4] R5 [R6] R7 [R8] R9 [R10] What if the issue queue also had load/store instructions? Can we continue executing instructions out-of-order? 11

Memory Dependence Checking St St 0x abcdef 0x abcdef 0x abcd00 0x abc000 0x abcd00 The issue queue checks for register dependences and executes instructions as soon as registers are ready Loads/stores access memory as well must check for RAW, WAW, and WAR hazards for memory as well Hence, first check for register dependences to compute effective addresses; then check for memory dependences 12

Memory Dependence Checking St St 0x abcdef 0x abcdef 0x abcd00 0x abc000 0x abcd00 Load and store addresses are maintained in program order in the Load/Store Queue (LSQ) Loads can issue if they are guaranteed to not have true dependences with earlier stores Stores can issue only if we are ready to modify memory (can not recover if an earlier instr raises an exception) 13

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 LD R4 8[R3] ST R4 8[R1] Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Instr 7 Committed Reg Map R1 P1 R2 P2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 P37 8[P35] P37 8[P36] Issue Queue (IQ) Register File P1-P64 ALU ALU ALU Results written to regfile and tags broadcast to IQ P37 [P35 + 8] P37 [P36 + 8] LSQ ALU D-Cache 14

Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism: multiple threads share the same large processor reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor simple design, low interference between threads Chip Multi-Processing (CMP) or multi-core 15

How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Cycles Superscalar Fine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle Superscalar processor has high under-utilization not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle has the highest probability of finding work for every issue slot 16

What Resources are Shared? Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, IFQ, logical regs (and its own mappings from logical to phys regs) For performance, each thread could have its own ROB/LSQ (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing better utilization of resources Each additional thread costs a PC, IFQ, rename tables, and ROB cheap! 17

Pipeline Structure Front End Front End Front End Front End Private/ Shared Front-end I-Cache Bpred Private Front-end Rename ROB Execution Engine Shared Exec Engine Regs DCache IQ FUs 18

Resource Sharing Thread-1 R1 R1 + R2 R3 R1 + R4 R5 R1 + R3 P65 P1 + P2 P66 P65 + P4 P67 P65 + P66 Instr Fetch Instr Fetch R2 R1 + R2 R5 R1 + R2 R3 R5 + R3 Thread-2 Register File Instr Rename Instr Rename P76 P33 + P34 P77 P33 + P76 P78 P77 + P35 Issue Queue P65 P1 + P2 P66 P65 + P4 P67 P65 + P66 P76 P33 + P34 P77 P33 + P76 P78 P77 + P35 FU FU FU FU 19

Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 20

Pentium4 Hyper-Threading Two threads the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred 21

Multi-Programmed Speedup sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference worst slowdown is 0.9 22

Problem 2 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers R1 R2+R3 R3 R4+R5 BEQZ R1 R1 R1 + R3 R1 R1 + R3 R3 R1 + R3 R4 R3 + R1 23

Problem 2 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers R1 R2+R3 R3 R4+R5 BEQZ R1 R1 R1 + R3 R1 R1 + R3 R3 R1 + R3 R4 R3 + R1 P33 P2+P3 P34 P4+P5 BEQZ P33 P35 P33+P34 P36 P35+P34 P1 P36+P34 P3 P1+P36 24

Problem 3 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1 R2+R3 R1 R1+R5 BEQZ R1 R1 R4 + R5 R4 R1 + R7 R1 R6 + R8 R4 R3 + R1 R1 R5 + R9 25

Problem 3 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1 R2+R3 R1 R1+R5 BEQZ R1 R1 R4 + R5 R4 R1 + R7 R1 R6 + R8 R4 R3 + R1 R1 R5 + R9 P33 P2+P3 P34 P33+P5 BEQZ P34 P35 P4+P5 P36 P35+P7 P1 P6+P8 P33 P3+P1 P34 P5+P9 26

Problem 3 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1 R2+R3 P33 P2+P3 cycle i R1 R1+R5 P34 P33+P5 i+1 BEQZ R1 BEQZ P34 i+2 R1 R4 + R5 P35 P4+P5 i R4 R1 + R7 P36 P35+P7 i+1 R1 R6 + R8 P1 P6+P8 j R4 R3 + R1 P33 P3+P1 j+1 R1 R5 + R9 P34 P5+P9 j+2 Width is assumed to be 4. j depends on the #stages between issue and commit. 27

Thank you!