Out of Order Processing

Similar documents
Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture: Out-of-order Processors

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars


Hardware-based Speculation

Dynamic Scheduling. CSE471 Susan Eggers 1

Handout 2 ILP: Part B

Hardware-based Speculation

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Hardware-Based Speculation

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

CS 152, Spring 2011 Section 8

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Instruction Level Parallelism

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Advanced Computer Architecture

Advanced issues in pipelining

Static vs. Dynamic Scheduling

Copyright 2012, Elsevier Inc. All rights reserved.

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

LSU EE 4720 Dynamic Scheduling Study Guide Fall David M. Koppelman. 1.1 Introduction. 1.2 Summary of Dynamic Scheduling Method 3

Static & Dynamic Instruction Scheduling

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

5008: Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Three Dynamic Scheduling Methods

CS 152 Computer Architecture and Engineering

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture: Pipeline Wrap-Up and Static ILP

Lecture-13 (ROB and Multi-threading) CS422-Spring

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Chapter 4 The Processor 1. Chapter 4D. The Processor

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Instruction Pipelining Review

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Multiple Instruction Issue and Hardware Based Speculation

Lecture 19: Instruction Level Parallelism

Adapted from David Patterson s slides on graduate computer architecture

EECC551 Exam Review 4 questions out of 6 questions

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Getting CPI under 1: Outline

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Lecture 7: Pipelining Contd. More pipelining complications: Interrupts and Exceptions

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

EE 660: Computer Architecture Superscalar Techniques

Multi-cycle Instructions in the Pipeline (Floating Point)

COSC4201 Instruction Level Parallelism Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

E0-243: Computer Architecture

The Processor: Instruction-Level Parallelism

Four Steps of Speculative Tomasulo cycle 0

Instruction-Level Parallelism (ILP)

This Set. Scheduling and Dynamic Execution Definitions From various parts of Chapter 4. Description of Two Dynamic Scheduling Methods

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

ILP: Instruction Level Parallelism

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

6.823 Computer System Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Instruction-Level Parallelism and Its Exploitation

Advanced Computer Architecture

Tutorial 11. Final Exam Review

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Basic Pipelining Concepts

CS152 Computer Architecture and Engineering. Complex Pipelines

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Instruction Level Parallelism

COMPUTER ORGANIZATION AND DESI

Chapter 06: Instruction Pipelining and Parallel Processing

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Transcription:

Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative Approach; John Hennessy, David A. Patterson (Book)

A Simple Pipeline IF D/R ALU DM RW Assumptions? All stages take a the exact same amount of time Advantages? Not all instructions same amount of time; simpler ones can finish faster Disadvantages Later instructions can finish earlier

Multi Cycle Instructions Multiple parallel pipelines each pipeline can have a different number of stages Instructions can now complete out of order must make sure that writes to a register happen in the correct order

Effects of Multicycle Instructions Potentially multiple writes to the register file Frequent RAW hazards WAW hazards also possible WAR/RAR? Imprecise exception support

Recap Multicycle Instructions

Dealing With Effects of Multicycle Instructions Multiple Writes to the register file (structural hazard) Increase number of ports Stall one of the writers during its decode (ID) stage WAW hazards Detect hazards during ID; stall later instruction Imprecise Exceptions Buffer results for instructions completing early (ROB) need to make sure to return to the same state where it was left

Exceptions Exceptional circumstances where normal instruction execution order is changed Causes I/O device request OS service from user program Integer/FP arithmetic overflow Misaligned memory access Page fault Hardware/power failure Interrupt/Fault/Exception used interchangeably

Dealing with Exceptions

Issue Queue (IQ) An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 Register File R1-R32 ALU ALU ALU Results written to ROB and tags broadcast to IQ

Design Details Instructions enter the pipeline in order No need for branch delay slots if prediction happens in time Instructions leave the pipeline in order all instructions that enter also get placed in the ROB the process of an instruction leaving the ROB (in order) is called commit an instruction commits only if it and all instructions before it have completed successfully (without an exception) A result is written into the register file only when the instruction commits until then, the result is saved in a temporary register in the ROB

Design Details Instructions get renamed and placed in the issue queue some operands are available (T1-T6; R1-R32), while others are being produced by instructions in flight (T1-T6) As instructions finish, they write results into the ROB (T1-T6) and broadcast the operand tag (T1-T6) to the issue queue instructions now know if their operands are ready When a ready instruction issues, it reads its operands from T1-T6 and R1-R32 and executes (out-of-order execution) Can you have WAW or WAR hazards? By using more names (T1-T6), name dependences can be avoided

Design Details If instr-3 raises an exception, wait until it reaches the top of the ROB at this point, R1-R32 contain results for all instructions up to instr-3 save registers, save PC of instr-3, and service the exception If branch is a mispredict, flush all instructions after the branch and start on the correct path mispredicted instructions will not have updated registers (the branch cannot commit until it has completed and the flush happens as soon as the branch completes) Potential problems:?

Issue Queue (IQ) Recap OoO, Register Renaming Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 Register File R1-R32 ALU ALU ALU Results written to ROB and tags broadcast to IQ

Out of Order Implementation Rename Registers Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 T1 T2 T3 T4 T5 T6 T1 R1+R2 T2 T1+R3 BEQZ T2 T4 T1+T2 T5 T4+T2 Register File R1-R32 Logical /Architected Registers ALU ALU ALU Results written to ROB and tags broadcast to IQ Issue Queue (IQ)

Managing Register Names Temporary values are stored in the register file and not the ROB Logical Registers R1-R32 Physical Registers P1-P64 At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 What happens on commit?

The Commit Process On commit, no copy is required The register map table is updated the committed value of R1 is now in P33 and not P1 on an exception, P33 is copied to memory and not P1 An instruction in the issue queue need not modify its input operand when the producer commits When instruction-1 commits, we no longer have any use for P1 it is put in a free pool and a new instruction can now enter the pipeline for every instr that commits, a new instr can enter the pipeline number of in-flight instrs is a constant = number of extra (rename) registers

The Alpha 21264 Out-of-Order Implementation Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 Reorder Buffer (ROB) Instr 1 Committed Instr 2 Reg Map Instr 3 R1 P1 Instr 4 R2 P2 Instr 5 Instr 6 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 Issue Queue (IQ) Register File P1-P64 ALU ALU ALU Results written to regfile and tags broadcast to IQ

More Details When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries What type of instructions are present in the ROB? When does the ROB stall? Issue width: the number of instructions handled by each stage in a cycle. High issue width high peak ILP Window size: the number of in-flight instructions in the pipeline. Large window size high ILP No more WAR and WAW hazards because of rename registers must only worry about RAW hazards

OoO Structures in Intel Processors Intel Skylake (Client) https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)

Waking up Dependent Instructions Inorder pipelines an instruction can leave the decode stage based on the knowledge of when the input will be available, not when it is computed In OoO pipelines, instruction can leave the issue queue (woken up) before its inputs are know, based on knowledge of when they will be available wakeup is speculative based on expected latency of producer

Loads and Stores St R1 [R2] R3 [R4] R5 [R6] R7 [R8] R9 [R10] What if the issue queue also had load/store instructions? Can we continue executing instructions out-of-order?

Memory Dependence Checking St St 0x abcdef 0x abcdef 0x abcd00 0x abc000 0x abcd00 The issue queue checks for register dependences and executes instructions as soon as registers are ready Loads/stores access memory as well must check for hazards for memory as well Hence, first check for register dependences to compute effective addresses; then check for memory dependences

Memory Dependence Checking St St 0x abcdef 0x abcdef 0x abcd00 0x abc000 0x abcd00 Load and store addresses are maintained in program order in the Load/Store Queue (LSQ) Loads can issue if they are guaranteed to not have true dependences with earlier stores Stores can issue only if we are ready to modify memory (can not recover if an earlier instruction raises an exception)

LSQs Intel Processors https://www.realworldtech.com/haswell-cpu/5/

The Alpha 21264 Out-of-Order Implementation Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3 R1+R2 R1 R3+R2 LD R4 8[R3] ST R4 8[R1] Instr Fetch Queue Decode & Rename Speculative Reg Map R1 P36 R2 P34 Reorder Buffer (ROB) Instr 1 Committed Instr 2 Reg Map Instr 3 R1 P1 Instr 4 R2 P2 Instr 5 Instr 6 Instr 7 P33 P1+P2 P34 P33+P3 BEQZ P34 P35 P33+P34 P36 P35+P34 P37 8[P35] P37 8[P36] Issue Queue (IQ) P37 [P35 + 8] P37 [P36 + 8] LSQ Register File P1-P64 ALU ALU ALU Results written to regfile and tags broadcast to IQ ALU D-Cache