Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Similar documents
CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Complex Pipelining. Motivation

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 8 Instruction-Level Parallelism Part 1

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

Lecture-13 (ROB and Multi-threading) CS422-Spring

Processor: Superscalars Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Metodologie di Progettazione Hardware-Software

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

Instruction Level Parallelism

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Static vs. Dynamic Scheduling

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Handout 2 ILP: Part B

CS152 Computer Architecture and Engineering. Complex Pipelines

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

09-1 Multicycle Pipeline Operations

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Scoreboard information (3 tables) Four stages of scoreboard control

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Advanced issues in pipelining

C 1. Last time. CSE 490/590 Computer Architecture. Complex Pipelining I. Complex Pipelining: Motivation. Floating-Point Unit (FPU) Floating-Point ISA

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Instruction-Level Parallelism and Its Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Level Parallelism

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Multi-cycle Instructions in the Pipeline (Floating Point)

Tomasulo s Algorithm

Dynamic Scheduling. CSE471 Susan Eggers 1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Instruction Level Parallelism. Taken from

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Hardware-based Speculation

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

E0-243: Computer Architecture

Adapted from David Patterson s slides on graduate computer architecture

Course on Advanced Computer Architectures

CMSC411 Fall 2013 Midterm 2 Solutions

5008: Computer Architecture

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

COSC4201. Prof. Mokhtar Aboelaze York University

Hardware-based Speculation

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

CSE 490/590 Computer Architecture Homework 2

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

The basic structure of a MIPS floating-point unit

Advanced Computer Architecture

Lecture 4: Introduction to Advanced Pipelining

Chapter 4 The Processor 1. Chapter 4D. The Processor

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

COSC 6385 Computer Architecture - Pipelining (II)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Super Scalar. Kalyan Basu March 21,

CS 152 Computer Architecture and Engineering

Instruction Level Parallelism (ILP)

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

EECC551 Exam Review 4 questions out of 6 questions

Execution/Effective address

Advanced Computer Architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Instruction Level Parallelism

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Multiple Instruction Issue and Hardware Based Speculation

Good luck and have fun!

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

Pipeline issues. Pipeline hazard: RaW. Pipeline hazard: RaW. Calcolatori Elettronici e Sistemi Operativi. Hazards. Data hazard.

Announcement. ECE475/ECE4420 Computer Architecture L4: Advanced Issues in Pipelining. Edward Suh Computer Systems Laboratory

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Pipelining Review

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

Transcription:

6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd GPR s FPR s Fmul Scoreboard: In-order issue, out-of-order completion Register Renaming: Fdiv Out-of-order issue Speculative Execution Page 1

Scoreboard for In-order Issues 6823, L14--3 An instruction is not dispatched by the Issue stage if a RAW or WAW hazard exists or the required FU is busy Busy[unit#] : a bit-vector to indicate unit s availability (unit = Int, Add, Mult, Div) These bits are hardwired to FU s WP[reg#] : a bit-vector to record the registers for which writes are pending Issue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatch FU available? RAW? WAR? WAW? not Busy[FU#] WP[src1] or WP[src2] cannot arise WP[dest] I 2 I 1 Scoreboard Dynamics worksheet Functional Unit Status Int(1) Add(1) Mult(3) Div(4) WB for Writes t0 f6 f6 t1 f2 f6 f6, f2 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 Registers Reserved 6823, L14--4 FU available? RAW? WAW? Page 2

I 1 I 2 I 3 I 4 I 5 I 6 Scoreboard Dynamics Functional Unit Status Int(1) Add(1) Mult(3) Div(4) WB for Writes t0 f6 f6 t1 f2 f6 f6, f2 t2 f6 f2 f6, f2 t3 f0 f6 f6, f0 t4 f0 f6 f6, f0 t5 f0 f8 f0, f8 t6 f8 f0 f0, f8 t7 f10 f8 f8, f10 t8 f8 f10 f8, f10 t9 f8 f8 t10 f6 f6 t11 f6 f6 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 Registers Reserved 6823, L14--5 I 2 I 1 I 3 I 5 I 4 I 6 Precise Interrupts 6823, L14--6 It must appear as if an interrupt is taken between two instructions (say I i and I i+1 ) the effect of all instructions up to and including I i is totally complete no effect of any instruction after I i has taken place The interrupt handler either aborts the program or restarts it at I i+1 Page 3

Effect on Interrupts Out-of-order Completion 6823, L14--7 I 1 DIVD f6, f6, f4 I 2 LD f2, 45(r3) I 3 MULTD f0, f2, f4 I 4 DIVD f8, f6, f2 I 5 SUBD f10, f0, f6 I 6 ADDD f6, f8, f2 out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6 restore f2 restore f10 Consider interrupts Precise interrupts are difficult to implement The choice - imprecise interrupts or speculative execution Multiple Instructions in the Issue Stage 6823, L14--8 ALU Mem IF ID Issue WB Fadd Fmul Decode adds an instruction to the buffer only if it is not full and the instruction does not cause a WAR or WAW hazard Any instruction in the Issue buffer whose RAW hazard has been satisfied can be dispatched On a write back (WB) new instructions may get enabled Out-of order dispatch (and completion) Page 4

Out-of-Order Issue: an example 6823, L14--9 latency 1 LD F2, 34(R2) 1 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 5 DIVD F4, F2, F8 4 4 5 3 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) 2 3 4 4 3 5 5 6 6 Out-of-order: 1 (2,1) 4 4 2 3 3 5 5 6 6 Out-of-order did not allow any significant improvement! How many Instructions can be in the pipeline 6823, L14--10 Which features of an ISA limit the number of instructions in the pipeline? Which features of a program limit the number of instructions in the pipeline? Page 5

Overcoming the Lack of Register Names 6823, L14--11 Number of registers in an ISA limits the number of partially executed instructions in complex pipelines Floating Point pipelines often cannot be kept filled with small number of registers IBM 360 had only 8 Floating Point Registers Can a microarchitecture use more registers than specified by the ISA without loss of ISA compatibility? Robert Tomosulo of IBM suggested an ingenious solution in 1967 based on on-the-fly register renaming Instruction-Level Parallelism with Renaming latency 1 LD F2, 34(R2) 1 6823, L14--12 1 2 2 LD F4, 45(R3) long 3 MULTD F6, F4, F2 3 4 SUBD F8, F2, F2 1 4 X 3 5 DIVD F4, F2, F8 4 5 6 ADDD F10, F6, F4 1 6 In-order: 1 (2,1) 2 3 4 4 (5,3) 5 6 6 Out-of-order: 1 (2,1) 4 4 5 2 (3,5) 3 6 6 Any antidependence can be eliminated by renaming renaming storage Can it be done in hardware? yes! Page 6

Register Renaming 6823, L14--13 ALU Mem IF ID ROB WB Fadd Fmul Decode does register renaming and adds instructions to the reorder buffer (ROB) renaming makes WAR or WAW hazards impossible Any instruction in ROB whose RAW hazard has been satisfied can be dispatched Out-of order or dataflow execution Renaming & Out-of-order Issue An example 6823, L14--14 Renaming table p data F1 F2 F3 F4 F5 F6 F7 F8 Reorder buffer Ins# use exec op p1 src1 p2 src2 t 1 t2 data / t i 1 LD F2, 34(R2) 2 LD F4, 45(R3) 3 MULTD F6, F4, F2 4 SUBD F8, F2, F2 5 DIVD F4, F2, F8 6 ADDD F10, F6, F4 When are names in sources replaced by data? When can a name be reused? Page 7

Data-Driven Execution 6823, L14--15 Renaming table & reg file Reorder buffer Replacing the tag by its value is an expensive operation Ins# use exec op p1 src1 p2 src2 t 1 t2 t n Load Unit FU FU Store Unit < t, result > Instruction template (ie, tag t) is allocated by the Decode stage, which also stores the tag the reg file When an instruction completes, its tag is deallocated 6823, L14--16 Simplifying Allocation/Deallocation Reorder buffer Ins# use exec op p1 src1 p2 src2 ptr 2 next to deallocate t 1 t 2 prt 1 next t n available Instruction buffer is managed circularly When an instruction completes its use bit is marked free ptr 2 is incremented only if the use bit is marked free Page 8

IBM 360/91 Floating Point Unit R M Tomasulo, 1967 6823, L14--17 1 2 3 4 5 6 data load buffers (from storage) instructions 1 2 3 4 p data Floating Point Reg distribute instruction templates by functional units 1 2 3 p data p data Adder 1 2 < t, result > p data p data Mult store buffers (to storage) p data Common bus ensures that data is made available immediately to all the instructions waiting for it Effectiveness? 6823, L14--18 Renaming and Out-of-order execution was first implemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid Nineties Why? reasons 1 No precise interrupts! 2 Effective on a very small class of programs One more problem needed to be solved Page 9

Effect of Control Transfer on Pipelined Execution 6823, L14--19 Control transfer instructions require insertion of bubbles in the pipeline The number of bubbles depends upon the number of cycles it takes to determine the next instruction address, and to fetch the next instruction Average Run-Length between Branches Average dynamic instruction mix from SPEC92: SPECint92 SPECfp92 ALU 39 % 13 % FPU Add 20 % FPU Mult 13 % load 26 % 23 % store 9 % 9 % branch 16 % 8 % other 10 % 12 % 6823, L14--20 SPECint92: SPECfp92: compress, eqntott, espresso, gcc, li doduc, ear, hydro2d, mdijdp2, su2cor What is the average run length between branches? Page 10

Loss Due to Control Transfers Suppose a branch needed the result of a long latency operation, eg load=1~50, FADD=2, FMUL=4, FDIV=10 How often does each case arise? How far apart are the branch instruction and the instruction that determines the branch condition? 6823, L14--21 Assuming average run length = l (10, 100, 1000) average delay in branch resolution = d (1, 2, 10, 100) maximum pipeline efficiency =? l \ d 1 2 10 100 10 909 833 500 91 100 990 980 909 500 1000 999 998 990 909 6823, L14--22 Reducing Control Transfer Penalties Software solution loop unrolling Increases the run length instruction scheduling Compute the branch condition as early as possible (limited) Hardware solution delay slots replaces pipeline bubbles with useful work (requires software cooperation) branch prediction & speculative execution of instructions beyond the branch Page 11

Branch Prediction and Speculative Execution motivation: without longer run-lengths, pipelines with long latencies cannot be utilized properly 6823, L14--23 speculative execution can increase run length for most programs required hardware support: for prediction: branch target buffers etc for speculative execution: shadow registers and memory buffers for each speculated branch improved prediction by itself does not increase run lengths Speculative Execution 6823, L14--24 next PC Branch Prediction (BTB) update prediction Branch Resolution branch condition Kill Kill Kill PC IF ID IS Ex WB Reg File Mem Buf s Speculated instructions should not update the reg file or memory hold the data in the instruction reorder buffer + no speculative stores Page 12

Shadows and Renaming 6823, L14--25 Two ideas have a lot in common: both require additional resources registers, instruction buffers both require allocation of a resource on instruction issue, and maintenance of a <reg no, tag> map both require deallocation of resources when an instruction completes Differences: the ability to kill an instruction in case of wrong prediction kill free resources occupied by the shadow Extensions for Speculation 6823, L14--26 Instruction reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data ptr 2 next to commit ptr 1 next available add <pd, dest, data> fields in the instruction template commit instructions to reg file and memory in program order buffers can be maintained circularly wrong speculation roll back the next available pointer no speculative stores Page 13

Rollback and Renaming 6823, L14--27 Register File Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2 t n Load Unit FU FU FU Store Unit < t, result > Register file does not contain renaming tags any more How does the decode stage find the tag of a source register? Renaming Table 6823, L14--28 Renaming Table r 1 r 2 t i t j Reg File Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2 t n Load Unit FU FU FU Store Unit < t, result > Renaming table is like a cache to speed up register name look up It needs to be loaded after each branch prediction failure! Several renaming tables? Page 14

Temporary Register File 6823, L14--29 The width of the reorder buffer (ROB) can be reduced if data values are stored in a temporary register file instead of ROB This will reduce the width of several datapaths at the expense of extra register fetches while dispatching to functional units Temporary register files does not reduce the need for associative searching of tags when an instruction completes There is tremendous variety in the implementation of ROBs in commercial microprocessors Speculative Loads / Stores 6823, L14--30 Just like register updates, stores should not modify the memory until after the instruction is committed store buffer entry must carry a speculation bit and the tag of the corresponding store instruction If the instruction is committed, the speculation bit of the corresponding store buffer entry is cleared If the instruction is killed, the corresponding store buffer entry is freed Loads work normally -- older store buffer entries needs to be searched before accessing the memory or the cache Page 15