DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Similar documents
DYNAMIC SPECULATIVE EXECUTION

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

INSTRUCTION LEVEL PARALLELISM

Hardware-Based Speculation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Static vs. Dynamic Scheduling

Copyright 2012, Elsevier Inc. All rights reserved.

Hardware-based Speculation

CS433 Homework 2 (Chapter 3)

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

5008: Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Four Steps of Speculative Tomasulo cycle 0

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

CS433 Homework 2 (Chapter 3)

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Instruction Level Parallelism

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction-Level Parallelism and Its Exploitation

Hardware-based Speculation

Super Scalar. Kalyan Basu March 21,

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Getting CPI under 1: Outline

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Adapted from David Patterson s slides on graduate computer architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

ILP: Instruction Level Parallelism

The basic structure of a MIPS floating-point unit

Case Study IBM PowerPC 620

COSC4201 Instruction Level Parallelism Dynamic Scheduling

Lecture-13 (ROB and Multi-threading) CS422-Spring

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

CMSC411 Fall 2013 Midterm 2 Solutions

Handout 2 ILP: Part B

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

Metodologie di Progettazione Hardware-Software

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes

EECC551 Exam Review 4 questions out of 6 questions

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Hardware-Based Speculation

Course on Advanced Computer Architectures

RECAP. B649 Parallel Architectures and Programming

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Dynamic Scheduling. CSE471 Susan Eggers 1

Exploitation of instruction level parallelism

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

Advanced issues in pipelining

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Processor: Superscalars Dynamic Scheduling

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Good luck and have fun!

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Chapter 4 The Processor 1. Chapter 4D. The Processor

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Multi-cycle Instructions in the Pipeline (Floating Point)

E0-243: Computer Architecture

TDT 4260 lecture 7 spring semester 2015

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Scoreboard information (3 tables) Four stages of scoreboard control

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Multiple Instruction Issue and Hardware Based Speculation

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Advanced Computer Architecture

CS 152 Computer Architecture and Engineering

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

The Tomasulo Algorithm Implementation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Control Dependence, Branch Prediction

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CS425 Computer Systems Architecture

Transcription:

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Outline 2 Dynamic instruction scheduling: Advanced techniques for dynamic branch prediction Implementing speculative execution with Tomasulo Superscalar processors

3 Branch prediction Dynamic branch prediction using: Branch-Target-Buffer (BTB) Branch-Prediction-Buffer (BPB) Branch-History-Table (BHT)

Branch Prediction Calculation of the jump address 4 Branch predict not taken is easy The predicted jump address is the next PC Loops typically impose that many branches are taken Even unconditional branches (e.g., function call/return) require knowing the target address Anticipation of the effective address calculation to early pipeline stages and the use delayed branches can minimize this problem However these techniques cannot be applied in all cases

Branch Prediction Branch Target Buffer (BTB) 5 Instruction address (TAG) Jump address Prediction bits Alternative: Branch Target Buffer (BTB) Create a table, in real time, of the target address for each control instruction LSBs To reduce memory resources, instead of saving the target address for all instructions, use a cache for the most recent instructions The larger the memory ( cache ), the more information can be saved thus decreasing branch miss prediction. However it also implies spending more memory p TAG n-p Not taken n Branch prediction Where to put the BTB n-p MSBs = MUX 4 At the IF stage, to enable fetching the next instruction without stalling the pipeline n MUX + NEXT PC PC CURRENT PC

Branch Prediction Branch predict buffer 6 The simplest branch prediction scheme implies using: 1-bit branch predict buffer Branch not taken Branch taken BPB=1 (Predict Taken) BPB=0 (Predict Not Taken) Branch not taken Branch taken 2-bit branch predict buffer Branch not taken PREDICT TAKEN Strong Predict Taken Branch taken Weak Predict Taken Branch not taken PREDICT NOT TAKEN Branch taken Branch not taken Strong predict Not Taken Branch taken Weak Predict Not Taken Branch not taken

Dynamic Branch Prediction Correlated Branch Prediction 7 The previous schemes consider only their own behaviour to predict future behaviour Work well in typical floating point algorithms Do not work so well in complex algorithms with many control conditions where many conditional branches are correlated Typically occurs in programs with integer calculation Example: If (d==0) d=1; /* Branch B1 */ if (d==1) /* Branch B2 */ Branch B1 is correlated with branch B2: Whenever condition 1 is true, condition 2 is also true

8 Dynamic Branch Prediction Correlated Branch Prediction Example: BNE R1,R0, La /* S1 com d em R1 */ DADDI R1,R0,#1 La: DSUBI R2,R1,#1 BNE R2,R0, Lb /* S2 */ Lb: Iteration Initial value of d S1 S2 1 0 Not Taken Not Taken 2 2 Taken Taken

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 9 Dynamic speculation using a (m,n) branch correlation scheme: Use an m-bit Branch History Register (BHR) to store the result of the m-latest branches Typically implemented as a shift register Simultaneously use 2 m branch prediction tables, all using a branch prediction table of n-bits Use the BHR to select which table to use

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 10 Dynamic speculation using a (m,n) branch correlation scheme example: Table index: Address of the branch Instruction Branch Prediction Table 0 n-bit BPB Branch Prediction Table 1 n-bit BPB Branch Prediction Table 2 m -1 n-bit BPB (0,1) and (0,2) correlation schemes use a single branch prediction table (BPT) (1,x) correlation schemes use the information of the last branch (taken/not taken) to select which of the 2 tables to use... (5,3) uses the information from the five latest branches (5-bit BHR), to select which of the 32 tables to use; each BPT entry uses a 3-bit BPB Result of branch instruction Branch History Register (BHR) (m-bit shift register) Table select... MUX Branch prediction

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 11 When predicting a branch, use the BHR to select one of the BPTs Table index: Address of the branch Instruction Branch Prediction Table 0 n-bit BPB Branch Prediction Table 1 n-bit BPB Branch Prediction Table 2 m -1 n-bit BPB Use the information on the selected BPT to predict on branch result... Result of branch instruction 0 0 1 Branch History Register (BHR) Consider that the current value is 1 Table select... MUX Branch prediction

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 12 After knowing the branch result (taken/not taken), do: Table index: Address of the branch Instruction Branch Prediction Table 0 n-bit BPB Branch Prediction Table 1 n-bit BPB Branch Prediction Table 2 m -1 n-bit BPB On the BPT used to predict the branch, update the prediction buffer (BPB) with the branch result UPDATE... Update the BHR with the branch result (R) Result of branch instruction UPDATE R 0 0 Branch History Register (BHR) Consider that the current value is 1 Table select... MUX Branch prediction

Correlated Branch Prediction Gselect 13 A simpler method to implement a correlated (m,n) branch predictor is by: Concatenate the index bits from the PC with the BHR Use the concatenated value to address a larger BPT n index bits PC BHR m history bits n+m table index bits Large BPT n-bit BPB Branch prediction

Correlated Branch Prediction Gshare 14 Gshare uses an alternative method: Instead of concatenating the index bits from the PC with the BHR, apply a bitwise XOR operation between the two Large BPT n-bit BPB Gshare as a better performance than Gselect Achieves a better use of the BPT size n index bits PC Bitwise XOR BHR n history bits n table index bits Branch prediction

Correlated Branch Prediction Tournament predictors 15 Uses: Global predictor (single BHR) Local predictor (BHR for each branch) Combines the two values with a selector based on the recent accuracy of each predictor It hopes to use the right predictor for the right branch

Comparison of branch predictors Accuracy vs size (SPEC 89) 16

17 Speculative Execution Dynamic instruction scheduling with: Tomasulo s algorithm Speculative execution

Speculation in Tomasulo Basic principle 18 To perform out-of-order execution with speculation, one must be able to rollback to the point where speculation occurred The same problem occurs with interruptions/exceptions and can be dealt with in the same way Solution: Perform in-order instruction commit The commit of an instruction corresponds to a register or memory write Allow uncommitted values to be speculatively used

Speculation in Tomasulo Reorder Buffer (ROB) 19 To implement the instruction commit stage, add a reorder buffer (ROB) After out-of-order execution, store the instruction result in the ROB Instructions are removed from the ROB and its results committed to the registers/memory in-order When a branch instruction is found, check if the prediction was correct: If the prediction was correct, continue If the prediction was wrong, remove all other ROB entries and re-start executing from the correct address

Speculation in Tomasulo Reorder Buffer (ROB) 20 On issue instructions are inserted into the ROB When the resulting value is written in the CDB, it is copied to the ROB Instructions are committed in order, by writing the result to memory/register Reorder Buffer (ROB) IF ISSUE Register File Address calculation MEMORY L1 L2 L3 L4 I1 I2 I3 I4 FU 2 (INT ALU) A1 A2 A3 A4 FU 3 (FP ADD) M1 M2 M3 FU 4 (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB)

Speculation in Tomasulo Reorder Buffer (ROB) 21 ROB fields: Instruction Type: Branch (possible speculation) ST (writes on memory) LD/ALU (writes on register) Destination (register/memory) Resulting value Execution status (result readiness) Since the ROB already has instruction information: Each reservation station has a field indicating the destination ROB entry Instructions no longer wait for values using the reservation station ID, but on the ROB entry ID

22 Speculation in Tomasulo Issue stage Check for structural hazards: A structural hazard is found if: All the reservation stations for the required functional unit are busy There is no free space on the ROB If no structural hazard is found: Send the instruction to the ROB Assign a ROB entry ID to the instruction Send the instruction to a reservation station Write the values that are available either on the RF or on the ROB Unavailable values are indexed by ROB entry ID of the instruction generating the required result Write the ROB entry ID of the instruction

Speculation in Tomasulo Execute stage 23 If an operand is not available, wait for the result to be written on the CDB The operand will be written on the CDB with the ROB entry ID of the instruction generating the result When the operand becomes available, copy it to the reservation station When all the values become available start execution When the result computed, write the value to the CDB Append the associated instruction ROB entry ID When a value is written on the CDB, all reservation stations with an instruction waiting for the value and the ROB read the value The Register File no longer reads values written to the CDB

Speculation in Tomasulo Commit 24 Commit instructions in-order, i.e., when instructions reach the top of the ROB When an instruction reaches the top of the ROB, check if the result was already executed Once the result is known update the registers/memory with the corresponding result If the instruction on the top of the ROB is a branch, verify if the condition is known Once the condition is known, check whether the prediction matches the condition; if it does not, clear all other ROB entries and restart execution from the correct instruction address

Speculation in Tomasulo Implementation 25 The ROB is typically implemented as a circular register with access of type FIFO An instruction is placed in-order on the FIFO on issue and is removed in-order on commit Register values: In Tomasulo, they can be on the RF or on the reservation stations With Speculation, the values can also be on the ROB Alternatively all values can be placed on a extended register file A Register Alias Table (RAT) maps each architectural register (visible to the programmer) to each physical register

Speculation in Tomasulo Register Alias Table 26 Issue stage: Renaming between physical and architectural registers, by assigning a new physical register to the destination Solves WAW and WAR hazards Simplified commit stage: Record that a given register is no longer speculative Free the physical register that stored the previous value Current architectures use a RAT+ROB approach

27 Superscalar processors Extending Tomasulo to support multiple instruction issue

Superscalar processors 28 Modern superscalar architectures achieve a CPI<1 Perform out-of-order issue of multiple instructions in a single clockcycle (e.g., 0-4) How to combine instructions? Solution 1: Allow any combination of instructions to be issued May lead to structural conflicts, with the multiple instructions competing for the same resources

Superscalar processors 29 Modern superscalar architectures achieve a CPI<1 Perform out-of-order issue of multiple instructions in a single clockcycle (e.g., 0-4) How to combine instructions? Solution 2: Restrict the combination of instructions that can be issued simultaneously (this is a similar strategy to the one used in VLIW processors) Simplifies the issue stage, by reducing the possible number of hazards in the same clock cycle For example, a dual issue processor can only allow an integer and a FP instruction to be issued simultaneously; this restricts the hazards to load/store instructions Decreases the maximum instruction-level parallelism (ILP) that can be explored

Superscalar processors Branch prediction is fundamental 30 In a single instruction issue processor, the use of a single delay slot can be enough to solve most control hazards But in multiple issue processors, branch prediction is fundamental Conditional branch Requires 3 delay slots! Issued instructions per clock cycle 1 2 3 4 5 6 7 i IF ID EX ME WB i+1 IF ID EX ME WB i+2 IF ID EX ME WB i+3 IF ID EX ME WB i+4 IF ID EX ME WB i+5 IF ID EX ME WB

Superscalar processors Tomasulo extension 31 To allow multiple instruction issue, the issue stage must: Simultaneously verify the structural hazards for all the instructions Update multiple reservation stations and update the corresponding control tables (RAT and ROB) Two possible solutions: Develop complex control circuits to perform all operations in a single clock cycle Split the Issue Stage into: Issue (cycle 1), to check for hazards Dispatch (cycle 2), to update the tables Note that reservation stations can be associated to a single Functional Unit (FU) or to sets of FUs

Superscalar processors Example 32 Consider the following architecture: Support for issuing one INT and one FP operation in each clock cycle (even if there are dependencies) Function Units: 2 integer FUs (one for normal operations, another for memory address calculation) 1 pipelined unit for each of the following operations: FP Add, FP Mult, FP Div Latencies: 1 clock cycle (CC) for integer/memory operations; 3 CCs for FP Add 2 Common Data Buses (CDBs) Forwarding values to the reservation stations takes one clock cycle, which implies starting execution on the following clock cycle Dynamic branch predictor Assume that for the current case it has an accuracy of 100% Commit up to 2 instructions per clock cycle Cont: L.D F0,0(R2) ADD.D S.D DSUBI BNE F2,F0,F1 0(R2),F2 R2,R2,#8 R2,R1,Cont

Superscalar processors Example 33 Iter. Instrução Fetch (In-order) Issue (In-order) EX MEM Write on CDB Commit (In-order) L.D ADD.D F0,0(R2) F2,F0,F1 Data Hazard 1 S.D 0(R2),F2 DSUBI BNE R2,R2,#8 R1,R2,Cont Control Hazard L.D F0,0(R2) 2 ADD.D S.D F2,F0,F1 0(R2),F2 Structural Hazard DSUBI R2,R2,#8 BNE R1,R2,Cont L.D F0,0(R2) ADD.D F2,F0,F1 3 S.D 0(R2),F2 DSUBI R2,R2,#8 BNE R1,R2,Cont

Superscalar processors Example 34 Iter. 1 2 3 Instrução Fetch (In-order) Issue (In-order) EX MEM Write on CDB Commit (In-order) L.D F0,0(R2) 1 2 3 4 5 6 ADD.D F2,F0,F1 1 2 6,7,8 9 10 S.D 0(R2),F2 2 3 4 10 DSUBI R2,R2,#8 2 3 4 5 11 BNE R1,R2,Cont 3 4 6 11 L.D F0,0(R2) 4 5 6 7 8 12 ADD.D F2,F0,F1 4 5 9,10,11 12 13 S.D 0(R2),F2 5 6 7 13 DSUBI R2,R2,#8 5 6 7 8 14 BNE R1,R2,Cont 6 7 9 14 L.D F0,0(R2) 7 8 9 10 11 15 ADD.D F2,F0,F1 7 8 12,13,14 15 16 S.D 0(R2),F2 8 9 10 16 DSUBI R2,R2,#8 8 9 10 11 17 BNE R1,R2,Cont 9 10 12 17 Data Hazard Control Hazard Structural Hazard

35 Next lesson Exercises