DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Outline 2 Dynamic instruction scheduling: Advanced techniques for dynamic branch prediction Implementing speculative execution with Tomasulo Superscalar processors

3 Branch prediction Dynamic branch prediction using: Branch-Target-Buffer (BTB) Branch-Prediction-Buffer (BPB) Branch-History-Table (BHT)

Branch Prediction Calculation of the jump address 4 Branch predict not taken is easy The predicted jump address is the next PC Loops typically impose that many branches are taken Even unconditional branches (e.g., function call/return) require knowing the target address Anticipation of the effective address calculation to early pipeline stages and the use delayed branches can minimize this problem However these techniques cannot be applied in all cases

Branch Prediction Branch Target Buffer (BTB) 5 Instruction address (TAG) Jump address Prediction bits Alternative: Branch Target Buffer (BTB) Create a table, in real time, of the target address for each control instruction LSBs To reduce memory resources, instead of saving the target address for all instructions, use a cache for the most recent instructions The larger the memory ( cache ), the more information can be saved thus decreasing branch miss prediction. However it also implies spending more memory p TAG n-p Not taken n Branch prediction Where to put the BTB n-p MSBs = MUX 4 At the IF stage, to enable fetching the next instruction without stalling the pipeline n MUX + NEXT PC PC CURRENT PC

Branch Prediction Branch predict buffer 6 The simplest branch prediction scheme implies using: 1-bit branch predict buffer Branch not taken Branch taken BPB=1 (Predict Taken) BPB=0 (Predict Not Taken) Branch not taken Branch taken 2-bit branch predict buffer Branch not taken PREDICT TAKEN Strong Predict Taken Branch taken Weak Predict Taken Branch not taken PREDICT NOT TAKEN Branch taken Branch not taken Strong predict Not Taken Branch taken Weak Predict Not Taken Branch not taken

Dynamic Branch Prediction Correlated Branch Prediction 7 The previous schemes consider only their own behaviour to predict future behaviour Work well in typical floating point algorithms Do not work so well in complex algorithms with many control conditions where many conditional branches are correlated Typically occurs in programs with integer calculation Example: If (d==0) d=1; /* Branch B1 */ if (d==1) /* Branch B2 */ Branch B1 is correlated with branch B2: Whenever condition 1 is true, condition 2 is also true

8 Dynamic Branch Prediction Correlated Branch Prediction Example: BNE R1,R0, La /* S1 com d em R1 */ DADDI R1,R0,#1 La: DSUBI R2,R1,#1 BNE R2,R0, Lb /* S2 */ Lb: Iteration Initial value of d S1 S2 1 0 Not Taken Not Taken 2 2 Taken Taken

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 9 Dynamic speculation using a (m,n) branch correlation scheme: Use an m-bit Branch History Register (BHR) to store the result of the m-latest branches Typically implemented as a shift register Simultaneously use 2 m branch prediction tables, all using a branch prediction table of n-bits Use the BHR to select which table to use

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 10 Dynamic speculation using a (m,n) branch correlation scheme example: Table index: Address of the branch Instruction Branch Prediction Table 0 n-bit BPB Branch Prediction Table 1 n-bit BPB Branch Prediction Table 2 m -1 n-bit BPB (0,1) and (0,2) correlation schemes use a single branch prediction table (BPT) (1,x) correlation schemes use the information of the last branch (taken/not taken) to select which of the 2 tables to use... (5,3) uses the information from the five latest branches (5-bit BHR), to select which of the 32 tables to use; each BPT entry uses a 3-bit BPB Result of branch instruction Branch History Register (BHR) (m-bit shift register) Table select... MUX Branch prediction

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 11 When predicting a branch, use the BHR to select one of the BPTs Table index: Address of the branch Instruction Branch Prediction Table 0 n-bit BPB Branch Prediction Table 1 n-bit BPB Branch Prediction Table 2 m -1 n-bit BPB Use the information on the selected BPT to predict on branch result... Result of branch instruction 0 0 1 Branch History Register (BHR) Consider that the current value is 1 Table select... MUX Branch prediction

Dynamic Branch Prediction Correlated (m,n) Branch Prediction 12 After knowing the branch result (taken/not taken), do: Table index: Address of the branch Instruction Branch Prediction Table 0 n-bit BPB Branch Prediction Table 1 n-bit BPB Branch Prediction Table 2 m -1 n-bit BPB On the BPT used to predict the branch, update the prediction buffer (BPB) with the branch result UPDATE... Update the BHR with the branch result (R) Result of branch instruction UPDATE R 0 0 Branch History Register (BHR) Consider that the current value is 1 Table select... MUX Branch prediction

Correlated Branch Prediction Gselect 13 A simpler method to implement a correlated (m,n) branch predictor is by: Concatenate the index bits from the PC with the BHR Use the concatenated value to address a larger BPT n index bits PC BHR m history bits n+m table index bits Large BPT n-bit BPB Branch prediction

Correlated Branch Prediction Gshare 14 Gshare uses an alternative method: Instead of concatenating the index bits from the PC with the BHR, apply a bitwise XOR operation between the two Large BPT n-bit BPB Gshare as a better performance than Gselect Achieves a better use of the BPT size n index bits PC Bitwise XOR BHR n history bits n table index bits Branch prediction

Correlated Branch Prediction Tournament predictors 15 Uses: Global predictor (single BHR) Local predictor (BHR for each branch) Combines the two values with a selector based on the recent accuracy of each predictor It hopes to use the right predictor for the right branch

Comparison of branch predictors Accuracy vs size (SPEC 89) 16

17 Speculative Execution Dynamic instruction scheduling with: Tomasulo s algorithm Speculative execution

Speculation in Tomasulo Basic principle 18 To perform out-of-order execution with speculation, one must be able to rollback to the point where speculation occurred The same problem occurs with interruptions/exceptions and can be dealt with in the same way Solution: Perform in-order instruction commit The commit of an instruction corresponds to a register or memory write Allow uncommitted values to be speculatively used

Speculation in Tomasulo Reorder Buffer (ROB) 19 To implement the instruction commit stage, add a reorder buffer (ROB) After out-of-order execution, store the instruction result in the ROB Instructions are removed from the ROB and its results committed to the registers/memory in-order When a branch instruction is found, check if the prediction was correct: If the prediction was correct, continue If the prediction was wrong, remove all other ROB entries and re-start executing from the correct address

Speculation in Tomasulo Reorder Buffer (ROB) 20 On issue instructions are inserted into the ROB When the resulting value is written in the CDB, it is copied to the ROB Instructions are committed in order, by writing the result to memory/register Reorder Buffer (ROB) IF ISSUE Register File Address calculation MEMORY L1 L2 L3 L4 I1 I2 I3 I4 FU 2 (INT ALU) A1 A2 A3 A4 FU 3 (FP ADD) M1 M2 M3 FU 4 (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB)

Speculation in Tomasulo Reorder Buffer (ROB) 21 ROB fields: Instruction Type: Branch (possible speculation) ST (writes on memory) LD/ALU (writes on register) Destination (register/memory) Resulting value Execution status (result readiness) Since the ROB already has instruction information: Each reservation station has a field indicating the destination ROB entry Instructions no longer wait for values using the reservation station ID, but on the ROB entry ID

22 Speculation in Tomasulo Issue stage Check for structural hazards: A structural hazard is found if: All the reservation stations for the required functional unit are busy There is no free space on the ROB If no structural hazard is found: Send the instruction to the ROB Assign a ROB entry ID to the instruction Send the instruction to a reservation station Write the values that are available either on the RF or on the ROB Unavailable values are indexed by ROB entry ID of the instruction generating the required result Write the ROB entry ID of the instruction

Speculation in Tomasulo Execute stage 23 If an operand is not available, wait for the result to be written on the CDB The operand will be written on the CDB with the ROB entry ID of the instruction generating the result When the operand becomes available, copy it to the reservation station When all the values become available start execution When the result computed, write the value to the CDB Append the associated instruction ROB entry ID When a value is written on the CDB, all reservation stations with an instruction waiting for the value and the ROB read the value The Register File no longer reads values written to the CDB

Speculation in Tomasulo Commit 24 Commit instructions in-order, i.e., when instructions reach the top of the ROB When an instruction reaches the top of the ROB, check if the result was already executed Once the result is known update the registers/memory with the corresponding result If the instruction on the top of the ROB is a branch, verify if the condition is known Once the condition is known, check whether the prediction matches the condition; if it does not, clear all other ROB entries and restart execution from the correct instruction address

Speculation in Tomasulo Implementation 25 The ROB is typically implemented as a circular register with access of type FIFO An instruction is placed in-order on the FIFO on issue and is removed in-order on commit Register values: In Tomasulo, they can be on the RF or on the reservation stations With Speculation, the values can also be on the ROB Alternatively all values can be placed on a extended register file A Register Alias Table (RAT) maps each architectural register (visible to the programmer) to each physical register

Speculation in Tomasulo Register Alias Table 26 Issue stage: Renaming between physical and architectural registers, by assigning a new physical register to the destination Solves WAW and WAR hazards Simplified commit stage: Record that a given register is no longer speculative Free the physical register that stored the previous value Current architectures use a RAT+ROB approach

27 Superscalar processors Extending Tomasulo to support multiple instruction issue

Superscalar processors 28 Modern superscalar architectures achieve a CPI<1 Perform out-of-order issue of multiple instructions in a single clockcycle (e.g., 0-4) How to combine instructions? Solution 1: Allow any combination of instructions to be issued May lead to structural conflicts, with the multiple instructions competing for the same resources

Superscalar processors 29 Modern superscalar architectures achieve a CPI<1 Perform out-of-order issue of multiple instructions in a single clockcycle (e.g., 0-4) How to combine instructions? Solution 2: Restrict the combination of instructions that can be issued simultaneously (this is a similar strategy to the one used in VLIW processors) Simplifies the issue stage, by reducing the possible number of hazards in the same clock cycle For example, a dual issue processor can only allow an integer and a FP instruction to be issued simultaneously; this restricts the hazards to load/store instructions Decreases the maximum instruction-level parallelism (ILP) that can be explored

Superscalar processors Branch prediction is fundamental 30 In a single instruction issue processor, the use of a single delay slot can be enough to solve most control hazards But in multiple issue processors, branch prediction is fundamental Conditional branch Requires 3 delay slots! Issued instructions per clock cycle 1 2 3 4 5 6 7 i IF ID EX ME WB i+1 IF ID EX ME WB i+2 IF ID EX ME WB i+3 IF ID EX ME WB i+4 IF ID EX ME WB i+5 IF ID EX ME WB

Superscalar processors Tomasulo extension 31 To allow multiple instruction issue, the issue stage must: Simultaneously verify the structural hazards for all the instructions Update multiple reservation stations and update the corresponding control tables (RAT and ROB) Two possible solutions: Develop complex control circuits to perform all operations in a single clock cycle Split the Issue Stage into: Issue (cycle 1), to check for hazards Dispatch (cycle 2), to update the tables Note that reservation stations can be associated to a single Functional Unit (FU) or to sets of FUs

Superscalar processors Example 32 Consider the following architecture: Support for issuing one INT and one FP operation in each clock cycle (even if there are dependencies) Function Units: 2 integer FUs (one for normal operations, another for memory address calculation) 1 pipelined unit for each of the following operations: FP Add, FP Mult, FP Div Latencies: 1 clock cycle (CC) for integer/memory operations; 3 CCs for FP Add 2 Common Data Buses (CDBs) Forwarding values to the reservation stations takes one clock cycle, which implies starting execution on the following clock cycle Dynamic branch predictor Assume that for the current case it has an accuracy of 100% Commit up to 2 instructions per clock cycle Cont: L.D F0,0(R2) ADD.D S.D DSUBI BNE F2,F0,F1 0(R2),F2 R2,R2,#8 R2,R1,Cont

Superscalar processors Example 33 Iter. Instrução Fetch (In-order) Issue (In-order) EX MEM Write on CDB Commit (In-order) L.D ADD.D F0,0(R2) F2,F0,F1 Data Hazard 1 S.D 0(R2),F2 DSUBI BNE R2,R2,#8 R1,R2,Cont Control Hazard L.D F0,0(R2) 2 ADD.D S.D F2,F0,F1 0(R2),F2 Structural Hazard DSUBI R2,R2,#8 BNE R1,R2,Cont L.D F0,0(R2) ADD.D F2,F0,F1 3 S.D 0(R2),F2 DSUBI R2,R2,#8 BNE R1,R2,Cont

Superscalar processors Example 34 Iter. 1 2 3 Instrução Fetch (In-order) Issue (In-order) EX MEM Write on CDB Commit (In-order) L.D F0,0(R2) 1 2 3 4 5 6 ADD.D F2,F0,F1 1 2 6,7,8 9 10 S.D 0(R2),F2 2 3 4 10 DSUBI R2,R2,#8 2 3 4 5 11 BNE R1,R2,Cont 3 4 6 11 L.D F0,0(R2) 4 5 6 7 8 12 ADD.D F2,F0,F1 4 5 9,10,11 12 13 S.D 0(R2),F2 5 6 7 13 DSUBI R2,R2,#8 5 6 7 8 14 BNE R1,R2,Cont 6 7 9 14 L.D F0,0(R2) 7 8 9 10 11 15 ADD.D F2,F0,F1 7 8 12,13,14 15 16 S.D 0(R2),F2 8 9 10 16 DSUBI R2,R2,#8 8 9 10 11 17 BNE R1,R2,Cont 9 10 12 17 Data Hazard Control Hazard Structural Hazard

35 Next lesson Exercises