DYNAMIC SPECULATIVE EXECUTION

Size: px

Start display at page:

Download "DYNAMIC SPECULATIVE EXECUTION"

Solomon O’Brien’
6 years ago
Views:

DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter

1 DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

2 Outline 2 Dynamic instruction scheduling: Revision of Tomasulo algorithm Loop unrolling with Tomasulo Introduction dynamic branch prediction

3 3 Tomasulo algorithm Proposed by Robert Tomasulo in 1966: Initially proposed to overcome the long latencies in both memory accesses and floating point operations First implemented on the IBM 360/91 The algorithm revealed to be far more powerful than anticipated being used in all modern superscalar processors

4 Tomasulo s algorithm General idea Instruction are issued to reservation stations associated to functional units Operands that are ready are directly copied to the reservation station Operands which are unavailable, force instructions to wait at the reservation station Instructions no longer wait for the value on the register, but for completion of an instruction on a given reservation station IF ISSUE Register File S1 S2 S3 S Address calculation MEMORY L1 L2 L3 L I1 I2 I3 I FU 2 (INT ALU) A1 A2 A3 A FU 3 (FP ADD) M1 M2 M3 FU (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB)

5 Tomasulo s algorithm Execute stage 5 1. Reservation stations and the register file (RF) snoop writings to the common data bus (CDB) If a value is written on the CDB that is required by the reservation station/rf, it is copied from the CDB IF ISSUE Register File S1 S2 S3 S Address calculation MEMORY L1 L2 L3 L I1 I2 I3 I FU 2 (INT ALU) A1 A2 A3 A FU 3 (FP ADD) M1 M2 M3 FU (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB) WRITE RESULT FROM INSTRUCTION ON RESERVATION STATION D2

6 Tomasulo s algorithm Reservation stations 6 Information on reservation stations: Reservation station Q n Station availability Operation to execute Busy Op Vj Value of operands j,k (valid if operands are ready) Readiness of operands j,k (Label of the reservation with the instruction that will generate the result) Vk Qj Qk Load store operations have an additional field for indexed load/stores, e.g., M[R[AA] + Imm] R[BA] A : used to store the immediate and latter the effective load/store address Information on registers: R0 Integer Data Data 0 Readiness Q 0 R1 Data 1 Q 1 Rn Data n... Q n F0 FP Data FP Data 0 Readiness Q 0 F1 FP Data 1 Q 1 Fn FP Data n... Q n Label each register as ready (value of zero) or not ready (indicating the reservation station holding the instruction that generates the value)

7 7 Tomasulo s algorithm Example Consider the execution of the instructions on the left on a processor with: Pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with cycles latency 1x FP Adder/subtractor, with 3 cycles latency 1x INT/FP Division, with 20 cycles latency Load/store unit has latency Effective address calculation: 1 cycle Level 1 Cache: 3 cycles Level 2 Cache: 5 cycles Level 3 Cache: 12 cycles Main Memory: 50 cycles LWI R1,#V1 LWI R2,#V1+Len(V1) Cont: L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Reservation stations: 3 load + 3 store buffers 2 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction

8 8 Tomasulo s algorithm Example Consider the execution of the instructions on the left on a processor with: Pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with cycles latency 1x FP Adder/subtractor, with 3 cycles latency 1x INT/FP Division, with 20 cycles latency Cont: L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Load/store unit has latency Effective address calculation: 1 cycle Level 1 Cache: 3 cycles Level 2 Cache: 5 cycles Level 3 Cache: 12 cycles Main Memory: 50 cycles Reservation stations: 3 load + 3 store buffers 2 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction

9 Dynamic scheduling with Tomasulo Execution example 9 [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk Load/Store Buffers Busy Vj Vk Qj Qk A Instruction Status (not required, used for illustration only) L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Register Status R1 F0 F2 F

10 Dynamic scheduling with Tomasulo Execution example after cycle 1 10 [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk Load/Store Buffers Busy Vj Vk Qj Qk A Yes R Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Register Status R1 F0 F2 F L1

11 Dynamic scheduling with Tomasulo Execution example after cycle 2 11 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Effective address [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Load/Store Buffers Busy Vj Vk Qj Qk A Yes R R1+0 Register Status R1 F0 F2 F L1 M1

12 Dynamic scheduling with Tomasulo Execution example after cycle 3 12 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB L1 Access Cycle 1 Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F L1 M1

13 Dynamic scheduling with Tomasulo Execution example after cycle 13 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB L1 Access Cycle 2 Effective address Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 R1+0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F I1 L1 M1

D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R1 - - - R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer

14 Dynamic scheduling with Tomasulo Execution example after cycle 5 1 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 BNE R1,R2,Cont STALL Register Status L1 Access MISS Assume for now that branches are solved at issue stage The pipeline stalls until the hazard is solved R1 F0 F2 F I1 L1 M1

15 Dynamic scheduling with Tomasulo Execution example after cycle 6 15 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L2 Access Cycle 1 Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F L1 M1

16 Dynamic scheduling with Tomasulo Execution example cycles 7,8 16 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L1 L2 Access Cycle 2,3 M1

17 Dynamic scheduling with Tomasulo Execution example after cycle 9 17 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 Yes R [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 L2 Access Cycle M1

18 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 L2 Access HIT Effective address WAW on register F was resolved by renaming M2

19 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 Cycle 1 L1 Access Cycle 1 tice that there are 2 cycles under execution LOOP UNROLLING M2

20 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,#8 12 BNE R1,R2,Cont Register Status Cycle 2 L1 Access Cycle 2 Effective address R1 F0 F2 F I1 L2 M2

21 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# BNE R1,R2,Cont STALL Register Status Cycle 3 L1 Access HIT R1 F0 F2 F I1 L2 M2

22 Dynamic scheduling with Tomasulo Execution example after cycle 1 22 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 1 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# STALL BNE R1,R2,Cont STALL Register Status R1 F0 F2 F I1 Cycle 1 M2

23 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# STALL BNE R1,R2,Cont STALL Register Status R1 F0 F2 F I1 Cycle 2 M2

24 Dynamic scheduling with Tomasulo Execution example after cycle 16 2 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F S.D 0(R1),F 3 16 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# BNE R1,R2,Cont 16 Register Status Cycle 3 R1 F0 F2 F M2

25 Dynamic scheduling with Tomasulo Hazards due to memory accesses 25 Out-of-order memory accesses can generate hazards, namely when: A LD is followed by a ST on the same effective address (WAR) A ST is followed by a LD on the same effective address (RAW) A ST is followed by a ST on the same effective address (WAW) A simple way to solve these hazards is to compute the effective address in order: Delay dispatching a LD/ST to the load/store buffer when the effective address is already in any of the buffers (RAW/WAR/WAW) Alternative method to compiler s instruction retiming with hardware support

26 Dynamic scheduling with Tomasulo Problems 26 While the performance of Tomasulo s algorithm is high The implementation of Tomasulo is complex and requires a large amount of hardware resources: Each reservation station requires fast logic to compare the CDB and operand (Qj,Qk) labels The CDB can seriously compromise the performance whenever there are simultaneous writings Multiple CDBs can be implemented However that also implies increasing the reservation station logic to compare the labels And increases the control logic for CDB arbitration

27 Dynamic scheduling with Tomasulo Problems 27 Control hazards are an important limitation On average, 1/7 of all instructions are control To solve the control hazards in conditional jumps/branches the following information must be obtained: Effective jump/branch address Value of the register associated with the condition Deep pipelines impose larger penalties when the effective address or the associated register is unknown This problem is aggravated when performing multiple instruction issue Control instructions appear in more cycles The impact of stalling the pipeline becomes worse

28 Dynamic scheduling with Tomasulo Branch prediction 28 To apply branch prediction (static or dynamic), the processor must have a way of recovering the execution to the alternative path whenever the original assumption is wrong Recovering implies knowing the processor status at the prediction stage The efficiency of branch prediction depends on: Branch miss prediction rate Branch penalty for each miss prediction On Pentium II the branch miss prediction penalty is 12 cycles On Pentium IV the branch miss prediction penalty is 20 cycles On Core 2 the branch miss prediction penalty is 13 cycles On i3/i5/i7 (Nehalem) the branch miss prediction penalty is 17 cycles On i3/i5/i7 (Sandy Bridge) the branch miss prediction penalty is 15 cycles

29 29 Branch prediction Dynamic branch prediction using: Branch-Target-Buffer (BTB) Branch-Prediction-Buffer (BPB) Branch-History-Table (BHT)

30 Branch Prediction Calculation of the jump address 30 Branch predict not taken is easy The predicted jump address is the next PC Loops typically impose that many branches are taken Even unconditional branches (e.g., function call/return) require knowing the target address Anticipation of the effective address calculation to early pipeline stages and the use delayed branches can minimize this problem However these techniques cannot be applied in all cases

31 Branch Prediction Branch Target Buffer (BTB) 31 Instruction address (TAG) Jump address Prediction bits Alternative: Branch Target Buffer (BTB) Create a table, in real time, of the target address for each control instruction LSBs To reduce memory resources, instead of saving the target address for all instructions, use a cache for the most recent instructions The larger the memory ( cache ), the more information can be saved thus decreasing branch miss prediction. However it also implies spending more memory p TAG n-p t taken n Branch prediction Where to put the BTB n-p MSBs = At the IF stage, to enable fetching the next instruction without stalling the pipeline n + NEXT PC PC CURRENT PC

32 Branch Prediction Branch Target Buffer (BTB) 32 Use the PC least significant bits (LSBs) to index the table Instruction address (TAG) Jump address Prediction bits Compare the output word TAG with the PC most significant bits (MSBs) LSBs TAG MSBs(PC) TAG? Branch history is not taken Prediction Bits? TAG=MSBs(PC) Branch history is taken p n n-p MSBs TAG = n-p t taken n + Branch prediction PC PC + (Don t jump) PC Address (Jump) PC NEXT PC CURRENT PC

33 Branch Prediction Branch prediction strategy 33 Static branch prediction, which efficiency depends on the compiler/programmer: Always predict taken Always predict not taken Conditioned to the jump address: Predict taken when predicted address corresponds to a jump backward (i.e., to a lower address) Predict not taken when predicted address corresponds to a jump forward (i.e., to a higher address) Dynamic branch prediction: Takes into consideration the execution path Uses more hardware and power, but increases the prediction rate to over 80%

34 Branch Prediction Dynamic prediction with a 1-bit table 3 The simplest branch prediction scheme implies using: A prediction table of just one bit Use a Branch-Predict-Buffer (BPB) with value of 1 when the latest jump history is taken Use a BPB with value of 0 when the latest jump history is not taken Complement the BPB value when the prediction is wrong Branch not taken Branch taken BPB=1 (Predict Taken) BPB=0 (Predict t Taken) Branch not taken The BPB can work: Branch taken Autonomously: Because target address is unknown, it can only be applied at ID stage Integrated into the BTB Works at IF stage

35 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 35 for (i=0;i<n_cols;i++){ sum = 0; for (j=0;i<n_rows;j++) sum += A[i][j]*A[i][j]; S[i] = sum; } Address OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

36 Using a Branch Target Buffer (BTB) at IF stage TAG INDEX Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

37 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

38 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

39 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 39 ISSUE STAGE fills the BTB only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

40 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 0 ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

41 TAG INDEX 1 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

42 TAG INDEX 2 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

43 TAG INDEX 3 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F DADD R2,# DADD R,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) BNE R2,R3,OUTER_LOOP LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n PC + NEXT PC CURRENT PC Prediction bits Branch prediction

44 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F DADD R2,# DADD R,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) BNE R2,R3,OUTER_LOOP LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n PC + NEXT PC CURRENT PC Prediction bits Branch prediction

45 TAG INDEX 5 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

46 TAG INDEX 6 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

47 TAG INDEX 7 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

48 TAG INDEX 8 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG Jump address Prediction bits n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

49 TAG INDEX 9 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue LSBs MSBs Instruction address (TAG) TAG = t taken Jump address + Prediction bits L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) n-p n Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

50 TAG INDEX 50 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Instruction address (TAG) = Jump address LSBs Issue L1 Fetch (1/3) L1 Fetch (3/3) L1 Fetch (2/3) 32- MSBs TAG n-p t taken n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

51 TAG INDEX 51 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F L1 Fetch (2/3) L1 Fetch (1/3) Issue L1 Fetch (3/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

52 52 Dynamic prediction with a 1-bit table Example Consider a typical case of nested for loops: Ciclo_ext: Ciclo_int: DADDI DADDI DSUBI BNE DSUBI BNE R1,R0,#6 R10,R0,#20 R10,R10,# R10,R0, Ciclo_int R1,R1,# R1,R0, Ciclo_ext The internal loop is executed 5 times, for each of the 16 iterations of the external loop: Branch miss prediction: Total branches: 16x5+16=96 Total misses: 16x2+2 (once whenever entering/exiting the inner/outer loop) Miss rate: 3/96

53 Dynamic Branch Prediction 2-bit Branch-Predict-Buffer (BPB) 53 A 2-bit table has four states: t Taken Strong Prediction (00) Can be implemented with a simple 2-bit counter with saturation: t Taken Weak Prediction (01) Taken Weak Prediction (10) Taken Strong Prediction (11) Whenever the jump is taken, increase the counter Whenever the jump is not taken, decrease the counter Branch not taken Strong Predict Taken Weak Predict Taken PREDICT TAKEN Branch taken Branch not taken PREDICT NOT TAKEN Branch taken Branch not taken Strong predict t Taken Branch taken Weak Predict t Taken Branch not taken

54 5 Dynamic prediction with a 2-bit table Example For the previous case: Ciclo_ext: Ciclo_int: DADDI DADDI DSUBI BNE DSUBI BNE R1,R0,#6 R10,R0,#20 R10,R10,# R10,R0, Ciclo_int R1,R1,# R1,R0, Ciclo_ext The internal loop is executed 5 times, for each of the 16 iterations of the external loop: Branch miss prediction: Total branches: 16x5+16=96 Total misses: 16x1+1+2 (once whenever entering the inner/outer loop) Miss rate: 19/96

55 55 Branch Prediction Dynamic prediction with a 3-bit table A 3-bit table has eight states: t Taken Strong++ Prediction (000) t Taken t Taken Strong+ Prediction (001) t Taken t Taken Strong Prediction (010) t Taken t Taken Weak Prediction (011) t Taken Taken Weak Prediction (100) t Taken Taken Strong Prediction (101) t Taken Taken Strong+ Prediction (110) t Taken t Taken Taken Strong+ Prediction (111) Taken Taken Taken Taken Taken Taken Taken Taken X the most significant bit states the prediction

56 56 Next lesson Correlated branch prediction schemes Branch prediction and dynamic scheduling Superscalar architectures

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,