DYNAMIC SPECULATIVE EXECUTION

Size: px
Start display at page:

Download "DYNAMIC SPECULATIVE EXECUTION"

Transcription

1 DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

2 Outline 2 Dynamic instruction scheduling: Revision of Tomasulo algorithm Loop unrolling with Tomasulo Introduction dynamic branch prediction

3 3 Tomasulo algorithm Proposed by Robert Tomasulo in 1966: Initially proposed to overcome the long latencies in both memory accesses and floating point operations First implemented on the IBM 360/91 The algorithm revealed to be far more powerful than anticipated being used in all modern superscalar processors

4 Tomasulo s algorithm General idea Instruction are issued to reservation stations associated to functional units Operands that are ready are directly copied to the reservation station Operands which are unavailable, force instructions to wait at the reservation station Instructions no longer wait for the value on the register, but for completion of an instruction on a given reservation station IF ISSUE Register File S1 S2 S3 S Address calculation MEMORY L1 L2 L3 L I1 I2 I3 I FU 2 (INT ALU) A1 A2 A3 A FU 3 (FP ADD) M1 M2 M3 FU (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB)

5 Tomasulo s algorithm Execute stage 5 1. Reservation stations and the register file (RF) snoop writings to the common data bus (CDB) If a value is written on the CDB that is required by the reservation station/rf, it is copied from the CDB IF ISSUE Register File S1 S2 S3 S Address calculation MEMORY L1 L2 L3 L I1 I2 I3 I FU 2 (INT ALU) A1 A2 A3 A FU 3 (FP ADD) M1 M2 M3 FU (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB) WRITE RESULT FROM INSTRUCTION ON RESERVATION STATION D2

6 Tomasulo s algorithm Reservation stations 6 Information on reservation stations: Reservation station Q n Station availability Operation to execute Busy Op Vj Value of operands j,k (valid if operands are ready) Readiness of operands j,k (Label of the reservation with the instruction that will generate the result) Vk Qj Qk Load store operations have an additional field for indexed load/stores, e.g., M[R[AA] + Imm] R[BA] A : used to store the immediate and latter the effective load/store address Information on registers: R0 Integer Data Data 0 Readiness Q 0 R1 Data 1 Q 1 Rn Data n... Q n F0 FP Data FP Data 0 Readiness Q 0 F1 FP Data 1 Q 1 Fn FP Data n... Q n Label each register as ready (value of zero) or not ready (indicating the reservation station holding the instruction that generates the value)

7 7 Tomasulo s algorithm Example Consider the execution of the instructions on the left on a processor with: Pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with cycles latency 1x FP Adder/subtractor, with 3 cycles latency 1x INT/FP Division, with 20 cycles latency Load/store unit has latency Effective address calculation: 1 cycle Level 1 Cache: 3 cycles Level 2 Cache: 5 cycles Level 3 Cache: 12 cycles Main Memory: 50 cycles LWI R1,#V1 LWI R2,#V1+Len(V1) Cont: L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Reservation stations: 3 load + 3 store buffers 2 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction

8 8 Tomasulo s algorithm Example Consider the execution of the instructions on the left on a processor with: Pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with cycles latency 1x FP Adder/subtractor, with 3 cycles latency 1x INT/FP Division, with 20 cycles latency Cont: L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Load/store unit has latency Effective address calculation: 1 cycle Level 1 Cache: 3 cycles Level 2 Cache: 5 cycles Level 3 Cache: 12 cycles Main Memory: 50 cycles Reservation stations: 3 load + 3 store buffers 2 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction

9 Dynamic scheduling with Tomasulo Execution example 9 [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk Load/Store Buffers Busy Vj Vk Qj Qk A Instruction Status (not required, used for illustration only) L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Register Status R1 F0 F2 F

10 Dynamic scheduling with Tomasulo Execution example after cycle 1 10 [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk Load/Store Buffers Busy Vj Vk Qj Qk A Yes R Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Register Status R1 F0 F2 F L1

11 Dynamic scheduling with Tomasulo Execution example after cycle 2 11 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Effective address [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Load/Store Buffers Busy Vj Vk Qj Qk A Yes R R1+0 Register Status R1 F0 F2 F L1 M1

12 Dynamic scheduling with Tomasulo Execution example after cycle 3 12 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB L1 Access Cycle 1 Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F L1 M1

13 Dynamic scheduling with Tomasulo Execution example after cycle 13 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB L1 Access Cycle 2 Effective address Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 R1+0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F I1 L1 M1

14 Dynamic scheduling with Tomasulo Execution example after cycle 5 1 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 BNE R1,R2,Cont STALL Register Status L1 Access MISS Assume for now that branches are solved at issue stage The pipeline stalls until the hazard is solved R1 F0 F2 F I1 L1 M1

15 Dynamic scheduling with Tomasulo Execution example after cycle 6 15 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L2 Access Cycle 1 Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F L1 M1

16 Dynamic scheduling with Tomasulo Execution example cycles 7,8 16 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L1 L2 Access Cycle 2,3 M1

17 Dynamic scheduling with Tomasulo Execution example after cycle 9 17 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 Yes R [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 L2 Access Cycle M1

18 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 L2 Access HIT Effective address WAW on register F was resolved by renaming M2

19 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 Cycle 1 L1 Access Cycle 1 tice that there are 2 cycles under execution LOOP UNROLLING M2

20 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,#8 12 BNE R1,R2,Cont Register Status Cycle 2 L1 Access Cycle 2 Effective address R1 F0 F2 F I1 L2 M2

21 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# BNE R1,R2,Cont STALL Register Status Cycle 3 L1 Access HIT R1 F0 F2 F I1 L2 M2

22 Dynamic scheduling with Tomasulo Execution example after cycle 1 22 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 1 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# STALL BNE R1,R2,Cont STALL Register Status R1 F0 F2 F I1 Cycle 1 M2

23 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# STALL BNE R1,R2,Cont STALL Register Status R1 F0 F2 F I1 Cycle 2 M2

24 Dynamic scheduling with Tomasulo Execution example after cycle 16 2 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F S.D 0(R1),F 3 16 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# BNE R1,R2,Cont 16 Register Status Cycle 3 R1 F0 F2 F M2

25 Dynamic scheduling with Tomasulo Hazards due to memory accesses 25 Out-of-order memory accesses can generate hazards, namely when: A LD is followed by a ST on the same effective address (WAR) A ST is followed by a LD on the same effective address (RAW) A ST is followed by a ST on the same effective address (WAW) A simple way to solve these hazards is to compute the effective address in order: Delay dispatching a LD/ST to the load/store buffer when the effective address is already in any of the buffers (RAW/WAR/WAW) Alternative method to compiler s instruction retiming with hardware support

26 Dynamic scheduling with Tomasulo Problems 26 While the performance of Tomasulo s algorithm is high The implementation of Tomasulo is complex and requires a large amount of hardware resources: Each reservation station requires fast logic to compare the CDB and operand (Qj,Qk) labels The CDB can seriously compromise the performance whenever there are simultaneous writings Multiple CDBs can be implemented However that also implies increasing the reservation station logic to compare the labels And increases the control logic for CDB arbitration

27 Dynamic scheduling with Tomasulo Problems 27 Control hazards are an important limitation On average, 1/7 of all instructions are control To solve the control hazards in conditional jumps/branches the following information must be obtained: Effective jump/branch address Value of the register associated with the condition Deep pipelines impose larger penalties when the effective address or the associated register is unknown This problem is aggravated when performing multiple instruction issue Control instructions appear in more cycles The impact of stalling the pipeline becomes worse

28 Dynamic scheduling with Tomasulo Branch prediction 28 To apply branch prediction (static or dynamic), the processor must have a way of recovering the execution to the alternative path whenever the original assumption is wrong Recovering implies knowing the processor status at the prediction stage The efficiency of branch prediction depends on: Branch miss prediction rate Branch penalty for each miss prediction On Pentium II the branch miss prediction penalty is 12 cycles On Pentium IV the branch miss prediction penalty is 20 cycles On Core 2 the branch miss prediction penalty is 13 cycles On i3/i5/i7 (Nehalem) the branch miss prediction penalty is 17 cycles On i3/i5/i7 (Sandy Bridge) the branch miss prediction penalty is 15 cycles

29 29 Branch prediction Dynamic branch prediction using: Branch-Target-Buffer (BTB) Branch-Prediction-Buffer (BPB) Branch-History-Table (BHT)

30 Branch Prediction Calculation of the jump address 30 Branch predict not taken is easy The predicted jump address is the next PC Loops typically impose that many branches are taken Even unconditional branches (e.g., function call/return) require knowing the target address Anticipation of the effective address calculation to early pipeline stages and the use delayed branches can minimize this problem However these techniques cannot be applied in all cases

31 Branch Prediction Branch Target Buffer (BTB) 31 Instruction address (TAG) Jump address Prediction bits Alternative: Branch Target Buffer (BTB) Create a table, in real time, of the target address for each control instruction LSBs To reduce memory resources, instead of saving the target address for all instructions, use a cache for the most recent instructions The larger the memory ( cache ), the more information can be saved thus decreasing branch miss prediction. However it also implies spending more memory p TAG n-p t taken n Branch prediction Where to put the BTB n-p MSBs = At the IF stage, to enable fetching the next instruction without stalling the pipeline n + NEXT PC PC CURRENT PC

32 Branch Prediction Branch Target Buffer (BTB) 32 Use the PC least significant bits (LSBs) to index the table Instruction address (TAG) Jump address Prediction bits Compare the output word TAG with the PC most significant bits (MSBs) LSBs TAG MSBs(PC) TAG? Branch history is not taken Prediction Bits? TAG=MSBs(PC) Branch history is taken p n n-p MSBs TAG = n-p t taken n + Branch prediction PC PC + (Don t jump) PC Address (Jump) PC NEXT PC CURRENT PC

33 Branch Prediction Branch prediction strategy 33 Static branch prediction, which efficiency depends on the compiler/programmer: Always predict taken Always predict not taken Conditioned to the jump address: Predict taken when predicted address corresponds to a jump backward (i.e., to a lower address) Predict not taken when predicted address corresponds to a jump forward (i.e., to a higher address) Dynamic branch prediction: Takes into consideration the execution path Uses more hardware and power, but increases the prediction rate to over 80%

34 Branch Prediction Dynamic prediction with a 1-bit table 3 The simplest branch prediction scheme implies using: A prediction table of just one bit Use a Branch-Predict-Buffer (BPB) with value of 1 when the latest jump history is taken Use a BPB with value of 0 when the latest jump history is not taken Complement the BPB value when the prediction is wrong Branch not taken Branch taken BPB=1 (Predict Taken) BPB=0 (Predict t Taken) Branch not taken The BPB can work: Branch taken Autonomously: Because target address is unknown, it can only be applied at ID stage Integrated into the BTB Works at IF stage

35 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 35 for (i=0;i<n_cols;i++){ sum = 0; for (j=0;i<n_rows;j++) sum += A[i][j]*A[i][j]; S[i] = sum; } Address OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

36 Using a Branch Target Buffer (BTB) at IF stage TAG INDEX Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

37 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

38 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

39 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 39 ISSUE STAGE fills the BTB only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

40 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 0 ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

41 TAG INDEX 1 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

42 TAG INDEX 2 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

43 TAG INDEX 3 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F DADD R2,# DADD R,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) BNE R2,R3,OUTER_LOOP LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n PC + NEXT PC CURRENT PC Prediction bits Branch prediction

44 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F DADD R2,# DADD R,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) BNE R2,R3,OUTER_LOOP LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n PC + NEXT PC CURRENT PC Prediction bits Branch prediction

45 TAG INDEX 5 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

46 TAG INDEX 6 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

47 TAG INDEX 7 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

48 TAG INDEX 8 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG Jump address Prediction bits n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

49 TAG INDEX 9 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue LSBs MSBs Instruction address (TAG) TAG = t taken Jump address + Prediction bits L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) n-p n Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

50 TAG INDEX 50 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Instruction address (TAG) = Jump address LSBs Issue L1 Fetch (1/3) L1 Fetch (3/3) L1 Fetch (2/3) 32- MSBs TAG n-p t taken n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

51 TAG INDEX 51 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F L1 Fetch (2/3) L1 Fetch (1/3) Issue L1 Fetch (3/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC

52 52 Dynamic prediction with a 1-bit table Example Consider a typical case of nested for loops: Ciclo_ext: Ciclo_int: DADDI DADDI DSUBI BNE DSUBI BNE R1,R0,#6 R10,R0,#20 R10,R10,# R10,R0, Ciclo_int R1,R1,# R1,R0, Ciclo_ext The internal loop is executed 5 times, for each of the 16 iterations of the external loop: Branch miss prediction: Total branches: 16x5+16=96 Total misses: 16x2+2 (once whenever entering/exiting the inner/outer loop) Miss rate: 3/96

53 Dynamic Branch Prediction 2-bit Branch-Predict-Buffer (BPB) 53 A 2-bit table has four states: t Taken Strong Prediction (00) Can be implemented with a simple 2-bit counter with saturation: t Taken Weak Prediction (01) Taken Weak Prediction (10) Taken Strong Prediction (11) Whenever the jump is taken, increase the counter Whenever the jump is not taken, decrease the counter Branch not taken Strong Predict Taken Weak Predict Taken PREDICT TAKEN Branch taken Branch not taken PREDICT NOT TAKEN Branch taken Branch not taken Strong predict t Taken Branch taken Weak Predict t Taken Branch not taken

54 5 Dynamic prediction with a 2-bit table Example For the previous case: Ciclo_ext: Ciclo_int: DADDI DADDI DSUBI BNE DSUBI BNE R1,R0,#6 R10,R0,#20 R10,R10,# R10,R0, Ciclo_int R1,R1,# R1,R0, Ciclo_ext The internal loop is executed 5 times, for each of the 16 iterations of the external loop: Branch miss prediction: Total branches: 16x5+16=96 Total misses: 16x1+1+2 (once whenever entering the inner/outer loop) Miss rate: 19/96

55 55 Branch Prediction Dynamic prediction with a 3-bit table A 3-bit table has eight states: t Taken Strong++ Prediction (000) t Taken t Taken Strong+ Prediction (001) t Taken t Taken Strong Prediction (010) t Taken t Taken Weak Prediction (011) t Taken Taken Weak Prediction (100) t Taken Taken Strong Prediction (101) t Taken Taken Strong+ Prediction (110) t Taken t Taken Taken Strong+ Prediction (111) Taken Taken Taken Taken Taken Taken Taken Taken X the most significant bit states the prediction

56 56 Next lesson Correlated branch prediction schemes Branch prediction and dynamic scheduling Superscalar architectures

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD

DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,

More information

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise

More information

COSC4201 Instruction Level Parallelism Dynamic Scheduling

COSC4201 Instruction Level Parallelism Dynamic Scheduling COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

INSTRUCTION LEVEL PARALLELISM

INSTRUCTION LEVEL PARALLELISM INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3 CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Scoreboard information (3 tables) Four stages of scoreboard control

Scoreboard information (3 tables) Four stages of scoreboard control Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP? What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,

More information

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:

Dynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm: LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline //11 Limitations of Our Simple stage Pipeline Diversified Pipelines The Path Toward Superscalar Processors HPCA, Spring 11 Assumes single cycle EX stage for all instructions This is not feasible for Complex

More information

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

EITF20: Computer Architecture Part3.2.1: Pipeline - 3 EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done

More information

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software: CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

More information

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002

More information

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm

吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,

More information

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

The basic structure of a MIPS floating-point unit

The basic structure of a MIPS floating-point unit Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

Superscalar Architectures: Part 2

Superscalar Architectures: Part 2 Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this

More information

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson

More information

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012 Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code

More information

DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation

DAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation Mafijul Islam Department of Computer Science and Engineering November 19, 2009 Study Period 2, 2009 Goals:

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution CSCE 6 (Fall 07) Computer Architecture Homework Set # COVER SHEET Please turn in with your own solution Eun Jung Kim Write your answers on the sheets provided. Submit with the COVER SHEET. If you need

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007, Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would

More information

Super Scalar. Kalyan Basu March 21,

Super Scalar. Kalyan Basu March 21, Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4 PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI

More information

Instruction Level Parallelism. Taken from

Instruction Level Parallelism. Taken from Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Instruction Level Parallelism (ILP)

Instruction Level Parallelism (ILP) Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism

CSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism CSE 502 Graduate Computer Architecture Lec 8-10 Instruction Level Parallelism Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,

More information

COSC 6385 Computer Architecture - Instruction Level Parallelism (II)

COSC 6385 Computer Architecture - Instruction Level Parallelism (II) COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation

More information

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences

Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Increased compiler complexity, especially when attempting

More information

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review Review from Last Time CSE 820 Graduate Computer Architecture Lec 7 Instruction Level Parallelism Based on slides by David Patterson 4 papers: All about where to draw line between HW and SW IBM set foundations

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "

More information

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism

NOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism Outline Csci 211 Computer System Architecture Lec 4 Instruction Level Parallelism Xiuzhen Cheng Department of Computer Sciences The George Washington University ILP Compiler techniques to increase ILP

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?) Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static

More information

Dynamic Instruction Level Parallelism (ILP)

Dynamic Instruction Level Parallelism (ILP) Dynamic Instruction Level Parallelism (ILP) Introduction to ILP Data and name dependences and hazards CPU dynamic scheduling Branch prediction Hardware speculation Multiple issue Theoretical limits of

More information

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP

More information

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling 1 Outline ILP Compiler techniques to increase ILP Loop Unrolling Static

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

ILP: Instruction Level Parallelism

ILP: Instruction Level Parallelism ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Chapter 4 The Processor 1. Chapter 4D. The Processor

Chapter 4 The Processor 1. Chapter 4D. The Processor Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline

More information

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the

Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. he memory is word addressable he size of the cache is 8 blocks; each block is 4 words (32 words cache).

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CS 614 COMPUTER ARCHITECTURE II FALL 2004 CS 64 COMPUTER ARCHITECTURE II FALL 004 DUE : October, 005 HOMEWORK II READ : - Portions of Chapters 5, 7, 8 and 9 of the Sima book and - Portions of Chapter 3, 4 and Appendix A of the Hennessy book ASSIGNMENT:

More information

Updated Exercises by Diana Franklin

Updated Exercises by Diana Franklin C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address

More information

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide

More information

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow? Complications with long instructions CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study So far, all MIPS instructions take 5 cycles But haven't talked

More information

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010

CS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010 CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

More information