DYNAMIC SPECULATIVE EXECUTION
|
|
- Solomon O’Brien’
- 6 years ago
- Views:
Transcription
1 DYNAMIC SPECULATIVE EXECUTION Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011 ADVANCED COMPUTER ARCHITECTURES ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)
2 Outline 2 Dynamic instruction scheduling: Revision of Tomasulo algorithm Loop unrolling with Tomasulo Introduction dynamic branch prediction
3 3 Tomasulo algorithm Proposed by Robert Tomasulo in 1966: Initially proposed to overcome the long latencies in both memory accesses and floating point operations First implemented on the IBM 360/91 The algorithm revealed to be far more powerful than anticipated being used in all modern superscalar processors
4 Tomasulo s algorithm General idea Instruction are issued to reservation stations associated to functional units Operands that are ready are directly copied to the reservation station Operands which are unavailable, force instructions to wait at the reservation station Instructions no longer wait for the value on the register, but for completion of an instruction on a given reservation station IF ISSUE Register File S1 S2 S3 S Address calculation MEMORY L1 L2 L3 L I1 I2 I3 I FU 2 (INT ALU) A1 A2 A3 A FU 3 (FP ADD) M1 M2 M3 FU (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB)
5 Tomasulo s algorithm Execute stage 5 1. Reservation stations and the register file (RF) snoop writings to the common data bus (CDB) If a value is written on the CDB that is required by the reservation station/rf, it is copied from the CDB IF ISSUE Register File S1 S2 S3 S Address calculation MEMORY L1 L2 L3 L I1 I2 I3 I FU 2 (INT ALU) A1 A2 A3 A FU 3 (FP ADD) M1 M2 M3 FU (FP MULT) D1 D2 FU 5 (INT/FP DIV) Common Data Bus (CDB) WRITE RESULT FROM INSTRUCTION ON RESERVATION STATION D2
6 Tomasulo s algorithm Reservation stations 6 Information on reservation stations: Reservation station Q n Station availability Operation to execute Busy Op Vj Value of operands j,k (valid if operands are ready) Readiness of operands j,k (Label of the reservation with the instruction that will generate the result) Vk Qj Qk Load store operations have an additional field for indexed load/stores, e.g., M[R[AA] + Imm] R[BA] A : used to store the immediate and latter the effective load/store address Information on registers: R0 Integer Data Data 0 Readiness Q 0 R1 Data 1 Q 1 Rn Data n... Q n F0 FP Data FP Data 0 Readiness Q 0 F1 FP Data 1 Q 1 Fn FP Data n... Q n Label each register as ready (value of zero) or not ready (indicating the reservation station holding the instruction that generates the value)
7 7 Tomasulo s algorithm Example Consider the execution of the instructions on the left on a processor with: Pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with cycles latency 1x FP Adder/subtractor, with 3 cycles latency 1x INT/FP Division, with 20 cycles latency Load/store unit has latency Effective address calculation: 1 cycle Level 1 Cache: 3 cycles Level 2 Cache: 5 cycles Level 3 Cache: 12 cycles Main Memory: 50 cycles LWI R1,#V1 LWI R2,#V1+Len(V1) Cont: L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Reservation stations: 3 load + 3 store buffers 2 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction
8 8 Tomasulo s algorithm Example Consider the execution of the instructions on the left on a processor with: Pipelined functional units: 1x Integer ALU, with 1 cycle latency 1x FP multiplier, with cycles latency 1x FP Adder/subtractor, with 3 cycles latency 1x INT/FP Division, with 20 cycles latency Cont: L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Load/store unit has latency Effective address calculation: 1 cycle Level 1 Cache: 3 cycles Level 2 Cache: 5 cycles Level 3 Cache: 12 cycles Main Memory: 50 cycles Reservation stations: 3 load + 3 store buffers 2 slot for integer operations 2 slots for FP multiplication/division 2 slots for FP addition/subtraction
9 Dynamic scheduling with Tomasulo Execution example 9 [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk Load/Store Buffers Busy Vj Vk Qj Qk A Instruction Status (not required, used for illustration only) L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Register Status R1 F0 F2 F
10 Dynamic scheduling with Tomasulo Execution example after cycle 1 10 [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk Load/Store Buffers Busy Vj Vk Qj Qk A Yes R Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Register Status R1 F0 F2 F L1
11 Dynamic scheduling with Tomasulo Execution example after cycle 2 11 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB Effective address [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 [S3] Store Buffer 3 Load/Store Buffers Busy Vj Vk Qj Qk A Yes R R1+0 Register Status R1 F0 F2 F L1 M1
12 Dynamic scheduling with Tomasulo Execution example after cycle 3 12 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB L1 Access Cycle 1 Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F L1 M1
13 Dynamic scheduling with Tomasulo Execution example after cycle 13 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 BNE R1,R2,Cont ISSUE EX WB L1 Access Cycle 2 Effective address Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 R1+0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F I1 L1 M1
14 Dynamic scheduling with Tomasulo Execution example after cycle 5 1 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 BNE R1,R2,Cont STALL Register Status L1 Access MISS Assume for now that branches are solved at issue stage The pipeline stalls until the hazard is solved R1 F0 F2 F I1 L1 M1
15 Dynamic scheduling with Tomasulo Execution example after cycle 6 15 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L2 Access Cycle 1 Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Register Status R1 F0 F2 F L1 M1
16 Dynamic scheduling with Tomasulo Execution example cycles 7,8 16 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L1 L2 Access Cycle 2,3 M1
17 Dynamic scheduling with Tomasulo Execution example after cycle 9 17 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 Yes R [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 L2 Access Cycle M1
18 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D - F2 L1 - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 Yes R R1+0 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) L.D F0,0(R1) 1 MULT.D F,F0,F2 2 S.D 0(R1),F 3 ISSUE EX WB DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 L2 Access HIT Effective address WAW on register F was resolved by renaming M2
19 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,#8 BNE R1,R2,Cont Register Status R1 F0 F2 F L2 Cycle 1 L1 Access Cycle 1 tice that there are 2 cycles under execution LOOP UNROLLING M2
20 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,#8 12 BNE R1,R2,Cont Register Status Cycle 2 L1 Access Cycle 2 Effective address R1 F0 F2 F I1 L2 M2
21 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D - F2 L2 - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 Yes R R1+0 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) 9 MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# BNE R1,R2,Cont STALL Register Status Cycle 3 L1 Access HIT R1 F0 F2 F I1 L2 M2
22 Dynamic scheduling with Tomasulo Execution example after cycle 1 22 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div Yes MULT.D F0 F2 - - [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 Yes R1 - - M1 0 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F2 2 1 S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# STALL BNE R1,R2,Cont STALL Register Status R1 F0 F2 F I1 Cycle 1 M2
23 Dynamic scheduling with Tomasulo Execution example after cycle Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer Yes DSUBI R1 #8 - - [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F S.D 0(R1),F 3 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# STALL BNE R1,R2,Cont STALL Register Status R1 F0 F2 F I1 Cycle 2 M2
24 Dynamic scheduling with Tomasulo Execution example after cycle 16 2 Reservation stations Operand Value Res. Station Busy Op Vj Vk Qj Qk [I1] Integer [I2] Integer [M1] FP Mult/Div [M2] FP Mult/Div Yes MULT.D F0 F2 - - [A1] FP Add/Sub [A2] FP Add/Sub Load/Store Buffers Busy Vj Vk Qj Qk A [L1] Load Buffer 1 [L2] Load Buffer 2 [L3] Load Buffer 3 [S1] Store Buffer 1 [S2] Store Buffer 2 Yes R1 - - M2 R1+0 [S3] Store Buffer 3 Instruction Status (not required, used for illustration only) ISSUE EX WB L.D F0,0(R1) MULT.D F,F0,F S.D 0(R1),F 3 16 DSUBI R1,R1,#8 5 6 BNE R1,R2,Cont 6 L.D F0,0(R1) MULT.D F,F0,F2 10 S.D 0(R1),F 11 DSUBI R1,R1,# BNE R1,R2,Cont 16 Register Status Cycle 3 R1 F0 F2 F M2
25 Dynamic scheduling with Tomasulo Hazards due to memory accesses 25 Out-of-order memory accesses can generate hazards, namely when: A LD is followed by a ST on the same effective address (WAR) A ST is followed by a LD on the same effective address (RAW) A ST is followed by a ST on the same effective address (WAW) A simple way to solve these hazards is to compute the effective address in order: Delay dispatching a LD/ST to the load/store buffer when the effective address is already in any of the buffers (RAW/WAR/WAW) Alternative method to compiler s instruction retiming with hardware support
26 Dynamic scheduling with Tomasulo Problems 26 While the performance of Tomasulo s algorithm is high The implementation of Tomasulo is complex and requires a large amount of hardware resources: Each reservation station requires fast logic to compare the CDB and operand (Qj,Qk) labels The CDB can seriously compromise the performance whenever there are simultaneous writings Multiple CDBs can be implemented However that also implies increasing the reservation station logic to compare the labels And increases the control logic for CDB arbitration
27 Dynamic scheduling with Tomasulo Problems 27 Control hazards are an important limitation On average, 1/7 of all instructions are control To solve the control hazards in conditional jumps/branches the following information must be obtained: Effective jump/branch address Value of the register associated with the condition Deep pipelines impose larger penalties when the effective address or the associated register is unknown This problem is aggravated when performing multiple instruction issue Control instructions appear in more cycles The impact of stalling the pipeline becomes worse
28 Dynamic scheduling with Tomasulo Branch prediction 28 To apply branch prediction (static or dynamic), the processor must have a way of recovering the execution to the alternative path whenever the original assumption is wrong Recovering implies knowing the processor status at the prediction stage The efficiency of branch prediction depends on: Branch miss prediction rate Branch penalty for each miss prediction On Pentium II the branch miss prediction penalty is 12 cycles On Pentium IV the branch miss prediction penalty is 20 cycles On Core 2 the branch miss prediction penalty is 13 cycles On i3/i5/i7 (Nehalem) the branch miss prediction penalty is 17 cycles On i3/i5/i7 (Sandy Bridge) the branch miss prediction penalty is 15 cycles
29 29 Branch prediction Dynamic branch prediction using: Branch-Target-Buffer (BTB) Branch-Prediction-Buffer (BPB) Branch-History-Table (BHT)
30 Branch Prediction Calculation of the jump address 30 Branch predict not taken is easy The predicted jump address is the next PC Loops typically impose that many branches are taken Even unconditional branches (e.g., function call/return) require knowing the target address Anticipation of the effective address calculation to early pipeline stages and the use delayed branches can minimize this problem However these techniques cannot be applied in all cases
31 Branch Prediction Branch Target Buffer (BTB) 31 Instruction address (TAG) Jump address Prediction bits Alternative: Branch Target Buffer (BTB) Create a table, in real time, of the target address for each control instruction LSBs To reduce memory resources, instead of saving the target address for all instructions, use a cache for the most recent instructions The larger the memory ( cache ), the more information can be saved thus decreasing branch miss prediction. However it also implies spending more memory p TAG n-p t taken n Branch prediction Where to put the BTB n-p MSBs = At the IF stage, to enable fetching the next instruction without stalling the pipeline n + NEXT PC PC CURRENT PC
32 Branch Prediction Branch Target Buffer (BTB) 32 Use the PC least significant bits (LSBs) to index the table Instruction address (TAG) Jump address Prediction bits Compare the output word TAG with the PC most significant bits (MSBs) LSBs TAG MSBs(PC) TAG? Branch history is not taken Prediction Bits? TAG=MSBs(PC) Branch history is taken p n n-p MSBs TAG = n-p t taken n + Branch prediction PC PC + (Don t jump) PC Address (Jump) PC NEXT PC CURRENT PC
33 Branch Prediction Branch prediction strategy 33 Static branch prediction, which efficiency depends on the compiler/programmer: Always predict taken Always predict not taken Conditioned to the jump address: Predict taken when predicted address corresponds to a jump backward (i.e., to a lower address) Predict not taken when predicted address corresponds to a jump forward (i.e., to a higher address) Dynamic branch prediction: Takes into consideration the execution path Uses more hardware and power, but increases the prediction rate to over 80%
34 Branch Prediction Dynamic prediction with a 1-bit table 3 The simplest branch prediction scheme implies using: A prediction table of just one bit Use a Branch-Predict-Buffer (BPB) with value of 1 when the latest jump history is taken Use a BPB with value of 0 when the latest jump history is not taken Complement the BPB value when the prediction is wrong Branch not taken Branch taken BPB=1 (Predict Taken) BPB=0 (Predict t Taken) Branch not taken The BPB can work: Branch taken Autonomously: Because target address is unknown, it can only be applied at ID stage Integrated into the BTB Works at IF stage
35 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 35 for (i=0;i<n_cols;i++){ sum = 0; for (j=0;i<n_rows;j++) sum += A[i][j]*A[i][j]; S[i] = sum; } Address OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
36 Using a Branch Target Buffer (BTB) at IF stage TAG INDEX Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
37 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
38 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
39 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 39 ISSUE STAGE fills the BTB only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
40 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB 0 ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG n-p t taken Jump address n Prediction bits Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
41 TAG INDEX 1 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
42 TAG INDEX 2 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
43 TAG INDEX 3 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F DADD R2,# DADD R,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) BNE R2,R3,OUTER_LOOP LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n PC + NEXT PC CURRENT PC Prediction bits Branch prediction
44 TAG INDEX Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F DADD R2,# DADD R,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) BNE R2,R3,OUTER_LOOP LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n PC + NEXT PC CURRENT PC Prediction bits Branch prediction
45 TAG INDEX 5 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
46 TAG INDEX 6 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
47 TAG INDEX 7 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) Jump address Prediction bits TAG n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
48 TAG INDEX 8 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,#8 Issue L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) LSBs Instruction address (TAG) TAG Jump address Prediction bits n-p t taken n Branch prediction BNE R1,R2,INNER_LOOP S.D 0(R),F 32- MSBs = DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
49 TAG INDEX 9 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Issue LSBs MSBs Instruction address (TAG) TAG = t taken Jump address + Prediction bits L1 Fetch (3/3) L1 Fetch (2/3) L1 Fetch (1/3) n-p n Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
50 TAG INDEX 50 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F Instruction address (TAG) = Jump address LSBs Issue L1 Fetch (1/3) L1 Fetch (3/3) L1 Fetch (2/3) 32- MSBs TAG n-p t taken n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
51 TAG INDEX 51 Using a Branch Target Buffer (BTB) at IF stage Example case for a 1-bit BPB ISSUE STAGE fills the BTP only if the decoded instruction is a BR OUTER_LOOP: DSUB.D F,F,F INNER_LOOP: L.D F0,0(R1) MUL.D F2,F0,F DADD.D F,F,F DADD R1,R1,# BNE R1,R2,INNER_LOOP S.D 0(R),F L1 Fetch (2/3) L1 Fetch (1/3) Issue L1 Fetch (3/3) LSBs MSBs Instruction address (TAG) TAG = n-p t taken Jump address n + Prediction bits Branch prediction DADD R2,# DADD R,# BNE R2,R3,OUTER_LOOP 32 NEXT PC PC CURRENT PC
52 52 Dynamic prediction with a 1-bit table Example Consider a typical case of nested for loops: Ciclo_ext: Ciclo_int: DADDI DADDI DSUBI BNE DSUBI BNE R1,R0,#6 R10,R0,#20 R10,R10,# R10,R0, Ciclo_int R1,R1,# R1,R0, Ciclo_ext The internal loop is executed 5 times, for each of the 16 iterations of the external loop: Branch miss prediction: Total branches: 16x5+16=96 Total misses: 16x2+2 (once whenever entering/exiting the inner/outer loop) Miss rate: 3/96
53 Dynamic Branch Prediction 2-bit Branch-Predict-Buffer (BPB) 53 A 2-bit table has four states: t Taken Strong Prediction (00) Can be implemented with a simple 2-bit counter with saturation: t Taken Weak Prediction (01) Taken Weak Prediction (10) Taken Strong Prediction (11) Whenever the jump is taken, increase the counter Whenever the jump is not taken, decrease the counter Branch not taken Strong Predict Taken Weak Predict Taken PREDICT TAKEN Branch taken Branch not taken PREDICT NOT TAKEN Branch taken Branch not taken Strong predict t Taken Branch taken Weak Predict t Taken Branch not taken
54 5 Dynamic prediction with a 2-bit table Example For the previous case: Ciclo_ext: Ciclo_int: DADDI DADDI DSUBI BNE DSUBI BNE R1,R0,#6 R10,R0,#20 R10,R10,# R10,R0, Ciclo_int R1,R1,# R1,R0, Ciclo_ext The internal loop is executed 5 times, for each of the 16 iterations of the external loop: Branch miss prediction: Total branches: 16x5+16=96 Total misses: 16x1+1+2 (once whenever entering the inner/outer loop) Miss rate: 19/96
55 55 Branch Prediction Dynamic prediction with a 3-bit table A 3-bit table has eight states: t Taken Strong++ Prediction (000) t Taken t Taken Strong+ Prediction (001) t Taken t Taken Strong Prediction (010) t Taken t Taken Weak Prediction (011) t Taken Taken Weak Prediction (100) t Taken Taken Strong Prediction (101) t Taken Taken Strong+ Prediction (110) t Taken t Taken Taken Strong+ Prediction (111) Taken Taken Taken Taken Taken Taken Taken Taken X the most significant bit states the prediction
56 56 Next lesson Correlated branch prediction schemes Branch prediction and dynamic scheduling Superscalar architectures
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING
DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationDYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD
DYNAMIC INSTRUCTION SCHEDULING WITH SCOREBOARD Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 3, John L. Hennessy and David A. Patterson,
More informationChapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)
Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3) Hardware Speculation and Precise
More informationCOSC4201 Instruction Level Parallelism Dynamic Scheduling
COSC4201 Instruction Level Parallelism Dynamic Scheduling Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Outline Data dependence and hazards Exposing parallelism
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationINSTRUCTION LEVEL PARALLELISM
INSTRUCTION LEVEL PARALLELISM Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix H, John L. Hennessy and David A. Patterson,
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationReduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:
Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by: Result forwarding (register bypassing) to reduce or eliminate stalls needed
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationHardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.
Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)
More informationCISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3
CISC 662 Graduate Computer Architecture Lecture 10 - ILP 3 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationScoreboard information (3 tables) Four stages of scoreboard control
Scoreboard information (3 tables) Instruction : issued, read operands and started execution (dispatched), completed execution or wrote result, Functional unit (assuming non-pipelined units) busy/not busy
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)
Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case
More informationCACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás
CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationGood luck and have fun!
Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.
More informationInstruction Level Parallelism
Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches
More informationHardware-based Speculation
Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationPage 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationWebsite for Students VTU NOTES QUESTION PAPERS NEWS RESULTS
Advanced Computer Architecture- 06CS81 Hardware Based Speculation Tomasulu algorithm and Reorder Buffer Tomasulu idea: 1. Have reservation stations where register renaming is possible 2. Results are directly
More informationCS433 Homework 2 (Chapter 3)
CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..
More informationWhat is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?
What is ILP? Instruction Level Parallelism or Declaration of Independence The characteristic of a program that certain instructions are, and can potentially be. Any mechanism that creates, identifies,
More informationDynamic Scheduling. Better than static scheduling Scoreboarding: Tomasulo algorithm:
LECTURE - 13 Dynamic Scheduling Better than static scheduling Scoreboarding: Used by the CDC 6600 Useful only within basic block WAW and WAR stalls Tomasulo algorithm: Used in IBM 360/91 for the FP unit
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationCISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1
CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer
More informationEECC551 Exam Review 4 questions out of 6 questions
EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving
More informationCS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example
CS252 Graduate Computer Architecture Lecture 6 Tomasulo, Implicit Register Renaming, Loop-Level Parallelism Extraction Explicit Register Renaming John Kubiatowicz Electrical Engineering and Computer Sciences
More informationCS433 Homework 2 (Chapter 3)
CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies
More informationMulticycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline
//11 Limitations of Our Simple stage Pipeline Diversified Pipelines The Path Toward Superscalar Processors HPCA, Spring 11 Assumes single cycle EX stage for all instructions This is not feasible for Complex
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationReview: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:
CS152 Computer Architecture and Engineering Lecture 17 Dynamic Scheduling: Tomasulo March 20, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/
More informationPage # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationPage 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 15: Instruction Level Parallelism and Dynamic Execution March 11, 2002 Prof. David E. Culler Computer Science 252 Spring 2002
More information吳俊興高雄大學資訊工程學系. October Example to eleminate WAR and WAW by register renaming. Tomasulo Algorithm. A Dynamic Algorithm: Tomasulo s Algorithm
EEF011 Computer Architecture 計算機結構 吳俊興高雄大學資訊工程學系 October 2004 Example to eleminate WAR and WAW by register renaming Original DIV.D ADD.D S.D SUB.D MUL.D F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6,
More informationCPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationThe basic structure of a MIPS floating-point unit
Tomasulo s scheme The algorithm based on the idea of reservation station The reservation station fetches and buffers an operand as soon as it is available, eliminating the need to get the operand from
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationSuperscalar Architectures: Part 2
Superscalar Architectures: Part 2 Dynamic (Out-of-Order) Scheduling Lecture 3.2 August 23 rd, 2017 Jae W. Lee (jaewlee@snu.ac.kr) Computer Science and Engineering Seoul NaMonal University Download this
More informationPage 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer
CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson
More informationAdvanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012
Advanced Computer Architecture CMSC 611 Homework 3 Due in class Oct 17 th, 2012 (Show your work to receive partial credit) 1) For the following code snippet list the data dependencies and rewrite the code
More informationDAT105: Computer Architecture Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation
Study Period 2, 2009 Exercise 3 Chapter 2: Instruction-Level Parallelism and Its Exploitation Mafijul Islam Department of Computer Science and Engineering November 19, 2009 Study Period 2, 2009 Goals:
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 09
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationComputer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution
CSCE 6 (Fall 07) Computer Architecture Homework Set # COVER SHEET Please turn in with your own solution Eun Jung Kim Write your answers on the sheets provided. Submit with the COVER SHEET. If you need
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationChapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences
Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution
More informationCourse on Advanced Computer Architectures
Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationTopics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation
Digital Systems Architecture EECE 343-01 EECE 292-02 Predication, Prediction, and Speculation Dr. William H. Robinson February 25, 2004 http://eecs.vanderbilt.edu/courses/eece343/ Topics Aha, now I see,
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationChapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,
Chapter 3 (CONT II) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007, 2013 1 Hardware-Based Speculation (Section 3.6) In multiple issue processors, stalls due to branches would
More informationSuper Scalar. Kalyan Basu March 21,
Super Scalar Kalyan Basu basu@cse.uta.edu March 21, 2007 1 Super scalar Pipelines A pipeline that can complete more than 1 instruction per cycle is called a super scalar pipeline. We know how to build
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationInstruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4
PROBLEM 1: An application running on a 1GHz pipelined processor has the following instruction mix: Instruction Frequency CPI Load-store 55% 5 Arithmetic 30% 4 Branch 15% 4 a) Determine the overall CPI
More informationInstruction Level Parallelism. Taken from
Instruction Level Parallelism Taken from http://www.cs.utsa.edu/~dj/cs3853/lecture5.ppt Outline ILP Compiler techniques to increase ILP Loop Unrolling Static Branch Prediction Dynamic Branch Prediction
More informationLecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest
More informationCPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor
Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction
More informationCS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 19, 2017 Time: 1 hour + 15 minutes Name: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your time.
More informationCPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor
1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not
More informationFor this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units
CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationInstruction Level Parallelism (ILP)
Instruction Level Parallelism (ILP) Pipelining supports a limited sense of ILP e.g. overlapped instructions, out of order completion and issue, bypass logic, etc. Remember Pipeline CPI = Ideal Pipeline
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationReferences EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)
EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s
More informationCSE 502 Graduate Computer Architecture. Lec 8-10 Instruction Level Parallelism
CSE 502 Graduate Computer Architecture Lec 8-10 Instruction Level Parallelism Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson,
More informationCOSC 6385 Computer Architecture - Instruction Level Parallelism (II)
COSC 6385 Computer Architecture - Instruction Level Parallelism (II) Edgar Gabriel Spring 2016 Data fields for reservation stations Op: operation to perform on source operands S1 and S2 Q j, Q k : reservation
More informationPipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences
Dynamic Scheduling Pipelining: Issue instructions in every cycle (CPI 1) Compiler scheduling (static scheduling) reduces impact of dependences Increased compiler complexity, especially when attempting
More informationNOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review
Review from Last Time CSE 820 Graduate Computer Architecture Lec 7 Instruction Level Parallelism Based on slides by David Patterson 4 papers: All about where to draw line between HW and SW IBM set foundations
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationAs the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.
Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction
More informationCMSC411 Fall 2013 Midterm 2 Solutions
CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has
More informationInstruction Level Parallelism
Instruction Level Parallelism Dynamic scheduling Scoreboard Technique Tomasulo Algorithm Speculation Reorder Buffer Superscalar Processors 1 Definition of ILP ILP=Potential overlap of execution among unrelated
More informationMulti-cycle Instructions in the Pipeline (Floating Point)
Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining
More informationComputer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "
More informationNOW Handout Page 1. Outline. Csci 211 Computer System Architecture. Lec 4 Instruction Level Parallelism. Instruction Level Parallelism
Outline Csci 211 Computer System Architecture Lec 4 Instruction Level Parallelism Xiuzhen Cheng Department of Computer Sciences The George Washington University ILP Compiler techniques to increase ILP
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationEECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)
Evolution of Processor Performance So far we examined static & dynamic techniques to improve the performance of single-issue (scalar) pipelined CPU designs including: static & dynamic scheduling, static
More informationDynamic Instruction Level Parallelism (ILP)
Dynamic Instruction Level Parallelism (ILP) Introduction to ILP Data and name dependences and hazards CPU dynamic scheduling Branch prediction Hardware speculation Multiple issue Theoretical limits of
More informationFunctional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory
The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP
More informationESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling Instruction level parallelism
Computer Architecture ESE 545 Computer Architecture Instruction-Level Parallelism (ILP) and Static & Dynamic Instruction Scheduling 1 Outline ILP Compiler techniques to increase ILP Loop Unrolling Static
More informationExploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop
More informationILP: Instruction Level Parallelism
ILP: Instruction Level Parallelism Tassadaq Hussain Riphah International University Barcelona Supercomputing Center Universitat Politècnica de Catalunya Introduction Introduction Pipelining become universal
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationChapter 4 The Processor 1. Chapter 4D. The Processor
Chapter 4 The Processor 1 Chapter 4D The Processor Chapter 4 The Processor 2 Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
More informationQuestion 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. The memory is word addressable The size of the
Question 1 (5 points) Consider a cache with the following specifications Address space is 1024 words. he memory is word addressable he size of the cache is 8 blocks; each block is 4 words (32 words cache).
More informationCMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office
More informationCS 614 COMPUTER ARCHITECTURE II FALL 2004
CS 64 COMPUTER ARCHITECTURE II FALL 004 DUE : October, 005 HOMEWORK II READ : - Portions of Chapters 5, 7, 8 and 9 of the Sima book and - Portions of Chapter 3, 4 and Appendix A of the Hennessy book ASSIGNMENT:
More informationUpdated Exercises by Diana Franklin
C-82 Appendix C Pipelining: Basic and Intermediate Concepts Updated Exercises by Diana Franklin C.1 [15/15/15/15/25/10/15] Use the following code fragment: Loop: LD R1,0(R2) ;load R1 from address
More informationSlide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng
Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide
More informationComplications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?
Complications with long instructions CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3 Long Instructions & MIPS Case Study So far, all MIPS instructions take 5 cycles But haven't talked
More informationCS252 Graduate Computer Architecture Lecture 8. Review: Scoreboard (CDC 6600) Explicit Renaming Precise Interrupts February 13 th, 2010
CS252 Graduate Computer Architecture Lecture 8 Explicit Renaming Precise Interrupts February 13 th, 2010 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley
More information