CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del) and John Kubiatowicz (UC Berkeley) Pipeline CPI 2 Pipeline CPI (I) Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Techniques to reduce stalls - type of stalls (seen so far): Forward and bypass - potential data hazard stalls Delayed branches and simple branch scheduling - control hazard stalls Basic compiler pipeline schedule - data hazard stalls 3
Pipeline CPI (II) Techniques to reduce stalls - type of stalls (we will see in the next few weeks): Compiler pipeline schedule - data hazard stalls Loop unrolling - control hazard stalls Branch predictions - control stalls Dynamic scheduling (scoreboarding) - data hazard stalls from true dependences Dynamic scheduling with renaming - data hazard stalls and stall from antidependences and output dependences Dynamic memory disambiguation - data hazard stalls with memory Hardware speculations - data hazard and control hazard stalls Issuing multiple instructions per cycle - ideal CPI 4 Pipeline CPI (III) Techniques to reduce stalls - type of stalls (we will not cover in this course): Compiler dependence analysis, software pipeline, trace scheduling - ideal CPI, data hazard stalls Hardware support for compiler speculation - ideal CPI, data hazard stalls, branch hazard stalls 5 Dependences 6
Dependences and Hazards Dependences are a property of programs. If two instructions are data dependent they cannot execute simultaneously. Whether a dependence results in a hazard and whether that hazard actually causes a stall are properties of the pipeline organization. Data dependences may occur through registers or memory. 7 Dependences and Hazards The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. If two instructions are independent then they can be executed in parallel Otherwise they must execute in order, although they may partially overlap A data dependence: Indicates that there is a possibility of a hazard. Determines the order in which results must be calculated, and Sets an upper bound on the amount of parallelism that can be exploited. 8 Type of Dependencies Types of dependencies: Data (true) dependencies Name dependencies Control dependencies 9
Data Dependencies Instructions j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP 10 Data Dependencies Instructions j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i LOOP: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,#-8 BNE R1, R2, LOOP What effect do we get if we move branch condition test to EX phase? Is this RAW, WAW or WAR hazard? 11 Data Dependencies Dependences through registers are easy : lw r10, 10(r11) add r12, r10, r8 just compare register names Dependences through memory are harder : sw r10, 4 (r2) lw r6, 0(r4) is r2+4 = r4+0? If so they are dependent, if not, they are not. 12
Data Dependencies Data dependencies can be overcome by Leaving the dependence but avoiding the hazard Eliminating the dependence by transforming the code 13 Name Dependencies (I) Instructions i and j use the same register or memory location Antidependence instruction j writes a location that instruction i reads Is this RAW, WAW or WAR hazard? Output dependence instruction j writes a location that instruction i writes Is this RAW, WAW or WAR hazard? Since there is no data flow between instructions, they can be renamed and executed in parallel - register renaming 14 Name Dependencies (II) Antidependence : When instruction j writes a register or memory location that instruction i reads : i: add r6,r5,r4 j: sub r5,r8,r11 Output dependence : When instruction i and j write the same register or memory location. The ordering must be preserved to leave the correct value in the register: add r7,r4,r3 div r7,r2,r8 15
Control Dependencies Branches incur some penalty while the target and condition are evaluated we cannot be sure which instruction is next We have to guess We have to reorder instructions so that we execute useful instructions while waiting for the branch Main goal is not to affect correctness of the program 16 Control Dependencies An instruction j is control dependent on i if the execution of j is controlled by instruction i. i: if (a < b) j: a=a+1; j is control dependent on i. 1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. 2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. 17 Control Dependencies Preserve exception behavior and data flow Instruction reordering should not cause exception reordering L: DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) Only those exceptions are allowed that would surely occur Instructions after the branch depend on it and all instructions prior to the branch for correct execution DADDU R1, R2, R3 BEQZ R4, L DSUBU R1, R5, R6 L: OR R7, R1, R8 18
Preserving Exception Behavior A simple pipeline preserves control dependences since it executes programs in program order. L1: daddu r2,r3,r4 beqz r2,l1 lw r1,0(r2) Can we move lw before the branch? (Don t worry, it is OK to violate control dependences as long as we can preserve the program semantics) 19 Preserving Exception Behavior Corollary: Any changes in the ordering of instructions should not change how exceptions are raised in a program. 20 Preserving Data Flow Consider the following example: daddu r1,r2,r3 beqz r4,l dsubu r1,r5,r6 L: or r7,r1,r8 What can you say about the value of r1 used by the or instruction? 21
Preserving Data Flow Corollary: Preserving data dependences alone is not sufficient when changing program order. We must preserve the data flow. These two principles together allow us to execute instructions in a different order and still maintain the program semantics. This is the foundation upon which ILP processors are built. 22 Instruction Level Parallelism Amount of parallelism within a basic block is very small We must exploit parallelism across multiple basic blocks Pipelining Out-of-order execution 23 Dynamic Scheduling Techniques we have learned so far are static scheduling techniques forwarding, delayed branches, flush pipeline, predict taken, predict untaken Compiler detects dependencies and schedules instruction execution to minimize hazards Pipeline executes instructions in order, detects hazards and inserts stalls Dynamic scheduling overcomes data hazards by out-of-order execution 24
Out-of-Order Execution If some instruction is stalled, check the following instructions to see whether they can proceed (they have no hazards with previous instructions) Check for structural and data hazards Instruction can be issued as soon as its operands are available Out-of-order issue means out-of-order completion and possibility of WAR and WAW hazards, and problems with exception handling 25 Loop Unrolling and Scheduling 26 Can we make CPI closer to 1? Let s assume full pipelining: If we have a 4-cycle latency, then we need 3 instructions between a producing instruction and its use: multf $F0,$F2,$F4 delay-1 delay-2 delay-3 addf $F6,$F10,$F0 Earliest forwarding for 1-cycle instructions Earliest forwarding for 4-cycle instructions Fetch Decode Ex1 Ex2 Ex3 Ex4 WB addf delay3 delay2 delay1 multf 27
FP Loop: Where are the Hazards? Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 Where are the stalls? 28 FP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B (DW) 8 BNEZ R1,Loop ;branch R1!=zero 9 stall ;delayed branch slot Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clocks: Rewrite code to minimize stalls? 29 Revised FP Loop Minimizing Stalls 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 6 clocks: Unroll loop 4 times code to make faster? 30
Unroll Loop Four Times (straightforward way) 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 1 cycle stall 2 cycles stall 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP Rewrite loop to minimize stalls? 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 31 Unrolled Loop That Minimizes Stalls 1 Loop:LD F0,0(R1) 2 LD F6,-8(R1) What assumptions 3 LD F10,-16(R1) made when moved 4 LD F14,-24(R1) code? 5 ADDD F4,F0,F2 OK to move store past 6 ADDD F8,F6,F2 SUBI even though changes 7 ADDD F12,F10,F2 register 8 ADDD F16,F14,F2 OK to move loads before 9 SD 0(R1),F4 stores: get right data? 10 SD -8(R1),F8 When is it safe for 11 SD -16(R1),F12 compiler to do such 12 SUBI R1,R1,#32 changes? 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 32 Loop Level Parallelism Loop level parallelism ILP - unrolling loops Vector machines Loop level parallelism into ILP: unroll loop Statically by the compilers Dynamically by the hardware 33
Branch Predictions 34 Reducing Branch Penalty Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis-predicted branches Reduce branch penalty: Predict branch/jump instructions AND branch direction (taken or not taken) Predict branch/jump target address (for taken branches) Speculatively execute instructions along the predicted path 35 What to Use and What to Predict Available info: Current predicted PC Past branch history (direction and target) What to predict: Conditional branch inst: branch direction and target address Jump inst: target address Procedure call/return: target address May need instruction predecoded PC IM pred_pc Predictors PC & Inst pred info feedback PC 36
Mis-prediction Detections and Feedbacks Detections: At the end of decoding Target address known at decoding, and not match Flush fetch stage At commit (most cases) Wrong branch direction or target address not match Flush the whole pipeline (at EXE: MIPS R10000) Feedbacks: Any time a mis-prediction is detected At a branch s commit (at EXE: called speculative update) FETCH RENAME REB/ROB SCHD EXE WB COMMIT predictors 37 Branch Direction Prediction Predict branch direction: taken or not taken (T/NT) taken BNE R1, R2, L1 Not taken L1: Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more 38 Predictor for a Single Branch General Form 1. Access PC state 2. Predict Output T/NT 1-bit prediction Predict Taken T 3. Feedback T/NT NT NT 1 0 T Feedback Predict Taken 39
Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors K-bit 2 k Branch address Prediction 40 1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mis-predictions Consider a loop of 9 iterations before exit: for ( ){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping Only 80% accuracy even if loop 90% of the time 41 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Predict Taken Predict Not Taken T NT 11 10 T T NT NT 01 00 T Predict Taken Predict Not Taken NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 42
Correlating Branches Code example showing the potential Assemble code If (d==0) d=1; If (d==1) BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#-1 BNEZ R3, L2 L2: Observation: if BNEZ1 is not taken, then BNEZ2 is taken 43 Correlating Branch Predictor Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction 1-bit global branch history (0 = not taken) 44 Correlating Branch Predictor General form: (m, n) predictor m bits for global history, n bits for local history Records correlation between m+1 branches Simple implementation: global history can be store in a shift register Example: (2,2) predictor, 2-bit global, 2-bit local Branch address (4 bits) 2-bits per branch local predictors Prediction 2-bit global branch history (01 = not taken then taken) 45
Accuracy of Different Schemes (Figure 3.15, p. 206) Frequency of Mispredictions 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 46 Accuracy of Return Address Predictor 47 Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Example: BTB combined with BHT PC of instruction FETCH Branch PC Predicted PC =? No: branch not predicted, proceed normally (Next PC = PC+4) Yes: instruction is branch and use predicted PC as next PC Extra prediction state bits 48
Deadlines 4 Sep 25 Lec07 Multi-cycles App A.7; Chap 2 Sep 29 Homework 1 due 5 Sep 30 Homework review 5 Oct 2 Lec08 - Instruction Level Parallelism (ILP) Q3 6 Oct 7 Lec09 - Dynamic Scheduling: Scoreboard 6 Oct 9 Lec10 - Dynamic Scheduling: Tomasulo 7 Oct 14 Lec11 Hardware Speculation 7 Oct 16 Lec12 - Multiple Issue Oct 20 Homework 2 due 8 Oct 21 Homework review 8 Oct 23 Midterm exam Chap 3; App C 9 Oct 28 Lec13 - Study of the Limitations of ILP 9 Oct 30 Lec14 - Review Cache and Review Virtual Memory Q4 49