Digital Systems Architecture EECE 33-01 EECE 292-02 Software Approaches to ILP Part 2 Dr. William H. Robinson March 5, 200 Topics A deja vu is usually a glitch in the Matrix. It happens when they change something. Trinity from The Matrix Loop-carried dependencies Software pipelining http://eecs.vanderbilt.edu/courses/eece33/ Compiler speculation 1 2 Processor Case Studies GPU Daniel & Cole: Nvidia Geforce Adam D. & Stephen: ATi GPU DSP Adam N. & Brendan: SHARC II RISC Amogh & Raj: SunSPARC CISC Jehanzeb & Sylvester: Pentium Chapter 3 Chapter Ideas To Reduce Stalls Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls 3
Unroll Loop Four Times (Straightforward Way) 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F,F0,F2 stall 5 stall 6 SD 0(R1),F 7 LD F6,-8(R1) 8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -8(R1),F8 13 LD F10,-16(R1) 1 stall 15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1),F12 19 LD F1,-2(R1) 20 stall 21 ADDD F16,F1,F2 22 stall 23 stall 2 SD -2(R1),F16 25 SUBI R1,R1,#32 26 stall 27 BNEZ R1,LOOP 28 NOP 15 + x (1+2) +1 = 28 clock cycles, or 7 cycles per iteration Assumes R1 is multiple of Unrolled Loop that Minimizes Stalls 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) LD F1,-2(R1) 5 ADDD F,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F1,F2 9 SD 0(R1),F 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 1 SD 8(R1),F16 ; 8-32 = -2 1 clock cycles, or 3.5 cycles per iteration LD to ADDD: 1 Cycle ADDD to SD: 2 Cycles What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes? Rewrite loop to minimize stalls. 5 Adapted from John Kubiatowicz s CS 252 lecture notes. Copyright 2003 UCB. 6 Compiler Checks for Dependencies Compiler concerned about dependencies in program Data dependence Instruction i produces a result used by instruction j, or Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Anti-dependence Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first Compiler Checks for Dependencies Compiler concerned about dependencies in program Output dependence Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. Control dependence An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. 7 8
Detecting and Enhancing Loop-Level Parallelism (LLP) Example: Where are data dependencies? (A,B,C distinct & non-overlapping) Detecting and Enhancing Loop-Level Parallelism (LLP) Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping) A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations. Implies that iterations are dependent, and can t be executed in parallel Note the case for our prior example; each iteration was distinct 9 A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ 1. No dependence from S1 to S2. If there were, then there would be a cycle in the dependencies and the loop would not be parallel. Since this other dependence is absent, interchanging the two statements will not affect the execution of S2. 2. On the first iteration of the loop, statement S1 depends on the value of B[1] computed prior to initiating the loop. 10 OLD Eliminating Loop-Carried Dependence A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ S1 S2 Original Loop: A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; Example: LLP Analysis A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ Iteration 1 Iteration 2 Iteration 99 Iteration 100 A[2] = A[2] + B[2]; B[3] = C[2] + D[2];............ Loop-carried Dependence A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; NEW A[1] = A[1] + B[1]; /* Start-up code */ for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; B[101] = C[100] + D[100]; /* Completion code */ Modified Parallel Loop: Loop Start-up code A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; Iteration 1 A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; B[101] = C[100] + D[100]; Not Loop Carried Dependence.... Iteration 98 A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; Iteration 99 A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; Loop Completion code 11 12
ILP Compiler Support: Loop-Carried Dependence Detection Compilers can increase the utilization of ILP by better detection of instruction dependencies. To detect loop-carried dependence in a loop, the GCD test can be used by the compiler, which is based on the following: If an array element with index: a i + b is stored and element: c i + d of the same array is loaded where index runs from m to n, a dependence exists if the following two conditions hold: 1 There are two iteration indices, j and k, m j, k n (within iteration limits) 2 The loop stores into an array element indexed by: a j + b and later loads from the same array the element indexed by: c k + d Thus: a j + b = c k + d j < k 13 The Greatest Common Divisor (GCD) Test If a loop carried dependence exists, then : GCD(c, a) must divide (d - b) Index of element stored: a i + b Index of element loaded: c i + d The GCD test is sufficient to guarantee no dependence. However there are cases where GCD test succeeds but no dependence exists because GCD test does not take loop bounds into account. Example: for(i=1; i<=100; i=i+1) { x[2*i+3] = x[2*i] * 5.0; a = 2 b = 3 c = 2 d = 0 GCD(a, c) = 2 (d b) = -3 2 does not divide -3 No dependence possible. 1 Showing Example Loop Iterations to Be Independent for(i=1; i<=100; i=i+1) { x[2*i+3] = x[2*i] * 5.0; c x i + d a x i + b = 2 x i + 0 = 2 x i + 3 Iteration i Index of x loaded Index of x stored 1 2 3 5 6 7 2 6 8 10 12 1 5 7 9 11 13 15 17 Index of element stored: a x i + b Index of element loaded: c x i + d a = 2 b = 3 c = 2 d = 0 GCD(a, c) = 2 d- b = - 3 2 does not divide - 3 No dependence possible. What if GCD (a, c) divided d - b? i = 1 i = 2 i = 3 i = i = 5 i = 6 i = 7 x[1] x[2] x[3] x[] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] x[13] x[1] x[15] x[16] x[17] x[18] Software Pipelining Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW) Softwarepipelined iteration Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 15 16
Software Pipelining Example Show a software-pipelined version of the code: Loop: L.D F0,0(R1) ADD.D F,F0,F2 S.D F,0(R1) DADDUI R1,R1,#-8 BNE R1,R2,LOOP Before: Unrolled 3 times 1 L.D F0,0(R1) 2 ADD.D F,F0,F2 3 S.D F,0(R1) L.D F0,-8(R1) 5 ADD.D F,F0,F2 6 S.D F,-8(R1) 7 L.D F0,-16(R1) 8 ADD.D F,F0,F2 9 S.D F,-16(R1) 10 DADDUI R1,R1,#-2 11 BNE R1,R2,LOOP overlapped ops Time After: Software Pipelined L.D F0,0(R1) start-up ADD.D F,F0,F2 code L.D F0,-8(R1) 1 S.D F,0(R1) ;Stores M[i] 2 ADD.D F,F0,F2 ;Adds to M[i-1] 3 L.D F0,-16(R1);Loads M[i-2] DADDUI R1,R1,#-8 5 BNE R1,R2,LOOP S.D ADDD S.D start-up code F, 0(R1) F,F0,F2 F,-8(R1) Software Pipeline Time Loop Unrolled finish code finish code 17 Trace Scheduling Parallelism across IF branches vs. LOOP branches Two steps: Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong Compiler undoes bad guess (discards values in registers) Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks 18 Predicated Execution Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction IA-6: 6 1-bit condition fields selected so conditional execution of any instruction This transformation is called if-conversion Drawbacks to conditional instructions Still takes a clock even if annulled Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Increasing Parallelism Theory Move an instruction across a branch to increase the size of a basic block and thus to increase parallelism. Primary difficulty is in avoiding exceptions. For example: If ( a = {, 0, ) c = b / a; May have divide by zero error in some cases. Methods for increasing speculation include: 1. Use a set of status bits (poison bits) associated with the registers. Are a signal that the instruction results are invalid until some later time. 2. Result of instruction isn t written until it s certain the instruction is no longer speculative. Adapted from John Kubiatowicz s CS 252 lecture notes. Copyright 2003 UCB. 19 20
Example: Compiler Speculation Poison Bit Example on Page 36 Code for if ( A == 0 ) A = B; else A = A + ; Assume A is at 0(R3) and B is at 0(R) Assume R1 is unused and available Original Code: LW R1, 0(R3) Load A BNEZ R1, L1 Test A LW R1, 0(R2) If Clause J L2 Skip Else L1: ADDI R1, R1, # Else Clause L2: SW 0(R3), R1 Store A Speculated Code: LW R1, 0(R3) Load A LW R1, 0(R2) Spec Load B BEQZ R1, L3 Other if Branch ADDI R1, R1, # Else Clause L3: SW 0(R3), R1 Non-Spec Store If the LW* produces an exception, a poison bit is set on that register. If a later instruction tries to use the register, an exception is THEN raised. Speculated Code: LW R1, 0(R3) Load A LW* R1, 0(R2) Spec Load B BEQZ R1, L3 Other if Branch ADDI R1, R1, # Else Clause L3: SW 0(R3), R1 Non-Spec Store 21 22 Summary Compilers are used to statically identify ILP Loop-carried dependencies makes loop unrolling less effective Compilers can improve performance with Software pipelining Trace scheduling Speculation Adapted from John Kubiatowicz s CS 252 lecture notes. Copyright 2003 UCB. 23