Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Size: px

Start display at page:

Download "Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville"

Edgar Sutton
5 years ago
Views:

1 Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW Software Pipelining /0/005 UAH-

2 ILP: Concepts and Challenges ILP (Instruction Level Parallelism) overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Reducing each of the terms of the right-hand side minimize CPI and thus increase instruction throughput /0/005 UAH- Basic Pipeline Scheduling: Example Simple loop: Assumptions: for(i=; i<=000; i++) x[i]=x[i] + s; Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op FP ALU op Store double Load double FP ALU op Load double Store double 0 Integer op Integer op 0 ;R points to the last element in the array ;for simplicity, we assume that x[0] is at the address 0 Loop: L.D F0, 0(R) ;F0=array el. ADD.D F4,F0,F ;add scalar in F S.D 0(R),F4 ;store result SUBI R,R,#8 ;decrement pointer BNEZ R, Loop ;branch /0/005 UAH- 4

3 Executing FP Loop. Loop: LD F0, 0(R). Stall. ADDD F4,F0,F 4. Stall 5. Stall 6. SD 0(R),F4 7. SUBI R,R,#8 8. Stall 9. BNEZ R, Loop 0. Stall Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op FP ALU op Store double Load double FP ALU op Load double Store double 0 Integer op Integer op 0 0 clocks per iteration (5 stalls) => Rewrite code to minimize stalls? /0/005 UAH- 5. Loop: LD F0, 0(R). SUBI R,R,#8. ADDD F4,F0,F 4. Stall Revised FP loop to minimise stalls Swap BNEZ and SD by changing address of SD SUBI is moved up 5. BNEZ R, Loop ;delayed branch 6. SD 8(R),F4 ;altered and interch. SUBI 6 clocks per iteration ( stall); but only instructions do the actual work processing the array (LD, ADDD, SD) => Unroll loop 4 times to improve potential for instr. scheduling Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op FP ALU op Store double Load double FP ALU op Load double Store double 0 Integer op Integer op 0 /0/005 UAH- 6

4 LD Unrolled Loop F0, 0(R) cycle stall cycles stall ADDD F4,F0,F SD 0(R),F4 ; drop SUBI&BNEZ LD F0, -8(R) ADDD F4,F0,F SD -8(R),F4 ; drop SUBI&BNEZ LD F0, -6(R) ADDD F4,F0,F SD -6(R),F4 ; drop SUBI&BNEZ LD F0, -4(R) ADDD F4,F0,F SD -4(R),F4 SUBI R,R,# BNEZ R,Loop This loop will run 8 cc (4 stalls) per iteration; each LD has one stall, each ADDD, SUBI, BNEZ, plus 4 instruction issue cycles - or 8/4=7 for each element of the array (even slower than the scheduled version)! => Rewrite loop to minimize stalls /0/005 UAH- 7 Where are the name dependencies? Loop:LD F0,0(R) ADDD F4,F0,F SD 0(R),F4 ;drop DSUBUI & BNEZ 4 LD F0,-8(R) 5 ADDD F4,F0,F 6 SD -8(R),F4 ;drop DSUBUI & BNEZ 7 LD F0,-6(R) 8 ADDD F4,F0,F 9 SD -6(R),F4 ;drop DSUBUI & BNEZ 0 LD F0,-4(R) ADDD F4,F0,F SD -4(R),F4 SUBUI R,R,# ;alter to 4*8 4 BNEZ R,LOOP 5 NOP How can remove them? /0/005 UAH- 8

5 Where are the name dependencies? Loop:L.D F0,0(R) ADD.D F4,F0,F S.D 0(R),F4 ;drop DSUBUI & BNEZ 4 L.D F6,-8(R) 5 ADD.D F8,F6,F 6 S.D -8(R),F8 ;drop DSUBUI & BNEZ 7 L.D F0,-6(R) 8 ADD.D F,F0,F 9 S.D -6(R),F ;drop DSUBUI & BNEZ 0 L.D F4,-4(R) ADD.D F6,F4,F S.D -4(R),F6 DSUBUI R,R,# ;alter to 4*8 4 BNEZ R,LOOP 5 NOP The Orginal register renaming /0/005 UAH- 9 Loop: LD F0,0(R) LD F6,-8(R) LD F0,-6(R) LD F4,-4(R) ADDD F4,F0,F ADDD F8,F6,F ADDD F,F0,F ADDD F6,F4,F SD 0(R),F4 SD -8(R),F8 SUBI R,R,# SD 6(R),F BNEZ R,Loop SD 8(R),F4 ; Unrolled Loop that Minimise Stalls This loop will run 4 cycles (no stalls) per iteration; or 4/4=.5 for each element! Assumptions that make this possible: - move LDs before SDs - move SD after SUBI and BNEZ - use different registers When is it safe for compiler to do such changes? /0/005 UAH- 0

6 Steps Compiler Performed to Unroll Determine that is OK to move the S.D after SUBUI and BNEZ, and find amount to adjust SD offset Determine that unrolling the loop would be useful by finding that the loop iterations were independent Rename registers to avoid name dependencies Eliminate extra test and branch instructions and adjust the loop termination and iteration code Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent requires analyzing memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield same result as the original code /0/005 UAH- Multiple Issue Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Decrease Ideal pipeline CPI Multiple issue Superscalar Statically scheduled (compiler techniques) Dynamically scheduled (Tomasulo s alg.) VLIW (Very Long Instruction Word) parallelism is explicitly indicated by instruction EPIC (explicitly parallel instruction computers) /0/005 UAH-

7 Superscalar MIPS Superscalar MIPS: instructions, FP & anything else Fetch 64-bits/clock cycle; Int on left, FP on right Can only issue nd instruction if st instruction issues More ports for FP registers to do FP load & FP op in a pair 5 0 I FP I FP I FP IF ID Ex Mem WB IF ID Ex Mem WB IF ID Ex Mem WB IF ID Ex Mem WB Instr. IF ID Ex Mem WB IF ID Ex Mem WB Time [clocks] Note: FP operations extend EX cycle /0/005 UAH Loop: Loop Unrolling in Superscalar Integer Instr. LD F0,0(R) LD F6,-8(R) LD F0,-6(R) LD F4,-4(R) LD F8,-(R) SD 0(R),F4 SD -8(R),F8 SD -6(R),F SUBI R,R,#40 SD 6(R),F6 BNEZ R,Loop SD 8(R),F0 FP Instr. ADDD F4,F0,F ADDD F8,F6,F ADDD F,F0,F ADDD F6,F4,F ADDD F0,F8,F Unrolled 5 times to avoid delays This loop will run cycles (no stalls) per iteration - or /5=.4 for each element of the array /0/005 UAH- 4

8 Multiple Issue Processors Two variations Superscalar: varying no. instructions/cycle ( to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-6) scheduled by the compiler; put ops into wide templates Crusoe VLIW processor [ Intel Architecture-64 (IA-64) 64-bit address Style: Explicitly Parallel Instruction Computer (EPIC) Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI /0/005 UAH- 5 The VLIW Approach VLIWs use multiple independent functional units VLIWs package the multiple operations into one very long instruction Compiler is responsible to choose instructions to be issued simultaneously I i IF ID E E E W Time [clocks] I i+ Instr. IF ID E E E W /0/005 UAH- 6

9 Loop Unrolling in VLIW Mem. Ref Mem Ref. FP FP Int/Branch LD F,0(R) LD F6,-8(R) LD F0,-6(R) LD F4,-4(R) LD F8,-(R) LD F,-40(R) ADDD F4,F0,F ADDD F8,F0,F6 4 LD F6,-48(R) ADDD F,F0,F0 ADDD F6,F0,F4 5 ADDD F0,F0,F8 ADDD F4,F0,F 6 SD 0(R),F4 SD -8(R),F8 ADDD F8,F0,F6 7 SD -6(R),F SD -4(R),F6 SUBI R,R,#56 8 SD 4(R),F0 SD 6(R),F4 BNEZ R,Loop 9 SD 8(R),F8 Unrolled 7 times to avoid delays 7 results in 9 clocks, or. clocks per each element (.8X) Average:.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (5 vs. 6 in SS) /0/005 UAH- 7 Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If more instructions issue at same time, greater difficulty of decode and issue Even -scalar => examine opcodes, 6 register specifiers, & decide if or instructions can issue VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel E.g., integer operations, FP ops, Memory refs, branch 6 to 4 bits per field => 7*6 or bits to 7*4 or 68 bits wide Need compiling technique that schedules across several branches /0/005 UAH- 8

10 When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=0; i<00; i=i+) { A[i+] = A[i] + C[i]; /* S */ B[i+] = B[i] + A[i+]; /* S */ }. S uses the value, A[i+], computed by S in the same iteration. S uses a value computed by S in an earlier iteration, since iteration i computes A[i+] which is read in iteration i+. The same is true of S for B[i] and B[i+] This is a loop-carried dependence : between iterations For our prior example, each iteration was distinct /0/005 UAH- 9 Does a loop-carried dependence mean there is no parallelism??? Consider: for (i=0; i< 8; i=i+) { A = A + C[i]; /* S */ } Could compute: Cycle : temp0 = C[0] + C[]; temp = C[] + C[]; temp = C[4] + C[5]; temp = C[6] + C[7]; Cycle : temp4 = temp0 + temp; temp5 = temp + temp; Cycle : A = temp4 + temp5; Relies on associative nature of +. /0/005 UAH- 0

11 Another Example Loop carried dependences? for (i=; i<00; i=i+) { A[i] = A[i] + B[i]; /* S */ B[i+] = C[i] + D[i]; /* S */ } To overlap iteration execution: A[] = A[] + B[]; for (i=; i<00; i=i+) { B[i+] = C[i] + D[i]; A[i+] = A[i+] + B[i+]; } B[0] = C[00] + D[00]; /0/005 UAH- Another possibility: Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW) Iteration 0 Iteration Iteration Iteration Iteration 4 Softwarepipelined iteration /0/005 UAH-

12 Software Pipelining Example Before: Unrolled times LD F0,0(R) ADDD F4,F0,F SD 0(R),F4 4 LD F6,-8(R) 5 ADDD F8,F6,F 6 SD -8(R),F8 7 LD F0,-6(R) 8 ADDD F,F0,F 9 SD -6(R),F 0 SUBUI R,R,#4 BNEZ R,LOOP After: Software Pipelined SD 0(R),F4 ; Stores M[i] ADDD F4,F0,F ; Adds to M[i-] LD F0,-6(R); Loads M[i-] 4 SUBUI R,R,#8 5 BNEZ R,LOOP 5 cycles per iteration SW Pipeline overlapped ops Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling Time Loop Unrolled Time /0/005 UAH- Things to Remember Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Loop unrolling to minimise stalls Multiple issue to minimise CPI Superscalar processors VLIW architectures /0/005 UAH- 4

13 Statically Scheduled Superscalar E.g., four-issue static superscalar 4 instructions make one issue packet Fetch examines each instruction in the packet in the program order instruction cannot be issued will cause a structural or data hazard either due to an instruction earlier in the issue packet or due to an instruction already in execution can issue from 0 to 4 instruction per clock cycle /0/005 UAH- 5 Multiple Issue with Dynamic Scheduling From Mem FP Op Queue Load Buffers Load Load Load Load4 Load5 Load6 Add Add Add FP FP adders adders From Instruction Unit FP Registers Mult Mult Reservation Stations FP FP multipliers Store Buffers Store Store Store To Mem Issue: instructions per clock cycle /0/005 UAH- 6

14 Multiple Issue with Dynamic Scheduling Loop: L.D F0, 0(R) ADD.D F4,F0,F S.D 0(R), F4 DADDIU R,R,-#8 BNE R,R,Loop Assumptions: One FP and one integer operation can be issued; Resources: ALU (int + effective address), a separate pipelined FP for each operation type, branch prediction hardware, CDB cc for loads, cc for FP Add Branches single issue, branch prediction is perfect /0/005 UAH- 7 Multiple Issue with Dynamic Scheduling Iter. Inst. Issue Exe. (begins) Mem. Access Write at Com. CDB LD.D F0,0(R) 4 first issue ADD.D F4,F0,F 5 8 Wait for LD.D S.D 0(R), F4 9 Wait for ADD.D DADDIU R,R,-#8 4 5 Wait for ALU BNE R,R,Loop 6 Wait for DAIDU LD.D F0,0(R) Wait for BNE ADD.D F4,F0,F 4 0 Wait for LD.D S.D 0(R), F Wait for ADD.D DADDIU R,R,-# Wait for ALU BNE R,R,Loop 6 Wait for DAIDU LD.D F0,0(R) 7 4 Wait for BNE ADD.D F4,F0,F Wait for LD.D S.D 0(R), F4 8 9 Wait for ADD.D DADDIU R,R,-# Wait for ALU BNE /0/005 R,R,Loop 9 UAH- 6 Wait for DAIDU 8

15 Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU FP ALU Data Cache CDB /L.D /S.D /L.D 4 /DADDIU /L.D 5 /ADD.D /DADDIU 6 7 /L.D 8 /S.D /L.D /ADD.D 9 / DADDIU /S.D /L.D 0 /ADD.D /DADDIU /L.D /S.D /L.D /ADD.D 4 / DADDIU /S.D /L.D 5 /ADD.D /DADDIU /ADD.D /0/0059 UAH- /S.D 9 Multiple Issue with Dynamic Scheduling: DADDIU waits for ALU used by S.D Add one ALU dedicated to effective address calculation Use CDBs Draw table for the dual-issue version of Tomasulo s pipeline /0/005 UAH- 0

16 Multiple Issue with Dynamic Scheduling Iter. Inst. Issue Exe. (begins) Mem. Access Write at Com. CDB LD.D F0,0(R) 4 first issue ADD.D F4,F0,F 5 8 Wait for LD.D S.D 0(R), F4 9 Wait for ADD.D DADDIU R,R,-#8 4 Executes earlier BNE R,R,Loop 5 Wait for DAIDU LD.D F0,0(R) Wait for BNE ADD.D F4,F0,F 4 9 Wait for LD.D S.D 0(R), F4 5 7 Wait for ADD.D DADDIU R,R,-# Executes earlier BNE R,R,Loop 6 8 LD.D F0,0(R) Wait for BNE ADD.D F4,F0,F 7 5 S.D 0(R), F DADDIU R,R,-# BNE /0/005 R,R,Loop 9 UAH- Multiple Issue with Dynamic Scheduling: Resource Usage Clock Int ALU Adr. Adder FP ALU Data Cache CDB# CDB# /L.D /DADDIU /S.D /L.D 4 /L.D /DADDIU 5 /ADD.D 6 / DADDIU /L.D 7 /S.D /L.D /DADDIU 8 /ADD.D /L.D 9 / DADDIU /L.D /ADD.D /S.D 0 /S.D /L.D /DADDIU /L.D /ADD.D /ADD.D /S.D 4 5 /ADD.D 6 /S.D /0/005 UAH-

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,