Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW Software Pipelining /0/00 UAH- Basic Pipeline Scheduling: xample Simple loop: Assumptions: for(i=; i<=000; i++) x[i]=x[i] + s; Instruction Instruction Latency in producing result using result clock cycles ALU op Another ALU op ALU op Store double Load double ALU op Load double Store double 0 Integer op Integer op 0 ;R points to the last element in the array ;for simplicity, we assume that x[0] is at the address 0 Loop: L.D F0, 0(R) ;F0=array el. ADD.D F,F0,F ;add scalar in F S.D 0(R),F ;store result SUBI R,R,# BNZ R, Loop ;decrement pointer ;branch /0/00 UAH- Revised loop to minimise stalls. Loop: F0, 0(R). SUBI R,R,#. ADDD F,F0,F. Stall Swap BNZ and by changing address of SUBI is moved up. BNZ R, Loop ;delayed branch. (R),F ;altered and interch. SUBI clocks per iteration ( stall); but only instructions do the actual work processing the array (, ADDD, ) => Unroll loop times to improve potential for instr. scheduling Instruction Instruction Latency in producing result using result clock cycles ALU op Another ALU op ALU op Store double Load double ALU op Load double Store double 0 Integer op Integer op 0 /0/00 UAH-
Unrolled Loop F0, 0(R) ADDD F,F0,F 0(R),F ; drop SUBI&BNZ F0, -(R) ADDD F,F0,F -(R),F ; drop SUBI&BNZ F0, -(R) ADDD F,F0,F -(R),F ; drop SUBI&BNZ F0, -(R) ADDD F,F0,F -(R),F SUBI R,R,# BNZ R,Loop cycle stall cycles stall This loop will run cc ( stalls) per iteration; each has one stall, each ADDD, SUBI, BNZ, plus instruction issue cycles - or /= for each element of the array (even slower than the scheduled version)! => Rewrite loop to minimize stalls Unrolled Loop that Minimise Stalls Loop: F0,0(R) F,-(R) F0,-(R) F,-(R) ADDD F,F0,F ADDD F,F,F ADDD F,F0,F ADDD F,F,F 0(R),F -(R),F SUBI R,R,# (R),F BNZ R,Loop (R),F ; This loop will run cycles (no stalls) per iteration; or /=. for each element! Assumptions that make this possible: - move s before s - move after SUBI and BNZ - use different registers When is it safe for compiler to do such changes? /0/00 UAH- /0/00 UAH- I I I Superscalar MIPS Superscalar MIPS: instructions, & anything else Fetch -bits/clock cycle; Int on left, on right Can only issue nd instruction if st instruction issues More ports for registers to do load & op in a pair Instr. 0 Time [clocks] Note: operations extend X cycle /0/00 UAH- 0 Loop Unrolling in Superscalar Integer Instr. Loop: F0,0(R) F,-(R) F0,-(R) F,-(R) F,-(R) 0(R),F -(R),F -(R),F SUBI R,R,#0 (R),F BNZ R,Loop (R),F0 Instr. ADDD F,F0,F ADDD F,F,F ADDD F,F0,F ADDD F,F,F ADDD F0,F,F Unrolled times to avoid delays This loop will run cycles (no stalls) per iteration - or /=. for each element of the array /0/00 UAH-
I i I i+ The VLIW Approach VLIWs use multiple independent functional units VLIWs package the multiple operations into one very long instruction Compiler is responsible to choose instructions to be issued simultaneously IF Instr. ID IF ID W W Time [clocks] /0/00 UAH- Loop Unrolling in VLIW Mem. Ref F,0(R) F,-(R) F0,-(R) F,-(R) F,-(R) F,-0(R) ADDD F,F0,F ADDD F,F0,F F,-(R) ADDD F,F0,F0 ADDD F,F0,F 0(R),F -(R),F -(R),F (R),F0 (R),F (R),F Mem Ref. -(R),F Unrolled times to avoid delays ADDD F0,F0,F ADDD F,F0,F ADDD F,F0,F results in clocks, or. clocks per each element (.X) Average:. ops per clock, 0% efficiency Note: Need more registers in VLIW ( vs. in SS) Int/Branch SUBI R,R,# BNZ R,Loop /0/00 UAH- 0 Software Pipelining Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW) Iteration 0 Softwarepipelined iteration Iteration Iteration Iteration Iteration /0/00 UAH- Software Pipelining xample Before: Unrolled times F0,0(R) ADDD F,F0,F 0(R),F F,-(R) ADDD F,F,F -(R),F F0,-(R) ADDD F,F0,F -(R),F 0 SUBUI R,R,# BNZ R,LOOP After: Software Pipelined 0(R),F ; Stores M[i] ADDD F,F0,F ; Adds to M[i-] F0,-(R); Loads M[i-] SUBUI R,R,# BNZ R,LOOP cycles per iteration Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling SW Pipeline Time Loop Unrolled /0/00 UAH- overlapped ops Time
Statically Scheduled Superscalar.g., four-issue static superscalar instructions make one issue packet Fetch examines each instruction in the packet in the program order instruction cannot be issued will cause a structural or data hazard either due to an instruction earlier in the issue packet or due to an instruction already in execution can issue from 0 to instruction per clock cycle From Mem Op Queue Load Buffers Load Load Load Load Load Load Add Add Add adders From Instruction Unit Registers Mult Mult Reservation Stations multipliers Store Buffers Store Store Store To Mem /0/00 UAH- Issue: instructions per clock cycle /0/00 UAH- Loop: L.D F0, 0(R) ADD.D F,F0,F S.D DADDIU BN Assumptions: 0(R), F R,R,-# R,R,Loop One and one integer operation can be issued; Resources: ALU (int + effective address), a separate pipelined for each operation type, branch prediction hardware, CDB cc for loads, cc for Add Branches single issue, branch prediction is perfect /0/00 UAH- Iter. Inst..D F0,0(R) ADD.D F,F0,F S.D 0(R), F DADDIU R,R,-# BN R,R,Loop.D F0,0(R) ADD.D F,F0,F S.D 0(R), F xe. Issue (begins) 0 Mem. Access Wait for BN DADDIU R,R,-# 0 Wait for ALU BN R,R,Loop.D F0,0(R) Wait for BN ADD.D F,F0,F Wait for.d S.D 0(R), F DADDIU R,R,-# Wait for ALU BN R,R,Loop /0/00 UAH- Write Com. at CDB first issue Wait for.d Wait for ALU Wait for.d
: Resource Usage Clock 0 Int ALU /L.D /S.D /DADDIU /L.D /S.D / DADDIU /L.D /S.D / DADDIU ALU /ADD.D /ADD.D /ADD.D Data Cache /L.D /L.D /S.D /L.D /S.D /L.D /DADDIU /ADD.D /L.D /DADDIU /ADD.D /L.D /DADDIU /ADD.D /S.D /0/00 UAH- CDB : DADDIU waits for ALU used by S.D Add one ALU dedicated to effective address calculation Use CDBs Draw table for the dual-issue version of Tomasulo s pipeline /0/00 UAH- Iter. Inst..D F0,0(R) ADD.D F,F0,F S.D 0(R), F DADDIU R,R,-# BN R,R,Loop.D F0,0(R) ADD.D F,F0,F S.D 0(R), F xe. Issue (begins) Wait for BN DADDIU R,R,-# xecutes earlier BN R,R,Loop.D F0,0(R) 0 Wait for BN ADD.D F,F0,F S.D 0(R), F 0 DADDIU R,R,-# 0 BN R,R,Loop /0/00 UAH- Mem. Access Write Com. at CDB first issue Wait for.d xecutes earlier Wait for.d : Resource Usage Clock 0 Int ALU /DADDIU / DADDIU / DADDIU Adr. Adder /L.D /S.D /L.D /S.D /L.D /S.D ALU /ADD.D /ADD.D /ADD.D Data Cache /L.D /L.D /S.D /L.D /S.D /S.D CDB# /L.D /DADDIU /ADD.D /DADDIU /L.D /ADD.D /ADD.D CDB# /DADDIU /L.D /0/00 UAH- 0
What about Precise Interrupts? State of machine looks as if no instruction beyond faulting instructions has issued Tomasulo had: In-order issue, out-of-order execution, and outof-order completion Need to fix the out-of-order completion aspect so that we can find precise breakpoint in instruction stream. Relationship between precise interrupts and speculation Speculation: guess and check Important for branch prediction: Need to take our best shot at predicting branch direction. If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly: This is exactly same as precise exceptions! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit /0/00 UAH- /0/00 UAH- HW support for precise interrupts Need HW buffer for results of uncommitted instructions: reorder buffer fields: instr, destination, value Use reorder buffer number instead of reservation station when execution completes Supplies operands between execution complete & commit (Reorder buffer can be operand source => more registers like RS) Instructions commit Once instruction commits, result is put into register As a result, easy to undo speculated instructions on mispredicted branches or exceptions Op Queue Adder Reorder Buffer Regs Adder /0/00 UAH- Four Steps of Speculative Tomasulo Algorithm.Issue get instruction from Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination(this stage sometimes called dispatch ).xecution operate on operands (X) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ).Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer ; mark reservation station available..commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) /0/00 UAH-
What are the hardware complexities with reorder buffer (ROB)? Dest Reg Result xceptions? Valid Program Counter Op Queue Compar network Reorder Buffer Regs Reorder Table Adder Adder How do you find the latest version of a register? (As specified by Smith paper) need associative comparison network Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value Need as many ports on ROB as register file /0/00 UAH-