CISC 662 Gaduate Compute Achitectue Lectue 6 - Hazads Michela Taufe http://www.cis.udel.edu/~taufe/teaching/cis662f07 Powepoint Lectue Notes fom John Hennessy and David Patteson s: Compute Achitectue, 4th edition ---- Additional teaching mateial fom: Jelena Mikovic (U Del) and John Kubiatowicz (UC Bekeley)
Pipelining is not quite that easy! Limits to pipelining: Hazads pevent next instuction fom executing duing its designated clock cycle Stuctual hazads: HW cannot suppot this combination of instuctions (single peson to fold and put clothes away) Data hazads: Instuction depends on esult of pio instuction still in the pipeline (missing sock) Contol hazads: Caused by delay between the fetching of instuctions and decisions about changes in contol flow (banches and jumps). 2
One Memoy Pot/Stuctual Hazads Figue A.4, Page A-14 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t. O d e Load Inst 1 Inst 2 Inst 3 Inst 4 3
One Memoy Pot/Stuctual Hazads (Simila to Figue A.5, Page A-15) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t. O d e Load Inst 1 Inst 2 Stall Inst 3 Bubble Bubble Bubble Bubble Bubble How do you bubble the pipe? 4
Speed Up Equation fo Pipelining CPI pipelined = Ideal CPI + Aveage Stall cycles pe Inst Ideal CPI Pipeline depth Speedup = Ideal CPI + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined Fo simple RISC pipeline, CPI = 1: Pipeline depth Speedup = 1 + Pipeline stall CPI Cycle Cycle Time Time unpipelined pipelined 5
Example: Dual-pot vs. Single-pot Machine A: Dual poted memoy ( Havad Achitectue ) Machine B: Single poted memoy, but its pipelined implementation has a 1.05 times faste clock ate Ideal CPI = 1 fo both Loads ae 40% of instuctions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faste 6
Data Hazad on R1 Figue A.6, Page A-17 Time (clock cycles) IF ID/RF EX MEM WB I n s t. add 1,2,3 sub 4,1,3 O d e and 6,1,7 o 8,1,9 xo 10,1,11 7
Thee Geneic Data Hazads Read Afte Wite (RAW) Inst J ties to ead opeand befoe Inst I wites it I: add 1,2,3 J: sub 4,1,3 Caused by a Dependence (in compile nomenclatue). This hazad esults fom an actual need fo communication. 8
Thee Geneic Data Hazads Wite Afte Read (WAR) Inst J wites opeand befoe Inst I eads it I: sub 4,1,3 J: add 1,2,3 K: mul 6,1,7 Called an anti-dependence by compile wites. This esults fom euse of the name 1. Can t happen in MIPS 5 stage pipeline because: All instuctions take 5 stages, and Reads ae always in stage 2, and Wites ae always in stage 5 9
Thee Geneic Data Hazads Wite Afte Wite (WAW) Inst J wites opeand befoe Inst I wites it. I: sub 1,4,3 J: add 1,2,3 K: mul 6,1,7 Called an output dependence by compile wites This also esults fom the euse of name 1. Can t happen in MIPS 5 stage pipeline because: All instuctions take 5 stages, and Wites ae always in stage 5 Will see WAR and WAW in moe complicated pipes 10
Fowading to Avoid Data Hazad Figue A.7, Page A-19 Time (clock cycles) I n s t. add 1,2,3 sub 4,1,3 O d e and 6,1,7 o 8,1,9 xo 10,1,11 11
HW Change fo Fowading Figue A.23, Page A-37 NextPC istes ID/EX mux mux EX/MEM Data Memoy MEM/WR Immediate mux What cicuit detects and esolves this hazad? 12
Pipeline Contol Pass contol signals along just like the data Execution/Addess Calculation stage contol lines Memoy access stage contol lines Wite-back stage contol lines Instuction Dst Op1 Op0 Sc Banch Mem Read Mem Wite wite Mem to R-fomat 1 1 0 0 0 0 0 1 0 lw 0 0 0 1 0 1 0 1 1 sw X 0 0 1 0 0 1 0 X beq X 0 1 0 1 0 0 0 X WB Instuction Contol M WB EX M WB IF/ID ID/EX EX/MEM MEM/WB 13
Datapath with Contol 14
Fowading to Avoid LW-SW Data Hazad Figue A.8, Page A-20 Time (clock cycles) I n s t. add 1,2,3 lw 4, 0(1) O d e sw 4,12(1) o 8,6,9 xo 10,9,11 15
Data Hazad Even with Fowading Figue A.9, Page A-21 Time (clock cycles) I n s t. lw 1, 0(2) sub 4,1,6 O d e and 6,1,7 o 8,1,9 16
Data Hazad Even with Fowading (Simila to Figue A.10, Page A-21) Time (clock cycles) I n s t. O d e lw 1, 0(2) sub 4,1,6 and 6,1,7 Bubble Bubble o 8,1,9 Bubble How is this detected? 17
Softwae Scheduling to Avoid Load Hazads Ty poducing fast code fo a = b + c; d = e f; assuming a, b, c, d,e, and f in memoy. Slow code: LW LW ADD SW LW LW SUB Rb,b Rc,c Ra,Rb,Rc a,ra Re,e Rf,f Rd,Re,Rf SW d,rd Fast code: LW LW LW ADD LW SW SUB Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,ra Rd,Re,Rf SW d,rd Compile optimizes fo pefomance. Hadwae checks fo safety. 18
Contol Hazad 19
Contol Hazad on Banches Thee Stage Stall 10: beq 1,3,36 14: and 2,3,5 18: o 6,1,7 22: add 8,1,9 36: xo 10,1,11 What do you do with the 3 instuctions in between? How do you do it? Whee is the commit? 20
Banch Stall Impact If CPI = 1, 30% banch, Stall 3 cycles => new CPI = 1.9! Two pat solution: Detemine banch taken o not soone, AND Compute taken banch addess ealie MIPS banch tests if egiste = 0 o 0 MIPS Solution: Move Zeo test to ID/RF stage Adde to calculate new PC in ID/RF stage 1 clock cycle penalty fo banch vesus 3 21
Pipelined MIPS Datapath Figue A.24, page A-38 Instuction Fetch Inst. Decode. Fetch Execute Add. Calc Memoy Access Wite Back Next PC 4 Adde Next SEQ PC Adde RS1 MUX Zeo? Addess Memoy IF/ID RS2 File ID/EX MUX EX/MEM Data Memoy MEM/WB MUX Imm Sign Extend RD RD RD WB Data Inteplay of instuction set design and cycle time. 22
Fou Banch Hazad Altenatives Static altenatives: fixed fo each banch duing the entie execution #1: Stall until banch diection is clea #2: Pedict Banch Not Taken Execute successo instuctions in sequence Squash instuctions in pipeline if banch actually taken Advantage of late pipeline state update 47% MIPS banches not taken on aveage PC+4 aleady calculated, so use it to get next instuction #3: Pedict Banch Taken 53% MIPS banches taken on aveage But haven t calculated banch taget addess in MIPS» MIPS still incus 1 cycle banch penalty» Othe machines: banch taget known befoe outcome 23
Fou Banch Hazad Altenatives #4: Delayed Banch Define banch to take place AFTER a following instuction banch instuction sequential successo 1 sequential successo 2... sequential successo n banch taget if taken Banch delay of length n 1 slot delay allows pope decision and banch taget addess in 5 stage pipeline MIPS uses this 24
Scheduling Banch Delay Slots (Fig A.14) A. Fom befoe banch B. Fom banch taget C. Fom fall though add $1,$2,$3 if $2=0 then delay slot sub $4,$5,$6 add $1,$2,$3 if $1=0 then delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes add $1,$2,$3 if $2=0 then if $1=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 sub $4,$5,$6 A is the best choice, fills delay slot & educes instuction count (IC) In B, the sub instuction may need to be copied, inceasing IC In B and C, must be okay to execute sub when banch fails 25
Delayed Banch Compile effectiveness fo single banch delay slot: Fills about 60% of banch delay slots About 80% of instuctions executed in banch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Banch downside: As pocesso go to deepe pipelines and multiple issue, the banch delay gows and need moe than one delay slot Delayed banching has lost populaity compaed to moe expensive but moe flexible dynamic appoaches Gowth in available tansistos has made dynamic appoaches elatively cheape 26
Evaluating Banch Altenatives Pipeline speedup = Pipeline depth 1 +Banch fequency Banch penalty Assume 4% unconditional banch, 6% conditional banchuntaken, 10% conditional banch-taken Scheduling Banch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline 3 1.60 3.1 1.0 Pedict taken 1 1.20 4.2 1.33 Pedict not taken 1 1.14 4.4 1.40 Delayed banch 0.5 1.10 4.5 1.45 27