Topics. Pipelining Lessons (Task, Resource) Pipelining Lessons (Time cycles) Pipelining is Natural! Laundry Example

Size: px

Start display at page:

Download "Topics. Pipelining Lessons (Task, Resource) Pipelining Lessons (Time cycles) Pipelining is Natural! Laundry Example"

Steven Harrington
5 years ago
Views:

Lecture : Pipelining opics Introduction to pipelining Perfmance Pipelined datapath Design issues Hazards in pipeline ypes F a program with billion instructions, Execution ime =?

processed over multiple steps (called pipeline stages segments ) using cresponding tools in parallel Pipelining doesn t help latency of single task; it helps throughput of entire wkloads Potential

1 Lecture : Pipelining opics Introduction to pipelining Perfmance Pipelined datapath Design issues Hazards in pipeline ypes F a program with billion instructions, Execution ime =? Pipelining Lessons (ask, Resource) A way to parallelize: a load consists of multiple -tasks operating simultaneously using different resources (hardware equipments) Pipelining in parallel: a load is processed over multiple steps (called pipeline stages segments ) using cresponding tools in parallel Pipelining doesn t help latency of single task; it helps throughput of entire wkloads Potential speedup = Number of pipe-stages (Pipeline depth) Ideal could be: 8hrs/hrs= ime to fill pipeline time to drain reduce speedup: Speedup in this example: 8hrs/3.5hrs=.3 {Wash, Dry, Fold, Stash} 5 Pipelining is Natural! Laundry Example Pipelining Lessons (ime cycles) Use case scenario Ann, Brian, Cathy, Dave each has one load of clothes to wash, dry, fold, put away Load parameters Step Device Cycle Wash Washer 3 min Dry Dryer 3 min 3 Fold Folder 3 min Put away Stasher (put clothes into drawers) Efficient ganisation? 3 min wash Washer dry put away Dryer fold Folder Stasher 3 min a s k O r d e r A B C D ime 6 PM Suppose new Folder takes minutes, new Stasher takes minutes. How much faster is pipeline? /* see the last slide f answer */ Pipeline rate is limited by the slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup 6 Sequential Laundry Steps in Executing MIPS instruction instruction stages hardware resources Sequential laundry takes 8 hours f loads 3 ) IF: uction (PC -> memy) Hardware: uction () ) ID: uction (isters) Hardware: ister [read] () 3) EX: (Perfm Operation) Hardware: ) MEM: Access (f data) Hardware: () Load: Read from Ste: to IF ID EX 5) WB: Back ( to ister) Hardware: ister [write] () WB o simplify pipeline, every instruction takes the same pipeline depth (even if not needed; unused stages are NOP.); every stage has the same length. 7 Pipelined Laundry: Start wk ASAP Pipelined Execution Representation Single-Cycle Sequential uction uction uction Read / Read/ uction Pipelined Read/ Read / So the latency is 5 ps; ime between instructions is 5 ps ime (ps) /* see the Revision quiz slide f answer */ 3 uction Read/ Pipelined laundry takes 3.5 hours f loads! So the latency is ps; ime between instructions is ps Pipeline stages not balanced perfectly: instructions take the same pipeline depth; some instructions have NOP stages; every stage has the same length as the longest. 8

2 5: :6 :6 5: 5: 5: :6 :6 5: 5: Pipelining Abstraction instruction stages hardware resources Pipeline Hazards (conflicts) the nd half of the st half of the cycle the cycle lw $s, ($) lw $ $s add $s3, $t, $t $s, $s, $s5 $s5, $t5, $t6 sw $s6, ($s) add $t $t $s $s5 $s3 - $t5 $t6 sw $s $s $s5 $s6 ime (cycles) : uction : ister : Limits to pipelining: hazards prevent next instruction from executing during its designated clock cycle [intermediate input] Structural hazards [decision making] uction depends on the result of pri instruction that is still in the pipeline Attempt to use the same resource in two different ways at the same time; hardware cannot suppt a particular combination of instructions. Pipelining of branches other control instructions stall the pipeline until the next value of PC is known $s7,, $t $t $s7 o simplify pipeline, every instruction takes the same pipeline depth (even if not needed; unused stages are NOP.); every stage has the same length. 3 Pipelined path Single is a Structural Hazard PC' PC In s tru c tio n W E 3 A RD A RD ister W D 3 SrcA SrcB Zero Result W rite D a ta W E W D Read W rite R e g : PCPlus SignImm << PCBranch Pipeline registers Result O utw PC' PCF In strd In s tru c tio n W E 3 A RD A RD ister W D 3 SrcBE W ritee W ritee : W E OutM W ritem W D ReadW SignIm m E << PCPlusF PCPlusD PCPlusE Mem y W riteback State transition: need f propagating state at the end of every stage to push the instruction through the pipeline. Pipeline registers: add registers between every two stages (used as state elements) Each instruction is fetched from memy 3% of instructions are load ste Not possible to access same memy twice in same clock pulse PC' PCF PCPlusF Pipeline registers IF/ID ID/EX EX/MEM MEM/WB D Ins truc tio n WE3 5: A RD :6 A RD ister WD3 :6 5: SignImmE << 5: PCPlusD PCPlusE SrcBE W ritee E : OutW WE OutM ReadW W ritem WD M : W ritew : W riteback Pipeline registers separate pipeline stages Names of registers: names of two stages they separate IF/ID, ID/EX, EX/MEM, MEM/WB. Why no register WB/IF? Wide enough to ste all data ( control signals) passed between stages Structural Hazards Solutions Example: Each instruction is fetched from memy = memy access 3% of instructions are load ste = memy access otal no. of accesses expected per cycle = otal no. of accesses allowed per cycle = no duplicated rsc Solution: level cache is split into: instruction cache: I-Cache, each instruction needs one read from I-cache data cache: D-Cache 3% instructions need access to D-cache we can maintain CPI of /* see the Revision quiz slide f answer */ 5 Problems f Pipelining A B C D Limits to pipelining: hazards prevent next instruction from executing during its designated clock cycle $s add $s $s $s $s $s hazard: B, C, D depends on the result of A. Possible solutions to data hazard : Static Insert s in code at compile time Rearrange code at compile time $s $t $s $s5 $t - Dynamic $t ime (cycles) Fward data at run time Stall the process/pipeline at run time Later instructions use the results of the earlier instructions (attempt to use an item befe it is ready) Solution: Compile-ime Hazard Elimination Insert enough s f result to be ready Or find other instructions which will logically fit between add sequent instructions (move independent instructions fward) $s add $s $s $s $s $s $s $t $s $s5 ime (cycles) $t - $t 6

3:6 5: 5: :6 5: :6 5: 5: in : RAW, WAR Solution: Dynamic Hazard Elimination $s add $s3 fwarding: the result can be passed directly to the functional unit which needs it.

3 3:6 5: 5: :6 5: :6 5: 5: in : RAW, WAR Solution: Dynamic Hazard Elimination $s add $s3 fwarding: the result can be passed directly to the functional unit which needs it $s $s $s $s $s $t $s $s5 $t - Issue: he sequent instructions need this value $s earlier ADD gets value of $s after which stage? $t ime (cycles) AND needs befe its stage ; OR in stage ; SUB in stage /* see the Revision quiz slide f answer */ RAW: may not complete befe read is attempted his is the most common type of data hazard It is created by the logic of the program Obviously a value must be computed befe it is used. Fwarding can be used to overcome this hazard WAR: the location is overwritten befe read is completed his type of data hazard is rare (Antidependences) his can only occur if results are written early in the pipeline opers are read late in the pipeline. (only complex addressing modes require opers to be read late in the pipeline.) ister renaming can be used to overcome this hazard A large no of architecture registers reduces name dependencies. 7 Fwarding (rd) (rs) (rt) SrcA SrcB Fward result to EX stage of sequent instructions from either: PC' stage back stage of ADD PCF A RD uction PCPlusF D Control Unit Op Funct A A WD3 PCPlusD Sign Extend WE3 W rited MemtoD MemD ControlD : SrcD DstD BranchD ister RD RD SignImmD RsD RtD RdD W ritee M W ritew MemtoE MemtoM MemtoW MemE MemM ControlE : SrcE DstE PCSrcM BranchE BranchM WE OutM ReadW A RD SrcBE W ritee W ritem WD RsE OutW W ritee : M : W ritew : Hazard Unit FwardAE FwardBE SignImmE << PCPlusE M W 8 uctions in the pipeline are fetched in der of the program his flow would change when there is a branch, a procedure call Branch penalty: he pipeline has to stall until the address of the next instruction is determined Resolve the condition of the branch conditional branch only Calculate branch address conditional branch unconditional branch return from procedure In our pipeline When is branch address calculated? ID stage (stage ) When is condition evaluated? EX stage (stage 3) What resources are used? [Adder], Fwarding can fail where result is not ready lw gets value of $s in the which stage? ime (cycles) $ $s lw $s, ($) lw rouble! $s $t $s $s $t $s $s $t $s5 - Must delay/stall instruction ( pipeline bubble ) ime (cycles) $ $s lw $s, ($) lw $s $s $t $s $s $s $t $s Stall $s $t $s5 - [Stall when branch decoded]: 3 cycles lost per branch! (too much waste as there are ~3% branch instructions) Delayed branches: execute instructions not dependent on branch direction befe the address is resolved Predict Branch Not aken (e.g. f selections): 7% branches not taken on average Execution continued down the sequential path If branch taken, squash/cancel instructions in pipeline Predict Branch aken (e.g. f loops): 53% MIPS branches taken on average Can t proceed until we know the branch address (ID stage) Sophisticated branch prediction: execute the me likely stream of instructions, undo if guessed wrongly 3 in So far we considered data hazards limited to registers here may also be data hazards in memy A cache miss may delay load ste instruction Ste load instructions on the same address too close in time Name dependency: only happens to use the same name hree categies: Classification depending on der of reads writes: RAW read after write WAR write after read WAW write after write What about RAR read after read? some machines enfce memy accesses strictly in program der this affects pipeline perfmance some allow a different der in general me hardware is required f control Delayed branches Compilers move instruction that has no conflict with branch into delay slot Fills about 6% of branch delay slots About 8% of instructions executed in branch delay slots useful in computation 8% (6% x 8%) of slots usefully filled Where to get branch delay slot instructions? Befe branch instruction From fall through only valuable when branch-not-taken From the target address only valuable when branch-taken Original sequence add $ $5 $6 beq $ $ Redered to this beq $ $ add $ $5 $6

ime (cycles) States in -bit prediction scheme Predict-not-taken branch success instructions in sequence PC already calculated, so use it to get next instruction his scheme is simplest to implement

instructions A loop of iterations befe exit: // f (i=; i<; i) a[i] = a[i] *.

4 ime (cycles) States in -bit prediction scheme Predict-not-taken branch success instructions in sequence PC already calculated, so use it to get next instruction his scheme is simplest to implement Squash instructions in pipeline if branch taken the instructions in IF, ID EX stage are discarded nothing has been written so far $t beq $t, $t, $t - lw $s $s $s $s 8 Flush these instructions A loop of iterations befe exit: // f (i=; i<; i) a[i] = a[i] *.; aken N N Feedback Not aken Perfmance f loops one misprediction on the last execution of the loop one misprediction on the first execution of the loop (previous execution of this loop ended with branch not taken) f example if the loop executed times: 8% accuracy F less regular branches much lower accuracy may even be zero if the branch is alternately: not taken - taken C $s $s5-3 $s 6 slt, $s, $s3 slt $s3 slt 5 States in -bit prediction scheme Predict-taken branch (Early Branch Resolution) 8 C Can t proceed until we know the branch address (ID stage) MIPS still incurs one cycle branch penalty [Penalty now is only one lost cycle] Makes sense if branch conditions are me complex take longer to calculate $t beq $t, $t, lw $t $s $s 3 $s 6 slt, $s, $s3 slt $s3 slt ime (cycles) Flush this instruction Finite state automaton the -bit value is a saturating counter increment on taken decrement on not taken Use bits per entry instead of one Prediction must be wrong twice befe it is changed if predicted taken start execution at target as soon as target is known if predicted not-taken continue sequential execution until direction Change prediction only if getting misprediction twice 6 3 Pipelined Perfmance Example he rest is f self-study Static branch prediction can be perfmed by the compiler by hinting all fward branches not-taken all backward branches (loops) as taken unrolling loops to decrease the frequency of branches can be based on program traces f frequently executed programs we can determine which branches are me likely to be taken, which me likely to be not taken Dynamic branch prediction Prediction is based on the run-time behaviour of the program Behaviour of the branch Behaviour of the preceding branches Requires additional hardware to ste this infmation Branch Prediction Buffer Branch Histy able 7 SPECIN benchmark: 5% loads % stes % branches % jumps 5% R-type Suppose: % of loads used by next instruction 5% of branches mispredicted All jumps flush next instruction What is the average CPI? Load/Branch CPI = when no stalling, when stalling. hus, CPI lw = (.6) (.) =. CPIb eq = (.75) (.5) =.5 F a program with billion instructions, Execution ime =? Average CPI = (.5)(.) (.)() (.)(.5) (.)() (.5)() =.5 3 Branch Prediction Buffer [x8] beq $t,$t,l Pipelined Perfmance Example A table indexed by the lower ption of the address of the branch instruction Contains one bit per entry: recently taken not taken like a cache but every access is a hit he prediction MAY NO be f the branch currently executed it depends how many entries in the table me entries -> me hardware -> slower start branch execution accding to the prediction if prediction crect continue if prediction increct invert the prediction bit ste in the table cancel partially executed instructions start the proper sequence of instructions Pipelined process critical path: c = (t read t mux t eq t AND t mux t setup ) = [ ] ps = 55 ps Pipelined process critical path ) t pcq t mem t setup ) (t read t mux t eq t AND t mux t setup ) *half cycle c = max 3) t pcq t mux t mux t t setup ) t pcq t memwrite t setup 5) back (t pcq t mux t write ) *half cycle 8 3

Pipelined Perfmance Example F a program with billion instructions executing on a pipelined MIPS process, CPI =.5 c = 55 ps Execution ime = (# instructions) CPI c = ( )(.

5 Pipelined Perfmance Example F a program with billion instructions executing on a pipelined MIPS process, CPI =.5 c = 55 ps Execution ime = (# instructions) CPI c = ( )(.5)(55 - ) = 63 seconds What is speedup? 33 Summary Pipelining is a fundamental concept Multiple steps using distinct resources Exploiting parallelism in instructions What makes it easy? (MIPS vs. 8x86) All instructions are the same length simple instruction fetch Just a few instruction fmats read registers befe decode instruction opers only in loads stes fewer pipeline stages aligned memy access / load; memy access / ste What makes it hard? Structural hazards: suppose we had only one cache? Need me HW resources : need to wry about branch instructions? Branch prediction, delayed branch : an instruction depends on results of a previous instruction? need fwarding, compiler scheduling 3 Revision quiz How much faster is pipeline (slide 6)? Answer: min So the latency is 5 ps; ime between instructions is ps 5 List the steps f executing MIPS instructions through pipelining datapath. In addition, what functional hardware components are used? here are three types of pipeline hazards, what they are? answer (slide 5): answer (slide 7): With the delayed branch strategy, getting instructions from fall through is one solution. his solution is valuable only when branchnot-taken. ) rue ) False 35 Recommended readings ext readings are listed in eaching Schedule Learning Guide PH: ebook is available PH5: no ebook on library site yet PH5: companion materials (e.g. online sections f further readings) /?ISBN=

Topics. Lecture 12: Pipelining. Introduction to pipelining. Pipelined datapath. Hazards in pipeline. Performance. Design issues.

Topics. Lecture 12: Pipelining. Introduction to pipelining. Pipelined datapath. Hazards in pipeline. Performance. Design issues. Lecture 2: Pipelining Topics Introduction to pipelining Performance Pipelined datapath Design issues Hazards in pipeline Types Solutions Pipelining is Natural! Laundry Example Use case scenario Ann, Brian,