Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==> more control instructions per sec. control stall penalties will go up as machines go faster because of larger impact for low CPI machines (e.g., multiple-issue) Branch Prediction definitely helps if we get it right Using hardware to dynamically predict the branch outcome: prediction changes as the actual branch outcome changes The key Thing is to know the cost of a branch when the prediction is correct and when the prediction is incorrect Chapter 4 page 44
Prediction Based on History Branch Prediction Buffer (BPB) using 1 bit per branch inst. recording the last branch outcome using the low-order address bits of the branch as index to the buffer Many problems The usual cache alias problem 2 branches with same index bits will end up predicting each other can use the usual cache strategies to resolve this alias problem always mispredict twice for every loop once is unavoidable since the exit is always a suprise however previous exit will always cause a mispredict for the first loop iteration Example: a loop executing 10 iterations has a prediction accuracy of only 80% n-bit predictor (just a simple n-bit counter) increment on taken, decrement on untaken if greater than half it s range then the prediction is taken statistically using 2 bits performs as good of using more bits Same example: the prediction accuracy improves to 90% using 2-bit predictors. Why? Chapter 4 page 45
Note: Prediction Accuracy 4K 2-bit entries misprediction rates on SPEC89 nasa7 = 1% matrix300 = 0% (it s all loops so no surprise) tomcatv = 1% doduc = 5% spice = fpppp = 9% gcc = 12% (compiler does a lot of IF s so also no surprise) espresso = 5% with infinite buffer the values are eqntott = 18% (ouch!) the same li = 10% Except: gcc goes to 11%, nasa7 & tomcatv go to 0%. this is a very large buffer hence little collision (aliasing) useful technique to show potential Chapter 4 page 46
Improve Prediction Strategy by correlating branches Consider the worse case eqntott code fragment: if the first 2 fail then the 3rd will always be taken if (aa==2) then aa=0; if (bb=2) then bb=0; if (aa!= bb) then whatever single level predictors can never get this case correlating or 2-level predictors correlation = what happened on the last branch, i.e., T or NT note that the last branch may not always be the same predictor = which way to go, i.e., taken (T) or not taken (NT) 4 possibilities from combining (correlation) x (predictor) (Last-taken last-not-taken) X predict-taken, predict-not-taken) Chapter 4 page 47
In general: (m,n) branch prediction buffer (BPB) use last m branches = global branch history use n bits for predictor (counter), i.e., if counter > threshold then predict taken; otherwise, untaken use p bits as index bits to access BPB Total bits needed in the buffer: 2 m n 2 p = Total memory bits required 2 m banks of memory selected by the global branch history (which is just a shift register) - e.g. a column address use p index bits to select row get the n predictor bits in the entry to make the decision Chapter 4 page 48
How well does it work? OK let us check it with 8K bits (4K (0,2) vs. 1K (2,2)) i.e., 2-bit predictors BPB vs. (2,2) correlating BPB nasa7 = 1% vs. 1% matrix300 = 0% vs. 0% tomcatv = 1% vs. 1% doduc = 5% vs. 5% spice = fpppp = 9% vs. 5% gcc = 12% vs. 11% espresso = 5% vs. 4% eqntott = 18% vs. 6% (big win in the worst case) li = 10% vs 5% Clear problem is whether application has correlating branches Chapter 4 page 49
Branch Target Buffer (BTB) To eliminate the branch penalty need to know what the address is by the end of IF but the instruction isn t even decoded yet until ID? this implies that we have to wait a cycle and perhaps get a penalty of 1 Can we use the instruction address rather than wait for decode? if prediction works then penalty goes to 0! The BTB Idea Use cache to store taken branches (no need to store untaken) the match tag is the PC of the fetched instruction the data field is the predicted branch taken address Can also add predictor field if desired to avoid the 2 misses on every loop execution but this adds complexity since we now have to track untaken branches as well Chapter 4 page 50
Changes in DLX to incorporate BTB PC to memory and BTB NO found in BTB YES Send out Predicted PC NO taken YES NO taken YES Normal Execution Enter branch addr and next PC into BTB Mispredict - kill fetched instruction restart fetch at other target, delete entry from BTB Prediction correct, continue with no penalty Chapter 4 page 51
Penalties for the BTB Approach Instruction in Buffer? Table 1: Prediction Actual Branch Penalty Cycles yes taken taken 0 yes taken not taken 2 no taken 2 Chapter 4 page 52
Further Improvements on BTB Store instructions rather than target address increases entry size but removes Ifetch time permits BTB to run slower and therefore be larger permits branch folding branch job is to change PC and get the real instruction if you have the instruction then the branch can be folded out of the way (discarded) result is 0-cycle unconditional branches and 0-cycle properly predicted branches Predicting indirect jumps major source is procedure return obvious model is to use a stack note this can be combined with the above to get jump folding Approach for reducing misprediction penalties Fetch into the instruction buffer from both taken and not taken paths -- can reduce stalls if prediction is wrong Chapter 4 page 53
Going Beyond the Ideal CPI=1 2 approaches for multiple issue Superscalar issue varying numbers of instructions per clock cyle constrained by hazards made possible by multiple functional pipelines scheduling static - by the compiler dynamic - HW supports some form of scoreboarding VLIW (very long instruction word) long instruction contains several real instructions hence need to be statically scheduled by the compiler hardware does not dynamically make issuing decisions no dynamic issue capability Chapter 4 page 54
Consider a Superscalar 2-issue DLX which is very similar to the HP-7100 Lots of issues for even a 2-issue machine Which instructions (they must be independent) 1 integer: load, store, branch, or integer ALU instruction and 1 float: floating point FPU Need to keep decoding simple - have to deal with 64 bits instead of just 32 bits could require that instructions be paired and aligned on a double word boundary - e.g. keeps the pair in a cache line (e.g. 7100 has a 4 word I cache line size) also require the integer instruction to be first - avoids dynamic swap requirement and much more complicated hazard interlock control also require that the FP instruction can only issue if the INT does Seems simple each pipe has it s own register set anyway independence of data type means little can go wrong but what about the longer latency of the FP EX pipe? Chapter 4 page 55
Dealing With Long FP Latencies two options again Pipeline the FP Units (FPU) can still launch the FP instructions in every cycle long latency may cause out of order completion w.r.t. INT pipe causes increased complexity to prevent hazards - e.g. scoreboarding stalls will still occur but if there is reasonable independence then they can be minimized Use multiple FPU s circuitry for MULT, DIV, SQRT, and ADD/SUB tend to be different anyway could further replicate to match typical instruction frequency must do so carefully since they are expensive issue can be based on normal structural hazard check Both more common approach s Chapter 4 page 56
Look at the opcodes Problems So Far see if the pair is an appropriate issue pair Some integer ops are a problem e.g. FP register loads - since other instruction may be dependent hence a stall will result - options? force FP loads, stores or moves to issue by themselves - safe but suboptimal since other instruction may be independent OR add more ports to the FP register file (separate read and write ports) - still must stall the 2nd instruction if it is dependent Other issues Hazard detection - similar to the normal pipeline model the 1 cycle load delay - now covers 3 instruction slots the 1 instruction branch delay now holds 3 instructions as well so instruction scheduling becomes more important Chapter 4 page 57
Advantages of Superscalar over the VLIW option Old codes still run like those tools you have that came as binaries HW detects whether the instruction pair is a legal dual issue pair if not they are run sequentially Little impact on code density don t need to fill all of the can t issue here slots with NOP s Compiler issues are very similar still need to do instruction scheduling anyway one new thing is try hard to get that dual issue pair hardware is there so the compiler doesn t have to be too conservative Chapter 4 page 58
How well does it work? Text example (Fig. 4.27) - scalar vector sum shows 50% improvement over scheduled single issue HP7100 experience getting this simple 2-way right is not that hard most applications show a 50%-70% speedup no applications slow down However code containing a lot of branches doesn t speed up much What did slow down? Compiler execution No floating point so best you could do is single issue anyway And the compilers got more complex trying to schedule for the dual issue option Still basically a win got 1/2-3/4 of the ideal speedup too bad this can t happen for n-issue Chapter 4 page 59
Dynamic Scheduling Scoreboarding required so pick dataflow basis = Tomasulo's algorithm issue and let the reservation stations sort it out but still cannot issue a depentent pair 2 options for fixing the dependent pair problem pipeline the IF/ID stage and run it twice as fast as the EX... stages this isn t that hard since IF and ID are pretty simple for RISC s decoupling provide queues for destinations for loads, moves, and stores sort of a virtual register/renaming style approach scoreboard will become more complex but performance is likely to be enhanced Compiler still plays a major role e.g. do the best you can with static scheduling and then do a little better with the dynamic back-up Chapter 4 page 60
Limitations on Multiple Issue How much ILP can be found in the application most fundamental of the problems requires deep unrolling - hence big focus on loops compiler complexity goes way up deep unrolling needs lots of registers hence need for renaming and lots of additional registers in the machine for rename targets Increased HW cost increased ports for register files cost of scoreboarding and forwarding paths memory bandwidth requirement goes up most have gone with separate I and D ports already newest approaches are to go for multiple D ports as well - big time expense!! branch prediction HW is an absolute must Still multiple issue seems to be the trend today Chapter 4 page 61
Compiler Support for ILP Trick is to find and remove dependencies simple for single variable accesses more complex for pointers, array references, etc. mostly it s about loops and trying to unroll them - helps if non-cyclic dependencies recurrence dependency distance is larger can effectively test dependence between two references to the same array element in a loop (e.g., A GCD test using affine index function ai+b where a and b are constants) things that make analysis difficult reference via pointers rather than array indices indirect reference (e.g. through another array representation, e.g., for sparse arrays) false dependency - for some values a dependence may exist but where those values don t really get used - requires run time checks to determine General: NP hard problem specific cases can be done precisely current problem - lots of special cases that don t apply often Chapter 4 page 62
Software Pipelining a.k.a. symbolic loop unrolling idea is to separate dependencies in original loop body startup fully pipelined sections cleanup register management can be tricky but idea is to turn the code into a single loop body in practice both unrolling and software pipelining will be necessary due to the register limitations Chapter 4 page 63
Trace Scheduling Looking for the critical path across conditional branches Two separate processes trace selection predict branches to give long sequences of instructions each possible sequence is a separate trace selection will depend on which way the conditions really fall trace compaction global instruction scheduling over entire trace trick is to effectively move instructions across predicted branch this causes speculation based on prediction therefore exceptions become an issue - high exception probability will block e.g. memory references (may cause a page fault) also if misprediction then you have to clean things up which is a penalty Chapter 4 page 64
Some Things to Notice SW pipelining, loop unrolling, trace scheduling Not totally independent techniques all try to avoid dependence-induced stalls each has a slightly different primary focus Primary focus unrolling: reduce loop overhead of index modification and branch SW pipelining: reduce single body dependence stalls trace scheduling: reduce impact of branches Most advanced compilers attempt all 3 result is a hybrid which blurs the differences lots of special case analysis changes the hybrid mix All tend to fail if branch prediction is unreliable Chapter 4 page 65