Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Module: Branch Prediction Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech) Reading for this Module Branch Prediction Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4 Branch Target Buffers Section 3.5 Performance of Branch Prediction Schemes Section 3.8, pg. 245 248 The Trace Cache Section 4.4, pg. 332-338 Additional Ref: E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29 th Annual International Symposium on Microarchitecture, Dec. 1996. ECE 4100/6100 (2)

Control Dependencies Processor datapath Dependencies I- Execution Core Retire Structural Data Name Control Control dependencies determine execution order of instructions Instructions may be control dependent on a branch DADD R5, R6, R7 BNE R4, R2, CONTINUE Anti Output DMUL R4, R2, R5 DSUB R4, R9, R5 Goal: Maximize I- bandwidth via branch prediction How do we improve prediction accuracy and reduce branch penalties? ECE 4100/6100 (3) The Problem with Branches Branches introduce pipeline bubbles and therefore performance degradation Two types of branches: unconditional and conditional In addition, some branch instructions calculate the branch target by adding (or subtracting) the PC with some constants, others necessitate reading the register file Most of the branches encountered in program execution belongs to the former ECE 4100/6100 (4)

Conditional Branches.. DADD R5, R6, R7 BNE R4, R1, L1.... L1:.. IF ID EX INT EX FP MEM WB instruction issue EX BR For general pipelines, penalties occur because of the timing of Branch target address generation PC-relative address generation can occur after instruction fetch Branch condition resolution What cycle is the condition known? ECE 4100/6100 (5) Handling Branches Branch outcome known IF/ID ID/EX EX/MEM MEM/WB Instruction Decode & Register Read ALU Memory Operation Register Writeback Instruction decoded as a branch BNE R1, R2, Loop Simple solution: stall the pipeline What determines the branch penalty? ECE 4100/6100 (6)

Branch bubbles Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S3 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S4 Instr Decode & Read Reg ALU Memory Operation Reg Writeback K1: Instr Decode & Read Reg ALU Memory Operation Reg Writeback Program Execution Order (in instructions) ECE 4100/6100 (7) Unconditional Branches Should be moved to the earliest possible part of the pipeline the DECODE stage Reduce the number of bubbles inserted However some unconditional branches require register reads e.g. procedure return ECE 4100/6100 (8)

Unconditional Branch Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback Only 1 bubble incurred Program Execution Order (in instructions) ECE 4100/6100 (9) Branch Delay Slots ADD.D F0, F2, F4 ADD.D F0, F2, F4 ADD.D F0, F2, F4 BNEZ R3, L1 BNEZ R3, L1 BNEZ R3, L1 delay slot delay slot delay slot.... L1: L1: ADD.D F6, F8, F10 ADD.D.. F6, F8, F10 L1: ADD.D F16, F18, F14.. BNEZ ADD.D.. L1: R3, L1 F0, F2, F4 Compiler scheduling ADD.D BNEZ ADD.D.. L1:ADD.D L2: F0, F2, F4 R3, L2 F6, F8, F10 F6, F8, F10 ADD.D F0, F2, F4 BNEZ R3, L1 ADD.D F6, F8, F10.... L1: ADD.D F16, F18, F14 Instructions are moved to fill branch delay slots Must account for side effects when branch is mispredicted ECE 4100/6100 (10)

Branch Prediction Purpose: to steer the PC as accurately and as early as possible with respect to conditional branches Four possible outcomes in prediction-outcome pairs T/T predicted as taken and branch was indeed taken NT/NT predicted as not taken and branch was not taken T/NT predicted as taken but branch was not taken NT/T predicted as not taken but branch was taken The latter two are the branch misprediction pairs ECE 4100/6100 (11) Branch Prediction Strategies Three major classes of branch prediction strategies Static Semi-static Dynamic ECE 4100/6100 (12)

Static Prediction Static prediction are simple, hardwired strategies Two main examples: Always assumed taken Always assumed not taken Very ineffective ECE 4100/6100 (13) Static Branch Prediction EX INT IF ID EX FP MEM WB instruction issue EX BR The total number of stalls can be reduced Performance is very dependent on a priori understanding of branch behavior Based on extensive profiling Based on the instruction opcode (Motorola 8810) Based on the relative offset (IBM RS 6000), e.g., negative offsets ECE 4100/6100 (14)

Semi-static Strategies Sometimes, the programmer or the compiler can do a fairly good job in predicting branches A bit in branch instructions indicates if the branch is likely to be taken or not This is sometimes called the Take-Don t Take Bit (TDTB) The DECODE stage can steer instruction fetch accordingly ECE 4100/6100 (15) TDTB in action (1) Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 (TDTB=1) Instr Decode & Read Reg ALU Memory Operation Reg Writeback Confirm! S2 Instr ALU Memory Operation Reg Writeback K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback Only 1 bubble incurred Correct prediction with TDTB = 1 ECE 4100/6100 (16)

TDTB in action (2) Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback Cancelled! K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback K2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Resteer PC Instr Decode & Read Reg ALU Memory Operation Reg Writeback Penalty of 3 cycles ECE 4100/6100 (17) Dynamic Branch Prediction Strategies How do we capture this history? n-1 Shift register 0 Last branch behavior, i.e., taken or not taken How do we predict? prediction From Ref: Modern Processor Design: Fundamentals of Superscalar Processors, J. Shen and M. Lipasti Use past behavior to predict the future Branches show surprisingly good correlation with one another they are not totally random events A general model (shown above) captures history and uses it to make predictions ECE 4100/6100 (18)

Schemes based on local history A simple scheme for each branch instruction is to maintain a single T bit. If T = 0, predict current instantiation of the branch as not taken. If T = 1, predict it as taken. When branch is confirmed, set T appropriately to 0 or 1 depending on whether it is confirmed not taken or taken, respectively Works well when last behavior is a good indicator of future behavior ECE 4100/6100 (19) Single-bit Branch Predictor 1-bit predictors Taken / Not Taken Confirmation 0 Prediction Aliasing? 2 K -1 Use K least significant bits of PC PC ECE 4100/6100 (20)

Two-Bit Predictors The 1-bit predictor is quite ineffective T Branch is taken Use the result of the last two branches to change the prediction Predict Branch Taken T NT T Predict Branch Taken NT This scheme can be generalized to n-bit counters Predict Branch Not Taken NT T Predict Branch Not Taken NT Branch is not taken ECE 4100/6100 (21) Bimodal Branch Predictor PC is used to index a set of counters rather that branch prediction bits Each counter is an n-bit unsigned integer If branch is confirmed taken, counter is incremented by one. If it is confirmed not taken, counter is decremented by one Counters are saturating if value is zero, decrement operations are ignored; if value is at the maximum, increment operations are ignored ECE 4100/6100 (22)

Bimodal Branch Predictor n-bit counters Taken / Not Taken Confirmation 0 Prediction 2 K -1 Use K least significant bits of PC PC ECE 4100/6100 (23) Bimodal Branch Predictor Prediction: If most significant bit is one, predict taken, else predict not taken Can tolerate occasional changes in branch direction Problem of aliasing (two PCs mapping to the same counter) depends on size of table Useful when the branch address is known before the branch condition is known so as to support prefetching ECE 4100/6100 (24)

Size and resolution of predictors established empirically Performance Comparison Performance improvement beyond 4K entry buffer sizes is minimal Performance is a function of Accuracy Branch penalties Branch frequency ECE 4100/6100 (25) Branch Prediction Improvements in accuracy beyond counter resolution is required Note: integer programs have higher branch frequency 4K entries with 2-bit predictors ECE 4100/6100 (26)

Global Branch Predictor Bimodal predictor does not take other branches into its consideration Global predictor use a shift register (GR) to store history of last k branches Table of counters maintained in the same way as the bimodal predictor When a branch is confirmed, a 1 or 0 (depending on whether the branch was taken or not) is shifted into GR The new value of GR is used as the address of the counter in the next prediction ECE 4100/6100 (27) Basic Idea The shift register captures the path through the program B1 For each unique path a counter is maintained B2 T F B3 Prediction is based on the behavior history of each path B6 111 B4 B6 B5 B7 Shift register length determines program region size 101 ECE 4100/6100 (28)

Global Branch Predictor Table of Counters Taken / Not Taken Confirmation Prediction GR ECE 4100/6100 (29) Tackling double loops The global predictor can tackle double loops with short inner loops for (I=0; I<100; I++) for (J=0; J<3; J++)... Conditional Value GR Result -- C1 -- C2 Distinguishes this particular iteration from the others though PC is the same C2 J=1 1101 taken C2 J=2 1011 taken C2 J=3 0111 not taken C1 1110 taken ECE 4100/6100 (30)

The gselect Predictor Table of Counters Taken / Not Taken Confirmation Prediction GR m bits PC m+n bits n bits The PC is used to select from a bag of identical GR patterns overcome aliasing problem in the pure GR global predictor ECE 4100/6100 (31) Another View 4 bit branch address Get corresponding prediction Select predictor 3-bit global history across three branches Capture history for a specific branch instuction Instead of having a predictor for a single branch have a predictor for the most recent history of branch decisions For each branch history sequence, use an n-bit predictor ECE 4100/6100 (32)

The gshare Predictor Table of Counters Taken / Not Taken Confirmation Prediction GR n bits n bits XOR n bits PC A further combination (hashing) of the PC and GR. Counters are not distinguishable by PC or GR alone shared ECE 4100/6100 (33) Combining Predictors Idea: combine local with global predictor use the one that is more accurate for a particular branch A separate table to track which predictor is more accurate Can be extended to incorporate a number of different predictors In experiments, shown to be 98% accurate - a variant is used in the Compaq Alpha 21264 processor ECE 4100/6100 (34)

Multi-level Predictors Specific transitions between states determined accuracy of individual predictors Use predictor 1 Use predictor 2 Use predictor 1 Use predictor 2 Use multiple predictors and chose between them Employ predictors based on local and global information state of the art ECE 4100/6100 (35) The Alpha 21264 Multi-Level Predictor 2-bit predictor Global predictor 2-bit selector 12-bit history 4K entries Local branch address Two level local predictor 4K entries 10-bit history (1024) entries 3-bit saturating counters ECE 4100/6100 (36)

A Combined Predictor Taken / Not Taken Confirmation P1 P2 P1c-P2c Predictor Selection Table of Counters Use P1 Use P2 GSHARE Predict BIMODAL Predict GR n bits n bits XOR n bits PC ECE 4100/6100 (37) The Combined Predictor The table of counters to determine whether P1 or P2 is to be used is updated as follows: P1 correct? P2 correct? P1c-P2c 0 0 0 0 1-1 1 0 1 1 1 0 The P1c-P2c value is added to the counter addressed by PC Flexible and dynamic use of whichever is the more accurate predictor ECE 4100/6100 (38)

Combined Predictor Performance Performance of selection between a 2-bit local predictor and gshare ECE 4100/6100 (39) Misprediction Recovery What actions must be taken on a misprediction? Remove predicted instructions Start fetching from the correct branch target(s) What information is necessary to recover from misprediction? Address information for non-predicted branch target address Identification of those instructions that are predicted To be invalidated and prevented from completion Association between predicted instructions and specific branch When that branch is mispredicted then only those instructions must be squashed ECE 4100/6100 (40)

Branch Target Buffer A cache that contains three pieces of information: The address of branch instructions The BTB is managed like a cache and the addresses of branch instructions are kept for lookup purpose Branch target address To avoid re-computation of branch target address where possible Prediction statistics Different strategies are possible to maintain this portion of the BTB ECE 4100/6100 (41) Branch Target Buffers PC of prior branches PC of corresponding target PC search Prediction info Hit: use corresponding target address Miss: no action Access in parallel with instruction cache Hit produces the branch target address ECE 4100/6100 (42)

Branch Target Buffer Operation Instruction N BTB Hit? Y Instruction Decode N Y N Y Branch? Branch? Normal execution Update BTB: new branch Update BTB: Misprediction recovery Continue: normal execution ECE 4100/6100 (43) Branch Target Buffers: Operation Couple speculative generation of the branch target address with branch prediction Continue to fetch and resolve branch condition Take appropriate action if wrong Any of the preceding history based techniques can be used for branch condition speculation Store prediction information, e.g., n-bit predictors, along with BTB entry Branch folding optimization: store target instruction rather than target address ECE 4100/6100 (44)

Return Address Predictors There is a need for prediction mechanisms for indirect jumps, that is for addresses generated at run time such as return addresses Return addresses are pushed onto a stack in the fetch unit If the fetch unit sees a return in its instruction stream, immediate pop return stack and fetch from popped address BTB accuracy can be degraded by calls from multiple locations ECE 4100/6100 (45) An Integrated Solution Branch Prediction RAS address Branch Target Buffer BTB Logic Target address Interleaved access I-Cache Concurrently check whether each entry in an I-cache line is a branch Multi-branch prediction using a variant of global history To CPU Based on the design reported in E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29th Annual International Symposium on Microarchitecture, Dec. 1996. ECE 4100/6100 (46)

Further Analysis Even with accurate branch prediction, we must fetch instructions from multiple targets What is the effect on instruction bandwidth and pipeline performance? How can we increase instruction fetch bandwidth to compensate Assume perfect branch prediction Instructions are located at different cache lines B1 B2 B3 B4 B5 B6 Exploit instruction locality + branch prediction! ECE 4100/6100 (47) Challenges to Increasing BW? Pipeline latency BW Instruction alignment I-cache Predicting multiple branches ILP What happens as ILP increases? Impact on memory bandwidth, especially fetch bandwidth? Impact on branch predictor throughput? Impact I-Cache? ECE 4100/6100 (48)

Some Branch Statistics From E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29th Annual International Symposium on Microarchitecture, Dec. 1996 ECE 4100/6100 (49) A Program Trace An instruction trace is the actual sequence of executed instructions We can block the trace #instructions/block Treat the trace blocks as units of prediction Based on prediction of individual branches in a block Creating program regions for analysis is a common technique Note that the I-Cache stores a static program description B4 B2 B6 B5 B1 B7 B3 ECE 4100/6100 (50)

The Trace Cache: Principle Instruction trace B1 B3 B1 B4 B2 B6 Branch instruction Branch instruction Branch instruction Store recurring sequences of basic blocks These form a contiguous sequence of instructions big basic block Issue multiple instructions from this big basic block high issue rate Trace length determined by (#instructions, #branches) Predict and fetch traces rather than lines in the instruction cache Multiple instructions are issue from the trace ECE 4100/6100 (51) The Trace Cache: The Problem Need to identify multiple blocks in the cache Some form branch target table? Multi-ported instruction cache To fetch from multiple targets Instruction alignment To feed the decoder Most likely will add a pipeline stage after instruction fetch ECE 4100/6100 (52)

The Trace Cache Basic Idea: fetch the trace according to a multiple path predictor It compliments a core (standard) fetch unit Trace cache reconstruct trace in parallel Ref: E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29 th Annual International Symposium on Microarchitecture, Dec. 1996. Following figures are from this reference ECE 4100/6100 (53) The Trace Cache trace cache B1 B2 B5 B6 first time through record the trace (instructions) B2 B1 B3 time B4 B5 trace cache B1 B2 B5 B6 B6 B7 second time through access the trace To decoder ECE 4100/6100 (54)

The Trace Cache address Trace cache BTB I-cache tag Branch flags Branch mask FT addr Target addr Trace logic n instructions RAS branch predictor n instructions Capture/fill trace history Trace length is determined by Dispatch bandwidth and branch prediction bandwidth Parallel look-up of trace history and instruction cache First address + branch prediction bits to index the cache ECE 4100/6100 (55) Data Structure Valid bit: to indicate if trace is valid Tag: to identify starting address Branch flags: to indicate the branching behavior of the trace ECE 4100/6100 (56)

Data Structure Branch mask: to indicate the number of branches in the trace whether the trace ends in a branch Trace fall-through address: next fetch address if the last branch in the trace is predicted as not taken Trace target address: fetch address if last branch is taken ECE 4100/6100 (57) On a hit There is a hit in the trace cache if the fetch address match the tag the branch predictions match the branch flags When there is a hit, instructions come from the trace cache, else they come from the core unit ECE 4100/6100 (58)

On a miss Core unit takes over the responsibility of supplying instruction Trace cache uses what is supplied by core fetch unit to build up its cache line ECE 4100/6100 (59) Performance ECE 4100/6100 (60)

Performance ECE 4100/6100 (61) The Real Deal: P4 Micro-architecture Front-end Branch Target Buffer 4k entries used when it misses in the trace BTB If no BTB entry is found, use static prediction backward branches predicted taken Static branch prediction uses a threshold Indirect branch predictor ECE 4100/6100 (62) From The Microarchitecture of the Intel Pentium4 Processor, Intel Technology Journal, February 2004 & February 2001

P4 Microarchitecture Front-end BTB (4K entries) I-TLB Prefetcher Instruction decoder Trace Cache BTB (512 entries) Trace Cache (12K µops) Up to 3 µops/cycle µop queue Trace Cache 6 µops per trace line (many lines/trace) Has its own branch predictor Has its own 512 entry BTB with a 16 entry return address stack ECE 4100/6100 (63) The Real Deal: Power5* Func/proc returns Predicted targets Two level prediction scheme shared by two threads 1 bimodal + 1 path-correlated prediction 1 to predict which of the first 2 is correct Branch instruction queue (BIQ) Store recovery information for misprediction Retired in program order ECE 4100/6100 (64) R. Kalla, B. Sinharoy, J. Tendler, IBM POWER5 CHIP: A Dual-Core Multithreaded Processor, IEEE Micro, March/April, 2004

Power5 Execution Pipeline Threads share IF-IC-BP High branch prediction throughput: all branches can be predicted R. Kalla, B. Sinharoy, J. Tendler, IBM POWER5 CHIP: A Dual-Core Multithreaded Processor, IEEE Micro, March/April, 2004 ECE 4100/6100 (65) Some Research Questions? Quality of branch prediction Improving branch predictor throughput Power efficiency of branch prediction logic Speculative execution Is the focus moving up a level to multi-core, manycore, any core? Has the era of ILP stabilized? ECE 4100/6100 (66)

Concluding Remarks Handling control flow is a challenge to keeping the execution core fed Prediction and recovery mechanisms key to keeping the pipeline active avoid performance degradation Superscalar datapaths provide increased pressure pushing for better, more innovative techniques to keep pace with technology-enabled appetite for instruction level parallelism What next? ECE 4100/6100 (67) Study Guide What are the basic approaches to branch prediction? What properties of the program does each predictor rely upon? Given a predictor, describe program structures/behaviors for which this predictor will work well or work poorly. Behavior of branch predictors Given a program trace (including taken/not taken conditions of branches), be able to trace through the states of a predictor Behavior of the BTB Given a program trace, be able to show the BTB contents at each point in the fetch sequence Trace pipeline operation on a BTB miss ECE 4100/6100 (68)

Study Guide What is a trace cache? Basic operation What are the properties of programs for which a trace cache is a good solution? Compare and contrast each type of branch predictor Given a set of program behaviors/statistics design a branch prediction strategy and implementation ECE 4100/6100 (69)