Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Size: px

Start display at page:

Download "Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers"

Chester Barrett
6 years ago
Views:

1 Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the pipeline) Predict-not-taken and predict-taken Delayed branch and canceling branch Dynamic predictors Effectiveness of dynamic prediction schemes Accuracy Cost of a correctly predicted branch Cost of an incorrectly predicted branch Basic Branch Prediction Buffers a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits IR: Branch Instruction PC: + Branch Target BHT T (predict taken) NT (predict not- taken) PC + 4 N-bit Branch Prediction Buffers Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor Predict taken Predict taken taken not taken Predict not taken Predict not taken bit Predictor 1

2 Correlating Predictors a.k.a. Two-level Predictors Use recent behavior of other (previous) branches IR: Branch Instruction PC: + Branch Target BHT T (predict taken) NT (predict not- taken) 1-bit global branch history: (stores behavior of previous branch) T/NT T NT PC + 4 Example if (d = = 0) d = 1; if (d = = 1) whatever; BNEZ R1, L1 ; branch b1 (d!=0) ADDI R1, R0, #1 L1: SUBUI R3, R1, #1 BNEZ... R3, L2 ; branch b2 (d!=1) L2: Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred 2 NT/NT T T/NT NT/NT T NT/T 0 T/NT NT T/NT NT/T NT NT/T 2 T/NT T T/NT NT/T T NT/T 0 T/NT NT T/NT NT/T NT NT/T (m, n) Predictors Use behavior of the last m branches 2 m n-bit predictors for each branch Simple implementation Use m-bit shift register to record the behavior of the last m branches m-bit GBH PC: + (m,n) BPF n-bit predictor 2

3 Size of the Buffers Number of bits in a (m,n) predictor 2 m x n x Number of entries in the table Example assume 8K bits in the BHT (0,1): 8K entries (0,2): 4K entries (2,2): 1K entries (12,2): 1 entry! Does not use the branch address Relies only on the global branch history Performance of 2-bit Predictors Frequency of mispredictions nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott SPEC89 Benchmarks li (0,2) 4K entries (0,2) 1M entries (2,2) 1K entries Branch-Target Buffers Further reduce control stalls (hopefully to 0) Store the predicted address in the buffer Access the buffer during IF PC Look up Predicted address T/NT = : instruction is a branch : instruction is not a branch 3

4 Prediction with BTF Send PC to memory and BTF IF Entry found in BTF? ID Is instr a taken branch? Send out predicted address Taken branch? EX Update BTF Kill fetched instr; restart fetch at other target delete entry from BTF; Target Instruction Buffers Store target instructions instead of addresses Advantages BTB access can take longer than time between IFs and BTB can be larger Branch folding Zero-cycle unconditional branches Replace branch with target instruction Zero-cycle conditional branches Condition codes preset Procedure Return Predictors Use buffer (stack) of return addresses Misprediction rate gcc li fpppp Number of entries in the return stack 4

5 Performance Issues Limitations of branch prediction schemes Prediction accuracy (80% - 95%) Type of program Size of buffer Penalty of misprediction Fetch from both directions to reduce penalty Memory system should: Dual-ported Have an interleaved cache Fetch from one path and then from the other Approaches to Improve Performance Goal so far: achieve CPI = 1 Eliminate structural, data, and control stalls Additional performance improvements Make clock rate faster Improve manufacturing process Increase the number of stages Superpipelining Multiple issue of instructions Superscalar VLIW IPC instead of CPI! Superscalar Processors Issue more than one instruction per cycle Duplication of functional units Constraints Structural Data dependencies Control dependencies Scheduling of instructions Static Dynamic Sound familiar? 5

CSE4201 Instruction Level Parallelism. Branch Prediction

CSE4201 Instruction Level Parallelism Branch Prediction Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) 1 Introduction With dynamic scheduling that