Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Size: px

Start display at page:

Download "Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3."

Liliana Heath
6 years ago
Views:

1 Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1

2 Frontend and Backend Feedback: - Prediction correct or not, update info - Incorrect: Correct next PC Fetch Rename Wakeup select schedule Regfile bypass FU FU execute D-cache commit Frontend: - Keep fetching n insts per cycle (in-order) - Predict next PC based on current PC and past history of branch targets and directions - Special handling of function returns Backend: - Execute instructions out-of-order - Provide feedback info to frontend - Mis-prediction affects performance but not correctness 2

3 Instruction Flow Instruction flow must be continuous Branch Target prediction: What is the target PC, must be done at the fetch stage Branch prediction: What direction does a branch take, usually done at fetching Return Address Prediction: Special target prediction for return instructions, may be done at fetching or decoding 3

4 Instruction Flow Design questions: What would happen if branch prediction is done after the fetching stage, e.g. decoding? Inst Memory Single cycle loop PC Target, branch, and RA predictors At the fetch stage, how to know if an inst is a branch or not? INST Decode/Rename Feedback How to know if an inst is a return inst? Feedback 4

5 Branch Prediction Buffer I-Cache PC IF ID EX M WB PC A0 0 Associative Lookup expensive! A1 A2 1 1 log k PC BPB Index A(k-1) 0

6 Branch Target Buffer (BTB) PC of instruction to fetch Look up Predicted PC Number of entries in branchtarget buffer = No: instruction is not predicted to be branch. Proceed normally Yes: then instruction is branch and predicted PC should be used as the next PC Branch predicted taken or untaken

7 Branch Prediction Steps Send PC to memory and branch-target buffer IF No Entry found in branch-target buffer? Yes ID No Normal instruction execution Is instruction a taken branch? Yes No Send out predicted PC Taken branch? Yes EX Enter branch addr and next PC Into BTB Mispredicted branch, kill fetched inst; restart fetch at other target; delete entry from BTB Branch predicted Correctly; continue execution with no stalls

8 Branch Folding Optimization: Larger branch-target buffer Add target instruction into buffer to deal with longer decoding time required by larger buffer Branch folding Adv. Techniques for Instruction Delivery and Speculation Copyright 2012, Elsevier Inc. All rights reserved.

9 Mis-prediction Recovery Pipeline flushing Mis-prediction is detected when a branch is resolved May wait until the branch is to be committed, and then flush the pipeline Selective flushing: Immediately and selectively flush misfetched instructions Fetch stage flushing: Special cases, e.g. A branch target was wrongly predicted; the correct branch target is known at decoding for most branches An unconditional branches (jumps) were predicted as not-taken 9

10 Branch Prediction Predict branch direction: taken or not taken (T/NT) taken Not taken BNE R1, R2, L1 L1: Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more 10

11 Predictor for a Single Branch General Form 1. Access PC state 2. Predict Output T/NT 1-bit prediction Predict Taken T 3. Feedback T/NT NT NT 1 0 T Feedback Predict Not Taken 11

12 1-bit BHT Accuracy Example: in a loop, 1-bit BHT will cause 2 mispredictions Consider a loop of 10 iterations before exit: for ( ){ for (i=0; i<10; i++) a[i] = a[i] * 2.0; } Two mispredictions first loop iteration and last loop iteration. Only 80% accuracy. 12

13 1-Bit Prediction Drawbacks LOOP: Inst 1 Inst 2 Inst 3 10 iterations.. Inst k Branch Taken: 9 times Not taken: 1 time 1-bit prediction mispredicts twice: 20% misprediction rate Outer loop

14 Branch History Table of 1-bit Predictor BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors k-bit 2 k Branch address Prediction 14

15 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249) Predict Taken Predict Not Taken T NT T T NT NT T Predict Taken Predict Not Taken NT Blue: stop, not taken Gray: go, taken Adds hysteresis to decision making process 15

16 Correlating Branches Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch. Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table. In general, (m,n) predictor records last m branches to select between 2 m history tables each with n-bit counters. Old 2-bit BHT is then a (0,2) predictor

17 Correlating Branches if (d==0) d=1; if (d==1) BNEZ R1, L1 Branch B1 ADDI R1, R0, #1 L1: SUBUI R3, R1, #1 BNEZ R3, L2 Branch B2 L2: B1 and B2 are correlated? B1 Not Taken B2 Not Taken

18 Correlating Branch Predictor Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors 1-bit global branch history (0 = not taken) Prediction 18

19 Correlating Branch Predictor General form: (m, n) predictor m bits for global history, n bits for local history Records correlation between m+1 branches Simple implementation: global history can be stored in a shift register Example: (2,2) predictor, 2-bit global, 2-bit local Branch address (4 bits) 2-bits per branch local predictors 2-bit global branch history (01 = not taken then taken) Prediction 19

20 20

21 Correlating Branch Example Initial value of d d==0? b1 Value of d before b2 d==1? b2 0 Yes Not taken 1 Yes Not taken 1 No Taken 1 Yes Not taken 2 No Taken 2 No Taken Assume d alternates between 2 and 0. d=? b1 prediction b1 action New b1 prediction b2 prediction b2 action New b2 prediction 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT 1-bit predictor mispredicts every branch!

22 Correlating Branch Example Prediction bits Prediction if last branch not taken Prediction if last branch taken NT/NT Not taken Not taken NT/T Not taken Taken T/NT Taken Not Taken T/T Taken Taken Initial prediction: NT/NT New b1 prediction b2 action d=? b1 prediction b1 action b2 prediction New b2 pred 2 NT /NT T T/NT NT/ NT T NT/T 0 T/ NT NT T/NT NT /T NT NT/T 2 T/NT T T/NT NT/ T T NT/T 0 T/ NT NT T/NT NT /T NT NT/T

23 Correlating Branches (2,2) predictor Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction Branch address 4 2-bit per branch predictors XX bit global branch history XX prediction

24 Gselect and Gshare predictors Keep a global register (GR) with outcome of k branches Use that in conjunction with PC to index into a table containing 2-bit predictor Gselect concatenate Gshare XOR (better) branch result: taken/ not taken global branch history register (GBHR) shift / PHT / decode 2 predict: taken/ not taken ECE Adapted from Patterson, Katz and Culler Copyright UCB 2007 CAM Copyright 2001 UCB & Morgan Kaufmann

25 Accuracy of Different Schemes (Figure 2.7, page 87) 20% Frequency of Mispredictions Frequency of Mispredictions 18% 16% 14% 12% 10% 8% 6% 4% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 5% 6% 6% 11% 4% 6% 5% 2% 0% 1% 1% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 25

26 Re-evaluating Correlation Several SPEC benchmarks have less than a dozen branches responsible for 90% of taken branches: program branch % static # = 90% compress 14% eqntott 25% gcc 15% mpeg 10% real gcc 13% Real programs + OS more like gcc Small benefits of correlation beyond benchmarks? Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when indexing the table For SPEC92, 4096 about as good as infinite table Misprediction mostly due to wrong prediction Can we improve using global history? ECE Adapted from Patterson, Katz and Culler UCB Copyright 2001 UCB & Morgan Kaufmann

27 Estimate Branch Penalty EX: BHT correct rate is 95%, BTB hit rate is 95% Average miss penalty is 1 cycle on BTB and 6 cycles on BHT How much is the branch penalty? 27

28 Return Address (RA) Prediction Return: special register indirect branches Register indirect branch hard to predict Many callers, one callee Jump to multiple return addresses from a single address (no PC-target correlation) SPEC89 85% such branches for procedure return Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries leads to small miss rate 28

29 Accuracy of Return Address Predictor 29

30 Tournament Predictors Motivation for correlating branch predictors: 2-bit local predictor failed on important branches; by adding global information, performance improved Tournament predictors: use two predictors, 1 based on global information and 1 based on local information, and combine with a selector Hopes to select right predictor for right branch (or right context of branch) ECE Adapted from Patterson, Katz and Culler UCB Copyright 2001 UCB & Morgan Kaufmann

31 Tournament Branch Predictor Used in Alpha 21264: Track both local and global history Intended for mixed types of applications Global history: T/NT history of past k branches, e.g (NT T NT T NT T) PC Local Predictor Global Predictor Choice Predictor mux NT/T Global history 31

32 Tournament Predictor in Alpha K 2-bit counters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor 12-bit pattern: ith bit is 0 => ith prior branch not taken; ith bit is 1 => ith prior branch taken; 00,10,11 00,01,11 Use 1 Use Use 1 Use ,11 00, K 2 bits ECE Adapted from Patterson, Katz and Culler UCB Copyright 2001 UCB & Morgan Kaufmann

33 Tournament Predictor in Alpha Local predictor consists of a 2-level predictor: Top level a local history table consisting of bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors) 1K 10 bits 1K 3 bits ECE Adapted from Patterson, Katz and Culler UCB Copyright 2001 UCB & Morgan Kaufmann

34 Tournament Branch Predictor Local predictor: use 10-bit local history, shared 3-bit counters PC Local history table (1Kx10) Counters 10 (1Kx3) 1 NT/T Global and choice predictors: Global history 12-bit 12 Counters (4Kx2) 1 NT/T NT/T Counters (4Kx2) 1 local/global 34

35 % of predictions from local predictor in Tournament Prediction Scheme 0% 20% 40% 60% 80% 100% nasa7 matrix300 tomcatv doduc spice fpppp gcc espresso eqntott li 37% 55% 76% 72% 63% 69% 98% 100% 94% 90% ECE Adapted from Patterson, Katz and Culler UCB Copyright 2001 UCB & Morgan Kaufmann

36 Accuracy of Branch Prediction tomcatv doduc 84% 99% 99% 100% 95% 97% fpppp li 86% 82% 88% 77% 98% 98% Profile-based 2-bit counter Tournament espresso 86% 82% 96% gcc 70% 88% 94% 0% 20% 40% 60% 80% 100% Profile: branch profile from last execution (static in that is encoded in instruction, but profile) fig 3.40 ECE Adapted from Patterson, Katz and Culler UCB Copyright 2001 UCB & Morgan Kaufmann

Branch Prediction Performance Branch Prediction Copyright 2012,

38 Patt-Yeh Predictor The correlating branch predictors we have just studied work by combining local with global information. However, it is also possible to do quite well considering only information about the current branch (local information). 38

39 Patt-Yeh Predictor 39

40 Patt-Yeh Predictor A: T NT N T N T N B: T TTTTTTT 40

41 41

42 PT entries 01 and 10 are trained for A, and 11 is trained for B. In general, the Yeh-Patt predictor provides 96% 98% accuracy for integer code 42

43 Branch Predictors Smith (bimodal) predictor Pattern-based predictors Two-level, gshare, bi-mode, gskewed, Agree, Predictors based on alternative contexts Alloyed history, path history, loop counting, Hybrid predictors Multiple component predictors + selection/fusion Tournament, multihybrid, prediction fusion, Reference book: Ch. 9, Advanced Instruction Flow Techniques 43

44 Branch Decoupling Loop: LD F0,0(R1) ;F0=vector element ADDD F4,F0,F2 ;add scalar from F2 SD 0(R1),F4 ;store result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero Say 100 iterations. Can branch be pre-computed for each loop iteration?

45 Branch Determining Instructions (BDIs) LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 BDI SUBI R1,R1,8 BNEZ R1,Loop

46 Branch Decoupling Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,8 BNEZ R1,Loop Branch Stream BLoop: SUBI R1,R1,8 BNEZ R1,Bloop,Ploop Program Stream PLoop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,8

47 Branch Decoupled Microarchitecture P-Reg File P-Processor I-cache D-cache B-Reg File I-cache B-Processor B-PC PPC Block Counter Target + Block size PPC Queue

48 PPC Control If(Block counter not = 0) Decrement Block counter & Increment PPC. else when (PPCQ not empty) Dequeue a (target, block size) entry from PPCQ. PPC target; Block counter block size;

49 Branch Prediction With n-way Issue 1. Branches will arrive up to n times faster in an n-issue processor 2. Amdahl s Law => relative impact of the control stalls will be larger with the lower potential CPI in an n-issue processor 49

50 Modern Design: Frontend and Backend Frontend: Instruction fetch and dispatch To supply high-quality instructions to the backend Instruction flows in program order Backend: Schedule/execute, Writeback and Commit Instructions are processed out-of-order Frontend Enhancements Instruction prefetch: fetch ahead to deliver multiple instructions per cycle To handle multiple branches: may access multiple cache lines in one cycle, use prefetch to hide the cost Target and branch predictions may be integrated with instruction cache: e.g. Intel P4 trace cache 50

51 Pitfall: Sometimes bigger and dumber is better uses tournament predictor (29 Kbits) Earlier uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits) SPEC95 benchmarks, outperforms avg mispredictions per 1000 instructions avg mispredictions per 1000 instructions Reversed for transaction processing (TP)! avg. 17 mispredictions per 1000 instructions avg. 15 mispredictions per 1000 instructions TP code much larger & hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264) 51

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer CISC 662 Graduate Computer Architecture Lecture 8 - ILP 1 Michela Taufer Pipeline CPI http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson