COSC3330 Computer Architecture Lecture 14. Branch Prediction

Size: px

Start display at page:

Download "COSC3330 Computer Architecture Lecture 14. Branch Prediction"

Job Newton
6 years ago
Views:

1 COSC3330 Computer Architecture Lecture 14. Branch Prediction Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

2 opic Out-of-Order Execution Branch Prediction

3 Superscalar erminology Superscalar Able to issue > 1 instruction / cycle Superpipelined Issue Width Deep, but not superscalar pipeline. Number of instructions issued per cycle Out-of-order Able to execute instructions out of program order Register Renaming Able to dynamically assign physical registers to instructions Speculative Execution Able to run instructions speculatively (branch predictions)

4 A Dynamic Superscalar Processor IF ID RD EX ALU FP1 FP2 MEM1 BR MEM2 ( in order ) ( out of order ) Dispatch Buffer WB FP3 ( out of order ) ( in order ) Reorder Buffer

5 Remember the oll Booth? One-at-a-time = 45s 5s 5s 30s 5s Hands toll-booth agent a $100 bill; takes a while to count the change With a 4-Issue oll Booth L1 L2 L3 L4 OOO = 30s We ll add the equivalent of the shoulder to the CPU: the Re-Order Buffer (ROB)

6 Re-Order Buffer (ROB) Separates architected vs. physical registers racks program order of all in-flight instructions Enables in-order completion or commit

7 Hardware Organization Instruction Buffers RA Architected Register File Reservation Stations and ALUs op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk Add op Qj Qk Vj Vk op Qj Qk Vj Vk ROB head Mult type dest value fin

8 Circular Ring Buffer 8

9 Issue Instruction Buffers RA Architected Register File Read inst from inst buffer Check if resources available: Appropriate RS entry ROB entry Read RA, read (available) sources, update RA Write to RS and ROB Reservation Stations and ALUs op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk Add op Qj Qk Vj Vk op Qj Qk Vj Vk Stall issue Mult if any needed resource not available ROB type dest value fin

10 Exec Same as before Wait for all operands to arrive Compete to use functional unit Execute!

11 Write Result Broadcast result on CDB (any dependents will grab the value) Write result back to your ROB entry he ARF holds the official register state, which we will only update in program order Mark ready/finished bit in ROB (note that this inst has completed execution)

12 New: Commit When an inst is the oldest in the ROB i.e. ROB-head points to it Write result (if ready/finished bit is set) If register producing instruction: write to architected register file If store: write to memory Advance ROB-head to next instruction his is what the outside world sees And it s all in-order

13 Commit Illustrated Make instruction execution visible to the outside world Commit the changes to the architected state ROB A B C D EF G H J K WB result ARF Outside World sees : A executed B executed C executed D executed E executed Instructions execute out of program order, but outside world still believes it s in-order

14 James E. Smith Eckert Mauchly Award 1999 for fundamental contributions to high performance micro-architecture, including saturating counters for branch prediction, reorder buffers for precise exceptions, 14

15 Loose Ends Up to now: echniques for handling register-related dependencies Register renaming for WAR, WAW omasulo s algorithm for scheduling RAW Still need to address: Control dependencies

16 Branch Prediction/Speculative Execution When we hit a branch, guess if it s or N A ADD Branch Guess C B N Q LOAD SUB LOAD DIV XOR ADD Keep scheduling and executing Instructions as if the branch Didn t even exist D R SORE LOAD Branch ADD ADD SUB SORE MUL Sometime later, if we messed up Just throw it all out And fetch the correct instructions

17 Branches Kill! Branches are very frequent Approx. 20% of all instructions Can not afford waiting until we know where it goes Long pipelines Branch outcome known after B cycles No scheduling past the branch until outcome known Superscalars (e.g. 4-way) Branch every cycle or so! One cycle of work, then bubbles for ~B cycles?

18 Categorizing Branches Conditional Branch 75% 82% Jump 6% 10% SPEC2000IN SPEC2000FP Call/Return 8% 19% 0% 20% 40% 60% 80% 100% Frequency of branch instructions Source: H&P using Alpha

19 Surviving Branches: Prediction Predict Branches And predict them well! Fetch, decode, etc. on the predicted path Option 1: No execute until branch resovled Option 2: Execute anyway (speculation) A Recover from mispredictions Restart fetch from correct path C D B N Q R

20 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve Single Issue Mispredict

21 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve Single Issue (flush entailed instructions and refetch) Mispredict

22 Branch Misprediction PC Next PC Fetch DriveAlloc Rename Queue Schedule Dispatch Reg File ExecFlags Br Resolve Single Issue 8-issue Superscalar Processor (Worst case) Mispredict

23 Intel Quad Core

24 A9 (Apple A5)

25 Importance of Branches Instruction Window for ILP If misp rate equals 50%, and 1 in 5 insts is a branch, then number of useful instructions that we can fetch is: 5*(1 + ½ + (½) 2 + (½) 3 + ) = 10 If we halve the miss rate down to 25%: 5*(1 + ¾ + (¾) 2 + (¾) 3 + ) = 20 Halving the miss rate doubles the number of useful instructions that we can try to extract ILP from

26 Branch Prediction Need to know two things Whether the branch is taken or not (direction) he target address if it is taken (target) Direct jumps, Function calls Direction known (always taken), target easy to compute Conditional Branches (typically PC-relative) Direction difficult to predict, target easy to compute Indirect jumps, function returns Direction known (always taken), target difficult

27 Branch Prediction: Direction Needed for conditional branches Most branches are of this type Many, many kinds of predictors for this Static: fixed rule, or compiler annotation (e.g. BEQL is branch if equal likely ) Dynamic: hardware prediction Dynamic prediction usually history-based Example: predict direction is the same as the last time this branch was executed

28 Why Branch Direction is Predictable? for (i=0; i<100; i++) { } L1:. addi r10, r0, 100 addi r1, r0, r0 addi r1, r1, 1 bne r1, r10, L1 if (aa==2) aa = 0; if (bb==2) bb = 0; if (aa!=bb). addi r2, r0, 2 bne r10, r2, L_bb xor r10, r10, r10 j L_exit L_bb: bne r11, r2, L_xx xor r11, r11, r11 j L_exit L_xx: beq r10, r11, L_exit Lexit:

29 Static Branch Prediction Uni-directional, always predict taken (or not taken) Backward taken, Forward not taken Need offset information Compiler hints with branch annotation When the info will be available? Post-decode?

30 FSM of the Simplest Predictor A 2-state machine Change mind fast 0 1 If branch taken If branch not taken 0 1 Predict not taken Predict taken

31 Example using 1-bit branch history table for (i=0; i<4; i++) {. } addi r10, r0, 4 addi r1, r1, r0 L1: addi r1, r1, 1 bne r1, r10, L1 Pred Actual N N % accuracy

32 2-bit Saturating Up/Down Counter Predictor MSB: Direction bit LSB: Hysteresis bit 10/ W 11/ S 01/ WN 00/ SN aken Not aken Predict Not taken Predict taken S: Strongly aken W: Weakly aken WN: Weakly Not aken SN: Strongly Not aken

33 2-bit Counter Predictor (Another Scheme) 11/ S 10/ W 01/ WN 00/ SN aken Not aken Predict Not taken Predict taken S: Strongly aken W: Weakly aken WN: Weakly Not aken SN: Strongly Not aken

34 Example using 2-bit up/down counter Pred for (i=0; i<4; i++) {. } addi r10, r0, 4 addi r1, r1, r0 L1: addi r1, r1, 1 bne r1, r10, L Actual N N 10/ W 11/ S 01/ WN 00/ SN 80% accuracy

35 Bimodal Branch Prediction PC Address N entries addressed by N-bit PC N bits Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter N entries (each entry has a 2 bit counter) table update FSM Update Logic Actual outcome Prediction

36 Global vs. Local Branch History Local Behavior What is the predicted direction of Branch A given the outcomes of previous instances of Branch A? Global Behavior What is the predicted direction of Branch Z given the outcomes of all* previous branches A, B,, X and Y? * number of previous branches tracked limited by the history length

37 Code Snippet if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3. } Branch direction Not independent Branch Correlation Correlated to the path taken Example: Path 1-1 of b3 can be surely known beforehand rack path using a 2-bit register 1 () b1 b b3 b3 b3 0 (N) b2 0 b3 Path: A:1-1 B:1-0 C:0-1 D:0-0 aa=0 aa=0 aa2 aa2 bb=0 bb2 bb=0 bb2

38 Global Branch History Register Code Snippet if (aa==2) // b1 aa = 0; if (bb==2) // b2 bb = 0; if (aa!=bb) { // b3. } Actual N 110 An N-bit Shift Register Shift-in branch outcomes 1 taken 0 not taken First-in First-Out BHR can be Global Local (Per-address)

39 Local Branch History Register for (i=0; i<4; i++) {. } addi r10, r0, 4 addi r1, r1, r0 L1: addi r1, r1, 1 bne r1, r10, L1 Actual N N

40 wo-level Branch Predictor [YehPatt91,92,93] Pattern History able (PH) Branch History Register (BHR) (Shift left when update) Rc-k Rc N entries Branch History Pattern Generalized correlated branch predictor N Rc: Actual Branch Outcome 1 st level keeps branch history in Branch History Register (BHR) 2 nd level segregates pattern history in Pattern History able (PH) 40 PH update FSM Update Logic Prediction Current State

41 Correlated Branch Predictor [PanSoRahmeh 92] Subsequent branch direction 2-bit shift register (global branch history) Branch PC X hash.... X 2-bit counte r Prediction 2-bit Sat. Counter Scheme Branch PC hash w 2 w. 2-bit counter X. select 2-bit counter (M,N) correlation scheme X 2-bit counter M: shift register size (# bits). N: N-bit counter. 2-bit counter (2,2) Correlation Scheme Prediction 41

42 Pattern History able 2 N entries addressed by N-bit BHR Each entry keeps a counter (2-bit or more) for prediction Counter update: the same as 2-bit counter Can be initialized in alternate patterns (01, 10, 01, 10,..) Alias (or interference) problem 42

43 wo-level Branch Prediction he 2 LSBs are insignificant for 32-bit instruction PC = 0x C PH BHR MSB = 1 Predict aken

44 PH Indexing Branch addr Global history Gselect radeoff between more history bits and address bits oo many bits needed in Gselect sparse table entries 4/ Insufficient History 44

Gshare Branch Predictor [McFarling93] Branch addr Global history Gselect radeoff between more history bits and address bits oo many bits needed in Gselect sparse table entries Gshare Not to lose

45 Gshare Branch Predictor [McFarling93] Branch addr Global history Gselect radeoff between more history bits and address bits oo many bits needed in Gselect sparse table entries Gshare Not to lose global history bits Ex: AMD Athlon, MIPS R12000, Sun MAJC, Broadcom SiByte s SB-1 4/4 Gshare 8/ Gselect 4/4: Index PH by concatenate low order 4 bits Gshare 8/8: Index PH by {Branch address Global history} 45

46 Gshare Branch Predictor PC Address Global BHR PH MSB = 0 Predict Not aken

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Spring 2010 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Spring 2010 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic C/C++ program Compiler Assembly Code (binary) Processor 0010101010101011110 Memory MAR MDR INPUT Processing Unit OUTPUT ALU TEMP PC Control