Control Dependence, Branch Prediction

Size: px

Start display at page:

Download "Control Dependence, Branch Prediction"

Dominick Powell
5 years ago
Views:

1 Control Dependence, Branch Prediction

2 Outline Control dependences Branch evaluation delay Branch delay slot Branch prediction Static Dynamic correlating, local, global.

3 Control Dependences Program correctness Data flow and Exception behaviour L1: DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) Can be be moved before BEQZ What if if a memory access exception occurs?

4 Control Dependences Program correctness Data flow and Exception behaviour R1 R1 has has 2 producers. Control dependence decides which feeds to to the the OR OR instruction DADDU BEQZ DSUBU L1:... OR R1, R2, R3 R4, L1 R1, R5, R6 R7, R1, R8

5 Control Dependences Program correctness Data flow and Exception behaviour Software Speculation Liveness DADDU R1, R2, R3 BEQZ R12, L1 DSUBU R4, R5, R6 DADDU R5, R4, R9 L1: OR R7, R8, R9 DSUBU can can be be moved before BEQZ (control dependence violation) if if R4 R4 dies after DADDU.

6 Branch Delay NPC Clock cycles needed to to ascertain whether NPC is is to to be be used or or the address after the effective address calculation.

7 Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 Multiple Issue Pipelines Pipeline Frontend Time (clock cycles) ADD J SUB Jump Successor ID EX MEM WB

8 Branch Hazards Time (clock cycles) Branch Instruction i + 1 Branch Target Branch Target + 1 IF ID EX MEM WB Branch Target IF 1 stall cycle for every branch yields a performance loss of 10% to 30%!

9 Branch Hazards Time (clock cycles) Branch Instruction i + 1 Branch Target Branch Target + 1 IF ID EX MEM WB Branch Target IF Branch Delay Slot From the the MIPS ISA ISA Manual The transfer of of control takes place only following the instruction immediately after the control transfer instruction.

10 Reducing Pipeline Branch Penalties Freeze the pipeline Static Prediction Predict Taken, Predict Untaken Delayed Branch Fill Branch Delay Slot

11 Predict Untaken Scheme Time (clock cycles) Untaken Branch Instruction i + 1 Instruction i + 2 Instruction i + 3 IF ID EX MEM WB Time (clock cycles) Taken Branch Instruction i + 1 Branch Target Branch Target + 1 IF ID EX MEM WB

12 Branch Delay Slot Predict taken Predict untaken

13 Control Hazards Performance Speedup pipelining = Pipeline depth 1+ Pipeline stall cycles per instruction Stall cycles Branches = Branch frequency Branch penalty Pipeline depth Speedup pipelining pipelining = 1+ Branch frequency Branch penalty

14 Branch Predictors Without Branch Predictor With With Branch Predictor

15 Static Branch Prediction

16 Dynamic Branch Prediction Branch prediction buffers or Branch History Table Single bit predictors (1-bit bimodal predictor) Change prediction with branch behaviour No. of wrong predictions? 0x0100 0x0100 while(1) {{ for(i=0; i<count; i++) i++) {{ }} BRANCH BRANCH }} Branch instruction behaviour T T T T N T T T T T T T T T T T T Wrong Predictions BRANCH PREDICTION BUFFER PC Prediction Target 0x x0090 0x x0200 0x x Addresses of branches in the program Can Can we we do do better?

17 Dynamic Branch Prediction 2-bit predictors 2-bit Bimodal Saturating Counter 0010 Branch Prediction Buffer

18 Dynamic Branch Prediction 2-bit predictors... T T T N T T T T T T T T...

19 Dynamic Branch Prediction 2-bit predictors... T T T N T T T T T T T T...

20 Dynamic Branch Prediction 2-bit predictors... T T T N T T T T T T T T...

21 Dynamic Branch Prediction 2-bit predictors... T T T N T T T T T T T T...

22 Dynamic Branch Prediction 2-bit predictors... T T T N T T T T T T T T...

23 Dynamic Branch Prediction 2-bit predictors T T T N T T T T T T T T...

24 Dynamic Branch Prediction 2-bit predictors Entries = No. of bits in the BPB? n-bit saturating counters

25 Paper Reading Scott McFarling, Combining Branch Predictors, WRL Technical Note TN-36, June 1993.

26 Branch Prediction Buffer BIMODAL PREDICTOR Branch PC PC A 12 bits Counter Target PC PC Branch PC PC B Two Two different branch PCs PCs may may map map to to the the same same entry entry in in the the BPB BPB Given limited buffer space, how can can one maximize buffer entries while minimizing aliases? entries

27 Observations? Dynamic Branch Prediction

28 Correlating Branch Predictors eqntott code code if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa!=bb) { Two-level predictors (1,2) predictor Outcome of the previous branch X 0010 Branch Prediction Buffer Yeh and Patt, Alternative implementations of two-level adaptive branch prediction, ISCA, 1992.

29 Correlating Branch Predictors if (aa == 2) aa = 0; Outcomes of the previous 2 branches Branch Prediction Buffer if (bb == 2) bb = 0; if (aa!=bb) { XX 0010 A 4096 bit, (2,2) buffer supports how many branch instructions? (m,n) BPB bits=2 m n No. of prediction entries Yeh and Patt, Alternative implementations of two-level adaptive branch prediction, ISCA, 1992.

30 Correlating Branch Predictors Yeh and Patt, ISCA, 1992.

31 BPB Global Predictor GLOBAL PREDICTOR Branch PC PC A Counter Target PC PC 10 bits 2 bits Concatenate 12 bits Global History Aliases may may appear entries

32 BPB Aliases GLOBAL PREDICTOR Branch PC PC A 12 bits XOR Counter Target PC PC 12 bits Global History May May reduce aliases entries

33 Local Predictors A history of branch behaviour is recorded One for each possible combination of outcomes for the last n occurrences of this branch Previous outcomes of the same branch Branch Prediction Buffer XXXXXXXXXX K entries

34 Local Predictors Example 0x0100 0x0100 while(1) {{ for(i=0; i<count; i++) i++) {{ }} BRANCH BRANCH }} Branch instruction behaviour T T N T T T T N T T T T N T T Local History Prediction NT NT T T T T T T T T NT NT

35 BPB Local Predictor LOCAL PREDICTOR Branch PC PC A Counter Target PC PC 6 bits 12b 12b entries entries

36 BPB Local Predictor LOCAL PREDICTOR 12 bits Branch PC PC A Counter Target PC PC 6 bits 12b 12b entries May May reduce aliases entries

37 Tournament Predictors Use multiple predictors: Global, local or mix Combine them with a selector 2 bit saturating counter to select the right predictor for the branch (global vs. local) SPEC INT- Global predictor is chosen 40% of the time. SPEC FP: Global predictor is chosen 15% of the time Branch PC PC Local predictor Global predictor Predictor Selector Prediction

38 Tournament Predictors

39 Intel Core i7 Branch Predictor Uses 2 Tournament predictors A smaller first-level predictor and a backup larger second level predictor Each predictor combines: Simple 2-bit predictor A global history predictor A loop exit predictor Counts loop iterations One of the three predictors is chosen per branch

40 Outline Control dependences Branch evaluation delay Branch delay slot Branch prediction Static Dynamic correlating, local, global.

41 References H & P, 5e. Chapter 3. Scott McFarling, Combining Branch Predictors, 1993

42 Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 Time (clock cycles) ADD J SUB ADD XOR Jump Successor

43 Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 What is the CPI? What is the throughput of this pipeline? Time (clock cycles) ADD J SUB ADD XOR Jump Successor ID EX MEM WB

44 Branch Delay ADD J SUB ADD XOR R2, R3,R4 loop R5, R5,R4 R6, R8, R2 R1, R3, R3 Clock cycles needed to to ascertain whether NPC is is to to be be used or or the address after the effective address calculation. Time (clock cycles) ADD J SUB ADD XOR Jump Successor ID EX MEM WB

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware