Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

Similar documents
Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches!

What about branches? Branch outcomes are not known until EXE What are our options?

What about branches? Branch outcomes are not known until EXE What are our options?

Implementing a MIPS Processor. Readings:

Implementing a MIPS Processor. Readings:

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Complex Pipelines and Branch Prediction

The Processor: Improving the performance - Control Hazards

Pipelining is Hazardous!

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Instruction Level Parallelism (Branch Prediction)

Static, multiple-issue (superscaler) pipelines

Full Datapath. Chapter 4 The Processor 2

LECTURE 3: THE PROCESSOR

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Chapter 4. The Processor

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

COMPUTER ORGANIZATION AND DESI

Lecture 13: Branch Prediction

Chapter 4 The Processor 1. Chapter 4B. The Processor

CS 104 Computer Organization and Design

Control Hazards. Branch Prediction

Chapter 4. The Processor

Pipeline design. Mehran Rezaei

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

COMPUTER ORGANIZATION AND DESIGN

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

CPU Pipelining Issues

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Assignment 1 solutions

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Spring 2009 Prof. Hyesoon Kim

14:332:331 Pipelined Datapath

Control Hazards. Prediction

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Processor Design Pipelined Processor. Hung-Wei Tseng

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ELE 655 Microprocessor System Design

Unresolved data hazards. CS2504, Spring'2007 Dimitris Nikolopoulos

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 32: Pipeline Parallelism 3

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Thomas Polzer Institut für Technische Informatik

HY425 Lecture 05: Branch Prediction

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

Outline Marquette University

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

Dynamic Control Hazard Avoidance

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Lecture 7 Pipelining. Peng Liu.

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

ECE 571 Advanced Microprocessor-Based Design Lecture 7

ECE232: Hardware Organization and Design

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Chapter 4. The Processor

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Pipelining. CSC Friday, November 6, 2015

Chapter 4. The Processor

DEE 1053 Computer Organization Lecture 6: Pipelining

CS311 Lecture: Pipelining and Superscalar Architectures

Topic 14: Dealing with Branches

Topic 14: Dealing with Branches

The Processor: Instruction-Level Parallelism

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

EIE/ENE 334 Microprocessors

Pipelining - Part 2. CSE Computer Systems October 19, 2001

LECTURE 9. Pipeline Hazards

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Control Dependence, Branch Prediction

November 7, 2014 Prediction

CS Basic Pipeline

6.004 Tutorial Problems L22 Branch Prediction

ECE 571 Advanced Microprocessor-Based Design Lecture 8

(Refer Slide Time: 00:02:04)

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Computer Organization and Structure

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

Fall 2011 Prof. Hyesoon Kim

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS 341l Fall 2008 Test #2

Transcription:

Control Hazards 1

Today Quiz 5 Mini project #1 solution Mini project #2 assigned Stalling recap Branches! 2

Key Points: Control Hazards Control hazards occur when we don t know which instruction to execute next Mostly caused by branches Strategies for aling with them Stall Guess! Leads to speculation Flushing the pipeline Strategies for making better guesses Unrstand the difference between stall and flush 3

Ial operation Cycles 4

Stalling for Load Cycles Load $s1, 0($s1) Addi $t1, $s1, 4 Deco Fetch Mem Writ bac Fetch Deco Me To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 5

Stalling for Load Cycles Load $s1, 0($s1) Addi $t1, $s1, 4 Deco All stages of the Fetch Mem Writ bac pipeline earlier than the stall stand still. Fetch Deco Me To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 5

Stalling for Load Cycles Only four stages are occupied. What s in Mem? Load $s1, 0($s1) Addi $t1, $s1, 4 Deco All stages of the Fetch Mem Writ bac pipeline earlier than the stall stand still. Fetch Deco Me To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 5

Inserting Noops Cycles Load $s1, 0($s1) The noop is in Mem Noop inserted Mem Write Addi $t1, $s1, 4 Deco Fetch Fetch Deco Mem To stall we insert a noop in place of the instruction and freeze the earlier stages of the pipeline 6

Control Hazards add $s1, $s3, $s2 Computing the new PC sub $s6, $s5, $s2 beq $s6, $s7, somewhere and $s2, $s3, $s1 7

Computing the PC Non-branch instruction PC = PC + 4 When is PC ready? 8

Computing the PC Non-branch instruction PC = PC + 4 When is PC ready? 8

Computing the PC Non-branch instruction PC = PC + 4 When is PC ready? No Hazard. 8

Computing the PC Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? 9

Computing the PC Branch instructions bne $s1, $s2, offset if ($s1!= $s2) { PC = PC + offset} else {PC = PC + 4;} When is the value ready? 9

Solution #1: Stall on branches Worked for loads and ALU ops. 10

Solution #1: Stall on branches Worked for loads and ALU ops. 10

Solution #1: Stall on branches Worked for loads and ALU ops. But wait! When do we know whether the instruction is a branch? 10

Solution #1: Stall on branches Worked for loads and ALU ops. But wait! When do we know whether the instruction is a branch? We would have to stall on every instruction 10

There is a constant control hazard We don t even know what kind of instruction we have until co. What do we do? 11

Smart ISA sign Cycles add $s0, $t0, $t1 sub $t2, $s0, $t3 sub $t2, $s0, $t3 sub $t2, $s0, $t3 Make it very easy to tell if the instruction is a branch -- maybe a single bit or just a couple. Decoding these bits is nearly trivial. In MIPS the branches and jumps are opcos 0-7, so if the high orr bits are zero, it s a control instruction 12

Dealing with Branches: Option 1 -- stall Cycles sll $s4, $t6, $t5 bne $t2, $s0, somewhere add $s0, $t0, $t1 Fetch Fetch Stall and $s4, $t0, $t1 Fetch Deco Mem No instructions in Deco or Execute What does this do to our CPI? Speedup? 13

Performance impact of stalling ET = I * CPI * CT Branches about about 1 in 5 instructions What s the CPI for branches? Amdah ls law:speedup = ET = 14

Performance impact of stalling ET = I * CPI * CT Branches about about 1 in 5 instructions What s the CPI for branches? 1 + 2 = 3 Amdah ls law:speedup = ET = 14

Performance impact of stalling ET = I * CPI * CT Branches about about 1 in 5 instructions What s the CPI for branches? 1 + 2 = 3 Amdah ls law:speedup = 1/(.2/(1/3) + (.8)) = 0.714 ET = 14

Performance impact of stalling ET = I * CPI * CT Branches about about 1 in 5 instructions What s the CPI for branches? 1 + 2 = 3 Amdah ls law:speedup = 1/(.2/(1/3) + (.8)) = 0.714 ET = 1 * (.2*3 +.8 * 1) * 1 = 1.4 14

Option 2: The compiler Use branch lay slots. The next N instructions after a branch are always executed Much like load-lay slots Good Simple hardware Bad N cannot change. MIPS has one branch lay slot It s a big pain! 15

Delay slots. Cycles Taken bne $t2, $s0, somewhere Branch Delay add $t2, $s4, $t1 add $s0, $t0, $t1 Mem Wri bac... somewhere: sub $t2, $s0, $t3 Fetch Deco Me 16

Option 3: Simple Prediction Can a processor tell the future? For non-taken branches, the new PC is ready immediately. Let s just assume the branch is not taken Also called branch prediction or control speculation What if we are wrong? 17

Predict Not-taken Cycles bne $t2, $s0, somewhere sll $s4, $t6, $t5 bne $t2, $s4, else and $s4, $t0, $t1 Mem W b Squash add $s0, $t0, $t1 Fetch Deco Squa... else: sub $t2, $s0, $t3 Fetch Deco We start the add and the and, and then, when we discover the branch outcome, we squash them. This means we turn it into a Noop. 18

Predict Not-taken Cycles bne $t2, $s0, somewhere sll $s4, $t6, $t5 bne $t2, $s4, else and $s4, $t0, $t1 add $s0, $t0, $t1... else: sub $t2, $s0, $t3 These two instructions become Noops Mem W b Fetch Deco Fetch Squash Deco Squa We start the add and the and, and then, when we discover the branch outcome, we squash them. This means we turn it into a Noop. 18

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? 19

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Loops are commons 19

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Loops are commons Not all branches are for loops. 19

Simple static Prediction static means before run time Many prediction schemes are possible Predict taken Pros? Predict not-taken Pros? Loops are commons Not all branches are for loops. Backward Taken/Forward not taken Best of both worlds. 19

Implementing Backward taken/forward not taken 01 2!"#$ %$$&"''.//!6+$';A?+'!"#$%&'()" *+,)%-!"#$+%$$&+- 3+45#$+%!"#$+%$$&+. 657+ (&)*"+%$$& (&)*"+,#*#!"#$ +,#*#+-!"#$ +,#*#+.?+'ABC+' :;5< 7+<=>.//.89 BC+'A*+, %$$&"'' (&)*"+,#*#?@$@ *+,)%-!"#$,#*# *+,ADE :54" -/ BC$+"/ 0.

Implementing Backward taken/forward not sign bit taken comparison result Sign Extend Shi< le< 2 Add Compute target Insert bubble to flush PC 4 Read Address Add Instruc(on Memory IFetch/Dec Read Addr 1 Register Read Addr 2 File Write Addr Write Data Read Data 1 Read Data 2 Dec/Exec Shi< le< 2 Add ALU Exec/Mem Address Write Data Data Memory Read Data Mem/WB Sign 16 Extend 32

Implementing Backward taken/forward not taken Changes in control New inputs to the control unit The sign of the offset The result of the branch New outputs from control The flush signal. Inserts noop bits in datapath and control 22

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the pipeline increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? 23

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI = 0.2*0.2*(1 + 2) + (1-.2*.2)*1 = 1.08 CT = 1.1 ET = 1.188 Stall CPI =.2*3 +.8*1 = 1.4 CT = 1 ET = 1.4 Speed up = 1.4/1.188 = 1.17 24

The Importance of Pipeline pth There are two important parameters of the pipeline that termine the impact of branches on performance Branch co time -- how many cycles does it take to intify a branch (in our case, this is less than 1) Branch resolution time -- cycles until the real branch outcome is known (in our case, this is 2 cycles) 25

Pentium 4 pipeline (Willamette) 1.Branches take 19 cycles to resolve 2.Intifying a branch takes 4 cycles. 1.The P4 fetches 3 instructions per cycle 3.Stalling is not an option. Pentium 4 pipelines peaked at 31 stage!!! Current cpus have about 12-14 stages.

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI =.2*.2*(1 + 2) +.9*1 CT = 1.1 ET = 1.118 Stall CPI =.2*4 +.8*1 = 1.6 CT = 1 ET = 1.4 Speed up = 1.4/1.118 = 1.18 What if this were 20? 27

Performance Impact ET = I * CPI * CT Back taken, forward not taken is 80% accurate Branches are 20% of instructions Changing the front end increases the cycle time by 10% What is the speedup Bt/Fnt compared to just stalling on every branch? Btfnt CPI =.2*.2*(1 + 20) +.8*1 = 1.64 CT = 1.1 ET = 1.804 Stall CPI =.2*(1 + 20) +.8*1 = 5 CT = 1 ET = 5 Speed up = 5/1.804= 2.77 28

Dynamic Branch Prediction Long pipes mand higher accuracy than static schemes can liver. Instead of making the the guess once, make it every time we see the branch. Predict future behavior based on past behavior 29

Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? 30

Predictable control Use previous branch behavior to predict future branch behavior. When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or always not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function 31

Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous one. Pros? Cons? 32

Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous one. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange Cons? things to make it work better 32

Dynamic Predictor 1: The Simplest Thing Predict that this branch will go the same way as the previous one. Pros? Dead simple. Keep a bit in the fetch stage. Works ok for simple loops. The compiler might be able to arrange things to make it work better Cons? An unpredictable branch in a loop will mess everything up. It can t tell the difference between branches 32

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table How big does the table need to be? Look up the prediction bit for the branch Pros: Cons: 33

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table How big does the table need to be? Look up the prediction bit for the branch Pros: Cons: It can differentiate between branches. Bad behavior by one won t mess up others... mostly (why not always?) 33

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table How big does the table need to be? Infinite! Bigger is better, but don t mess with the cycle time. Inx into it using the low orr bits of the PC Look up the prediction bit for the branch Pros: Cons: It can differentiate between branches. Bad behavior by one won t mess up others... mostly (why not always?) 33

Dynamic Prediction 2: A table of bits Give each branch it s own bit in a table How big does the table need to be? Infinite! Bigger is better, but don t mess with the cycle time. Inx into it using the low orr bits of the PC Look up the prediction bit for the branch Pros: Cons: It can differentiate between branches. Bad behavior by one won t mess up others... mostly (why not always?) Accuracy is still not great. 33

Dynamic Prediction 2: A table of bits for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } What s the accuracy for the inner loop s branch? 34

Dynamic Prediction 2: A table of bits for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual prediction new prediction 1 taken not taken taken 2 taken taken taken 3 taken taken taken 4 not taken taken not taken 1 taken not taken take 2 taken taken taken 3 taken taken taken What s the accuracy for the inner loop s branch? 34

Dynamic Prediction 2: A table of bits for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual prediction new prediction 1 taken not taken taken 2 taken taken taken 3 taken taken taken 4 not taken taken not taken 1 taken not taken take 50% or 2 per loop 2 taken taken taken 3 taken taken taken What s the accuracy for the inner loop s branch? 34

Dynamic prediction 3: A table of counters Instead of a single bit, keep two. This gives four possible states Taken branches move the state to the right. Not-taken branches move it to the left. State 00 -- strongly not taken 01 -- weakly not taken 10 -- weakly taken 11 -- strongly taken Predicti on not taken not taken taken taken The net effect is that we wait a bit to change out mind 35

Dynamic Prediction 3: A table of counters for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } What s the accuracy for the inner loop s branch? (start in weakly taken) 36

Dynamic Prediction 3: A table of counters for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual state prediction new state 1 taken weakly taken taken strongly taken 2 taken strongly taken taken strongly taken 3 taken strongly taken taken strongly taken 4 not taken strongly taken taken weakly taken 1 taken weakly taken taken strongly taken 2 taken strongly taken taken strongly taken 3 taken strongly taken taken strongly taken What s the accuracy for the inner loop s branch? (start in weakly taken) 36

Dynamic Prediction 3: A table of counters for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual state prediction new state 1 taken weakly taken taken strongly taken 2 taken strongly taken taken strongly taken 3 taken strongly taken taken strongly taken 4 not taken strongly taken taken weakly taken 1 taken weakly taken taken strongly taken 25% or 1 per loop 2 taken strongly taken taken strongly taken 3 taken strongly taken taken strongly taken What s the accuracy for the inner loop s branch? (start in weakly taken) 36

Two-bit Prediction The two bit prediction scheme is used very wily and in many ways. Make a table of 2-bit predictors Devise a map to associate a 2-bit predictor with each dynamic branch Use the 2-bit predictor for each branch to make the prediction. The map does not need to be perfect It s ok if multiple branches share a single 2-bit predictor. These collisions will reduce accuracy, but won t affect correctness. In the previous example we associated the predictors with branches using the PC. We ll call this per-pc prediction. 37

Mapping Branches to Predictors Branch update pipeline registers prediction Predictor selector Predictor inx predictors table(s) Prediction update Weather report Branch History Program Counter Branch outcome Branch prediction inputs 38

Associating Predictors with Branches: Using the low-orr PC bits When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of particular sub class t->somevirtualfunction() // will usually call the same function OK -- we miss one per loop Good Poor -- no help Not applicable 39

Predicting Loop Branches Revisited for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } What s the pattern we need to intify? 40

Predicting Loop Branches Revisited for(i = 0; i < 10; i++) { for(j = 0; j < 4; j++) { } } iteration Actual 1 taken 2 taken 3 taken 4 not taken 1 taken 2 taken 3 taken 4 not taken 1 taken 2 taken 3 taken 4 not taken What s the pattern we need to intify? 40

Dynamic prediction 4: Global branch history Instead of using the PC to choose the predictor, use a bit vector ma up of the previous branch outcomes. 41

Dynamic prediction 4: Global branch history Instead of using the PC to choose the predictor, use a bit vector ma up of the previous branch outcomes. iteration Actual Branch history Steady state prediction 1 taken 11111 2 taken 11111 3 taken 11111 4 not taken 11111 outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken 41

Dynamic prediction 4: Global branch history Instead of using the PC to choose the predictor, use a bit vector ma up of the previous branch outcomes. iteration Actual Branch history Steady state prediction Nearly perfect 1 taken 11111 2 taken 11111 3 taken 11111 4 not taken 11111 outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken outer loop branch taken 11110 taken 1 taken 11101 taken 2 taken 11011 taken 3 taken 10111 taken 4 not taken, 01111 not taken 41

Dynamic prediction 4: Global branch history How long should the history be? Imagine N bits of history and a loop that executes K iterations If K <= N, history will do well. If K > N, history will do poorly, since the history register will always be all 1 s for the last K-N iterations. We will mis-predict the last branch. 42

Dynamic prediction 4: Global branch history How long should the history be? Infinite is a bad choice. We would learn nothing. Long is not always good: we might get confused by extraneous information Imagine N bits of history and a loop that executes K iterations If K <= N, history will do well. If K > N, history will do poorly, since the history register will always be all 1 s for the last K-N iterations. We will mis-predict the last branch. 42

Global History Correlated Control if (a > 10) {} a = 11; b = <something usually larger than a > if (b > 10) {} Global history notice the correlation and predict the second branch. Another example: if(rare case) {}... if(rare case) {}... if(race case) {} 43

Associating Predictors with Branches: Global history When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls Not so great LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function Good Pretty good, as long as the history is not too long (or too short) Not applicable 44

Other ways of intifying branches Use local branch history Use a table of history registers (say 128), inxed by the low-orr bits of the PC. Also use the PC to choose between 128 tables, each inxed by the history for that branch. 7 bits of the PC Table of 128 history registers Predictor table 0 Predictor table 1 Predictor table 2 Prediction Predictor table 128 45

Local history For some loops local history is better Foo() { for(i = 0; i < 10; i++){ unpredictable branches} }. If foo is called from many places, or the loop body has unpredictable control, the global history will be polluted. Local history gives the loops branch it s own history. 46

Other Ways of Intifying Branches All these schemes have different pros and cons and will work better or worse for different branches. How do we get the best of all possible worlds? 47

Other Ways of Intifying Branches All these schemes have different pros and cons and will work better or worse for different branches. How do we get the best of all possible worlds? Build them all, and have a predictor to ci which one to use on a given branch For each branch, make all the different predictions, and keep track which predictor is most often correct. For future branches use the prediction from that predictor. Do this on a per-branch basis. This has been studied to ath. For good reason. Without good branch prediction, performance drops by at least 90%. 48

Other Prediction Schemes The ACM Portal (an enormous inx of CS research papers) lists 9,000 papers on branch prediction Neural networks Data compressions Tournament-based predictors Predictors that use analog circuits to make predictions Variable-length history prediction Confince prediction 49

Predicting Function Calls Branch Target Buffers (BTB) The name is unfortunate, since it s really a jump target Use a table, inxed by PC, that stores the last target of the jump. When you fetch a jump, start executing at the address in the BTB. Update the BTB when you find out the correct stination. 50

Associating Predictors with Branches: BTB When is branch behavior predictable? Loops -- for(i = 0; i < 10; i++) {} 9 taken branches, 1 nottaken branch. All 10 are pretty predictable. Run-time constants Foo(int v,) { for (i = 0; i < 1000; i++) {if (v) {...}}}. The branch is always taken or not taken. Corollated control a = 10; b = <something usually larger than a > if (a > 10) {} if (b > 10) {} Function calls LibraryFunction() -- Converts to a jr (jump register) instruction, but it s always the same. BaseClass * t; // t is usually a of sub class, SubClass t->somevirtualfunction() // will usually call the same function not applicable not applicable not applicable Pretty good 51

Predicting Returns Function call returns are hard to predict For every call site, the return address is different Instead, maintain a return stack predictor Keep a stack of return targets jal pushes $ra onto the stack Fetch predicts the target for retn instruction by The BTB will do a poor job, since it s based on PC popping an address off the stack. Doesn t work in MIPS, because there is no return inst. Not always correct because the return stack might not be large enough. 52

Interference Our schemes for associating branches with predictors are imperfect. Different branches may map to the same predictor and pollute the predictor. This is called structive interference Using larger tables will (typically) reduce this effect. 53

Performance Impact of Short History A loop has 5 instructions, including the branch. The mis-prediction penalty is 7 cycles. The baseline CPI is 1 What is the speedup of the global history predictor vs the per-pc predictor if the loop executes 4 iterations and we keep 5 history bits? If it is executes 40 iterations and we keep 41 history bits? 54

4 iterations Per-PC mis-prediction rate is 25%, since per-pc prediction results in one mis predict execution of the loop. 1/5 of the dynamic instructions are branches F = fraction of instructions that cause a mispredict F = 1/5 * 0.25 = 0.05 CPI = 1 * (1-F) + (1+ MissPenalty)*0.05 CPI = 1*(1-0.05) + (1+7)*(0.05) = 1.35 Global history mis-prediction rate is 0%, CPI = 1 SpeedUp = 1.35 40 iterations Per-PC mis-prediction rate is 2.5%, since per-pc prediction results in one mis predict execution of the loop. 1/5 of the dynamic instructions are branches F = fraction of instructions that cause a mispredict F = 1/5 * 0.025 = 0.005 CPI = 1 * (1-F) + (1+ MissPenalty)*0.005 CPI = 1*(1-0.005) + (1+7)*(0.005) = 1.035 Global history mis-prediction rate is 0%, CPI = 1 SpeedUp = 1.035 55

4 iterations Per-PC mis-prediction rate is 25%, since per-pc prediction results in one mis predict execution of the loop. 1/5 of the dynamic instructions are branches F = fraction of instructions that cause a mispredict F = 1/5 * 0.25 = 0.05 CPI = 1 * (1-F) + (1+ MissPenalty)*0.05 CPI = 1*(1-0.05) + (1+7)*(0.05) = 1.35 Global history mis-prediction rate is 0%, CPI = 1 SpeedUp = 1.35 40 iterations Per-PC mis-prediction rate is 2.5%, since per-pc prediction results in one mis predict execution of the loop. 1/5 of the dynamic instructions are branches F = fraction of instructions that cause a mispredict F = 1/5 * 0.025 = 0.005 CPI = 1 * (1-F) + (1+ MissPenalty)*0.005 CPI = 1*(1-0.005) + (1+7)*(0.005) = 1.035 Global history mis-prediction rate is 0%, CPI = 1 SpeedUp = 1.035 With more iterations, the benefit of history creases, so a shorter history is ok. 55