ECE 571 Advanced Microprocessor-Based Design Lecture 7 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 9 February 2016
HW2 Grades Ready Announcements HW3 Posted be careful when cut and pasting with the backslashes 1
The Branch Problem With a pipelined processor, you may have to stall waiting until branch outcome known before you can correctly fetch the next instruction Conditional branches are common. 5th instruction [cite?] On average every What can you do to speed things up? 2
Branch Prediction One solution is speculative execution. Guess which way branch goes. Good branch predictors have a 95% or higher hit rate Downsides? If wrong, in-flight thrown out, have to replay. Speculation wastes power 3
Branch Predictor Implementations 4
Static Prediction of Conditional Branches Backward taken Forward not taken Can be used as fallback when nothing else is available 5
Common Access Patterns For Loop for (i =0;i <100; i ++) SOMETHING ; mov r1,#0 label : SOMETHING add r1,r1,#1 cmp r1,#100 bne label 6
for Branch Behavior No branch predictor 100 stalls Other way to avoid problem (branch delay slot on MIPS) Static prediction BTFN 99 times predicted right, 1 time wrong (exit) 99% correct predict rate 7
Common Access Patterns While Loop x =0; while (x <100) { SOMETHING ; x ++;} mov r1,#0 label : cmp r1,#100 bge done SOMETHING add r1,r1,#1 b label done : 8
while Branch Behavior No branch predictor 100 stalls (unless branch delay slot) Static BTFN prediction 99 times predicted right, 1 time wrong (exit) 99% correct predict rate Optimizing compiler may translate this to a for loop (why?) 9
Common Access Patterns Do/While Loop x =0; do { SOMETHING ; x ++;} while (x <100); mov r1,#0 label : SOMETHING add r1,r1,#1 cmp r1,#100 blt label b done : label 10
while Do/While Behavior No branch predictor 100 stalls (unless branch delay slot) Static BTFN prediction 99 times predicted right, 1 time wrong (exit) 99% correct predict rate 11
Notes Optimizing compiler will optimize all above to same for loop (tried it). Why? Because loop unrolling becomes possible? 12
Common Access Patterns If/Then if (x) { FOO } else { BAR } cmp r1,#0 beq else then : FOO b done else : BAR done : ARM : cmp r1,#0 FOOeq BARne 13
Common Access Patterns If/Then Behavior If x is true, static = 100%, if x is false, 0% Assuming completely random, average 50% miss rate ARM can use conditional execution/predication to avoid this in simple scenarios 14
How can we Improve Things? 15
Branch Prediction Hints Give compiler (or assembler) hints likely() (maps to builtin expect()) unlikely() on some processors, (p4) hint for static others, just move unlikely blocks out of way for better L1I$ performance 16
Dynamic Branch Prediction 17
One-bit Branch History Table Table, likely indexed by lower bits of instruction (Why low bits and not high bits?) an have more advanced indexing to avoid aliasing no-tag bits, unlike caches aliasing does not affect correct program execution One-bit indicating what the branch did last time Update when a branch miss happens 18
0x1000 0003 : bne PC+45 0 1 2 3 taken... f 19
Branch History Table Behavior Two misses for each time through loop. Wrong at exit of loop, then wrong again when restarts. (so actually worse than static on loops) If/Then potentially better than static if long runs of true/false but can be worse if completely random. 20
Aliasing Is it bad? Good? Does the O/S need to save on context switch? Do you need to save if entering low-power state? 21
Two-bit saturating counter Use saturating 2-bit counter If 3/2, predict taken, if 1,0 not-taken. Takes two misses or hits to switch from one extreme to the next, letting loops take only one mispredict. Needs to be updated on every branch, not just for a mispredict 22
Taken NT 00 Strong NT NT 01 Weak NT Taken Taken NT 11 Strong Taken NT 10 Weak Taken Taken 23
Local vs Global History Can use branch history as index into tables Use a shift register to hold history Global: history is all branches Local: store branch history on a branch by branch basis 24
Global History Global Predictor N N T N 2 bit Counters 1 1 Taken... 25
Local Predictor 0x8000 0001 : bne PC+45 2 bit Counters N N T N 0 0 Not Taken...... 26
Correlating / Two Level Predictors Take history into account. Break branch prediction for a branch out into multiple locations based on history. 0x1000 0002 : bge PC+45 2 bit saturating counters............ 1 1 Taken Global History T N 27
gshare Xors the global history with the address bits to get which line to use. Benefits of 2-level without the extra circuitry 0x1000 000e : bgt PC+45 1110 2 bit Counters XOR N N Not Taken 1111... T T T T Global History 28
Tournament Predictors Which to use? Local or global? Have both. How to know which one to use? Predict it! 2-bit counter remembers which was best. 29
Perceptron There are actually Branch Prediction Competitions The winner the past few times has been a Perceptron predictor Neural Networks 30
Comparing Predictors Branch miss rate not enough Usually the total number of bits needed is factored in May also need to keep track of logic needed if it is complex. 31
Branch Target Buffer Predicts the actual destination of addresses. Indexed by whole PC. May be looking up before even know it is a branch instruction. Only need to store predicted-taken branches. Because not-taken fall through as per normal). (Why? 32
Return Address Stack Function calls can confuse BTB. Multiple locations branching to same spot. Which return address should be predicted? Keep a stack of return addresses for function calls Playing games with size optimization and fallthrough/tail optimization can confuse. 33
Adjusting Predictor on the Fly Some processors let you configure predictor at runtime. MIPS R12000 let you ARM possibly does. Why is this useful? In theory if you have a known workload you can pick the one that works best. Also if realtime you want something that is deterministic, like static prediction. Also Good for simulator validation 34
From the Manual: Cortex A9 Branch Predictor two-level prediction mechanism, comprising: a two-way BTAC of 512 entries organized as two-way x 256 entries a Global History Buffer (GHB) with 4096 2-bit predictors a return stack with eight 32-bit entries. It is also capable of predicting state changes from ARM to Thumb, and from Thumb to ARM. 35
Example Code in perf event validation tests for generic events. http://web.eece.maine.edu/~vweaver/projects/perf_events/validation/ 36
Example Results 37
Part 1 Testing a loop with 1500000 branches (100 times): On a simple loop like this, miss rate should be very small. Adjusting domain to 0,0,0 for ARM Average number of branch misses: 685 Part 2 Adjusting domain to 0,0,0 for ARM Testing a function that branches based on a random number The loop has 7710798 branches. 500000 are random branches; 250699 of those were taken Adjusting domain to 0,0,0 for ARM Out of 7710798 branches, 291081 were mispredicted Assuming a good random number generator and no freaky luck The mispredicts should be roughly between 125000 and 375000 Testing branch-misses generalized event... PASSED 38
Value Prediction Can we use this mechanism to help other performance issues? What about caches? Can we predict values loaded from memory? Load Value Prediction. You can, sometimes with reasonable success, but apparently not worth trouble as no vendors have ever implemented it. 39