CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture, 4th edition ---- Additional teaching material from: Jelena Mirkovic (U Del), John Kubiatowicz (UC Berkeley), and Soner Oender (Michigan Technological University) 2 Reducing Branch Penalty What to Use and What to Predict Branch penalty in dynamically scheduled processors: wasted cycles due to pipeline flushing on mis-predicted branches Reduce branch penalty: Predict branch/jump instructions AND branch direction (taken or not taken) Predict branch/jump target address (for taken branches) Speculatively execute instructions along the predicted path Available info: Current predicted PC Past branch history (direction and target) What to predict: Conditional branch inst: branch direction and target address Jump inst: target address Procedure call/return: target address May need instruction pre-decoded PC IM pred_pc Predictors PC & Inst pred info feedback PC 3 4 1
Mis-prediction Detections and Feedbacks Branch Direction Prediction Detections: At the end of decoding Target address known at decoding, and not match Flush fetch stage At commit (most cases) Wrong branch direction or target address not match Flush the whole pipeline (at EE: MIPS R10000) Feedbacks: Any time a mis-prediction is detected At a branch s commit (at EE: called speculative update) FETCH RENAME REB/ROB SCHD EE WB COMMIT predictors Predict branch direction: taken or not taken (T/NT) taken BNE R1, R2, L1 Not taken L1: Static prediction: compilers decide the direction Dynamic prediction: hardware decides the direction using dynamic information 1. 1-bit Branch-Prediction Buffer 2. 2-bit Branch-Prediction Buffer 3. Correlating Branch Prediction Buffer 4. Tournament Branch Predictor 5. and more 5 6 Predictor for a Single Branch Branch History Table of 1-bit Predictor General Form 1. Access PC 1-bit prediction Predict Taken state T 3. Feedback T/NT NT NT 1 0 T 2. Predict Output T/NT Feedback Predict Taken BHT also Called Branch Prediction Buffer in textbook Can use only one 1-bit predictor, but accuracy is low BHT: use a table of simple predictors, indexed by bits from PC Similar to direct mapped cache More entries, more cost, but less conflicts, higher accuracy BHT can contain complex predictors K-bit 2 k Branch address Prediction 7 8 2
1-bit BHT Weakness Example: in a loop, 1-bit BHT will cause 2 mis-predictions Consider a loop of 9 iterations before exit: for ( ){ for (i=0; i<9; i++) a[i] = a[i] * 2.0; } End of loop case, when it exits instead of looping as before First time through loop on next time through code, when it predicts exit instead of looping Only 80% accuracy even if loop 90% of the time 2-bit Saturating Counter Solution: 2-bit scheme where change prediction only if get mis-prediction twice: (Figure 3.7, p. 249) Predict Taken Predict Not Taken T NT 11 10 T T NT NT 01 00 T Predict Taken Predict Not Taken NT 9 Blue: stop, not taken Gray: go, taken 10 Adds hysteresis to decision making process Correlating Branches Correlating Branch Predictor Code example showing the potential If (d==0) d=1; If (d==1) Assemble code BNEZ R1, L1 DADDIU R1,R0,#1 L1: DADDIU R3,R1,#-1 BNEZ R3, L2 L2: Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior) Then behavior of recent branches selects between, say, 2 predictions of next branch, updating just that prediction (1,1) predictor: 1-bit global, 1-bit local Branch address (4 bits) 1-bits per branch local predictors Prediction Observation: if BNEZ1 is not taken, then BNEZ2 is taken 11 12 1-bit global branch history (0 = not taken) 3
Correlating Branch Predictor General form: (m, n) predictor m bits for global history, n bits for local history Records correlation between m+1 branches Simple implementation: global history can be store in a shift register Example: (2,2) predictor, 2-bit global, 2-bit local Branch address (4 bits) 2-bits per branch local predictors Prediction Frequency of Mispredictions Accuracy of Different Schemes (Figure 3.15, p. 206) 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 13 2-bit global branch history (01 = not taken then taken) 14 Accuracy of Return Address Predictor Branch Target Buffer Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Example: BTB combined with BHT PC of instruction FETCH Branch PC Predicted PC 15 =? No: branch not predicted, proceed normally 16 (Next PC = PC+4) Yes: instruction is branch and use predicted PC as next PC Extra prediction state bits 4
Hardware Based Speculation Hardware Speculation Exploiting more ILP requires that we overcome the limitation of control dependence: With branch prediction we allowed the processor continue issuing instructions past a branch based on a prediction: Those fetched instructions do not modify the processor state. These instructions are squashed if prediction is incorrect. We now allow the processor to execute these instructions before we know if it is ok to execute them: We need to correctly restore the processor state if such an instruction should not have been executed. We need to pass the results from these instructions to future instructions as if the program is just following that path. 17 Hardware Based Speculation Hardware Based Speculation B1 x < y? N T A =b+c C=0 C=c-1 A=0 < z B2 N T B=b+1 C=a A=a+1 D=a+b+c. Use d Assume the processor predicts B1 to be taken (T) and executes. What will happen if the prediction was wrong? What value of each variable should be used if the processor predicts B1 and B2 taken (T) and executes instructions along the way? In order to execute instructions speculatively, we need to provide means: To roll back the values of both registers and the memory to their correct values upon a misprediction. To communicate speculatively calculated values to the new uses of those values. Both can be provided by using a simple structure called Reorder Buffer (ROB). 5
Reorder Buffer It is a simple circular array with a head and a tail pointer: New instructions is allocated a position at the tail in program order. Each entry provides a location for storing the instruction s result. New instructions look for the values starting from tail back. When the instruction at the head complete and becomes non-speculative the values are committed and the instruction is removed from the buffer. Tail Head Reorder Buffer 3 fields: instr, destination, value can be operand source => more registers like RS Supplies operands between execution complete & commit Use reorder buffer number instead of reservation station when execution completes Once operand commits, result is put into register As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Steps of Speculative Tomasulo Algorithm 1. Issue [get instruction from FP Op Queue] 1. Check if the reorder buffer is full. 2. Check if a reservation station is available. 3. Access the register file and the reorder buffer for the current values of the source operands. 4. Send the instruction, its reorder buffer slot number and the source operands to the reservation station. Steps of Speculative Tomasulo Algorithm 2. Execute [operate on operands (E) ] When both operands ready and a functional unit is available, the instruction executes. This step checks RAW hazards and as long as operands are not ready, watches CDB for results. Once issued, the instruction stays in the reservation station until it gets both operands. 6
Steps of Speculative Tomasulo Algorithm 3. Write result [ finish execution (WB) ] Write on Common Data Bus to all awaiting FUs and the reorder buffer. Mark reservation station available. Steps of Speculative Tomasulo Algorithm 4. Commit [ update register file with reorder result ] When instruction reaches the head of reorder buffer The result is present No exceptions associated with the instruction The instruction becomes non-speculative: Update register file with result (or store to memory) Remove the instruction from the reorder buffer. A mispredicted branch flushes the reorder buffer. MIPS FP Unit Recall: Four Steps of Speculative Tomasulo Algorithm 1. Issue get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch ) 2. Execution operate on operands (E) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ) 3. Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4. Commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called graduation ) 7
FP Op Queue Dest Tomasulo With Reorder Buffer Reorder Buffer Registers FP adders Dest Reservation Stations FP multipliers Done? To Memory from Memory Dest 1 10+R2 ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest Oldest FP Op Queue Tomasulo With Reorder Buffer Reorder Buffer Dest. Value Instruction type Done? Registers To Memory ROB7 ROB6 ROB5 ROB4 ROB3 ROB2 ROB1 Newest COB Oldest LD F6, 34(R2) LD F2, 45(R3) MULTD, F2, F4 SUBD F8, F6, F2 DIVD F10,, F6 ADDD F6, F8, F2 Example 1 Time = 8
Time = 1 2 3 4 5 6 Time =1 First load is issued Load Regs[R2] #1 34 #1 Time =1 First load is issued 1 Issue F6 2 3 4 5 6 Time =2 First load executes Second load is issued Load #1 34+ Regs[R2] Load Regs[R3] #2 45 #2 #1 9
Time =3 Time =2 First load executes First load executes Second load executes Second load is issued 1 Execute F6 2 3 4 5 6 Issue F2 Mul is issued Load #1 34+ Regs[R2] Load #2 45+ Regs[R3] Mult Regs[F4] #2 #2 #1 Time =3 First load executes Second load executes Mul is issued 1 Execute F6 2 Execute F2 3 4 5 6 Issue Time =4 First load writes result Second load executes Sub is issued Load #2 Sub Mem[34+ Regs[R2]] #2 #4 Mult Regs[F4] #2 45+ Regs[R3] #2 #1 #4 10
Time =4 First load writes result Second load executes Sub is issued 1 Write result 2 Execute F2 3 Stalled in issue 4 5 6 Issue F8 Time =5 First load commits Second load writes result Div is issued Sub Mem[45+ Regs[R3]] Mem[34+ Regs[R2]] #4 Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #2 #4 #5 Time =5 First load commits Second load writes result Div is issued 1 no Commit 2 Write result 3 Stalled in issue 4 Stalled in issue F8 5 6 Issue F10 Time =6 Second load commits Mul (1/10) and sub(1/2) execute Add is issued Sub Mem[45+ Regs[R3]] Mem[34+ Regs[R2]] #4 Add Mem[45+ Regs[R3]] #4 #6 Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #4 #5 11
Time =6 Second load commits Mul (1/10) and sub(1/2) execute Add is issued 1 no Commit 2 no Commit 3 Execute 4 Execute F8 5 Stalled in issue F10 6 Issue F6 Time =7 Second load commits Mul (2/10) and sub(2/2) execute Sub Mem[45+ Regs[R3]] Mem[34+ Regs[R2]] #4 Add Mem[45+ Regs[R3]] #4 #6 Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #4 #5 Time =7 Second load commits Mul (2/10) and sub(2/2) execute Add is issued 1 no Commit 2 no Commit 3 Execute 4 Execute F8 5 Stalled in issue F10 6 Issue F6 Time =8 Mul executes (3/10) Sub writes result () Add Mem[45+ Regs[R3]] Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #6 #4 #5 12
Time =8 Mul executes (3/10) Sub writes result () 1 no Commit 2 no Commit 3 Execute 4 Write result F8 5 Stalled in issue F10 6 Stalled in issue F6 Time =9 Mul executes (4/10) Add executes(1/2) Add Mem[45+ Regs[R3]] Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #6 #4 #5 Time =9 Mul executes (4/10) Add executes(1/2) 1 no Commit 2 no Commit 3 Execute 4 Waiting to commit F8 5 Stalled in issue F10 6 Execute F6 Time =10 Mul executes (5/10) Add executes(2/2) Add Mem[45+ Regs[R3]] Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #6 #4 #5 13
Time =10 Mul executes (5/10) Add executes(2/2) 1 no Commit 2 no Commit 3 Execute 4 Waiting to commit F8 5 Stalled in issue F10 6 Execute F6 Time =11 Mul executes (6/10) Add writes result (Y) Mult Mem[45+ Regs[R3]] Regs[F4] Div Mem[34+ Regs[R2]] #5 #6 #4 #5 Time =11 Mul executes (6/10) Add writes result (Y) 1 no Commit 2 no Commit 3 Execute 4 Waiting to commit F8 5 Stalled in issue F10 6 Write result F6 Y Faster than light computation (skip a couple of cycles) 14
Faster than light computation (skip a couple of cycles) Time =16 Mul writes result (Z) Div Z Mem[34+ Regs[R2]] #5 #6 #4 #5 Time =16 Mul writes result (Z) 1 no Commit 2 no Commit 3 Write result Z 4 Waiting to commit F8 5 Stalled in issue F10 6 Waiting to commit F6 Y Time =17 Mul commits Div is executed (1/40) Div Z Mem[34+ Regs[R2]] #5 #6 #4 #5 15
Time =17 Mul commits Div is executed (1/40) 1 no Commit 2 no Commit 3 no Commit Z 4 Waiting to commit F8 5 Execute F10 6 Waiting to commit F6 Y Time =18 Sub commits Div is executed (2/40) Div Z Mem[34+ Regs[R2]] #5 #6 #5 Time =18 Sub commits Div is executed (2/40) 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 Execute F10 6 Waiting to commit F6 Y Faster than light computation (skip a couple of cycles) 16
Faster than light computation (skip a couple of cycles) Time =57 Div writes result (W) #6 #5 Time =57 Div writes result (W) 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 Write result F10 W 6 Waiting to commit F6 Y Time =58 Div commits #6 17
Time =58 Div commits 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 no Commit F10 W 6 Waiting to commit F6 Y Time =59 Add commits Time =59 Add commits 1 no Commit 2 no Commit 3 no Commit Z 4 no Commit F8 5 no Commit F10 W 6 no Commit F6 Y 18