Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Size: px

Start display at page:

Download "Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines"

Aubrey Cook
5 years ago
Views:

1 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T , L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 1 GHz, 2000) A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B I J R E Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Remainder of execute pipeline (+another 6 stages) Branch penalty: Cycles? Instructions? Page 1

2 Branch Prediction 6.823, L15--3 Motivation: branch penalties limit performance of deeply pipelined processors Modern branch predictors have high accuracy (>95%) and can reduce branch penalties significantly Required hardware support: Prediction structures: branch history tables, branch target buffers etc. Mispredict recovery mechanisms: In-order machines: instructions following branch in pipeline Out-of-order machines: shadow registers and memory buffers for each speculated branch DLX Branches and Jumps 6.823, L15--4 Instruction Taken known? Target known? BEQZ/BNEZ After Reg. Fetch After Inst. Fetch J Always Taken After Inst. Fetch JR Always Taken After Reg. Fetch Must know (or guess) both target address and whether taken to execute branch/jump. Page 2

3 Static Branch Prediction (Encode prediction as part of branch instruction) 6.823, L15--5 Probability a branch is taken (~67% overall): backward 90% JZ forward 50% JZ Can predict all taken, or backwards taken/forward not-taken ISA can attach additional semantics to branches about preferred direction, e.g., Motorola MC88110 bne0 (preferred taken) beq0 (not taken) ISA can allow arbitrary choice of statically predicted direction (HP PA-RISC, Intel Itanium) Dynamic Branch Prediction learning based on past behavior 6.823, L15--6 Temporal correlation The way a branch resolves may be a good predictor of the way it will resolve at the next execution Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution) Page 3

4 Branch Prediction Bits 6.823, L15--7 Assume 2 BP bits per instruction Change the prediction after two consecutive mistakes! take wrong taken taken taken take right taken taken taken take wrong take right taken taken BP state: (predict take/ take) x (last prediction right/wrong) Branch History Table 6.823, L15--8 Fetch PC I-Cache Instruction Opcode offset 00 k BHT Index 2 k -entry BHT, 2 bits/entry Branch? + Target PC Taken/ Taken? 4K-entry BHT, 2 bits/entry, ~80-90% correct predictions Page 4

5 Exploiting Spatial Correlation Yeh and Patt, , L15--9 if (x[i]< 7) then y += 1; if (x[i]< 5) then c -= 4; If first condition false, second condition also false History bit: H records the direction of the last branch executed by the processor Two sets of BP bits (BP0 & BP1) per branch instruction H = 0 (taken) consult BP0 H = 1 (not taken) consult BP1 Two-Level Branch Predictor Pentium Pro uses the result from the last two branches to select one of the four sets of BP bits (~95% correct) 6.823, L Fetch PC 00 k Global branch history shift register Shift in Taken/ Taken results of each branch Taken/ Taken? Page 5

6 Limitations of BHTs 6.823, L Cannot redirect fetch stream until after branch instruction is fetched and decoded, and target address determined A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B I J R E Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute Correctly predicted taken branch penalty: Cycles? Instructions? (UltraSPARC-III fetch pipeline) What about JR instructions? I-Cache Branch Target Buffer Branch Target Buffer (2 k entries) Entry PC Valid predicted target PC 6.823, L PC k = match valid target Keep both the branch PC and target PC in the BTB PC+4 is fetched if match fails Only taken branches and jumps held in BTB Next PC determined before branch fetched and decoded Page 6

7 Combining BTB and BHT 6.823, L (scheme used in PowerPC620) BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR) BHT can hold many more entries and is more accurate BHT in later pipeline stage corrects when BTB misses a predicted taken branch BTB BHT A PC Generation/Mux P Instruction Fetch Stage 1 F Instruction Fetch Stage 2 B I J R E Branch Address Calc/Begin Decode Complete Decode Steer Instructions to Functional units Register File Read Integer Execute BTB/BHT only updated after branch resolves in E stage Subroutine Return Stack Main uses of register indirect jumps (JR) : Switch statements (jump to address of matching case) Dynamic function call (jump to run-time function address) Subroutine returns (jump to return address) 6.823, L Push call address when function call executed Pop return address when subroutine return decoded k entries (typically k=8-16) Subroutine call/return address stack predicts return addresses more accurately than BTB, Why? Page 7

8 Speculating Both Directions An alternative to branch prediction is to execute both directions of a branch speculatively 6.823, L resource requirement is proportional to the number of concurrent speculative executions only half the resources engage in useful work when both directions of a branch are executed speculatively branch prediction takes less resources than speculative execution of both paths With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction Mispredict Recovery 6.823, L In-order execution machines: Assume no instruction issued after branch can write-back before branch resolves Kill all instructions in pipeline behind mispredicted branch Out-of-order execution? Multiple instructions following branch in program order can complete before branch resolves Page 8

9 In-Order Commit 6.823, L Instructions fetched, decoded, and placed into reorder buffer in-order Execution is out-of-order (=> out-of-order completion) Commit (write-back to architectural state, regfile+memory) is in-order In-order Out-of-order In-order Fetch Decode Reorder Buffer Commit complete Execute Temporary storage needed to hold results before commit (shadow registers and store buffers) Extensions for Speculation 6.823, L Instruction reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data ptr 2 next to commit ptr 1 next available add <pd, dest, data> fields in the instruction template commit instructions to reg file and memory in program order buffers can be maintained circularly wrong speculation roll back the next available pointer no speculative stores Page 9

10 Branch Instructions 6.823, L Branch instructions are entered into the ROB normally, except the predicted branch direction is also recorded When the branch is resolved and prediction is incorrect, ptr 1 is rewound to just after the speculated branch in the instruction template use-bits of all incorrectly speculated instructions after the failed branch are reset Fetch pipeline is flushed and fetch stream redirected When branch is committed, Branch predictors (BTBs, BHTs, etc.) are updated Branch Execution 6.823, L Update predictors Branch Prediction Branch Resolution PC Fetch Decode Reorder Buffer Commit Execute Can have multiple unresolved branches in ROB Can resolve branches out-of-order Page 10

Rollback and Renaming 6.823, L15--21 Register File (now holds only committed state) Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.

11 Rollback and Renaming 6.823, L Register File (now holds only committed state) Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.. t n Load FU FU FU Store Commit < t, result > Register file does not contain renaming tags any more. How does the decode stage find the tag of a source register? Renaming Table 6.823, L Rename Table r 1 r 2 t i t j Register File Reorder buffer Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.. t n Load FU FU FU Store Commit < t, result > Renaming table caches register name look up. Machine takes snapshot of table at each predicted branch, and recovers earlier snapshot if branch mispredicted. Page 11

12 Physical Register File 6.823, L r 1 r 2 t i t j Snapshots for mispredict recovery t 1 t 2. t n Reg File Rename Table Load FU FU FU Store (ROB not shown) < t, result > One regfile for both committed and speculative values (no data in ROB) During decode, instruction result allocated new physical register, source regs translated to physical regs through rename table Instruction reads data from regfile at start of execute (not in decode) Write-back updates reg. busy bits on instructions in ROB (assoc. search) Snapshots of rename table taken at every branch to recover mispredicts On exception, renaming undone in reverse order of issue (MIPS R10000) Speculative Loads / Stores 6.823, L Just like register updates, stores should not modify the memory until after the instruction is committed store buffer entry must carry a speculation bit and the tag of the corresponding store instruction If the instruction is committed, the speculation bit of the corresponding store buffer entry is cleared If the instruction is ed, the corresponding store buffer entry is freed Loads work normally -- older store buffer entries needs to be searched before accessing the memory or the cache Page 12

13 Datapath: Branch Prediction and Speculative Execution 6.823, L PC Branch Prediction Fetch Decode & Rename Branch Resolution Reorder Buffer Update predictors Commit Reg. File Branch Execute ALU MEM Store Buffer D$ Page 13

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time