Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Size: px

Start display at page:

Download "Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation"

Quentin Pitts
6 years ago
Views:

1 Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation Dr. William H. Robinson February 25, Topics Aha, now I see, said the blind man to the deaf man who couldn t hear him anyway. Politically incorrect statement from elementary school Administrative stuff Return Homework #2 and Reading Assignment #3 Project 1? (still grading) No office hours today Importance of the Reorder Buffer (ROB) 1 2 Instruction Fetch with Branch Prediction Problem: Fetch Unit Stream of Instructions To Execute Out-Of-Order Execution Unit Correctness Feedback On Branch Results Instruction fetch decoupled from execution Often issue logic (plus rename) included with Fetch Branches Must Be Resolved Quickly For Loop Overlap! In our loop-unrolling example, we relied on the fact that branches were under control of fast integer unit in order to get overlap! Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1 SUBI R1 R1 #8 BNEZ R1 Loop What happens if branch depends on result of multd?? We completely lose all of our advantages! Need to be able to predict branch outcome. If we were to predict that branch was taken, this would be right most of the time. Problem much worse for superscalar machines! 3 4

2 Predicated Execution Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP If false, then neither store result nor cause exception Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instruction IA-64: 64 1-bit condition fields selected so conditional execution of any instruction This transformation is called if-conversion Drawbacks to conditional instructions Still takes a clock even if annulled Stall if condition evaluated late Complex conditions reduce effectiveness; condition becomes known late in pipeline x A = B op C Dynamic Branch Prediction Problem Incoming Branches { Address } Branch Predictor History Information Prediction { Address, Value } Corrections { Address, Value } Incoming stream of addresses Fast outgoing stream of predictions Correction information returned from pipeline 5 6 Need Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address (Figure 3.19, p. 210) PC of instruction FETCH Branch PC =? Predicted PC Predict taken or untaken Branch PC Branch (Pattern) History Table Predictor 0 Predictor 1 Predictor 7 BHT is a table of Predictors Usually 2-bit, saturating counters Indexed by PC address of Branch without tags In Fetch state of branch: BTB identifies branch Predictor from BHT used to make prediction When branch completes Update corresponding Predictor 7 8

3 Dynamic Branch Prediction (Jim Smith, 1981) Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 198) Strong Predict Taken Weak Predict Not Taken T NT T T NT NT Red: stop, not taken NT Green: go, taken Adds hysteresis to decision making process T Weak Predict Taken Strong Predict Not Taken Correlating Branches Hypothesis: Recent branches are correlated Behavior of recently executed branches affects prediction of current branch Two possibilities; current branch depends on: Last m most recently executed branches anywhere in program Produces a GA (for global adaptive ) in the Yeh and Patt classification (e.g. GAg) Last m most recent outcomes of same branch. Produces a PA (for per-address adaptive ) in same classification (e.g. PAg) Little g means global pattern history table 9 10 Correlating Branches Yeh and Patt Classification Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table (BHT) entry A single history table shared by all branches (appends a g at end), indexed by history value. Address is used along with history to select table entry (appends a p at end of classification) GBHR If only portion of address used, often appends an s to indicate set-indexed tables (i.e. GAs) GPHT PABHR GPHT PABHR GAg PAg PAp PAPHT GAg: Global History Register, Global History Table PAg: Per-Address History Register, Global History Table PAp: Per-Address History Register, Per-Address History Table

4 li Other Global Variants: Try to Avoid Aliasing What are Important Metrics? Clearly, Accuracy matters Even 1% can be important when above 90% accuracy GBHR GBHR Speed: Does this affect cycle time? GAs PAPHT Address GShare GPHT GAs: Global History Register, Per-Address (Set Associative) History Table Gshare: Global History Register, Global History Table with Simple attempt at anti-aliasing Space: Clearly Total Space matters! Papers which do not try to normalize across different options are playing fast and loose with data Try to get best performance for the cost Calculating Number of State Bits General case (m,n) predictor Uses behavior of last m branches Selects among 2 m branch predictors Each predictor is n bits Number of bits in (m,n) predictor: 2 m n (Number of prediction entries selected by branch address) Frequency of Mispredictions 18% 16% 14% 12% 10% Accuracy of Different Schemes 8% 6% 4% 4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT 1024 Entries (2,2) BHT 5% 6% 6% 11% 4% 6% 5% Example: (0,2) predictor with 4096 entries = 8192 bits Example: (2,2) predictor with 1024 entries = 8192 bits 2% 1% 1% 0% 0% nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott 4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2) 15 16

5 Discussion of Papers A Comparative Analysis of Schemes for Correlated Branch Prediction Cliff Young, Nicolas Gloy and Michael D. Smith Modern high-performance architectures require extremely accurate branch prediction to overcome the performance limitations of conditional branches. An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work? Marius Evers, Sanjay J. Patel, Robert S. Chappel, and Yale N. Patt To build high performance microprocessors, accurate branch prediction is required. 17 Dynamic Scheduling HW exploitation of ILP Works when can t know dependence at compile time Code for one machine runs well on another Scoreboard (e.g. CDC 6600 in 1963) Centralized control structure No register renaming, no forwarding Pipeline stalls for WAR and WAW hazards Are these fundamental limitations??? (No) stations (e.g. IBM 360/91 in 1966) Distributed control structures Implicit renaming of registers (dispatched pointers) WAR and WAW hazards eliminated by register renaming Results broadcast to all reservation stations for RAW 18 From Mem FP Op Load Buffers Load6 Load5 Load4 Load3 Load2 Load1 Add3 Add2 Add1 Tomasulo s Organization Mult2 Mult1 FP Registers Store Buffers To Mem Three Stages of Tomasulo s Algorithm 1.Issue get instruction from FP Op If reservation station free (no structural hazard), control issues instr & sends operands (renames registers) 2.Execution operate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3.Write result finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination ( go to bus) Common data bus: data + source ( come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Common Data Bus (CDB) 19 20

6 Review: Loop Example Cycle 9 Instruction status: Exec Write ITER Instruction j k Issue CompResult Busy Addr Fu 1 LD F0 0 R1 1 9 Load1 Yes 80 1 MULTD F4 F0 F2 2 Load2 Yes 72 1 SD F4 0 R1 3 Load3 No 2 LD F0 0 R1 6 Store1 Yes 80 Mult1 2 MULTD F4 F0 F2 7 Store2 Yes 72 Mult2 2 SD F4 0 R1 8 Store3 No : S1 S2 RS Time Name Busy Op Vj Vk Qj Qk Code: Add1 No LD F0 0 R1 Add2 No MULTD F4 F0 F2 Add3 No SD F4 0 R1 Mult1 Yes Multd R(F2) Load1 SUBI R1 R1 #8 Mult2 Yes Multd R(F2) Load2 BNEZ R1 Loop Register result status Clock R1 F0 F2 F4 F6 F8 F10 F12... F Fu Load2 Mult2 Dataflow graph constructed completely in hardware Renaming detaches early iterations from registers What about Precise Exceptions/Interrupts? Both Scoreboard and Tomasulo have: In-order issue, out-of-order execution, out-of-order completion Recall: An interrupt or exception is precise if there is a single instruction for which: All instructions before that have committed their state No following instructions (including the interrupting instruction) have modified any state. Need way to resynchronize execution with instruction stream (i.e. with issue-order) Easiest way is with in-order completion (i.e. reorder buffer) Other Techniques (Smith paper): Future File, History Buffer Discussion of Paper Implementation of Precise Interrupts in Pipelined Processors James Smith and Andrew Pleszkun From Smith s ISCA Retrospective: In retrospect, although neither of us thought it very significant at the time, the reorder buffer has probably turned out to be the main contribution of the paper. HW Support for Precise Interrupts Concept of Reorder Buffer (ROB): Holds instructions in FIFO order, exactly as they were issued Each ROB entry contains PC, dest reg, result, exception status When instructions complete, results placed into ROB Supplies operands to other instruction between execution complete & commit more registers like RS Tag results with ROB buffer number instead of reservation station Instructions commit values at head of ROB placed in registers As a result, easy to undo speculated instructions on mispredicted branches or on exceptions Commit path FP Op Reorder Buffer FP Regs Res FP Adder Res FP Adder 23 24

7 Four Steps of Speculative Tomasulo Algorithm 1.Issue get instruction from FP Op If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called dispatch ) 2.Execution operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called issue ) 3.Write result finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer Hardware Complexities with Reorder Buffer (ROB)? Reorder Table ination Register Result Exceptions? Valid Program Counter FP Op Res FP Adder Compare network Reorder Buffer FP Regs Res FP Adder How do you find the latest version of a register? As specified by Smith paper, need associative comparison network Could use future file or just use the register result status buffer to track which specific reorder buffer has received the value Need as many ports on ROB as register file FP Op Tomasulo with Reorder Buffer Done? ROB7 ROB6 Newest FP Op Tomasulo with Reorder Buffer Done? ROB7 ROB6 Newest Reorder Buffer F0 F0 LD LD F0,10(R2) N ROB5 ROB4 ROB3 ROB2 ROB1 Oldest Reorder Buffer F10 F10 F0 F0 ADDD ADDD F10,F4,F0 LD LD F0,10(R2) N N N N ROB5 ROB4 ROB3 ROB2 ROB1 Oldest Registers To from Registers 2 ADDD ADDD R(F4),ROB1 To from 1 10+R2 10+R R2 10+R

8 FP Op Tomasulo with Reorder Buffer Reorder Buffer F2 F2 F10 F10 F0 F0 Done? ROB7 ROB6 ROB5 ROB4 DIVD DIVD F2,F10,F6 ADDD ADDD F10,F4,F0 LD LD F0,10(R2) N N N ROB3 ROB2 ROB1 Newest Oldest FP Op Tomasulo with Reorder Buffer Reorder Buffer Done? ROB7 F0 F0 ADDD ADDD F0,F4,F6 N ROB6 F4 F4 LD LD F4,0(R3) N ROB BNE BNE F2,< > N ROB4 F2 F2 F10 F10 F0 F0 DIVD DIVD F2,F10,F6 ADDD ADDD F10,F4,F0 LD LD F0,10(R2) N N N ROB3 ROB2 ROB1 Newest Oldest Registers 2 ADDD ADDD R(F4),ROB1 3 DIVD DIVD ROB2,R(F6) To from 1 10+R2 10+R2 Registers 2 ADDD ADDD R(F4),ROB1 6 ADDD ADDD ROB5, ROB5, R(F6) R(F6) 3 DIVD DIVD ROB2,R(F6) To from 1 10+R2 10+R2 5 0+R3 0+R Relationship Between Precise Interrupts and Speculation Speculation is a form of guessing Branch prediction, data prediction If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly This is exactly same as precise exceptions! Branch prediction is a very important! Need to take our best shot at predicting branch direction. If we issue multiple instructions per cycle, lose lots of potential instructions otherwise: Consider 4 instructions per cycle If take single cycle to decide on branch, waste from 4-7 instruction slots! Technique for both precise interrupts/exceptions and speculation: in-order completion or commit This is why reorder buffers in all new processors Summary Branch prediction is necessary for high performance in modern processors Important metrics for branch prediction schemes include accuracy, speed, and size/cost Hardware-based speculation key ideas Dynamic branch prediction to choose which instructions to execute Speculation to allow the execution of instructions before resolving control dependencies with ability to undo incorrect sequences Dynamic scheduling to deal with different combinations of basic blocks 31 32

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory The Big Picture: Where are We Now? CS152 Computer Architecture and Engineering Lecture 18 The Five Classic Components of a Computer Processor Input Control Dynamic Scheduling (Cont), Speculation, and ILP