Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table Unit Block Cache Final Collapse Completion Fetch Buffer Execution Core I-Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1
Announcements Wenisch 2007 -- Portions Austin, Brehob, Falsafi, HW # 3 (due 10/12) Project proposal (due 10/12) Review meetings will be Monday (10/15) Sign up for timeslots in discussion on Friday Midterm (10/17) Slide 2
Readings Wenisch 2007 -- Portions Austin, Brehob, Falsafi, For Today: Rotenberg et al Trace Cache Slide 3
Flow Path Model of Superscalars I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow Slide 4
Two-level Predictor Update Global BHR 10101010 Ptt Pattern History Table NT T When do we update BHR and PHT? Need updated BHR for next prediction (speculative (p update) May or may not update PHT speculatively Must undo/fix updates on mispredictions! Slide 5
Correlated Predictor Design Space Design choice I: one global BHR or one per PC (local)? Each one captures different kinds of patterns Global is better, captures local patterns for tight loop branches Design choice II: how many history bits (BHR size)? + Given unlimited resources, longer BHRs are better, but BHT utilization decreases Many history patterns are never seen Many branches are history independent BHR length < log 2 (BHT size) Predictor takes longer to train Typical length: 8 12 Design choice III: Per PC PC pattern tables? Different branches want different predictions for same pattern Storage cost of each PHT is high, only a few patterns matter Slide 6
Hybrid Predictor Hybrid (tournament) predictor [McFarling] Attacks correlated predictor BHT utilization problem Idea: combine two predictors Simple BHT predicts history independent branches Correlated predictor predicts only branches that need history Chooser assigns branches to one predictor or the other Branches start in simple BHT, move mis prediction threshold + Correlated predictor can be made smaller, handles fewer branches + 90 95% 95% accuracy Alpha 21264: Hybrid of Gshare & 2 bit saturating counters PC BHR BHT BHT chooser Slide 7
Implementation Challenges Big hybrid predictors: Accurate, but slow May take multiple cycles to make a prediction! Challenge I: Need a prediction right away Overriding predictors Use fast, simple predictor for initial prediction Confirm/fix with slower more accurate predictor Use misprediction recovery mechanism if predictors disagree Challenge II: Pipelining predictors BHR needs immediate update for next prediction Slide 8
Deep Speculation & Recovery NT T tag1 NT T tag2 NT T NT T NT T NT T NT T tag3 Leading Speculation Tag speculative instructions Advance branch and following instructions Buffer addresses of speculated branch instructions Slide 9
Mis-speculation Recovery NT T tag1 NT T NT T tag2 tag2 NT T NT T NT T NT T tag3 tag3 tag3 Eliminate Incorrect Path Must ensure that the mis speculated instructions produce no side effects Start New Correct Path Must have remembered the alternate (non predicted) path Slide 10
Trailing Confirmation NT T tag1 NT T NT T tag2 tag2 NT T NT T NT T NT T tag3 tag3 tag3 Trailing Confirmation When branch is resolved, remove/deallocate speculation tag Permit completion of branch and following instructions Slide 11
Mis-speculation Recovery Eliminate Incorrect Path Clean up all instructions younger than mis predicted branch Want to clean up ASAP (can t use exception rewind) Must clean up ROB, LSQ, map table, RS, fetch/dispatch buffers How expensive is a misprediction? Start New Correct Path Prediction was NT PC = computed branch target Prediction was T PC = next sequential address Can speculate again when encountering a new branch How soon can you restart? Slide 12
Fast Branch Recovery Wenisch 2007 -- Portions Austin, Brehob, Falsafi, Key Ideas: For branches, keep copy of all state needed for recovery Branch stack stores recovery state For all instructions, keep track of pending branches they depend on Branch mask register tracks which stack entries are in use Branch masks in RS/FU pipeline pp indicate all older pending branches Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list b-mask reg op T T1+ T2+ b-mask RS Slide 13
Fast Branch Recovery Dispatch Stage Branches: If branch stack is full, stall Allocate stack entry, set b mask bit Take snapshot of map table, free list, ROB, LSQ tails Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list T+ Save PC & details needed to fix BP b-mask reg 1 0 0 0 All instructions: Copy b mask to RS entry op T T1+ T2+ b-mask == 0000 == == 1000 == == mul == br add 1000 RS Slide 14
Fast Branch Recovery Branch Resolution - Mispredict Wenisch 2007 -- Portions Austin, Brehob, Falsafi, Fix ROB & LSQ: Set tail pointer from branch stack Fix Map Table & free list: Restore from checkpoint Fix RS & FU pipeline entries: Squash if b mask bit for branch == 1 Clear branch stack entry, b mask bit Can handle nested mispredictions! Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list b-mask reg 0 0 0 0 op T T1+ T2+ b-mask == 0000 mul == br 1000 add 1000 RS T+ Slide 15
Fast Branch Recovery Branch Resolution Correct Prediction Free branch stack entry Clear bit in b mask Flash clear b mask bit in RS & pipeline: Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list Frees b mask bit for immediate reuse b-mask reg 0 0 0 0 T+ Branches may resolve out of order! op T T1+ T2+ B mask bits keep track of unresolved mul == == br control dependencies == == b-mask 0000 0000 add 0000 RS Slide 16
Wide Instruction Fetch Issues Average Basic Block Size integer code: 4 6 instructions floating point code: 6 10 instructions Three Major Challenges: Multiple Branch Prediction Multiple Fetch Groups Alignment and Collapsing Branch Fetch Prediction Cannot be solved with just larger cache blocks Instruction Cache Decode Dispatch Instruction Buffer Slide 17
Wide Fetch - Sequential Instructions I$ B P 1020 1021 1022 1023 What is involved in fetching multiple instructions per cycle? In same cache block? no problem Favors larger block size (independent of hit rate) Compilers align basic blocks to I$ lines (pad with nops) Reduces I$ capacity + Increases fetch bandwidth utilization (more important) Inmultiple blocks? Fetch block A and A+1 in parallel Banked I$ + combining network May add latency (add pipeline stages to avoid slowing down clock) Slide 18
Wide Fetch - Non-sequential Two related questions How many branches predicted per cycle? Can we fetch from multiple taken branches per cycle? Simplest, most common organization: 1 and No One prediction, discard post branch insns if prediction is Taken Lowers effective fetch width and IPC Average number of instructions per taken branch? Assume: 20% branches, 50% taken ~10 instructions Consider a 10 instruction loop body with an 8 issue processor Without smarter fetch, ILP is limited to 5 (not 8) Compiler can help Unroll loops, reduce taken branch frequency Slide 19
Multiple Branch Predictions Issues with multiple branch predictions: Latency resulting from sequential predictions Later predictions based on stale/speculative /p history Don t forget, 0.95x0.95x0.95=0.85 BTB Fetch address BTB BTB Block 1 Block 2 Block 3 Slide 20
Examples of Multi-Branch Predictors PHT BHSR b n b 0 p 0 p 1 p 2 How do you update this thing after a branch resolves? Slide 21
Examples of Multi-Branch Predictors BHSR b n b 0 PHT b n:2 b n-1:1 b n-2:0 b 1 b 0 b 0 p 0 2 n-2 x 4 entries p 0 p 1 p 2 p 1 p 0 Slide 22
Multiple Predicted Taken Branches Issues with multiple taken branches: Long latency with multiple sequential I cache accesses or, multi ported I cache with slower access latency or, multi banked I cache to approximate multi port Block 1 FA Block 2 FA Multi-ported I-cache Block 3 FA Block 1 Block 2 Block 3 instructions instructions instructions Slide 23
Instruction Alignment and Collapsing Issues with alignment and collapsing: Misalignment between fetch group and cache line. Packing of variable sized blocks into fetch buffer. I-cache Port 1 I-cache Port 2 I-cache Port 3 How do you know where this is? Fetch buffer Slide 24
Mapping CFG to Linear Instruction Sequence Wenisch 2007 -- Portions Austin, Brehob, Falsafi, A A A B C C B D D B D C Slide 25
Trace Cache Motivation static 90% dynamic 10% A A B B C C D D 10% static 90% dynamic E E F F G G A B C D F G I cache line boundaries Tracecache line boundaries Storing traces (ABC, DFG) improves code density; fetch continuity Slide 26
Trace Cache Wenisch 2007 -- Portions Austin, Brehob, Falsafi, T$ T P Trace cache (T$) [Peleg+Weiser, Rotenberg+] Overcomes serialization of prediction and fetch by combining them New kind of I$ that stores dynamic, not static, insn sequences Blocks can contain statically non contiguous insns Tag: PC of first insn + N/T of embedded branches Coupled with trace predictor (TP) Predicts next trace, not next branch Trace identified by initial address & internal branch outcomes Slide 27
Trace Cache Example Traditional instruction cache Tag Data (insns) 0 addi,beq #4,ld,sub 4 st,call #32,ld,add 1 2 0: addi r1,4,r1 F D 1: beq r1,#4 F D 4: st r1,4(sp) f* F 5: call #32 f* F Trace cache 1 2 Tag Data (insns) 0:T addi,beq #4,st,call #32 0: addi r1,4,r1 F D 1: beq r1,#4 F D 4: st r1,4(sp) F D 5: call #32 F D Slide 28
A Typical Trace Cache Organization predicted PC Next Trace Predict. Trace Cache I-Cache Hist. Hash br. hist. Fetch Buffer Execution Core Fill Unit Completion Slide 29
Aside: Multiple-issue CISC How do we apply superscalar techniques to CISC (e.g., x86) Break macro ops into micro ops Also called μops or RISC ops A typical CISCy instruction add [r1], [r2] [r3] becomes: Load [r1] t1 (t1 is a temp. register, not visible to software) Load [r2] t2 Add t1, t2 t3 Store t3[r3] However, conversion is expensive (latency, area, power) Solution: cache converted instructions in trace cache Used by Pentium 4 Internal pipeline manipulates only these RISC like instructions Slide 30
Intel P4 Trace Cache A 12K uop trace cache replaces the L1 I cache 6 uop per trace line, can include branches Trace cache returns 3 uop per cycle IA 32 decoder can be simpler and slower Only needs to decode one IA 32 instruction per cycle Front End BTB 4K Entries ITLB & Prefetcher L2 Interface IA32 Decoder Trace Cache BTB 512 Entries Trace Cache 12K uop s Slide 31
Wide-Fetch I-cache vs. T-cache Enhanced Instruction ti Cache Proposed Trace Cache Fetch 1. Multiple-branch prediction 1. Next trace prediction 2. Instruction cache fetch 2. Trace cache fetch 3. Instruction alignment & collapsing Execution Core Execution Core Completion 1. Multiple-branch predictor 1. Trace construction and fill update Slide 32
Trace Cache Trade-offs Trace cache: Pros Moves complexity to backend Cons Inefficient instruction storage Instruction storage redundancy Fetch time complexity Enhanced instruction cache: Pros Efficient instruction storage Cons Complexity during fetch time Slide 33
As Machines Get Wider ( and Deeper) Fetch Fetch 1. Eliminate Stages 2 Relocate work to the backend Slide 34