Wide Instruction Fetch

Similar documents
EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018

EECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 7 Winter 2018

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

November 7, 2014 Prediction

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

ECE/CS 552: Introduction to Superscalar Processors

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Lecture 13: Branch Prediction

5008: Computer Architecture

Superscalar Processor Design

Control Hazards. Prediction

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Dynamic Control Hazard Avoidance

EECS 470 Lecture 13. Basic Caches. Fall 2018 Jon Beaumont

The Processor: Instruction-Level Parallelism

Instruction Level Parallelism (Branch Prediction)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Processor (IV) - advanced ILP. Hwansoo Han

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Advanced Computer Architecture

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

E0-243: Computer Architecture

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Superscalar Processors

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Multiple Instruction Issue. Superscalars

Hardware-Based Speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

Control Hazards. Branch Prediction

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

EE382A Lecture 5: Branch Prediction. Department of Electrical Engineering Stanford University

CS 152, Spring 2011 Section 8

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Fall 2007 Prof. Thomas Wenisch

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

High Bandwidth Instruction Fetching Techniques Instruction Bandwidth Issues

EECS 470. Further review: Pipeline Hazards and More. Lecture 2 Winter 2018

Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

EECS 470 PROJECT: P6 MICROARCHITECTURE BASED CORE

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Architectures for Instruction-Level Parallelism

Tutorial 11. Final Exam Review

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS146 Computer Architecture. Fall Midterm Exam

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

Computer Architecture EE 4720 Final Examination

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Complex Pipelines and Branch Prediction

PowerPC 620 Case Study

Superscalar Machines. Characteristics of superscalar processors

CS252 Graduate Computer Architecture Midterm 1 Solutions

Static & Dynamic Instruction Scheduling

Midterm Exam 1 Wednesday, March 12, 2008

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Fall 2011 Prof. Hyesoon Kim

Hardware-based Speculation

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Lecture 15 Pipelining & ILP Instructor: Prof. Falsafi

The Problem with P6. Problem for high performance implementations

COMPUTER ORGANIZATION AND DESI

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

The IA-64 Architecture. Salient Points

Superscalar Organization

Transcription:

Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table Unit Block Cache Final Collapse Completion Fetch Buffer Execution Core I-Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Michigan, and University of Wisconsin. Slide 1

Announcements Wenisch 2007 -- Portions Austin, Brehob, Falsafi, HW # 3 (due 10/12) Project proposal (due 10/12) Review meetings will be Monday (10/15) Sign up for timeslots in discussion on Friday Midterm (10/17) Slide 2

Readings Wenisch 2007 -- Portions Austin, Brehob, Falsafi, For Today: Rotenberg et al Trace Cache Slide 3

Flow Path Model of Superscalars I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow Slide 4

Two-level Predictor Update Global BHR 10101010 Ptt Pattern History Table NT T When do we update BHR and PHT? Need updated BHR for next prediction (speculative (p update) May or may not update PHT speculatively Must undo/fix updates on mispredictions! Slide 5

Correlated Predictor Design Space Design choice I: one global BHR or one per PC (local)? Each one captures different kinds of patterns Global is better, captures local patterns for tight loop branches Design choice II: how many history bits (BHR size)? + Given unlimited resources, longer BHRs are better, but BHT utilization decreases Many history patterns are never seen Many branches are history independent BHR length < log 2 (BHT size) Predictor takes longer to train Typical length: 8 12 Design choice III: Per PC PC pattern tables? Different branches want different predictions for same pattern Storage cost of each PHT is high, only a few patterns matter Slide 6

Hybrid Predictor Hybrid (tournament) predictor [McFarling] Attacks correlated predictor BHT utilization problem Idea: combine two predictors Simple BHT predicts history independent branches Correlated predictor predicts only branches that need history Chooser assigns branches to one predictor or the other Branches start in simple BHT, move mis prediction threshold + Correlated predictor can be made smaller, handles fewer branches + 90 95% 95% accuracy Alpha 21264: Hybrid of Gshare & 2 bit saturating counters PC BHR BHT BHT chooser Slide 7

Implementation Challenges Big hybrid predictors: Accurate, but slow May take multiple cycles to make a prediction! Challenge I: Need a prediction right away Overriding predictors Use fast, simple predictor for initial prediction Confirm/fix with slower more accurate predictor Use misprediction recovery mechanism if predictors disagree Challenge II: Pipelining predictors BHR needs immediate update for next prediction Slide 8

Deep Speculation & Recovery NT T tag1 NT T tag2 NT T NT T NT T NT T NT T tag3 Leading Speculation Tag speculative instructions Advance branch and following instructions Buffer addresses of speculated branch instructions Slide 9

Mis-speculation Recovery NT T tag1 NT T NT T tag2 tag2 NT T NT T NT T NT T tag3 tag3 tag3 Eliminate Incorrect Path Must ensure that the mis speculated instructions produce no side effects Start New Correct Path Must have remembered the alternate (non predicted) path Slide 10

Trailing Confirmation NT T tag1 NT T NT T tag2 tag2 NT T NT T NT T NT T tag3 tag3 tag3 Trailing Confirmation When branch is resolved, remove/deallocate speculation tag Permit completion of branch and following instructions Slide 11

Mis-speculation Recovery Eliminate Incorrect Path Clean up all instructions younger than mis predicted branch Want to clean up ASAP (can t use exception rewind) Must clean up ROB, LSQ, map table, RS, fetch/dispatch buffers How expensive is a misprediction? Start New Correct Path Prediction was NT PC = computed branch target Prediction was T PC = next sequential address Can speculate again when encountering a new branch How soon can you restart? Slide 12

Fast Branch Recovery Wenisch 2007 -- Portions Austin, Brehob, Falsafi, Key Ideas: For branches, keep copy of all state needed for recovery Branch stack stores recovery state For all instructions, keep track of pending branches they depend on Branch mask register tracks which stack entries are in use Branch masks in RS/FU pipeline pp indicate all older pending branches Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list b-mask reg op T T1+ T2+ b-mask RS Slide 13

Fast Branch Recovery Dispatch Stage Branches: If branch stack is full, stall Allocate stack entry, set b mask bit Take snapshot of map table, free list, ROB, LSQ tails Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list T+ Save PC & details needed to fix BP b-mask reg 1 0 0 0 All instructions: Copy b mask to RS entry op T T1+ T2+ b-mask == 0000 == == 1000 == == mul == br add 1000 RS Slide 14

Fast Branch Recovery Branch Resolution - Mispredict Wenisch 2007 -- Portions Austin, Brehob, Falsafi, Fix ROB & LSQ: Set tail pointer from branch stack Fix Map Table & free list: Restore from checkpoint Fix RS & FU pipeline entries: Squash if b mask bit for branch == 1 Clear branch stack entry, b mask bit Can handle nested mispredictions! Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list b-mask reg 0 0 0 0 op T T1+ T2+ b-mask == 0000 mul == br 1000 add 1000 RS T+ Slide 15

Fast Branch Recovery Branch Resolution Correct Prediction Free branch stack entry Clear bit in b mask Flash clear b mask bit in RS & pipeline: Branch Stack Recovery PC T+ ROB&LSQ tail BP repair Free list Frees b mask bit for immediate reuse b-mask reg 0 0 0 0 T+ Branches may resolve out of order! op T T1+ T2+ B mask bits keep track of unresolved mul == == br control dependencies == == b-mask 0000 0000 add 0000 RS Slide 16

Wide Instruction Fetch Issues Average Basic Block Size integer code: 4 6 instructions floating point code: 6 10 instructions Three Major Challenges: Multiple Branch Prediction Multiple Fetch Groups Alignment and Collapsing Branch Fetch Prediction Cannot be solved with just larger cache blocks Instruction Cache Decode Dispatch Instruction Buffer Slide 17

Wide Fetch - Sequential Instructions I$ B P 1020 1021 1022 1023 What is involved in fetching multiple instructions per cycle? In same cache block? no problem Favors larger block size (independent of hit rate) Compilers align basic blocks to I$ lines (pad with nops) Reduces I$ capacity + Increases fetch bandwidth utilization (more important) Inmultiple blocks? Fetch block A and A+1 in parallel Banked I$ + combining network May add latency (add pipeline stages to avoid slowing down clock) Slide 18

Wide Fetch - Non-sequential Two related questions How many branches predicted per cycle? Can we fetch from multiple taken branches per cycle? Simplest, most common organization: 1 and No One prediction, discard post branch insns if prediction is Taken Lowers effective fetch width and IPC Average number of instructions per taken branch? Assume: 20% branches, 50% taken ~10 instructions Consider a 10 instruction loop body with an 8 issue processor Without smarter fetch, ILP is limited to 5 (not 8) Compiler can help Unroll loops, reduce taken branch frequency Slide 19

Multiple Branch Predictions Issues with multiple branch predictions: Latency resulting from sequential predictions Later predictions based on stale/speculative /p history Don t forget, 0.95x0.95x0.95=0.85 BTB Fetch address BTB BTB Block 1 Block 2 Block 3 Slide 20

Examples of Multi-Branch Predictors PHT BHSR b n b 0 p 0 p 1 p 2 How do you update this thing after a branch resolves? Slide 21

Examples of Multi-Branch Predictors BHSR b n b 0 PHT b n:2 b n-1:1 b n-2:0 b 1 b 0 b 0 p 0 2 n-2 x 4 entries p 0 p 1 p 2 p 1 p 0 Slide 22

Multiple Predicted Taken Branches Issues with multiple taken branches: Long latency with multiple sequential I cache accesses or, multi ported I cache with slower access latency or, multi banked I cache to approximate multi port Block 1 FA Block 2 FA Multi-ported I-cache Block 3 FA Block 1 Block 2 Block 3 instructions instructions instructions Slide 23

Instruction Alignment and Collapsing Issues with alignment and collapsing: Misalignment between fetch group and cache line. Packing of variable sized blocks into fetch buffer. I-cache Port 1 I-cache Port 2 I-cache Port 3 How do you know where this is? Fetch buffer Slide 24

Mapping CFG to Linear Instruction Sequence Wenisch 2007 -- Portions Austin, Brehob, Falsafi, A A A B C C B D D B D C Slide 25

Trace Cache Motivation static 90% dynamic 10% A A B B C C D D 10% static 90% dynamic E E F F G G A B C D F G I cache line boundaries Tracecache line boundaries Storing traces (ABC, DFG) improves code density; fetch continuity Slide 26

Trace Cache Wenisch 2007 -- Portions Austin, Brehob, Falsafi, T$ T P Trace cache (T$) [Peleg+Weiser, Rotenberg+] Overcomes serialization of prediction and fetch by combining them New kind of I$ that stores dynamic, not static, insn sequences Blocks can contain statically non contiguous insns Tag: PC of first insn + N/T of embedded branches Coupled with trace predictor (TP) Predicts next trace, not next branch Trace identified by initial address & internal branch outcomes Slide 27

Trace Cache Example Traditional instruction cache Tag Data (insns) 0 addi,beq #4,ld,sub 4 st,call #32,ld,add 1 2 0: addi r1,4,r1 F D 1: beq r1,#4 F D 4: st r1,4(sp) f* F 5: call #32 f* F Trace cache 1 2 Tag Data (insns) 0:T addi,beq #4,st,call #32 0: addi r1,4,r1 F D 1: beq r1,#4 F D 4: st r1,4(sp) F D 5: call #32 F D Slide 28

A Typical Trace Cache Organization predicted PC Next Trace Predict. Trace Cache I-Cache Hist. Hash br. hist. Fetch Buffer Execution Core Fill Unit Completion Slide 29

Aside: Multiple-issue CISC How do we apply superscalar techniques to CISC (e.g., x86) Break macro ops into micro ops Also called μops or RISC ops A typical CISCy instruction add [r1], [r2] [r3] becomes: Load [r1] t1 (t1 is a temp. register, not visible to software) Load [r2] t2 Add t1, t2 t3 Store t3[r3] However, conversion is expensive (latency, area, power) Solution: cache converted instructions in trace cache Used by Pentium 4 Internal pipeline manipulates only these RISC like instructions Slide 30

Intel P4 Trace Cache A 12K uop trace cache replaces the L1 I cache 6 uop per trace line, can include branches Trace cache returns 3 uop per cycle IA 32 decoder can be simpler and slower Only needs to decode one IA 32 instruction per cycle Front End BTB 4K Entries ITLB & Prefetcher L2 Interface IA32 Decoder Trace Cache BTB 512 Entries Trace Cache 12K uop s Slide 31

Wide-Fetch I-cache vs. T-cache Enhanced Instruction ti Cache Proposed Trace Cache Fetch 1. Multiple-branch prediction 1. Next trace prediction 2. Instruction cache fetch 2. Trace cache fetch 3. Instruction alignment & collapsing Execution Core Execution Core Completion 1. Multiple-branch predictor 1. Trace construction and fill update Slide 32

Trace Cache Trade-offs Trace cache: Pros Moves complexity to backend Cons Inefficient instruction storage Instruction storage redundancy Fetch time complexity Enhanced instruction cache: Pros Efficient instruction storage Cons Complexity during fetch time Slide 33

As Machines Get Wider ( and Deeper) Fetch Fetch 1. Eliminate Stages 2 Relocate work to the backend Slide 34