Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Similar documents
Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Lecture: Branch Prediction

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture: Branch Prediction

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Branch Prediction

Lecture: Static ILP, Branch Prediction

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Hardware-Based Speculation

Lecture: Out-of-order Processors

Copyright 2012, Elsevier Inc. All rights reserved.

CS433 Homework 2 (Chapter 3)

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

CMSC411 Fall 2013 Midterm 2 Solutions

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Instruction-Level Parallelism and Its Exploitation

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Hardware-based Speculation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

LIMITS OF ILP. B649 Parallel Architectures and Programming

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Instruction Level Parallelism

Control Hazards. Prediction

CS433 Homework 2 (Chapter 3)

Computer Systems Architecture

Instruction Level Parallelism

Out of Order Processing

HY425 Lecture 05: Branch Prediction

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Instruction Level Parallelism (ILP)

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Complex Pipelines and Branch Prediction

ECE 505 Computer Architecture

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Exploitation of instruction level parallelism

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Multiple Instruction Issue and Hardware Based Speculation

5008: Computer Architecture

Multiple Instruction Issue. Superscalars

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Branch prediction ( 3.3) Dynamic Branch Prediction

Control Hazards. Branch Prediction

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Getting CPI under 1: Outline

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Advanced Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

E0-243: Computer Architecture

CS 152, Spring 2011 Section 8

Chapter 4 The Processor 1. Chapter 4D. The Processor

The Processor: Instruction-Level Parallelism

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

CIS 662: Midterm. 16 cycles, 6 stalls

Metodologie di Progettazione Hardware-Software

Chapter 06: Instruction Pipelining and Parallel Processing

Static vs. Dynamic Scheduling

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Processor (IV) - advanced ILP. Hwansoo Han

November 7, 2014 Prediction

University of Toronto Faculty of Applied Science and Engineering

Instruction Level Parallelism (Branch Prediction)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Handout 2 ILP: Part B

ECE 341. Lecture # 15

Adapted from David Patterson s slides on graduate computer architecture

Superscalar Processors

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Dynamic Control Hazard Avoidance

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

Wide Instruction Fetch

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Dynamic Scheduling. CSE471 Susan Eggers 1

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Transcription:

Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1

1-Bit Prediction For each branch, keep track of what happened last time and use that outcome as the prediction What are prediction accuracies for branches 1 and 2 below: while (1) { for (i=0;i<10;i++) { branch-1 } for (j=0;j<20;j++) { branch-2 } } 2

2-Bit Prediction For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) If (counter >= 2), predict taken, else predict not taken Advantage: a few atypical branches will not influence the prediction (a better measure of the common case ) Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor) Can be easily extended to N-bits (in most processors, N=2) 3

Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each entry (or use 10 branch PC bits to index into one of 1024 counters) captures the recent common case for each branch Can we take advantage of additional information? If a branch recently went 01111, expect 0; if it recently went 11101, expect 1; can we have a separate counter for each case? If the previous branches went 01, expect 0; if the previous branches went 11, expect 1; can we have a separate counter for each case? Hence, build correlating predictors 4

Local/Global Predictors Instead of maintaining a counter for each branch to capture the common case, Maintain a counter for each branch and surrounding pattern If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor 5

Global Predictor A single register that keeps track of recent history for all branches 00110101 Branch PC 8 bits 6 bits Table of 16K entries of 2-bit saturating counters Also referred to as a two-level predictor 6

Local Predictor Branch PC Use 6 bits of branch PC to index into local history table 10110111011001 Table of 64 entries of 14-bit histories for a single branch Also a two-level predictor that only uses local histories at the first level 14-bit history indexes into next level Table of 16K entries of 2-bit saturating counters 7

Tournament Predictors A local predictor might work well for some branches or programs, while a global predictor might work well for others Provide one of each and maintain another predictor to identify which predictor is best for each branch Local Predictor Global Predictor M U X Alpha 21264: 1K entries in level-1 1K entries in level-2 4K entries 12-bit global history Branch PC Tournament Predictor Table of 2-bit saturating counters 4K entries Total capacity:? 8

Predictor Comparison Note that predictors of equal capacity must be compared Sizes of each level have to be selected to optimize prediction accuracy Influencing factors: degree of interference between branches, program likely to benefit from local/global history 9

Branch Target Prediction In addition to predicting the branch direction, we must also predict the branch target address Branch PC indexes into a predictor table; indirect branches might be problematic Most common indirect branch: return from a procedure can be easily handled with a stack of return addresses 10

Multiple Instruction Issue The out-of-order processor implementation can be easily extended to have multiple instructions in each pipeline stage Increased complexity (lower clock speed!): more reads and writes per cycle to register map table more read and write ports in issue queue more tags being broadcast to issue queue every cycle higher complexity for bypassing/forwarding among FUs more register read and write ports more ports in the LSQ more ports in the data cache more ports in the ROB 11

ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target prediction Perfect memory disambiguation Perfect instruction and data caches Single-cycle latencies for all ALUs Infinite ROB size (window of in-flight instructions) No limit on number of instructions in each pipeline stage The last instruction may be scheduled in the first cycle The only constraint is a true dependence (register or memory RAW hazards) (with value prediction, how would the perfect processor behave?) 12

Infinite Window Size and Issue Rate 13

Effect of Window Size Window size is effected by register file/rob size, branch mispredict rate, fetch bandwidth, etc. We will use a window size of 2K instrs and a max issue rate of 64 for subsequent experiments 14

Imperfect Branch Prediction Note: no branch mispredict penalty; branch mispredict restricts window size Assume a large tournament predictor for subsequent experiments 15

Effect of Name Dependences More registers fewer WAR and WAW constraints (usually register file size goes hand in hand with in-flight window size) 256 int and fp registers for subsequent experiments 16

Memory Dependences 17

Limits of ILP Summary Int programs are more limited by branches, memory disambiguation, etc., while FP programs are limited most by window size We have not yet examined the effect of branch mispredict penalty and imperfect caching All of the studied factors have relatively comparable influence on CPI: window/register size, branch prediction, memory disambiguation Can we do better? Yes: better compilers, value prediction, memory dependence prediction, multi-path execution 18

Pentium III (P6 Microarchitecture) Case Study 14-stage pipeline: 8 for fetch/decode/dispatch, 3+ for o-o-o, 3 for commit branch mispredict penalty of 10-15 cycles Out-of-order execution with a 40-entry ROB (40 temporary or virtual registers) and 20 reservation stations Each x86 instruction gets converted into RISC-like micro-ops on average, one CISC instr 1.37 micro-ops Three instructions in each pipeline stage 3 instructions can simultaneously leave the pipeline ideal CPµI = 0.33 ideal CPI = 0.45 19

Branch Prediction 512-entry global two-level branch predictor and 512-entry BTB 20% combined mispredict rate For every instruction committed, 0.2 instructions on the mispredicted path are also executed (wasted power!) Mispredict penalty is 10-15 cycles 20

Where is Time Lost? Branch mispredict stalls Cache miss stalls (dominated by L1D misses) Instruction fetch stalls (happens often because subsequent stages are stalled, and occasionally because of an I-cache miss 21

CPI Performance Owing to stalls, the processor can fall behind (no instructions are committed for 55% of all cycles), but then recover with multi-instruction commits (31% of all cycles) average CPI = 1.15 (Int) and 2.0 (FP) Overlap of different stalls CPI is not the sum of individual stalls IPC is also an attractive metric 22

Title Bullet 23