Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Similar documents
Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture: Static ILP, Branch Prediction

Lecture 6: Static ILP

Lecture: Branch Prediction

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Branch Prediction

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture: Branch Prediction

Lecture: Out-of-order Processors

Control Hazards. Branch Prediction

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Control Hazards. Prediction

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Getting CPI under 1: Outline

LECTURE 3: THE PROCESSOR

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 4. The Processor

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Topic 14: Dealing with Branches

Topic 14: Dealing with Branches

Pipelining. CSC Friday, November 6, 2015

5008: Computer Architecture

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Instruction Level Parallelism (Branch Prediction)

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Computer Science 146. Computer Architecture

COMPUTER ORGANIZATION AND DESI

ECE 505 Computer Architecture

Instruction Level Parallelism (ILP)

Instruction Level Parallelism

Hardware-based Speculation

Copyright 2012, Elsevier Inc. All rights reserved.

ECE 4750 Computer Architecture, Fall 2017 T13 Advanced Processors: Branch Prediction

1 Hazards COMP2611 Fall 2015 Pipelined Processor

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

Full Datapath. Chapter 4 The Processor 2

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

ECE 571 Advanced Microprocessor-Based Design Lecture 7

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Advanced issues in pipelining

ECE 486/586. Computer Architecture. Lecture # 12

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Chapter 4 The Processor 1. Chapter 4D. The Processor

CMSC411 Fall 2013 Midterm 2 Solutions

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Hardware-Based Speculation

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Control Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance

BRANCH PREDICTORS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Advanced Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

ECE 571 Advanced Microprocessor-Based Design Lecture 9

Instruction Level Parallelism

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

HY425 Lecture 09: Software to exploit ILP

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

HY425 Lecture 09: Software to exploit ILP

Lecture 4: Advanced Pipelines. Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Old formulation of branch paths w/o prediction. bne $2,$3,foo subu $3,.. ld $2,..

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Handout 2 ILP: Part B

Lecture: Pipeline Wrap-Up and Static ILP

ECE232: Hardware Organization and Design

November 7, 2014 Prediction

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

Instruction-Level Parallelism and Its Exploitation

Four Steps of Speculative Tomasulo cycle 0

Dynamic Control Hazard Avoidance

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Advanced processor designs

Lecture-13 (ROB and Multi-threading) CS422-Spring

Full Datapath. Chapter 4 The Processor 2

Multiple Instruction Issue and Hardware Based Speculation

Adapted from David Patterson s slides on graduate computer architecture

UCI. Intel Itanium Line Processor Efforts. Xiaobin Li. PASCAL EECS Dept. UC, Irvine. University of California, Irvine

Lecture 5: Pipelining Basics

Transcription:

Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6) 1

Predication A branch within a loop can be problematic to schedule Control dependences are a problem because of the need to re-fetch on a mispredict For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions 2

Predicated or Conditional Instructions The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 =!R1 ; R8 = R2 ; R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1) 3

Complications Each instruction has one more input operand more register ports/bypassing If the branch condition is not known, the instruction stalls (remember, these are in-order processors) Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline wasted work Increases register pressure, activity on functional units Does not help if the br-condition takes a while to evaluate 4

Support for Speculation In general, when we re-order instructions, register renaming can ensure we do not violate register data dependences However, we need hardware support to ensure that an exception is raised at the correct point to ensure that we do not violate memory dependences ld st br 5

Detecting Exceptions Some exceptions require that the program be terminated (memory protection violation), while other exceptions require execution to resume (page faults) For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss In the former case, you want to defer servicing the exception until you are sure the instruction is not speculative Note that a speculative instruction needs a special opcode to indicate that it is speculative 6

Program-Terminate Exceptions When a speculative instruction experiences an exception, instead of servicing it, it writes a special NotAThing value (NAT) in the destination register If a non-speculative instruction reads a NAT, it flags the exception and the program terminates (it may not be desireable that the error is caused by an array access, but the segfault happens two procedures later) Alternatively, an instruction (the sentinel) in the speculative instruction s original location checks the register value and initiates recovery 7

Memory Dependence Detection If a load is moved before a preceding store, we must ensure that the store writes to a non-conflicting address, else, the load has to re-execute When the speculative load issues, it stores its address in a table (Advanced Load Address Table in the IA-64) If a store finds its address in the ALAT, it indicates that a violation occurred for that address A special instruction (the sentinel) in the load s original location checks to see if the address had a violation and re-executes the load if necessary 8

Dynamic Vs. Static ILP Static ILP: + The compiler finds parallelism no extra hw higher clock speeds and lower power + Compiler knows what is next better global schedule - Compiler can not react to dynamic events (cache misses) - Can not re-order instructions unless you provide hardware and extra instructions to detect violations (eats into the low complexity/power argument) - Static branch prediction is poor even statically scheduled processors use hardware branch predictors - Building an optimizing compiler is easier said than done A comparison of the Alpha, Pentium 4, and Itanium (statically scheduled IA-64 architecture) shows that the Itanium is not much better in terms of performance, clock speed or power 9

Control Hazards In the 5-stage in-order processor: assume always taken or assume always not taken; if the branch goes the other way, squash mis-fetched instructions (momentarily, forget about branch delay slots) Modern in-order and out-of-order processors: dynamic branch prediction; instead of a default not-taken assumption, either predict not-taken, or predict taken-to-x, or predict taken-to-y Branch predictor: a cache of recent branch outcomes 10

Pipeline without Branch Predictor PC IF (br) PC + 4 Reg Read Compare Br-target In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch 11

Pipeline with Branch Predictor PC IF (br) Branch Predictor Reg Read Compare Br-target In the 5-stage pipeline, a branch completes in two cycles If the branch went the wrong way, one incorrect instr is fetched One stall cycle per incorrect branch 12

Branch Mispredict Penalty Assume: no data or structural hazards; only control hazards; every 5 th instruction is a branch; branch predictor accuracy is 90% Slowdown = 1 / (1 + stalls per instruction) Stalls per instruction = % branches x %mispreds x penalty = 20% x 10% x 1 = 0.02 Slowdown = 1/1.02 ; if penalty = 20, slowdown = 1/1.4 13

1-Bit Bimodal Prediction For each branch, keep track of what happened last time and use that outcome as the prediction What are prediction accuracies for branches 1 and 2 below: while (1) { for (i=0;i<10;i++) { } for (j=0;j<20;j++) { } } branch-1 branch-2 14

2-Bit Bimodal Prediction For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1) If (counter >= 2), predict taken, else predict not taken Advantage: a few atypical branches will not influence the prediction (a better measure of the common case ) Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor) Can be easily extended to N-bits (in most processors, N=2) 15

Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each entry (or use 10 branch PC bits to index into one of 1024 counters) captures the recent common case for each branch Can we take advantage of additional information? If a branch recently went 01111, expect 0; if it recently went 11101, expect 1; can we have a separate counter for each case? If the previous branches went 01, expect 0; if the previous branches went 11, expect 1; can we have a separate counter for each case? Hence, build correlating predictors 16

Global Predictor A single register that keeps track of recent history for all branches 00110101 Branch PC 8 bits 6 bits Table of 16K entries of 2-bit saturating counters Also referred to as a two-level predictor 17

Local Predictor Branch PC Use 6 bits of branch PC to index into local history table 10110111011001 Table of 64 entries of 14-bit histories for a single branch Also a two-level predictor that only uses local histories at the first level 14-bit history indexes into next level Table of 16K entries of 2-bit saturating counters 18

Local/Global Predictors Instead of maintaining a counter for each branch to capture the common case, Maintain a counter for each branch and surrounding pattern If the surrounding pattern belongs to the branch being predicted, the predictor is referred to as a local predictor If the surrounding pattern includes neighboring branches, the predictor is referred to as a global predictor 19

Tournament Predictors A local predictor might work well for some branches or programs, while a global predictor might work well for others Provide one of each and maintain another predictor to identify which predictor is best for each branch Local Predictor Global Predictor M U X Alpha 21264: 1K entries in level-1 1K entries in level-2 4K entries 12-bit global history Branch PC Tournament Predictor Table of 2-bit saturating counters 4K entries Total capacity:? 20

Branch Target Prediction In addition to predicting the branch direction, we must also predict the branch target address Branch PC indexes into a predictor table; indirect branches might be problematic Most common indirect branch: return from a procedure can be easily handled with a stack of return addresses 21

Title Bullet 22