Dynamic Control Hazard Avoidance

Similar documents
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Hardware Speculation Support

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

Tomasulo Loop Example

Metodologie di Progettazione Hardware-Software

Control Hazards. Prediction

Instruction Level Parallelism

Instruction-Level Parallelism and Its Exploitation

Hardware-based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

CS425 Computer Systems Architecture

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

COSC 6385 Computer Architecture. Instruction Level Parallelism

Super Scalar. Kalyan Basu March 21,

Page 1 ILP. ILP Basics & Branch Prediction. Smarter Schedule. Basic Block Problems. Parallelism independent enough

HY425 Lecture 05: Branch Prediction

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction Level Parallelism (ILP)

CS422 Computer Architecture

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Getting CPI under 1: Outline

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

Hardware-Based Speculation

Announcements. ECE4750/CS4420 Computer Architecture L10: Branch Prediction. Edward Suh Computer Systems Laboratory

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Control Hazards. Branch Prediction

Multi-cycle Instructions in the Pipeline (Floating Point)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Four Steps of Speculative Tomasulo cycle 0

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

CS425 Computer Systems Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Full Datapath. Chapter 4 The Processor 2

Multiple Instruction Issue. Superscalars

5008: Computer Architecture

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 4. The Processor

CS252 Graduate Computer Architecture Midterm 1 Solutions

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Exploitation of instruction level parallelism

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

COMPUTER ORGANIZATION AND DESI

Lecture 5: VLIW, Software Pipelining, and Limits to ILP. Review: Tomasulo

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP

CSE4201 Instruction Level Parallelism. Branch Prediction

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Instruction Level Parallelism (ILP)

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Instruction Level Parallelism

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Lecture-13 (ROB and Multi-threading) CS422-Spring

Computer Science 246 Computer Architecture

The Processor: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Advanced issues in pipelining

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

5008: Computer Architecture HW#2

Chapter 4. The Processor

Control Dependence, Branch Prediction

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

Multiple Instruction Issue and Hardware Based Speculation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Lecture 4: Introduction to Advanced Pipelining

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CSE502 Lecture 15 - Tue 3Nov09 Review: MidTerm Thu 5Nov09 - Outline of Major Topics

COMPUTER ORGANIZATION AND DESIGN

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

LECTURE 3: THE PROCESSOR

Multiple Issue ILP Processors. Summary of discussions

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Branch prediction ( 3.3) Dynamic Branch Prediction

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

COMPUTER ORGANIZATION AND DESIGN

Adapted from David Patterson s slides on graduate computer architecture

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Transcription:

Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==> more control instructions per sec. control stall penalties will go up as machines go faster because of larger impact for low CPI machines (e.g., multiple-issue) Branch Prediction definitely helps if we get it right Using hardware to dynamically predict the branch outcome: prediction changes as the actual branch outcome changes The key Thing is to know the cost of a branch when the prediction is correct and when the prediction is incorrect Chapter 4 page 44

Prediction Based on History Branch Prediction Buffer (BPB) using 1 bit per branch inst. recording the last branch outcome using the low-order address bits of the branch as index to the buffer Many problems The usual cache alias problem 2 branches with same index bits will end up predicting each other can use the usual cache strategies to resolve this alias problem always mispredict twice for every loop once is unavoidable since the exit is always a suprise however previous exit will always cause a mispredict for the first loop iteration Example: a loop executing 10 iterations has a prediction accuracy of only 80% n-bit predictor (just a simple n-bit counter) increment on taken, decrement on untaken if greater than half it s range then the prediction is taken statistically using 2 bits performs as good of using more bits Same example: the prediction accuracy improves to 90% using 2-bit predictors. Why? Chapter 4 page 45

Note: Prediction Accuracy 4K 2-bit entries misprediction rates on SPEC89 nasa7 = 1% matrix300 = 0% (it s all loops so no surprise) tomcatv = 1% doduc = 5% spice = fpppp = 9% gcc = 12% (compiler does a lot of IF s so also no surprise) espresso = 5% with infinite buffer the values are eqntott = 18% (ouch!) the same li = 10% Except: gcc goes to 11%, nasa7 & tomcatv go to 0%. this is a very large buffer hence little collision (aliasing) useful technique to show potential Chapter 4 page 46

Improve Prediction Strategy by correlating branches Consider the worse case eqntott code fragment: if the first 2 fail then the 3rd will always be taken if (aa==2) then aa=0; if (bb=2) then bb=0; if (aa!= bb) then whatever single level predictors can never get this case correlating or 2-level predictors correlation = what happened on the last branch, i.e., T or NT note that the last branch may not always be the same predictor = which way to go, i.e., taken (T) or not taken (NT) 4 possibilities from combining (correlation) x (predictor) (Last-taken last-not-taken) X predict-taken, predict-not-taken) Chapter 4 page 47

In general: (m,n) branch prediction buffer (BPB) use last m branches = global branch history use n bits for predictor (counter), i.e., if counter > threshold then predict taken; otherwise, untaken use p bits as index bits to access BPB Total bits needed in the buffer: 2 m n 2 p = Total memory bits required 2 m banks of memory selected by the global branch history (which is just a shift register) - e.g. a column address use p index bits to select row get the n predictor bits in the entry to make the decision Chapter 4 page 48

How well does it work? OK let us check it with 8K bits (4K (0,2) vs. 1K (2,2)) i.e., 2-bit predictors BPB vs. (2,2) correlating BPB nasa7 = 1% vs. 1% matrix300 = 0% vs. 0% tomcatv = 1% vs. 1% doduc = 5% vs. 5% spice = fpppp = 9% vs. 5% gcc = 12% vs. 11% espresso = 5% vs. 4% eqntott = 18% vs. 6% (big win in the worst case) li = 10% vs 5% Clear problem is whether application has correlating branches Chapter 4 page 49

Branch Target Buffer (BTB) To eliminate the branch penalty need to know what the address is by the end of IF but the instruction isn t even decoded yet until ID? this implies that we have to wait a cycle and perhaps get a penalty of 1 Can we use the instruction address rather than wait for decode? if prediction works then penalty goes to 0! The BTB Idea Use cache to store taken branches (no need to store untaken) the match tag is the PC of the fetched instruction the data field is the predicted branch taken address Can also add predictor field if desired to avoid the 2 misses on every loop execution but this adds complexity since we now have to track untaken branches as well Chapter 4 page 50

Changes in DLX to incorporate BTB PC to memory and BTB NO found in BTB YES Send out Predicted PC NO taken YES NO taken YES Normal Execution Enter branch addr and next PC into BTB Mispredict - kill fetched instruction restart fetch at other target, delete entry from BTB Prediction correct, continue with no penalty Chapter 4 page 51

Penalties for the BTB Approach Instruction in Buffer? Table 1: Prediction Actual Branch Penalty Cycles yes taken taken 0 yes taken not taken 2 no taken 2 Chapter 4 page 52

Further Improvements on BTB Store instructions rather than target address increases entry size but removes Ifetch time permits BTB to run slower and therefore be larger permits branch folding branch job is to change PC and get the real instruction if you have the instruction then the branch can be folded out of the way (discarded) result is 0-cycle unconditional branches and 0-cycle properly predicted branches Predicting indirect jumps major source is procedure return obvious model is to use a stack note this can be combined with the above to get jump folding Approach for reducing misprediction penalties Fetch into the instruction buffer from both taken and not taken paths -- can reduce stalls if prediction is wrong Chapter 4 page 53

Going Beyond the Ideal CPI=1 2 approaches for multiple issue Superscalar issue varying numbers of instructions per clock cyle constrained by hazards made possible by multiple functional pipelines scheduling static - by the compiler dynamic - HW supports some form of scoreboarding VLIW (very long instruction word) long instruction contains several real instructions hence need to be statically scheduled by the compiler hardware does not dynamically make issuing decisions no dynamic issue capability Chapter 4 page 54

Consider a Superscalar 2-issue DLX which is very similar to the HP-7100 Lots of issues for even a 2-issue machine Which instructions (they must be independent) 1 integer: load, store, branch, or integer ALU instruction and 1 float: floating point FPU Need to keep decoding simple - have to deal with 64 bits instead of just 32 bits could require that instructions be paired and aligned on a double word boundary - e.g. keeps the pair in a cache line (e.g. 7100 has a 4 word I cache line size) also require the integer instruction to be first - avoids dynamic swap requirement and much more complicated hazard interlock control also require that the FP instruction can only issue if the INT does Seems simple each pipe has it s own register set anyway independence of data type means little can go wrong but what about the longer latency of the FP EX pipe? Chapter 4 page 55

Dealing With Long FP Latencies two options again Pipeline the FP Units (FPU) can still launch the FP instructions in every cycle long latency may cause out of order completion w.r.t. INT pipe causes increased complexity to prevent hazards - e.g. scoreboarding stalls will still occur but if there is reasonable independence then they can be minimized Use multiple FPU s circuitry for MULT, DIV, SQRT, and ADD/SUB tend to be different anyway could further replicate to match typical instruction frequency must do so carefully since they are expensive issue can be based on normal structural hazard check Both more common approach s Chapter 4 page 56

Look at the opcodes Problems So Far see if the pair is an appropriate issue pair Some integer ops are a problem e.g. FP register loads - since other instruction may be dependent hence a stall will result - options? force FP loads, stores or moves to issue by themselves - safe but suboptimal since other instruction may be independent OR add more ports to the FP register file (separate read and write ports) - still must stall the 2nd instruction if it is dependent Other issues Hazard detection - similar to the normal pipeline model the 1 cycle load delay - now covers 3 instruction slots the 1 instruction branch delay now holds 3 instructions as well so instruction scheduling becomes more important Chapter 4 page 57

Advantages of Superscalar over the VLIW option Old codes still run like those tools you have that came as binaries HW detects whether the instruction pair is a legal dual issue pair if not they are run sequentially Little impact on code density don t need to fill all of the can t issue here slots with NOP s Compiler issues are very similar still need to do instruction scheduling anyway one new thing is try hard to get that dual issue pair hardware is there so the compiler doesn t have to be too conservative Chapter 4 page 58

How well does it work? Text example (Fig. 4.27) - scalar vector sum shows 50% improvement over scheduled single issue HP7100 experience getting this simple 2-way right is not that hard most applications show a 50%-70% speedup no applications slow down However code containing a lot of branches doesn t speed up much What did slow down? Compiler execution No floating point so best you could do is single issue anyway And the compilers got more complex trying to schedule for the dual issue option Still basically a win got 1/2-3/4 of the ideal speedup too bad this can t happen for n-issue Chapter 4 page 59

Dynamic Scheduling Scoreboarding required so pick dataflow basis = Tomasulo's algorithm issue and let the reservation stations sort it out but still cannot issue a depentent pair 2 options for fixing the dependent pair problem pipeline the IF/ID stage and run it twice as fast as the EX... stages this isn t that hard since IF and ID are pretty simple for RISC s decoupling provide queues for destinations for loads, moves, and stores sort of a virtual register/renaming style approach scoreboard will become more complex but performance is likely to be enhanced Compiler still plays a major role e.g. do the best you can with static scheduling and then do a little better with the dynamic back-up Chapter 4 page 60

Limitations on Multiple Issue How much ILP can be found in the application most fundamental of the problems requires deep unrolling - hence big focus on loops compiler complexity goes way up deep unrolling needs lots of registers hence need for renaming and lots of additional registers in the machine for rename targets Increased HW cost increased ports for register files cost of scoreboarding and forwarding paths memory bandwidth requirement goes up most have gone with separate I and D ports already newest approaches are to go for multiple D ports as well - big time expense!! branch prediction HW is an absolute must Still multiple issue seems to be the trend today Chapter 4 page 61

Compiler Support for ILP Trick is to find and remove dependencies simple for single variable accesses more complex for pointers, array references, etc. mostly it s about loops and trying to unroll them - helps if non-cyclic dependencies recurrence dependency distance is larger can effectively test dependence between two references to the same array element in a loop (e.g., A GCD test using affine index function ai+b where a and b are constants) things that make analysis difficult reference via pointers rather than array indices indirect reference (e.g. through another array representation, e.g., for sparse arrays) false dependency - for some values a dependence may exist but where those values don t really get used - requires run time checks to determine General: NP hard problem specific cases can be done precisely current problem - lots of special cases that don t apply often Chapter 4 page 62

Software Pipelining a.k.a. symbolic loop unrolling idea is to separate dependencies in original loop body startup fully pipelined sections cleanup register management can be tricky but idea is to turn the code into a single loop body in practice both unrolling and software pipelining will be necessary due to the register limitations Chapter 4 page 63

Trace Scheduling Looking for the critical path across conditional branches Two separate processes trace selection predict branches to give long sequences of instructions each possible sequence is a separate trace selection will depend on which way the conditions really fall trace compaction global instruction scheduling over entire trace trick is to effectively move instructions across predicted branch this causes speculation based on prediction therefore exceptions become an issue - high exception probability will block e.g. memory references (may cause a page fault) also if misprediction then you have to clean things up which is a penalty Chapter 4 page 64

Some Things to Notice SW pipelining, loop unrolling, trace scheduling Not totally independent techniques all try to avoid dependence-induced stalls each has a slightly different primary focus Primary focus unrolling: reduce loop overhead of index modification and branch SW pipelining: reduce single body dependence stalls trace scheduling: reduce impact of branches Most advanced compilers attempt all 3 result is a hybrid which blurs the differences lots of special case analysis changes the hybrid mix All tend to fail if branch prediction is unreliable Chapter 4 page 65