Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes

Size: px
Start display at page:

Download "Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4. Performance of Branch Prediction Schemes"

Transcription

1 Module: Branch Prediction Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech) Reading for this Module Branch Prediction Appendix A.2 (pg. A-21 A-26), Section 4.2, Section 3.4 Branch Target Buffers Section 3.5 Performance of Branch Prediction Schemes Section 3.8, pg The Trace Cache Section 4.4, pg Additional Ref: E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29 th Annual International Symposium on Microarchitecture, Dec ECE 4100/6100 (2)

2 Control Dependencies Processor datapath Dependencies I- Execution Core Retire Structural Data Name Control Control dependencies determine execution order of instructions Instructions may be control dependent on a branch DADD R5, R6, R7 BNE R4, R2, CONTINUE Anti Output DMUL R4, R2, R5 DSUB R4, R9, R5 Goal: Maximize I- bandwidth via branch prediction How do we improve prediction accuracy and reduce branch penalties? ECE 4100/6100 (3) The Problem with Branches Branches introduce pipeline bubbles and therefore performance degradation Two types of branches: unconditional and conditional In addition, some branch instructions calculate the branch target by adding (or subtracting) the PC with some constants, others necessitate reading the register file Most of the branches encountered in program execution belongs to the former ECE 4100/6100 (4)

3 Conditional Branches.. DADD R5, R6, R7 BNE R4, R1, L1.... L1:.. IF ID EX INT EX FP MEM WB instruction issue EX BR For general pipelines, penalties occur because of the timing of Branch target address generation PC-relative address generation can occur after instruction fetch Branch condition resolution What cycle is the condition known? ECE 4100/6100 (5) Handling Branches Branch outcome known IF/ID ID/EX EX/MEM MEM/WB Instruction Decode & Register Read ALU Memory Operation Register Writeback Instruction decoded as a branch BNE R1, R2, Loop Simple solution: stall the pipeline What determines the branch penalty? ECE 4100/6100 (6)

4 Branch bubbles Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S3 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S4 Instr Decode & Read Reg ALU Memory Operation Reg Writeback K1: Instr Decode & Read Reg ALU Memory Operation Reg Writeback Program Execution Order (in instructions) ECE 4100/6100 (7) Unconditional Branches Should be moved to the earliest possible part of the pipeline the DECODE stage Reduce the number of bubbles inserted However some unconditional branches require register reads e.g. procedure return ECE 4100/6100 (8)

5 Unconditional Branch Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback Only 1 bubble incurred Program Execution Order (in instructions) ECE 4100/6100 (9) Branch Delay Slots ADD.D F0, F2, F4 ADD.D F0, F2, F4 ADD.D F0, F2, F4 BNEZ R3, L1 BNEZ R3, L1 BNEZ R3, L1 delay slot delay slot delay slot.... L1: L1: ADD.D F6, F8, F10 ADD.D.. F6, F8, F10 L1: ADD.D F16, F18, F14.. BNEZ ADD.D.. L1: R3, L1 F0, F2, F4 Compiler scheduling ADD.D BNEZ ADD.D.. L1:ADD.D L2: F0, F2, F4 R3, L2 F6, F8, F10 F6, F8, F10 ADD.D F0, F2, F4 BNEZ R3, L1 ADD.D F6, F8, F L1: ADD.D F16, F18, F14 Instructions are moved to fill branch delay slots Must account for side effects when branch is mispredicted ECE 4100/6100 (10)

6 Branch Prediction Purpose: to steer the PC as accurately and as early as possible with respect to conditional branches Four possible outcomes in prediction-outcome pairs T/T predicted as taken and branch was indeed taken NT/NT predicted as not taken and branch was not taken T/NT predicted as taken but branch was not taken NT/T predicted as not taken but branch was taken The latter two are the branch misprediction pairs ECE 4100/6100 (11) Branch Prediction Strategies Three major classes of branch prediction strategies Static Semi-static Dynamic ECE 4100/6100 (12)

7 Static Prediction Static prediction are simple, hardwired strategies Two main examples: Always assumed taken Always assumed not taken Very ineffective ECE 4100/6100 (13) Static Branch Prediction EX INT IF ID EX FP MEM WB instruction issue EX BR The total number of stalls can be reduced Performance is very dependent on a priori understanding of branch behavior Based on extensive profiling Based on the instruction opcode (Motorola 8810) Based on the relative offset (IBM RS 6000), e.g., negative offsets ECE 4100/6100 (14)

8 Semi-static Strategies Sometimes, the programmer or the compiler can do a fairly good job in predicting branches A bit in branch instructions indicates if the branch is likely to be taken or not This is sometimes called the Take-Don t Take Bit (TDTB) The DECODE stage can steer instruction fetch accordingly ECE 4100/6100 (15) TDTB in action (1) Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 (TDTB=1) Instr Decode & Read Reg ALU Memory Operation Reg Writeback Confirm! S2 Instr ALU Memory Operation Reg Writeback K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback Only 1 bubble incurred Correct prediction with TDTB = 1 ECE 4100/6100 (16)

9 TDTB in action (2) Time Clk1 Clk2 Clk3 Clk4 Clk5 Clk6 Clk7 Clk8 Clk 9 S1: br K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback Cancelled! K1 Instr Decode & Read Reg ALU Memory Operation Reg Writeback K2 Instr Decode & Read Reg ALU Memory Operation Reg Writeback S2 Resteer PC Instr Decode & Read Reg ALU Memory Operation Reg Writeback Penalty of 3 cycles ECE 4100/6100 (17) Dynamic Branch Prediction Strategies How do we capture this history? n-1 Shift register 0 Last branch behavior, i.e., taken or not taken How do we predict? prediction From Ref: Modern Processor Design: Fundamentals of Superscalar Processors, J. Shen and M. Lipasti Use past behavior to predict the future Branches show surprisingly good correlation with one another they are not totally random events A general model (shown above) captures history and uses it to make predictions ECE 4100/6100 (18)

10 Schemes based on local history A simple scheme for each branch instruction is to maintain a single T bit. If T = 0, predict current instantiation of the branch as not taken. If T = 1, predict it as taken. When branch is confirmed, set T appropriately to 0 or 1 depending on whether it is confirmed not taken or taken, respectively Works well when last behavior is a good indicator of future behavior ECE 4100/6100 (19) Single-bit Branch Predictor 1-bit predictors Taken / Not Taken Confirmation 0 Prediction Aliasing? 2 K -1 Use K least significant bits of PC PC ECE 4100/6100 (20)

11 Two-Bit Predictors The 1-bit predictor is quite ineffective T Branch is taken Use the result of the last two branches to change the prediction Predict Branch Taken T NT T Predict Branch Taken NT This scheme can be generalized to n-bit counters Predict Branch Not Taken NT T Predict Branch Not Taken NT Branch is not taken ECE 4100/6100 (21) Bimodal Branch Predictor PC is used to index a set of counters rather that branch prediction bits Each counter is an n-bit unsigned integer If branch is confirmed taken, counter is incremented by one. If it is confirmed not taken, counter is decremented by one Counters are saturating if value is zero, decrement operations are ignored; if value is at the maximum, increment operations are ignored ECE 4100/6100 (22)

12 Bimodal Branch Predictor n-bit counters Taken / Not Taken Confirmation 0 Prediction 2 K -1 Use K least significant bits of PC PC ECE 4100/6100 (23) Bimodal Branch Predictor Prediction: If most significant bit is one, predict taken, else predict not taken Can tolerate occasional changes in branch direction Problem of aliasing (two PCs mapping to the same counter) depends on size of table Useful when the branch address is known before the branch condition is known so as to support prefetching ECE 4100/6100 (24)

13 Size and resolution of predictors established empirically Performance Comparison Performance improvement beyond 4K entry buffer sizes is minimal Performance is a function of Accuracy Branch penalties Branch frequency ECE 4100/6100 (25) Branch Prediction Improvements in accuracy beyond counter resolution is required Note: integer programs have higher branch frequency 4K entries with 2-bit predictors ECE 4100/6100 (26)

14 Global Branch Predictor Bimodal predictor does not take other branches into its consideration Global predictor use a shift register (GR) to store history of last k branches Table of counters maintained in the same way as the bimodal predictor When a branch is confirmed, a 1 or 0 (depending on whether the branch was taken or not) is shifted into GR The new value of GR is used as the address of the counter in the next prediction ECE 4100/6100 (27) Basic Idea The shift register captures the path through the program B1 For each unique path a counter is maintained B2 T F B3 Prediction is based on the behavior history of each path B6 111 B4 B6 B5 B7 Shift register length determines program region size 101 ECE 4100/6100 (28)

15 Global Branch Predictor Table of Counters Taken / Not Taken Confirmation Prediction GR ECE 4100/6100 (29) Tackling double loops The global predictor can tackle double loops with short inner loops for (I=0; I<100; I++) for (J=0; J<3; J++)... Conditional Value GR Result -- C1 -- C2 Distinguishes this particular iteration from the others though PC is the same C2 J= taken C2 J= taken C2 J= not taken C taken ECE 4100/6100 (30)

16 The gselect Predictor Table of Counters Taken / Not Taken Confirmation Prediction GR m bits PC m+n bits n bits The PC is used to select from a bag of identical GR patterns overcome aliasing problem in the pure GR global predictor ECE 4100/6100 (31) Another View 4 bit branch address Get corresponding prediction Select predictor 3-bit global history across three branches Capture history for a specific branch instuction Instead of having a predictor for a single branch have a predictor for the most recent history of branch decisions For each branch history sequence, use an n-bit predictor ECE 4100/6100 (32)

17 The gshare Predictor Table of Counters Taken / Not Taken Confirmation Prediction GR n bits n bits XOR n bits PC A further combination (hashing) of the PC and GR. Counters are not distinguishable by PC or GR alone shared ECE 4100/6100 (33) Combining Predictors Idea: combine local with global predictor use the one that is more accurate for a particular branch A separate table to track which predictor is more accurate Can be extended to incorporate a number of different predictors In experiments, shown to be 98% accurate - a variant is used in the Compaq Alpha processor ECE 4100/6100 (34)

18 Multi-level Predictors Specific transitions between states determined accuracy of individual predictors Use predictor 1 Use predictor 2 Use predictor 1 Use predictor 2 Use multiple predictors and chose between them Employ predictors based on local and global information state of the art ECE 4100/6100 (35) The Alpha Multi-Level Predictor 2-bit predictor Global predictor 2-bit selector 12-bit history 4K entries Local branch address Two level local predictor 4K entries 10-bit history (1024) entries 3-bit saturating counters ECE 4100/6100 (36)

19 A Combined Predictor Taken / Not Taken Confirmation P1 P2 P1c-P2c Predictor Selection Table of Counters Use P1 Use P2 GSHARE Predict BIMODAL Predict GR n bits n bits XOR n bits PC ECE 4100/6100 (37) The Combined Predictor The table of counters to determine whether P1 or P2 is to be used is updated as follows: P1 correct? P2 correct? P1c-P2c The P1c-P2c value is added to the counter addressed by PC Flexible and dynamic use of whichever is the more accurate predictor ECE 4100/6100 (38)

20 Combined Predictor Performance Performance of selection between a 2-bit local predictor and gshare ECE 4100/6100 (39) Misprediction Recovery What actions must be taken on a misprediction? Remove predicted instructions Start fetching from the correct branch target(s) What information is necessary to recover from misprediction? Address information for non-predicted branch target address Identification of those instructions that are predicted To be invalidated and prevented from completion Association between predicted instructions and specific branch When that branch is mispredicted then only those instructions must be squashed ECE 4100/6100 (40)

21 Branch Target Buffer A cache that contains three pieces of information: The address of branch instructions The BTB is managed like a cache and the addresses of branch instructions are kept for lookup purpose Branch target address To avoid re-computation of branch target address where possible Prediction statistics Different strategies are possible to maintain this portion of the BTB ECE 4100/6100 (41) Branch Target Buffers PC of prior branches PC of corresponding target PC search Prediction info Hit: use corresponding target address Miss: no action Access in parallel with instruction cache Hit produces the branch target address ECE 4100/6100 (42)

22 Branch Target Buffer Operation Instruction N BTB Hit? Y Instruction Decode N Y N Y Branch? Branch? Normal execution Update BTB: new branch Update BTB: Misprediction recovery Continue: normal execution ECE 4100/6100 (43) Branch Target Buffers: Operation Couple speculative generation of the branch target address with branch prediction Continue to fetch and resolve branch condition Take appropriate action if wrong Any of the preceding history based techniques can be used for branch condition speculation Store prediction information, e.g., n-bit predictors, along with BTB entry Branch folding optimization: store target instruction rather than target address ECE 4100/6100 (44)

23 Return Address Predictors There is a need for prediction mechanisms for indirect jumps, that is for addresses generated at run time such as return addresses Return addresses are pushed onto a stack in the fetch unit If the fetch unit sees a return in its instruction stream, immediate pop return stack and fetch from popped address BTB accuracy can be degraded by calls from multiple locations ECE 4100/6100 (45) An Integrated Solution Branch Prediction RAS address Branch Target Buffer BTB Logic Target address Interleaved access I-Cache Concurrently check whether each entry in an I-cache line is a branch Multi-branch prediction using a variant of global history To CPU Based on the design reported in E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29th Annual International Symposium on Microarchitecture, Dec ECE 4100/6100 (46)

24 Further Analysis Even with accurate branch prediction, we must fetch instructions from multiple targets What is the effect on instruction bandwidth and pipeline performance? How can we increase instruction fetch bandwidth to compensate Assume perfect branch prediction Instructions are located at different cache lines B1 B2 B3 B4 B5 B6 Exploit instruction locality + branch prediction! ECE 4100/6100 (47) Challenges to Increasing BW? Pipeline latency BW Instruction alignment I-cache Predicting multiple branches ILP What happens as ILP increases? Impact on memory bandwidth, especially fetch bandwidth? Impact on branch predictor throughput? Impact I-Cache? ECE 4100/6100 (48)

25 Some Branch Statistics From E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29th Annual International Symposium on Microarchitecture, Dec ECE 4100/6100 (49) A Program Trace An instruction trace is the actual sequence of executed instructions We can block the trace #instructions/block Treat the trace blocks as units of prediction Based on prediction of individual branches in a block Creating program regions for analysis is a common technique Note that the I-Cache stores a static program description B4 B2 B6 B5 B1 B7 B3 ECE 4100/6100 (50)

26 The Trace Cache: Principle Instruction trace B1 B3 B1 B4 B2 B6 Branch instruction Branch instruction Branch instruction Store recurring sequences of basic blocks These form a contiguous sequence of instructions big basic block Issue multiple instructions from this big basic block high issue rate Trace length determined by (#instructions, #branches) Predict and fetch traces rather than lines in the instruction cache Multiple instructions are issue from the trace ECE 4100/6100 (51) The Trace Cache: The Problem Need to identify multiple blocks in the cache Some form branch target table? Multi-ported instruction cache To fetch from multiple targets Instruction alignment To feed the decoder Most likely will add a pipeline stage after instruction fetch ECE 4100/6100 (52)

27 The Trace Cache Basic Idea: fetch the trace according to a multiple path predictor It compliments a core (standard) fetch unit Trace cache reconstruct trace in parallel Ref: E. Rotenberg, S. Bennet, and J. Smith, Trace Cache: A Low Latency Approach to High Bandwidth Instruction ing, 29 th Annual International Symposium on Microarchitecture, Dec Following figures are from this reference ECE 4100/6100 (53) The Trace Cache trace cache B1 B2 B5 B6 first time through record the trace (instructions) B2 B1 B3 time B4 B5 trace cache B1 B2 B5 B6 B6 B7 second time through access the trace To decoder ECE 4100/6100 (54)

28 The Trace Cache address Trace cache BTB I-cache tag Branch flags Branch mask FT addr Target addr Trace logic n instructions RAS branch predictor n instructions Capture/fill trace history Trace length is determined by Dispatch bandwidth and branch prediction bandwidth Parallel look-up of trace history and instruction cache First address + branch prediction bits to index the cache ECE 4100/6100 (55) Data Structure Valid bit: to indicate if trace is valid Tag: to identify starting address Branch flags: to indicate the branching behavior of the trace ECE 4100/6100 (56)

29 Data Structure Branch mask: to indicate the number of branches in the trace whether the trace ends in a branch Trace fall-through address: next fetch address if the last branch in the trace is predicted as not taken Trace target address: fetch address if last branch is taken ECE 4100/6100 (57) On a hit There is a hit in the trace cache if the fetch address match the tag the branch predictions match the branch flags When there is a hit, instructions come from the trace cache, else they come from the core unit ECE 4100/6100 (58)

30 On a miss Core unit takes over the responsibility of supplying instruction Trace cache uses what is supplied by core fetch unit to build up its cache line ECE 4100/6100 (59) Performance ECE 4100/6100 (60)

31 Performance ECE 4100/6100 (61) The Real Deal: P4 Micro-architecture Front-end Branch Target Buffer 4k entries used when it misses in the trace BTB If no BTB entry is found, use static prediction backward branches predicted taken Static branch prediction uses a threshold Indirect branch predictor ECE 4100/6100 (62) From The Microarchitecture of the Intel Pentium4 Processor, Intel Technology Journal, February 2004 & February 2001

32 P4 Microarchitecture Front-end BTB (4K entries) I-TLB Prefetcher Instruction decoder Trace Cache BTB (512 entries) Trace Cache (12K µops) Up to 3 µops/cycle µop queue Trace Cache 6 µops per trace line (many lines/trace) Has its own branch predictor Has its own 512 entry BTB with a 16 entry return address stack ECE 4100/6100 (63) The Real Deal: Power5* Func/proc returns Predicted targets Two level prediction scheme shared by two threads 1 bimodal + 1 path-correlated prediction 1 to predict which of the first 2 is correct Branch instruction queue (BIQ) Store recovery information for misprediction Retired in program order ECE 4100/6100 (64) R. Kalla, B. Sinharoy, J. Tendler, IBM POWER5 CHIP: A Dual-Core Multithreaded Processor, IEEE Micro, March/April, 2004

33 Power5 Execution Pipeline Threads share IF-IC-BP High branch prediction throughput: all branches can be predicted R. Kalla, B. Sinharoy, J. Tendler, IBM POWER5 CHIP: A Dual-Core Multithreaded Processor, IEEE Micro, March/April, 2004 ECE 4100/6100 (65) Some Research Questions? Quality of branch prediction Improving branch predictor throughput Power efficiency of branch prediction logic Speculative execution Is the focus moving up a level to multi-core, manycore, any core? Has the era of ILP stabilized? ECE 4100/6100 (66)

34 Concluding Remarks Handling control flow is a challenge to keeping the execution core fed Prediction and recovery mechanisms key to keeping the pipeline active avoid performance degradation Superscalar datapaths provide increased pressure pushing for better, more innovative techniques to keep pace with technology-enabled appetite for instruction level parallelism What next? ECE 4100/6100 (67) Study Guide What are the basic approaches to branch prediction? What properties of the program does each predictor rely upon? Given a predictor, describe program structures/behaviors for which this predictor will work well or work poorly. Behavior of branch predictors Given a program trace (including taken/not taken conditions of branches), be able to trace through the states of a predictor Behavior of the BTB Given a program trace, be able to show the BTB contents at each point in the fetch sequence Trace pipeline operation on a BTB miss ECE 4100/6100 (68)

35 Study Guide What is a trace cache? Basic operation What are the properties of programs for which a trace cache is a good solution? Compare and contrast each type of branch predictor Given a set of program behaviors/statistics design a branch prediction strategy and implementation ECE 4100/6100 (69)

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

HY425 Lecture 05: Branch Prediction

HY425 Lecture 05: Branch Prediction HY425 Lecture 05: Branch Prediction Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS October 19, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 05: Branch Prediction 1 / 45 Exploiting ILP in hardware

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 15-20% of instructions

More information

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.

Instruction Fetch and Branch Prediction. CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3. Instruction Fetch and Branch Prediction CprE 581 Computer Systems Architecture Readings: Textbook (4 th ed 2.3, 2.9); (5 th ed 3.3) 1 Frontend and Backend Feedback: - Prediction correct or not, update

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction

Looking for Instruction Level Parallelism (ILP) Branch Prediction. Branch Prediction. Importance of Branch Prediction Looking for Instruction Level Parallelism (ILP) Branch Prediction We want to identify and exploit ILP instructions that can potentially be executed at the same time. Branches are 5-20% of instructions

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Branch prediction ( 3.3) Dynamic Branch Prediction

Branch prediction ( 3.3) Dynamic Branch Prediction prediction ( 3.3) Static branch prediction (built into the architecture) The default is to assume that branches are not taken May have a design which predicts that branches are taken It is reasonable to

More information

Branch Prediction Chapter 3

Branch Prediction Chapter 3 1 Branch Prediction Chapter 3 2 More on Dependencies We will now look at further techniques to deal with dependencies which negatively affect ILP in program code. A dependency may be overcome in two ways:

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov Dealing With Control Hazards Simplest solution to stall pipeline until branch is resolved and target address is calculated

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Wide Instruction Fetch

Wide Instruction Fetch Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Static, multiple-issue (superscaler) pipelines

Static, multiple-issue (superscaler) pipelines Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers Dynamic Hardware Prediction Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl s law) Schemes to attack control dependences Static Basic (stall the

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1

Chapter 3 (CONT) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,2013 1 Chapter 3 (CONT) Instructor: Josep Torrellas CS433 Copyright J. Torrellas 1999,2001,2002,2007,2013 1 Dynamic Hardware Branch Prediction Control hazards are sources of losses, especially for processors

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ... CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism

Lecture 8: Compiling for ILP and Branch Prediction. Advanced pipelining and instruction level parallelism Lecture 8: Compiling for ILP and Branch Prediction Kunle Olukotun Gates 302 kunle@ogun.stanford.edu http://www-leland.stanford.edu/class/ee282h/ 1 Advanced pipelining and instruction level parallelism

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Hiroaki Kobayashi // As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor. Branches will arrive up to n times faster in an n-issue processor, and providing an instruction

More information

COSC 6385 Computer Architecture Dynamic Branch Prediction

COSC 6385 Computer Architecture Dynamic Branch Prediction COSC 6385 Computer Architecture Dynamic Branch Prediction Edgar Gabriel Spring 208 Pipelining Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution of

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Lecture 13: Branch Prediction

Lecture 13: Branch Prediction S 09 L13-1 18-447 Lecture 13: Branch Prediction James C. Hoe Dept of ECE, CMU March 4, 2009 Announcements: Spring break!! Spring break next week!! Project 2 due the week after spring break HW3 due Monday

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

COSC 6385 Computer Architecture. Instruction Level Parallelism

COSC 6385 Computer Architecture. Instruction Level Parallelism COSC 6385 Computer Architecture Instruction Level Parallelism Spring 2013 Instruction Level Parallelism Pipelining allows for overlapping the execution of instructions Limitations on the (pipelined) execution

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

Superscalar Processors Ch 14

Superscalar Processors Ch 14 Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency? Superscalar Processors Ch 14 Limitations, Hazards Instruction Issue Policy Register Renaming Branch Prediction PowerPC, Pentium 4 1 Superscalar Processing (5) Basic idea: more than one instruction completion

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 18: Branch Prediction. Prof. Onur Mutlu ETH Zurich Spring May 2018 Design of Digital Circuits Lecture 18: Branch Prediction Prof. Onur Mutlu ETH Zurich Spring 2018 3 May 2018 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 17: Pipelining Wrapup Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Outline The textbook includes lots of information Focus on

More information

Fall 2011 Prof. Hyesoon Kim

Fall 2011 Prof. Hyesoon Kim Fall 2011 Prof. Hyesoon Kim 1 1.. 1 0 2bc 2bc BHR XOR index 2bc 0x809000 PC 2bc McFarling 93 Predictor size: 2^(history length)*2bit predict_func(pc, actual_dir) { index = pc xor BHR taken = 2bit_counters[index]

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

The Processor: Improving the performance - Control Hazards

The Processor: Improving the performance - Control Hazards The Processor: Improving the performance - Control Hazards Wednesday 14 October 15 Many slides adapted from: and Design, Patterson & Hennessy 5th Edition, 2014, MK and from Prof. Mary Jane Irwin, PSU Summary

More information

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

Lecture 7: Static ILP, Branch prediction. Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections ) Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections 2.2-2.6) 1 Predication A branch within a loop can be problematic to schedule Control

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations Determined by ISA

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (3) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (3) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor 1 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3) Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3) 1 Support for Speculation In general, when we re-order instructions, register renaming

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction

EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May Branch Prediction EE482: Advanced Computer Organization Lecture #3 Processor Architecture Stanford University Monday, 8 May 2000 Lecture #3: Wednesday, 5 April 2000 Lecturer: Mattan Erez Scribe: Mahesh Madhav Branch Prediction

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Chapter 4 The Processor 1. Chapter 4B. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Thomas Polzer Institut für Technische Informatik

Thomas Polzer Institut für Technische Informatik Thomas Polzer tpolzer@ecs.tuwien.ac.at Institut für Technische Informatik Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =

More information

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy

More information

EIE/ENE 334 Microprocessors

EIE/ENE 334 Microprocessors EIE/ENE 334 Microprocessors Lecture 6: The Processor Week #06/07 : Dejwoot KHAWPARISUTH Adapted from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, 2009, Elsevier (MK) http://webstaff.kmutt.ac.th/~dejwoot.kha/

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha 21264. The Alpha s predictor is very successful. On the SPECfp 95 benchmarks, there

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

COSC 6385 Computer Architecture - Pipelining

COSC 6385 Computer Architecture - Pipelining COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage

More information