Instruction Level Parallelism (ILP)

Similar documents
Exploitation of instruction level parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction Level Parallelism

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

ILP: Instruction Level Parallelism

Lecture-13 (ROB and Multi-threading) CS422-Spring

Instruction Level Parallelism

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Multiple Instruction Issue and Hardware Based Speculation

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

RECAP. B649 Parallel Architectures and Programming

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Super Scalar. Kalyan Basu March 21,

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Instruction-Level Parallelism and Its Exploitation

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

CMSC411 Fall 2013 Midterm 2 Solutions

Adapted from David Patterson s slides on graduate computer architecture

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

CS422 Computer Architecture

ECE 505 Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Dynamic Control Hazard Avoidance

Hardware-based Speculation

Metodologie di Progettazione Hardware-Software

Hardware Speculation Support

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Chapter 06: Instruction Pipelining and Parallel Processing

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

LIMITS OF ILP. B649 Parallel Architectures and Programming

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

Superscalar Processors

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

TDT 4260 lecture 7 spring semester 2015

Dynamic Hardware Prediction. Basic Branch Prediction Buffers. N-bit Branch Prediction Buffers

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Multi-cycle Instructions in the Pipeline (Floating Point)

Instruction Level Parallelism. Taken from

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Getting CPI under 1: Outline

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Functional Units. Registers. The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

Handout 2 ILP: Part B

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

UNIT I (Two Marks Questions & Answers)

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

CS425 Computer Systems Architecture

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS433 Homework 2 (Chapter 3)

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Static vs. Dynamic Scheduling

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

DYNAMIC AND SPECULATIVE INSTRUCTION SCHEDULING

Hardware-Based Speculation

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Dynamic Scheduling. CSE471 Susan Eggers 1

EE 4683/5683: COMPUTER ARCHITECTURE

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

As the amount of ILP to exploit grows, control dependences rapidly become the limiting factor.

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Instruction Level Parallelism. ILP, Loop level Parallelism Dependences, Hazards Speculation, Branch prediction

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

CS425 Computer Systems Architecture

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 502 Graduate Computer Architecture

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Advanced issues in pipelining

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

CS252 S05. Outline. Dynamic Branch Prediction. Static Branch Prediction. Dynamic Branch Prediction. Dynamic Branch Prediction

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

CS433 Homework 2 (Chapter 3)

Instruction-Level Parallelism (ILP)

Week 6 out-of-class notes, discussions and sample problems

EECC551 Exam Review 4 questions out of 6 questions

Tomasulo Loop Example

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Current Microprocessors. Efficient Utilization of Hardware Blocks. Efficient Utilization of Hardware Blocks. Pipeline

NOW Handout Page 1. Review from Last Time. CSE 820 Graduate Computer Architecture. Lec 7 Instruction Level Parallelism. Recall from Pipelining Review

Computer Architecture Spring 2016

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Transcription:

1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into more aggressive techniques to achieve parallel execution of the instructions in the instruction stream.

2 / 26 Exploiting ILP Two basic approaches: rely on hardware to discover and exploit parallelism dynamically, and rely on software to restructure programs to statically facilitate parallelism. These techniques are complimentary. They can be and are used in concert to improve performance.

3 / 26 Pgm Challenges and Opportunities to ILP Basic Block: a straight-line code sequence with a single entry point and a single exit point. Remember, the average branch frequency is 15% 25%; thus there are only 3 6 instructions on average between a pair of branches. Loop level parallelism: the opportunity to use multiple iterations of a loop to expose additional parallelism. Example techniques to exploit loop level parallelism: vectorization, data parallel, loop unrolling.

4 / 26 Dependencies and Hazards 3 types of dependencies: data dependencies (or true data dependencies), name dependencies, and control dependencies. Dependencies are artifacts of programs; hazards are artifacts of pipeline organization. Not all dependencies become hazards in the pipeline. That is, dependencies may turn into hazards within the pipeline depending on the architecture of the pipeline.

5 / 26 Data Dependencies An instruction j is data dependent on instruction i if: instruction i produces a result that may be used by instruction j; or instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

6 / 26 Name Dependencies (not true dependencies) Occurring when two instructions use the same register or memory location, but there is no flow in data between the instructions associated with that name. There are 2 types of name dependencies between an instruction i that precedes instruction j in the execution order: antidependence: when instruction j writes a register or memory location that instruction i reads. output dependence: when instruction i and instruction j write the same register or memory location. Because these are not true dependencies, the instructions i and j could potentially be executed in parallel if these dependencies are some how removed (by using distinct register or memory locations).

7 / 26 Data Hazards A hazard exists whenever there is a name or data dependence between two instructions and they are close enough that their overlapped execution would violate the program s order of dependency. Possible data hazards: RAW (read after write) WAW (write after write) WAR (write after read) RAR (read after read) is not a hazard.

8 / 26 Control Dependencies Dependency of instructions to the sequential flow of execution and preserves branch (or any flow altering operation) behavior of the program. In general, two constraints are imposed by control dependencies: An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. An instruction that is not control dependent on a branch cannot be moved after the branch so that the execution is controlled by the branch.

9 / 26 Relaxed Adherence to Dependencies Strictly enforcing dependency relations is not entirely necessary if we can preserve the correctness of the program. Two properties critical to program correctness (and normally preserved by maintaining both data and control dependency) are: exception behavior: reordering instructions must not alter the exception behavior of the program specifically the visible exceptions (transparent exceptions such as page faults are ok to affect). data flow: data sources must be preserved to feed the instructions consuming that data. Liveness: data that is needed is called live; data that is no longer used is called dead.

10 / 26 Compiler Techniques for Exposing ILP loop unrolling instruction scheduling

11 / 26 Advanced Branch Prediction Correlating branch predictors (or two-level predictors): make use of outcome of most recent branches to make prediction. Tournament predictors: run multiple predictors and run a tournament between them; use the most successful.

Measured misprediction rates: SPEC89 8% Conditional branch misprediction rate 7% 6% 5% 4% 3% 2% 1% Local 2-bit predictors Correlating predictors Tournament predictors 0% 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 Total predictor size 12 / 26

13 / 26 End section 3.1-3.3 Lecture break; continue section 3.4-3.9 next class

14 / 26 Dynamic Scheduling Designing the hardware so that it can dynamically rearrange instruction execution to reduce stalls while maintaining data flow and exception behavior. Two techniques: Scoreboarding (discussed in Appendix C), centralized scoreboard Tomasulo s Algorithm (discussed in Chapter 3), distributed reservation stations

15 / 26 Dynamic Scheduling Instructions are issued to the pipeline in-order but executed and completed out-of-order. out-of-order execution leading to the possibility of out-of-order completion. out-of-order execution introduces the possibility of WAR and WAW hazards which do not occur in statically scheduled pipelines. out-of-order completion creates major complications in exception handling

16 / 26 Key Components of Tomasulo s Algorithm Register renaming: to minimize WAW and WAR hazards (caused by name dependencies). Reservation station: buffering operands for instructions waiting to execute. Operands are fetched as soon as they are available (from the reservation station/instruction serving the operand). The register specifiers of issued instructions are renamed to the reservation station (to achieve register renaming). Thus, hazard detection and execution control are distributed (the targeted functional unit and reservation station determine when). Common data bus/result bus/etc: carries results past the reservation stations (where they are captured) and back to the register file. Sometimes multiple buses are used.

17 / 26 Hardware Speculation Execute (completely) the instructions that are predicted to occur after a branch w/o knowing the branch outcome. Speculate that the instructions are to be executed. instruction commit: when the results (operand writes/exceptions) of the speculated instruction are made. Key to speculation is to allow out-of-order instruction execute but force them to commit in order. Generally achieved by a reorder buffer (ROB) which holds completed instructions and retires them in order. Thus ROB holds/buffers register/memory operands that are also used by functional units for source operands.

18 / 26 Multiple-Issue Processors Attempting to reduce the CPI below one. 1. Statically scheduled superscalar processors 2. VLIW (very long instruction word) processors 3. Dynamically scheduled superscalar processors superscalar: scheduling multiple instructions for execution at the same time.

19 / 26 Advanced Instruction Delivery & Speculation branch-target buffer or branch-target cache: predict the destination PC address for a branch instruction; can also store the target instruction (instead of, or in addition to the destination address). branch folding: overwrite the branch instr with the destination instruction. Return address prediction. Integrated instruction fetch units. Autonomously executing units to fetch instructions. Comments on speculation: how much? through multiple branches? energy costs? Value prediction: not yet feasible; not sure why it s covered.

20 / 26 Branch-Target Buffer PC of instruction to fetch Look up Predicted PC Number of entries in branchtarget buffer = No: instruction is not predicted to be branch; proceed normally Yes: then instruction is branch and predicted PC should be used as the next PC Branch predicted taken or untaken

21 / 26 End section 3.4-3.9 Lecture break; continue section 3.10-3.16 next class

22 / 26 The Limits of ILP How much ILP is available? Is there a limit? Consider the ideal processor: 1. Infinite number of registers for register renaming. 2. Perfect branch prediction 3. Perfect jump prediction 4. Perfect memory address analysis 5. All memory accesses occur in one cycle Effectively removing all control dependencies and all but true data dependencies. That is, all instructions can be scheduled as early as their data dependency allows.

23 / 26 Unlimited Issue/Single Cycle Execution gcc 55 SPEC benchmarks espresso li fpppp doduc tomcatv 18 63 75 119 150 0 20 40 60 80 100 120 Instruction issues per cycle 140 160 The Power7 currently issues up to six instructions per clock.

24 / 26 Limited Issue Window Size/Single Cycle Execution Up to 64 instructions issued/cycle Tournament predictor; best available in 2011 (not a primary bottleneck) Perfect memory address analysis 64 int/64 FP extra registers available for renaming Single cycle execution Benchmarks gcc espresso li fpppp doduc tomcatv 10 10 10 9 8 15 15 13 10 8 12 12 11 11 9 9 14 17 16 15 12 14 22 22 35 34 Window size Infinite 256 128 64 32 45 47 52 56 0 10 20 30 40 50 60 Instruction issues per cycle

25 / 26 Hardware/Software Speculation Professor rambles on about this topic for a few moments.

26 / 26 Multithreading Fine-grained multithreading: switch threads at each clock cycle. Examples: Sun Niagra processor (8 threads), Nvidia GPUs. Course-grained multithreading: switch threads at major stalls such as L2/L3 misses. Examples: no commercial ventures. Simultaneous multithreading: process/dispatch multiple threads simultaneously to the common functional units. Examples: Intel (hyperthreading, two threads), IBM Power7 (four threads).