Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Similar documents
Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

LIMITS OF ILP. B649 Parallel Architectures and Programming

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Instruction Level Parallelism (ILP)

Lecture-13 (ROB and Multi-threading) CS422-Spring

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Handout 2 ILP: Part B

Multiple Instruction Issue and Hardware Based Speculation

Multithreading: Exploiting Thread-Level Parallelism within a Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Exploitation of instruction level parallelism

TDT 4260 lecture 7 spring semester 2015

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Hyperthreading Technology

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Hardware-based Speculation

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

CS425 Computer Systems Architecture

The basic structure of a MIPS floating-point unit

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

5008: Computer Architecture

Instruction-Level Parallelism and Its Exploitation (Part III) ECE 154B Dmitri Strukov

Hardware Speculation Support

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Lecture 7 Instruction Level Parallelism (5) EEC 171 Parallel Architectures John Owens UC Davis

RECAP. B649 Parallel Architectures and Programming

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Hardware-Based Speculation

Getting CPI under 1: Outline

Hardware-Based Speculation

Simultaneous Multithreading (SMT)

Week 6 out-of-class notes, discussions and sample problems

Dynamic Scheduling. CSE471 Susan Eggers 1

Page 1. Today s Big Idea. Lecture 18: Branch Prediction + analysis resources => ILP

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

CS 654 Computer Architecture Summary. Peter Kemper

Review Tomasulo. Lecture 17: ILP and Dynamic Execution #2: Branch Prediction, Multiple Issue. Tomasulo Algorithm and Branch Prediction

ILP Ends TLP Begins. ILP Limits via an Oracle

CS / ECE 6810 Midterm Exam - Oct 21st 2008

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

EECC551 Exam Review 4 questions out of 6 questions

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Case Study IBM PowerPC 620

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Instruction-Level Parallelism and Its Exploitation

Multithreaded Processors. Department of Electrical Engineering Stanford University

Limits to ILP. Limits to ILP. Overcoming Limits. CS252 Graduate Computer Architecture Lecture 11. Conflicting studies of amount

Adapted from David Patterson s slides on graduate computer architecture

Advanced issues in pipelining

TDT 4260 TDT ILP Chap 2, App. C

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Multithreaded Value Prediction

CS422 Computer Architecture


Simultaneous Multithreading on Pentium 4

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Simultaneous Multithreading (SMT)

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Advanced Computer Architecture

E0-243: Computer Architecture

Simultaneous Multithreading Processor

Out of Order Processing

Chapter 06: Instruction Pipelining and Parallel Processing

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

Simultaneous Multithreading (SMT)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Super Scalar. Kalyan Basu March 21,

Instruction Level Parallelism

Processor: Superscalars Dynamic Scheduling

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Processor Architecture

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CS433 Homework 2 (Chapter 3)

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

Metodologie di Progettazione Hardware-Software

Lecture 19: Instruction Level Parallelism -- Dynamic Scheduling, Multiple Issue, and Speculation

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CSE 502 Graduate Computer Architecture

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

ILP: Instruction Level Parallelism

HW1 Solutions. Type Old Mix New Mix Cost CPI

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Course on Advanced Computer Architectures

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EECS 452 Lecture 9 TLP Thread-Level Parallelism

Transcription:

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5) ILP vs. Parallel Computers Dynamic Scheduling (Section 3.4, 3.5) Dynamic Branch Prediction (Section 3.3, 3.9, and Appendix C) Hardware Speculation and Precise Interrupts (Section 3.6) Multiple Issue (Section 3.7) Static Techniques (Section 3.2, Appendix H) Limits and Benefits of ILP (Older editions and Section 3.12) Multithreading (Section 3.11) Putting it Together (Mini-projects)

Limits of ILP How much can ILP buy us? Limits studies make optimistic assumptions to find the limit for ILP But may miss impact of compiler, future advances A highly optimistic study [Wall 93] Infinite number of physical registers (no register WAW, WAR) Infinite number of in-flight instructions Perfect branch prediction Perfect memory address alias analysis Single cycle FU Single cycle memory (perfect caches)

Limits of ILP (contd.) (This and next four figures are from an old edition of the book) Figure 3.35 ILP available in a perfect processor for six of the SPEC92 benchmarks. The first three programs are integer programs, and the last three are floating-point programs. The floating-point programs are loop-intensive and have large amounts of loop-level parallelism. * This figure has been taken from Computer Architecture, A Quantitative Approach, 3rd Edition Copyright 2003 by Elsevier Inc. All rights reserved. It has been used with permission by Elsevier Inc.

Limits of ILP Impact of Optimistic Assumptions Limiting Instruction window size Finding dependences among n instr requires n^2 comparisons 2000 instructions implies 4 million comparisons! Following use 2K window and 64 issue limit Figure 3.36 The effects of reducing the size of the window. The window is the group of instructions from which an instruction an execute. The start of the window is the earliest uncompleted instruction (remember that instructions complete in one cycle), and the last instruction in the window is determined by the window size. The instructions in the window are obtained by perfectly predicting branches and selecting instructions until the window is full. * This figure has been taken from Computer Architecture, A Quantitative Approach, 3rd Edition Copyright 2003 by Elsevier Inc. All rights reserved. It has been used with permission by Elsevier Inc.

Limits of ILP Impact of Optimistic Assumptions Realistic branch prediction No charge for mispredictions Following use tournament predictor Figure 3.38 The effect of branch-prediction schemes. This graph shows the impact of going from a perfect model of branch prediction (all branches predicted correctly arbitrarily far ahead), to various dynamic predictors (selective and 2-bit), to compile time, profile-based prediction, and finally to using no predictor. The predictors are described precisely in the text. * This figure has been taken from Computer Architecture, A Quantitative Approach, 3rd Edition Copyright 2003 by Elsevier Inc. All rights reserved. It has been used with permission by Elsevier Inc.

Limits of ILP Impact of Optimistic Assumptions Finite registers Following uses 256 int and 256 fp for renaming Figure 3.41 The effect of finite numbers of registers available for renaming. Both the number of FP registers and the number of GP registers are increased by the number shown on the x-axis. The effect is most dramatic on the FP programs, although having only 32 extra GP and 32 extra FP registers has significant impact on all the programs. As stated earlier, we assume a window size of 2K entries and a maximum issue width of 64 instructions. None implies no extra registers available. * This figure has been taken from Computer Architecture, A Quantitative Approach, 3rd Edition Copyright 2003 by Elsevier Inc. All rights reserved. It has been used with permission by Elsevier Inc.

Limits of ILP Impact of Optimistic Assumptions Imperfect memory alias analysis Figure 3.43 The effect of various alias analysis techniques on the amount of ILP. Anything less than perfect analysis has a dramatic impact on the amount of parallelism found in the integer programs, and global/stack analysis is perfect (and unrealizable) for the FORTRAN programs. As we said earlier, we assume a maximum issue width of 64 instructions and a window of 2K instructions. * This figure has been taken from Computer Architecture, A Quantitative Approach, 3rd Edition Copyright 2003 by Elsevier Inc. All rights reserved. It has been used with permission by Elsevier Inc.

But Limits Studies may be Pessimistic! For most optimistic study WAR and WAW hazards through memory Unnecessary dependences (e.g., loop iteration count) Overcoming data flow limit value prediction For more realistic studies Address value prediction and speculation Speculating on multiple paths

Real Systems: Two Out-of-order i7 Processors Figure 3.39 The buffers and queues in the first generation i7 and the latest generation i7. Nehalem used a reservation station plus reorder buffer organization. In later microarchitectures, the reservation stations serve as scheduling resources, and register renaming is used rather than the reorder buffer; the reorder buffer in the Skylake microarchitecture serves only to buffer control information. The choices of the size of various buffers and renaming registers, while appearing sometimes arbitrary, are likely based on extensive simulation. 2019 Elsevier Inc. All rights reserved. 9

Two Out-of-order i7 Processors Figure 3.40 The CPI for the SPECCPUint2006 benchmarks on the i7 6700 and the i7 920. The data in this section were collected by Professor Lu Peng and PhD student Qun Liu, both of Louisiana State University. 2019 Elsevier Inc. All rights reserved. 10

Out-of-order i7 vs. In-order Atom Figure 3.43 The relative performance and energy efficiency for a set of single-threaded benchmarks shows the i7 920 is 4 to over 10 times faster than the Atom 230 but that it is about 2 times less power-efficient on average! Performance is shown in the columns as i7 relative to Atom, which is execution time (i7)/execution time (Atom). Energy is shown with the line as Energy (Atom)/Energy (i7). The i7 never beats the Atom in energy efficiency, although it is essentially as good on four benchmarks, three of which are floating point. The data shown here were collected by Esmaeilzadeh et al. (2011). The SPEC benchmarks were compiled with optimization using the standard Intel compiler, while the Java benchmarks use the Sun (Oracle) Hotspot Java VM. Only one core is active on the i7, and the rest are in deep power saving mode. Turbo Boost is used on the i7, which increases its performance advantage but slightly decreases its relative energy efficiency. 2019 Elsevier Inc. All rights reserved. 11

Multithreading: Instruction + Thread Level Parallelism Often superscalar instruction slots are wasted Why not use them for other threads? Multithreading Coarse-grained Fine-grained Simultaneous multithreading (SMT) or hyperthreading (Vs. multiprocessing)

Multithreading: Instruction + Thread Level Parallelism Vs. Multiprocessing? Figure 3.31 How four different approaches use the functional unit execution slots of a superscalar processor. The horizontal dimension represents the instruction execution capability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding execution slot is unused in that clock cycle. The shades of gray and black correspond to four different threads in the multithreading processors. Black is also used to indicate the occupied issue slots in the case of the superscalar without multithreading support. The Sun T1 and T2 (aka Niagara) processors are fine-grained, multithreaded processors, while the Intel Core i7 and IBM Power7 processors use SMT. The T2 has 8 threads, the Power7 has 4, and the Intel i7 has 2. In all existing SMTs, instructions issue from only one thread at a time. The difference in SMT is that the subsequent decision to execute an instruction is decoupled and could execute the operations coming from several different instructions in the same clock cycle. 2019 Elsevier Inc. All rights reserved. 13

SMT Speedup & Energy Efficiency: 1 vs. 4 threads Figure 3.33 The speedup from using multithreading on one core on an i7 processor averages 1.28 for the Java benchmarks and 1.31 for the PARSEC benchmarks (using an unweighted harmonic mean, which implies a workload where the total time spent executing each benchmark in the single-threaded base set was the same). The energy efficiency averages 0.99 and 1.07, respectively (using the harmonic mean). Recall that anything above 1.0 for energy efficiency indicates that the feature reduces execution time by more than it increases average power. Two of the Java benchmarks experience little speedup and have significant negative energy efficiency because of this issue. Turbo Boost is off in all cases. These data were collected and analyzed by Esmaeilzadeh et al. (2011) using the Oracle (Sun) HotSpot build 16.3-b01 Java 1.6.0 Virtual Machine and the gcc v4.4.1 native compiler. 2019 Elsevier Inc. All rights reserved. 14