Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Similar documents
Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

Lecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Hardware-Based Speculation

Handout 2 ILP: Part B

Copyright 2012, Elsevier Inc. All rights reserved.

Instruction Level Parallelism

Multiple Instruction Issue and Hardware Based Speculation

LIMITS OF ILP. B649 Parallel Architectures and Programming

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

ECE 505 Computer Architecture

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Reorder Buffer Implementation (Pentium Pro) Reorder Buffer Implementation (Pentium Pro)

Instruction Level Parallelism (ILP)

5008: Computer Architecture

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Lecture-13 (ROB and Multi-threading) CS422-Spring

The Processor: Instruction-Level Parallelism

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Instruction-Level Parallelism and Its Exploitation

Chapter 06: Instruction Pipelining and Parallel Processing

Multiple Instruction Issue. Superscalars

CMSC411 Fall 2013 Midterm 2 Solutions

CS425 Computer Systems Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Lecture: Out-of-order Processors

Hardware-based Speculation

CS433 Homework 2 (Chapter 3)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Hardware-Based Speculation

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

Getting CPI under 1: Outline

Advanced Computer Architecture

Dynamic Scheduling. CSE471 Susan Eggers 1

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Computer Systems Architecture

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Superscalar Organization

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Superscalar Processors

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 1)

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Static vs. Dynamic Scheduling

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

Instruction Level Parallelism

Chapter 3 Instruction-Level Parallelism and its Exploitation (Part 5)

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

CS 152, Spring 2011 Section 8

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

ECE 341. Lecture # 15

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

EECC551 Exam Review 4 questions out of 6 questions

Chapter 4 The Processor 1. Chapter 4D. The Processor

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

Processor (IV) - advanced ILP. Hwansoo Han

E0-243: Computer Architecture

CIS 662: Midterm. 16 cycles, 6 stalls

Chapter 3 & Appendix C Part B: ILP and Its Exploitation

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

Advanced Computer Architecture

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Tomasulo s Algorithm

ILP: Instruction Level Parallelism

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Computer Architecture Spring 2016

RECAP. B649 Parallel Architectures and Programming

Multithreaded Processors. Department of Electrical Engineering Stanford University

CS433 Homework 2 (Chapter 3)

Exploitation of instruction level parallelism

Complex Pipelines and Branch Prediction

Transcription:

Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1

ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target prediction Perfect memory disambiguation Perfect instruction and data caches Single-cycle latencies for all ALUs Infinite ROB size (window of in-flight instructions) No limit on number of instructions in each pipeline stage The last instruction may be scheduled in the first cycle What is the only constraint in this processor? 2

ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target prediction Perfect memory disambiguation Perfect instruction and data caches Single-cycle latencies for all ALUs Infinite ROB size (window of in-flight instructions) No limit on number of instructions in each pipeline stage The last instruction may be scheduled in the first cycle The only constraint is a true dependence (register or memory RAW hazards) (with value prediction, how would the perfect processor behave?) 3

Infinite Window Size and Issue Rate 4

Effect of Window Size Window size is effected by register file/rob size, branch mispredict rate, fetch bandwidth, etc. We will use a window size of 2K instrs and a max issue rate of 64 for subsequent experiments 5

Imperfect Branch Prediction Note: no branch mispredict penalty; branch mispredict restricts window size Assume a large tournament predictor for subsequent experiments 6

Effect of Name Dependences More registers fewer WAR and WAW constraints (usually register file size goes hand in hand with in-flight window size) 256 int and fp registers for subsequent experiments 7

Memory Dependences 8

Limits of ILP Summary Int programs are more limited by branches, memory disambiguation, etc., while FP programs are limited most by window size We have not yet examined the effect of branch mispredict penalty and imperfect caching All of the studied factors have relatively comparable influence on CPI: window/register size, branch prediction, memory disambiguation Can we do better? Yes: better compilers, value prediction, memory dependence prediction, multi-path execution 9

Pentium III (P6 Microarchitecture) Case Study 14-stage pipeline: 8 for fetch/decode/dispatch, 3+ for o-o-o, 3 for commit branch mispredict penalty of 10-15 cycles Out-of-order execution with a 40-entry ROB (40 temporary or virtual registers) and 20 reservation stations Each x86 instruction gets converted into RISC-like micro-ops on average, one CISC instr 1.37 micro-ops Three instructions in each pipeline stage 3 instructions can simultaneously leave the pipeline ideal CPµI = 0.33 ideal CPI = 0.45 10

Branch Prediction 512-entry global two-level branch predictor and 512-entry BTB 20% combined mispredict rate For every instruction committed, 0.2 instructions on the mispredicted path are also executed (wasted power!) Mispredict penalty is 10-15 cycles 11

CPI Performance Owing to stalls, the processor can fall behind (no instructions are committed for 55% of all cycles), but then recover with multi-instruction commits (31% of all cycles) average CPI = 1.15 (Int) and 2.0 (FP) Overlap of different stalls CPI is not the sum of individual stalls IPC is also an attractive metric 12

Alternative Designs Tomasulo s algorithm When an instruction is decoded and dispatched, it is assigned to a reservation station The reservation station has an ALU and space for storing operands ready operands are copied from the register file into the reservation station If an operand is not ready, the reservation station keeps track of which reservation station will produce it this is a form of register renaming Instructions are dispatched in order (dispatch stalls if a reservation station is not available), instructions begin execution out-of-order 13

Tomasulo s Algorithm Instr Fetch R1 RS3 R1 RS3 R1 RS4 Decode & Issue Stall if no functional unit available Register File Common Data Bus in1 in2 in1 in2 in1 in2 in1 in2 ALU ALU ALU ALU Instrs read register operands immediately (avoids WAR), wait for results produced by other ALUs (more names), and then execute, the earlier write to R1 never happens (avoids WAW) 14

Improving Performance What is the best strategy for improving performance? Take a single design and keep pipelining Increase capacity: use a larger branch predictor, larger cache, larger ROB/register file Increase bandwidth: more instructions fetched/ decoded/issued/committed per cycle 15

Deep Pipelining If instructions are independent, deep pipelining allows an instruction to leave the pipeline sooner If instructions are dependent, the gap between them (in nanoseconds) widens there is an optimal pipeline depth Some structures are hard to pipeline register files Pipelining can increase bypassing complexity 16

Capacity/Bandwidth Even if we design a 6-wide processor, most units go under-utilized (average IPC of 2.0!) hence, increased bandwidth is not going to buy much Higher capacity (being able to examine 500 speculative instructions) can increase the chances of finding work and boost IPC what is the big bottleneck? Higher capacity (and higher bandwidth) increases the complexity of each structure and its access time for example, access times: 32KB cache in 1 cycle, 128KB cache in 2 cycles 17

Future Microprocessors By increasing branch predictor, window size, number of ALUs, pipeline stages, IPC and clock speed can improve however, this is a case of diminishing returns! For example, with a window size of 400 and with 10 ALUs, we are likely to find fewer than four instructions to issue every cycle under-utilization, wasted work, low throughput per watt consumed Hence, a more cost-effective solution: build four simple processors in the same area each processor executes a different thread high thread throughput, but probably poorer single application performance 18

Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism: multiple threads share the same large processor reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor simple design, low interference between threads Chip Multi-Processing (CMP) 19

Title Bullet 20