EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Similar documents
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Pipelining to Superscalar

Superscalar Organization

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Pipeline Processor Design

John P. Shen Microprocessor Research Intel Labs March 19, 2002

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Superscalar Organization

ECE/CS 552: Introduction to Superscalar Processors

E0-243: Computer Architecture

Limitations of Scalar Pipelines

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Day 1: Introduction Course: Superscalar Architecture

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Superscalar Processor

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Complex Pipelines and Branch Prediction

Handout 2 ILP: Part B

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

The Pentium II/III Processor Compiler on a Chip

CS425 Computer Systems Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

Advanced Processor Architecture

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

Pipelined Processors. Ideal Pipelining. Example: FP Multiplier. 55:132/22C:160 Spring Jon Kuhl 1

Outline Marquette University

EEC 581 Computer Architecture. Instruction Level Parallelism (3.6 Hardware-based Speculation and 3.7 Static Scheduling/VLIW)

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

ECE 571 Advanced Microprocessor-Based Design Lecture 4

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Processor (IV) - advanced ILP. Hwansoo Han

Advanced issues in pipelining

Instruction-Level Parallelism and Its Exploitation

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CS 152, Spring 2012 Section 8

EEC 581 Computer Architecture. Lec 7 Instruction Level Parallelism (2.6 Hardware-based Speculation and 2.7 Static Scheduling/VLIW)

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

Advanced processor designs

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Computer Science 146. Computer Architecture

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

EECC551 Exam Review 4 questions out of 6 questions

Multi-cycle Instructions in the Pipeline (Floating Point)

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

Lecture 7 Pipelining. Peng Liu.

Chapter 3 (CONT II) Instructor: Josep Torrellas CS433. Copyright J. Torrellas 1999,2001,2002,2007,

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Lecture-13 (ROB and Multi-threading) CS422-Spring

TDT 4260 lecture 7 spring semester 2015

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Chapter 4 The Processor 1. Chapter 4D. The Processor

Lecture 9: Multiple Issue (Superscalar and VLIW)

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Super Scalar. Kalyan Basu March 21,

EITF20: Computer Architecture Part3.2.1: Pipeline - 3

The Processor: Instruction-Level Parallelism

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

ESE 545 Computer Architecture Instruction-Level Parallelism (ILP): Speculation, Reorder Buffer, Exceptions, Superscalar Processors, VLIW

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects

5008: Computer Architecture

CS 152, Spring 2011 Section 8

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Hardware-based Speculation

Instruction Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Advanced Computer Architecture

Computer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

November 7, 2014 Prediction

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Multithreaded Processors. Department of Electrical Engineering Stanford University

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 9 Instruction-Level Parallelism Part 2

COMPUTER ORGANIZATION AND DESI

EECS 322 Computer Architecture Superpipline and the Cache

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

Four Steps of Speculative Tomasulo cycle 0

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Lecture 19: Instruction Level Parallelism

Transcription:

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 3-1

Announcements HW1 is due today Hand to Davide at the end of lecture or send email ASAP Will contact you about the results within 1-2 days Required paper assigned for Lecture 3 Submit summary by Wed 9/30 th Check instruction on the class webpage Email for John Shen: jpshen@stanford.edu Lecture 3-2

Dynamic-Static Interface DSI = ISA = a contract between the program and the machine. Lecture 3-3

Lecture 3 Outline 1. From Scalar to Superscalar Pipelines 2. Limits of Instruction-Level Parallelism 3. Superscalar Microprocessor Landscapes Lecture 3-4

1. From Scalar to Superscalar Pipelines Lecture 3-5

Instruction Pipeline Design Uniform Sub-computations... NOT! Balancing pipeline stages - Stage quantization to yield balanced pipe stages - Minimize internal fragmentation (some waiting stages) Identical Computations... NOT! Unifying instruction types - Coalescing instruction types into one multi-function pipe - Minimize external fragmentation (some idling stages) Independent Computations... NOT! Resolving pipeline hazards - Inter-instruction dependence detection and resolution - Minimize i i performance lose due to pipeline stalls Lecture 3-6

Scalar Pipelined Processors The 6-stage TYPICAL pipeline: ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF 1 ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC Lecture 3-7

6-stage TYP Pipeline D-Cache D-Cache Add Update PC IF I-Cache I-Cache Data Instruction Decode ID RD Data Register File MEM Add WB ALU ALU Lecture 3-8

Pipeline Interface to Register File: IF add R1 <= R2 + R3 x0246 add R1 < R2 + R3 ID 1 D WAdd WData W/R RD ALU S1 S2 2 3 Register RAdd1 File RAdd2 RData1 RData2 MEM x0123 x0123 WB Lecture 3-9

6-stage TYP Pipeline Operation D-Cache D-Cache Add Update PC IF I-Cache ICache I-Cache Data Instruction Decode ID load R3 <= M[R4 + R5] 3 4 5 RD x99 Data Register File x80 x04 MEM Add WB x84 + ALU ALU Lecture 3-10

3 Major Penalty Loops of Pipelining IF ID RD LOAD PENALTY ALU PENALTY ALU MEM WB BRANCH PENALTY Performance Objective: Reduce CPI to 1. Lecture 3-11

Limitations of Scalar Pipelined Processors Upper Bound on Scalar Pipeline Throughtput Parallel Pipelines Limited it by IPC = 1 Inefficient i Unification Into Single Pipeline Diversified Pipelines Long latency for each instruction Hazards and associated stalls Performance Lost Due to In-order Pipeline Dynamic Pipelines Unnecessary stalls Lecture 3-12

Parallel Pipelines (a) No Parallelism (b) Temporal Parallelism (d) Parallel Pipeline (c) Spatial Parallelism Lecture 3-13

Intel Pentium Parallel Pipeline IF IF IF D1 D1 D1 D2 D2 D2 EX EX EX WB WB WB U - Pipe V - Pipe Lecture 3-14

Diversified Pipelines IF ID RD EX ALU MEM1 FP1 BR MEM2 FP2 FP3 WB Lecture 3-15

Power4 Diversified Pipelines I-Cache PC Fetch Q BR Scan Decode BR Predict FP Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buffer FX1 Unit LD1 LD2 FP1 FP2 Unit Unit Unit Unit FX2 Unit CR Unit BR Unit StQ D-Cache Lecture 3-16

Diversified Pipelines Separate execution pipelines Integer simple, memory, FP, Advantages: Reduce instruction latency Each instruction goes to WB asap Eliminate need for forwarding paths Eliminate some unnecessary stalls E.g. slow FP instruction does not block independent integer instructions Disadvantages?? Lecture 3-17

In-order Issue into Diversified Pipelines Inorder Inst. Stream RD Fn (RS, RT) Dest. Reg. Func Unit Source Registers INT Fadd1 Fmult1 LD/ST Fadd2 Fmult2 Fmult3 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Lecture 3-18

Dynamic Pipelines IF ID RD EX Dispatch Buffer ( in order ) ( out of order ) ALU MEM1 FP1 BR MEM2 FP2 FP3 WB Reorder Buffer ( out of order ) (inorder) Lecture 3-19

Designs of Inter-stage Buffers Stage i Stage i 1 n Buffer (1) Buffer (n) 1 n Stage i + 1 Stage i +1 ( in order ) ( in order ) Scalar Pipeline Buffer In-order Parallel Buffers (simple register) (wide-register or FIFO) Stage i Buffer (>_ n) Stage i + 1 ( any order ) ( any order ) (multiported SRAM and CAM) Out-of-order Pipeline Stages Lecture 3-20

The Challenges of Out-of-Order IF ID RD Program Order I a : F1 F2 x F3..... I b : F1 F4 + F5 EX INT Fadd1 Fmult1 LD/ST WB Fadd2 Fmult2 Fmult3 Out-of-order WB I b : F1 F4 + F5...... I a : F1 F2 x F3 What is the value of F1? WAW!!! Lecture 3-21

Dynamic-Static Interface DSI = ISA = a contract between the program and the machine. Architectural State Microarchitecture State Architectural state requirements: Support sequential instruction i execution semantics. Support precise servicing of exceptions and interrupts. Buffering needed between arch and uarch states: (ROB) Allow uarch state to deviate from arch state. Able to undo speculative uarch state if needed. Lecture 3-22

Modern Superscalar Processor Fetch Instruction/Decode Buffer In Order Decode Dispatch Dispatch Buffer Issue Reservation Stations Out of Order Execute In Order Finishi Complete Retire Reorder/ Completion Buffer Store Buffer Lecture 3-23

Impediments to Superscalar Performance IF I-cache ID Branch Predictor FETCH Instruction Buffer Instruction Flow RD DECODE Integer Floating-point Media Memory ALU PENALTY LOAD PENALTY ALU MEM WB BRANCH PENALTY Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow Lecture 3-24

2. Limits of Instruction-Level Parallelism Lecture 3-25

Amdahl s Law N No. of Processors 1 h 1 - h 1 - f f Time f = fraction that is vectorizable code (1 f) = fraction of time in serial code N = speedup for f Overall speedup: Speedup = 1 f 1 + f N Lecture 3-26

Revisit Amdahl s Law Sequential bottleneck Even if N is infinite lim v 1 Performance limited by non vectorizable portion (1 f) f 1 + f N = 1 1 f N No. of Processors 1 h 1 - h 1 - f f Time Lecture 3-27

Pipelined Processor Performance Model N Pipeline Depth 1 1-g 1g g g = fraction of time pipeline is filled 1 g = fraction of time pipeline is not filled (stalled) Lecture 3-28

Pipelined Processor Performance Model N Pipeline Depth 1 1-g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles in the pipeline are the key adversary and must be minimized as much as possible Canwe somehowfill the pipeline bubbles (stalled cycles)? Lecture 3-29

Motivation for Superscalar [Agerwala and Cocke] Spe eedup p 8 7 6 5 4 3 Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) n=6,s=2 n=100 n=12 n=6 n=4 2 1 Typical Range 0 0 0.2 0.4 0.6 0.8 1 Vectorizability f Lecture 3-30

Superscalar Proposal Moderate the tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: 1 Speedup = 1 ( 1 f ) f S + N Lecture 3-31

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] Nicolau and Fisher [1984] 51 (no control dependences) 90 (Fisher s optimism) Lecture 3-32

The Ideas Behind Modern Processors Superscalar or wide instruction issue Diversified pipelines Ideal IPC = n (CPI = 1/n) Different instructions go through different pipe stages Instructions go through needed stages only Out-of-order or data-flow execution Speculation Stall only on RAW hazards and structural hazards Overcome (some) RAW hazards through prediction And it all relies on: Instruction Level Parallelism (ILP) Independent instructions within sequential programs Lecture 3-33

Architectures for Instruction-Level Parallelism Scalar Pipeline (baseline) Instruction Parallelism = D Operation Latency = 1 Peak IPC = 1 D SU CCESSIV VE INST TRUCTIO NS 1 2 3 4 5 6 IF DE EX WB 0 1 2 3 4 5 6 7 8 9 TIME IN CYCLES (OF BASELINE MACHINE) Lecture 3-34

Superpipelined Processors Superpipelined Execution IP = DxM OL = M minor cycles Peak IPC = 1 per minor cycle (M per baseline cycle) major cycle = M minor cycle minor cycle 1 2 3 4 5 6 IF DE EX WB 1 2 3 4 5 6 Lecture 3-35

Superscalar Processors Superscalar (Pipelined) Execution IP = DxN OL = 1 baseline cycle Peak IPC = N per baseline cycle 1 2 3 4 5 6 7 8 9 N IF DE EX WB Lecture 3-36

Superscalar and Superpipelined Superscalar Parallelism Operation Latency: 1 Issuing Rate: N Superscalar Degree (SSD): N (Determined by Issue Rate) Superpipeline Parallelism Operation Latency: M Issuing Rate: 1 Superpipelined Degree (SPD): M (Determined by Operation Latency) SUPERSCALAR Key: SUPERPIPELINED IFetch Dcode Execute Writeback 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Time in Cycles (of Base Machine) Superscalar and superpipelined machines of equal degree have roughly the same performance, i.e. if n = m then both have about the same IPC. Lecture 3-37

3. Superscalar Microprocessor Landscapes Lecture 3-38

Iron Law of Processor Performance Time 1/Processor Performance = --------------- Program Instructions Cycles Time = ------------------ X ---------------- X ------------ Program Instruction Cycle (inst. count) (CPI) (cycle time) IPC x GHz Processor Performance = ----------------- inst. count Lecture 3-39

Landscape of Microprocessor Families (SPECint92) Lecture 3-40

008 0.08 Landscape of Microprocessor Families Landscape of Microprocessor Families (SPECint95) (SPECint95) 0.07 20 25 30 35 40 45 50 55 60 SPECint 95 SPECint95/ /MHz 0.06 0.05 0.04 0.03 10 5 PPro Pentium 15 PII 164 264 Athlon PIII PIII Athlon 0.02 Alpha 064 001 0.01 AMD-x86 Intel-x86 0 80 180 280 380 480 580 680 780 880 980 Frequency (MHz) ** Data source www.spec.org Lecture 3-41

SP PECint2000 0/MHz 1 0.5 25 Landscape of Microprocessor Families Landscape of Microprocessor Families (SPECint2K) 300 200 100 50 604e 400 500 PIII-Xeon 264A 600 700 264B Itanium (SPECint2000) 800 SPECint 2000 264C Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha PowerPC Sparc IPF 0 0 500 1000 1500 2000 2500 Frequency (MHz) ** Data source www.spec.org Lecture 3-42

Frequency vs. Parallelism Increase Frequency (GHz) Deeper Pipelines Increased Overall Latency Lower IPC Increase Instruction Parallelism (IPC) Wider Pipelines Increased Complexity Lower GHz Lecture 3-43

Deeper and Wider Pipelines Fetch Dec. Disp. Exec. Mem. Retire Fetch Decode Dispatch Execute Memory Branch Mispredict Penalty Retire Lecture 3-44

Front-End Pipe-Depth Penalty Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize Lecture 3-45

Alleviate Pipe-Depth Penalty Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, Pre-scheduling Memory Pre-fetch Prediction and Control Lecture 3-46

Execution Core Improvement Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching Lecture 3-47

Next Lecture Superscalar Pipeline Implementation: Instruction fetch Instruction decode Instruction dispatch Instruction execute Instruction complete and retire Instruction Flow Techniques Lecture 3-48