EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 3-1

Announcements HW1 is due today Hand to Davide at the end of lecture or send email ASAP Will contact you about the results within 1-2 days Required paper assigned for Lecture 3 Submit summary by Wed 9/30 th Check instruction on the class webpage Email for John Shen: jpshen@stanford.edu Lecture 3-2

Dynamic-Static Interface DSI = ISA = a contract between the program and the machine. Lecture 3-3

Lecture 3 Outline 1. From Scalar to Superscalar Pipelines 2. Limits of Instruction-Level Parallelism 3. Superscalar Microprocessor Landscapes Lecture 3-4

1. From Scalar to Superscalar Pipelines Lecture 3-5

Instruction Pipeline Design Uniform Sub-computations... NOT! Balancing pipeline stages - Stage quantization to yield balanced pipe stages - Minimize internal fragmentation (some waiting stages) Identical Computations... NOT! Unifying instruction types - Coalescing instruction types into one multi-function pipe - Minimize external fragmentation (some idling stages) Independent Computations... NOT! Resolving pipeline hazards - Inter-instruction dependence detection and resolution - Minimize i i performance lose due to pipeline stalls Lecture 3-6

Scalar Pipelined Processors The 6-stage TYPICAL pipeline: ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF 1 ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC Lecture 3-7

6-stage TYP Pipeline D-Cache D-Cache Add Update PC IF I-Cache I-Cache Data Instruction Decode ID RD Data Register File MEM Add WB ALU ALU Lecture 3-8

Pipeline Interface to Register File: IF add R1 <= R2 + R3 x0246 add R1 < R2 + R3 ID 1 D WAdd WData W/R RD ALU S1 S2 2 3 Register RAdd1 File RAdd2 RData1 RData2 MEM x0123 x0123 WB Lecture 3-9

6-stage TYP Pipeline Operation D-Cache D-Cache Add Update PC IF I-Cache ICache I-Cache Data Instruction Decode ID load R3 <= M[R4 + R5] 3 4 5 RD x99 Data Register File x80 x04 MEM Add WB x84 + ALU ALU Lecture 3-10

3 Major Penalty Loops of Pipelining IF ID RD LOAD PENALTY ALU PENALTY ALU MEM WB BRANCH PENALTY Performance Objective: Reduce CPI to 1. Lecture 3-11

Limitations of Scalar Pipelined Processors Upper Bound on Scalar Pipeline Throughtput Parallel Pipelines Limited it by IPC = 1 Inefficient i Unification Into Single Pipeline Diversified Pipelines Long latency for each instruction Hazards and associated stalls Performance Lost Due to In-order Pipeline Dynamic Pipelines Unnecessary stalls Lecture 3-12

Parallel Pipelines (a) No Parallelism (b) Temporal Parallelism (d) Parallel Pipeline (c) Spatial Parallelism Lecture 3-13

Intel Pentium Parallel Pipeline IF IF IF D1 D1 D1 D2 D2 D2 EX EX EX WB WB WB U - Pipe V - Pipe Lecture 3-14

Diversified Pipelines IF ID RD EX ALU MEM1 FP1 BR MEM2 FP2 FP3 WB Lecture 3-15

Power4 Diversified Pipelines I-Cache PC Fetch Q BR Scan Decode BR Predict FP Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buffer FX1 Unit LD1 LD2 FP1 FP2 Unit Unit Unit Unit FX2 Unit CR Unit BR Unit StQ D-Cache Lecture 3-16

Diversified Pipelines Separate execution pipelines Integer simple, memory, FP, Advantages: Reduce instruction latency Each instruction goes to WB asap Eliminate need for forwarding paths Eliminate some unnecessary stalls E.g. slow FP instruction does not block independent integer instructions Disadvantages?? Lecture 3-17

In-order Issue into Diversified Pipelines Inorder Inst. Stream RD Fn (RS, RT) Dest. Reg. Func Unit Source Registers INT Fadd1 Fmult1 LD/ST Fadd2 Fmult2 Fmult3 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Lecture 3-18

Dynamic Pipelines IF ID RD EX Dispatch Buffer ( in order ) ( out of order ) ALU MEM1 FP1 BR MEM2 FP2 FP3 WB Reorder Buffer ( out of order ) (inorder) Lecture 3-19

Designs of Inter-stage Buffers Stage i Stage i 1 n Buffer (1) Buffer (n) 1 n Stage i + 1 Stage i +1 ( in order ) ( in order ) Scalar Pipeline Buffer In-order Parallel Buffers (simple register) (wide-register or FIFO) Stage i Buffer (>_ n) Stage i + 1 ( any order ) ( any order ) (multiported SRAM and CAM) Out-of-order Pipeline Stages Lecture 3-20

The Challenges of Out-of-Order IF ID RD Program Order I a : F1 F2 x F3..... I b : F1 F4 + F5 EX INT Fadd1 Fmult1 LD/ST WB Fadd2 Fmult2 Fmult3 Out-of-order WB I b : F1 F4 + F5...... I a : F1 F2 x F3 What is the value of F1? WAW!!! Lecture 3-21

Dynamic-Static Interface DSI = ISA = a contract between the program and the machine. Architectural State Microarchitecture State Architectural state requirements: Support sequential instruction i execution semantics. Support precise servicing of exceptions and interrupts. Buffering needed between arch and uarch states: (ROB) Allow uarch state to deviate from arch state. Able to undo speculative uarch state if needed. Lecture 3-22

Modern Superscalar Processor Fetch Instruction/Decode Buffer In Order Decode Dispatch Dispatch Buffer Issue Reservation Stations Out of Order Execute In Order Finishi Complete Retire Reorder/ Completion Buffer Store Buffer Lecture 3-23

Impediments to Superscalar Performance IF I-cache ID Branch Predictor FETCH Instruction Buffer Instruction Flow RD DECODE Integer Floating-point Media Memory ALU PENALTY LOAD PENALTY ALU MEM WB BRANCH PENALTY Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow Lecture 3-24

2. Limits of Instruction-Level Parallelism Lecture 3-25

Amdahl s Law N No. of Processors 1 h 1 - h 1 - f f Time f = fraction that is vectorizable code (1 f) = fraction of time in serial code N = speedup for f Overall speedup: Speedup = 1 f 1 + f N Lecture 3-26

Revisit Amdahl s Law Sequential bottleneck Even if N is infinite lim v 1 Performance limited by non vectorizable portion (1 f) f 1 + f N = 1 1 f N No. of Processors 1 h 1 - h 1 - f f Time Lecture 3-27

Pipelined Processor Performance Model N Pipeline Depth 1 1-g 1g g g = fraction of time pipeline is filled 1 g = fraction of time pipeline is not filled (stalled) Lecture 3-28

Pipelined Processor Performance Model N Pipeline Depth 1 1-g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles in the pipeline are the key adversary and must be minimized as much as possible Canwe somehowfill the pipeline bubbles (stalled cycles)? Lecture 3-29

Motivation for Superscalar [Agerwala and Cocke] Spe eedup p 8 7 6 5 4 3 Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) n=6,s=2 n=100 n=12 n=6 n=4 2 1 Typical Range 0 0 0.2 0.4 0.6 0.8 1 Vectorizability f Lecture 3-30

Superscalar Proposal Moderate the tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: 1 Speedup = 1 ( 1 f ) f S + N Lecture 3-31

Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] Nicolau and Fisher [1984] 51 (no control dependences) 90 (Fisher s optimism) Lecture 3-32

The Ideas Behind Modern Processors Superscalar or wide instruction issue Diversified pipelines Ideal IPC = n (CPI = 1/n) Different instructions go through different pipe stages Instructions go through needed stages only Out-of-order or data-flow execution Speculation Stall only on RAW hazards and structural hazards Overcome (some) RAW hazards through prediction And it all relies on: Instruction Level Parallelism (ILP) Independent instructions within sequential programs Lecture 3-33

Architectures for Instruction-Level Parallelism Scalar Pipeline (baseline) Instruction Parallelism = D Operation Latency = 1 Peak IPC = 1 D SU CCESSIV VE INST TRUCTIO NS 1 2 3 4 5 6 IF DE EX WB 0 1 2 3 4 5 6 7 8 9 TIME IN CYCLES (OF BASELINE MACHINE) Lecture 3-34

Superpipelined Processors Superpipelined Execution IP = DxM OL = M minor cycles Peak IPC = 1 per minor cycle (M per baseline cycle) major cycle = M minor cycle minor cycle 1 2 3 4 5 6 IF DE EX WB 1 2 3 4 5 6 Lecture 3-35

Superscalar Processors Superscalar (Pipelined) Execution IP = DxN OL = 1 baseline cycle Peak IPC = N per baseline cycle 1 2 3 4 5 6 7 8 9 N IF DE EX WB Lecture 3-36

Superscalar and Superpipelined Superscalar Parallelism Operation Latency: 1 Issuing Rate: N Superscalar Degree (SSD): N (Determined by Issue Rate) Superpipeline Parallelism Operation Latency: M Issuing Rate: 1 Superpipelined Degree (SPD): M (Determined by Operation Latency) SUPERSCALAR Key: SUPERPIPELINED IFetch Dcode Execute Writeback 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Time in Cycles (of Base Machine) Superscalar and superpipelined machines of equal degree have roughly the same performance, i.e. if n = m then both have about the same IPC. Lecture 3-37

3. Superscalar Microprocessor Landscapes Lecture 3-38

Iron Law of Processor Performance Time 1/Processor Performance = --------------- Program Instructions Cycles Time = ------------------ X ---------------- X ------------ Program Instruction Cycle (inst. count) (CPI) (cycle time) IPC x GHz Processor Performance = ----------------- inst. count Lecture 3-39

Landscape of Microprocessor Families (SPECint92) Lecture 3-40

008 0.08 Landscape of Microprocessor Families Landscape of Microprocessor Families (SPECint95) (SPECint95) 0.07 20 25 30 35 40 45 50 55 60 SPECint 95 SPECint95/ /MHz 0.06 0.05 0.04 0.03 10 5 PPro Pentium 15 PII 164 264 Athlon PIII PIII Athlon 0.02 Alpha 064 001 0.01 AMD-x86 Intel-x86 0 80 180 280 380 480 580 680 780 880 980 Frequency (MHz) ** Data source www.spec.org Lecture 3-41

SP PECint2000 0/MHz 1 0.5 25 Landscape of Microprocessor Families Landscape of Microprocessor Families (SPECint2K) 300 200 100 50 604e 400 500 PIII-Xeon 264A 600 700 264B Itanium (SPECint2000) 800 SPECint 2000 264C Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha PowerPC Sparc IPF 0 0 500 1000 1500 2000 2500 Frequency (MHz) ** Data source www.spec.org Lecture 3-42

Frequency vs. Parallelism Increase Frequency (GHz) Deeper Pipelines Increased Overall Latency Lower IPC Increase Instruction Parallelism (IPC) Wider Pipelines Increased Complexity Lower GHz Lecture 3-43

Deeper and Wider Pipelines Fetch Dec. Disp. Exec. Mem. Retire Fetch Decode Dispatch Execute Memory Branch Mispredict Penalty Retire Lecture 3-44

Front-End Pipe-Depth Penalty Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize Lecture 3-45

Alleviate Pipe-Depth Penalty Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, Pre-scheduling Memory Pre-fetch Prediction and Control Lecture 3-46

Execution Core Improvement Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching Lecture 3-47

Next Lecture Superscalar Pipeline Implementation: Instruction fetch Instruction decode Instruction dispatch Instruction execute Instruction complete and retire Instruction Flow Techniques Lecture 3-48