EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

Size: px

Start display at page:

Download "EE382A Lecture 3: Superscalar and Out-of-order Processor Basics"

Todd Webster
6 years ago
Views:

1 EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University Lecture 3-1

2 Announcements HW1 is due today Hand to Davide at the end of lecture or send ASAP Will contact you about the results within 1-2 days Required paper assigned for Lecture 3 Submit summary by Wed 9/30 th Check instruction on the class webpage for John Shen: jpshen@stanford.edu Lecture 3-2

3 Dynamic-Static Interface DSI = ISA = a contract between the program and the machine. Lecture 3-3

4 Lecture 3 Outline 1. From Scalar to Superscalar Pipelines 2. Limits of Instruction-Level Parallelism 3. Superscalar Microprocessor Landscapes Lecture 3-4

5 1. From Scalar to Superscalar Pipelines Lecture 3-5

6 Instruction Pipeline Design Uniform Sub-computations... NOT! Balancing pipeline stages - Stage quantization to yield balanced pipe stages - Minimize internal fragmentation (some waiting stages) Identical Computations... NOT! Unifying instruction types - Coalescing instruction types into one multi-function pipe - Minimize external fragmentation (some idling stages) Independent Computations... NOT! Resolving pipeline hazards - Inter-instruction dependence detection and resolution - Minimize i i performance lose due to pipeline stalls Lecture 3-6

7 Scalar Pipelined Processors The 6-stage TYPICAL pipeline: ALU LOAD STORE BRANCH IF: I-CACHE PC I-CACHE PC I-CACHE PC I-CACHE PC IF 1 ID: DECODE DECODE DECODE DECODE ID 2 OF: RD. REG. RD. REG. RD. REG. RD. REG. RD 3 ADDR. GEN. ALU 4 EX: ALU OP. RD. MEM. MEM 5 OS: WR. REG. WR. REG. ADDR.GEN. ADDR. GEN. WB 6 WR. MEM. WR. PC Lecture 3-7

8 6-stage TYP Pipeline D-Cache D-Cache Add Update PC IF I-Cache I-Cache Data Instruction Decode ID RD Data Register File MEM Add WB ALU ALU Lecture 3-8

9 Pipeline Interface to Register File: IF add R1 <= R2 + R3 x0246 add R1 < R2 + R3 ID 1 D WAdd WData W/R RD ALU S1 S2 2 3 Register RAdd1 File RAdd2 RData1 RData2 MEM x0123 x0123 WB Lecture 3-9

10 6-stage TYP Pipeline Operation D-Cache D-Cache Add Update PC IF I-Cache ICache I-Cache Data Instruction Decode ID load R3 <= M[R4 + R5] RD x99 Data Register File x80 x04 MEM Add WB x84 + ALU ALU Lecture 3-10

11 3 Major Penalty Loops of Pipelining IF ID RD LOAD PENALTY ALU PENALTY ALU MEM WB BRANCH PENALTY Performance Objective: Reduce CPI to 1. Lecture 3-11

12 Limitations of Scalar Pipelined Processors Upper Bound on Scalar Pipeline Throughtput Parallel Pipelines Limited it by IPC = 1 Inefficient i Unification Into Single Pipeline Diversified Pipelines Long latency for each instruction Hazards and associated stalls Performance Lost Due to In-order Pipeline Dynamic Pipelines Unnecessary stalls Lecture 3-12

13 Parallel Pipelines (a) No Parallelism (b) Temporal Parallelism (d) Parallel Pipeline (c) Spatial Parallelism Lecture 3-13

14 Intel Pentium Parallel Pipeline IF IF IF D1 D1 D1 D2 D2 D2 EX EX EX WB WB WB U - Pipe V - Pipe Lecture 3-14

15 Diversified Pipelines IF ID RD EX ALU MEM1 FP1 BR MEM2 FP2 FP3 WB Lecture 3-15

16 Power4 Diversified Pipelines I-Cache PC Fetch Q BR Scan Decode BR Predict FP Issue Q FX/LD 1 Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buffer FX1 Unit LD1 LD2 FP1 FP2 Unit Unit Unit Unit FX2 Unit CR Unit BR Unit StQ D-Cache Lecture 3-16

17 Diversified Pipelines Separate execution pipelines Integer simple, memory, FP, Advantages: Reduce instruction latency Each instruction goes to WB asap Eliminate need for forwarding paths Eliminate some unnecessary stalls E.g. slow FP instruction does not block independent integer instructions Disadvantages?? Lecture 3-17

18 In-order Issue into Diversified Pipelines Inorder Inst. Stream RD Fn (RS, RT) Dest. Reg. Func Unit Source Registers INT Fadd1 Fmult1 LD/ST Fadd2 Fmult2 Fmult3 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Lecture 3-18

19 Dynamic Pipelines IF ID RD EX Dispatch Buffer ( in order ) ( out of order ) ALU MEM1 FP1 BR MEM2 FP2 FP3 WB Reorder Buffer ( out of order ) (inorder) Lecture 3-19

20 Designs of Inter-stage Buffers Stage i Stage i 1 n Buffer (1) Buffer (n) 1 n Stage i + 1 Stage i +1 ( in order ) ( in order ) Scalar Pipeline Buffer In-order Parallel Buffers (simple register) (wide-register or FIFO) Stage i Buffer (>_ n) Stage i + 1 ( any order ) ( any order ) (multiported SRAM and CAM) Out-of-order Pipeline Stages Lecture 3-20

21 The Challenges of Out-of-Order IF ID RD Program Order I a : F1 F2 x F I b : F1 F4 + F5 EX INT Fadd1 Fmult1 LD/ST WB Fadd2 Fmult2 Fmult3 Out-of-order WB I b : F1 F4 + F I a : F1 F2 x F3 What is the value of F1? WAW!!! Lecture 3-21

22 Dynamic-Static Interface DSI = ISA = a contract between the program and the machine. Architectural State Microarchitecture State Architectural state requirements: Support sequential instruction i execution semantics. Support precise servicing of exceptions and interrupts. Buffering needed between arch and uarch states: (ROB) Allow uarch state to deviate from arch state. Able to undo speculative uarch state if needed. Lecture 3-22

23 Modern Superscalar Processor Fetch Instruction/Decode Buffer In Order Decode Dispatch Dispatch Buffer Issue Reservation Stations Out of Order Execute In Order Finishi Complete Retire Reorder/ Completion Buffer Store Buffer Lecture 3-23

24 Impediments to Superscalar Performance IF I-cache ID Branch Predictor FETCH Instruction Buffer Instruction Flow RD DECODE Integer Floating-point Media Memory ALU PENALTY LOAD PENALTY ALU MEM WB BRANCH PENALTY Register Data Flow Reorder Buffer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow Lecture 3-24

25 2. Limits of Instruction-Level Parallelism Lecture 3-25

26 Amdahl s Law N No. of Processors 1 h 1 - h 1 - f f Time f = fraction that is vectorizable code (1 f) = fraction of time in serial code N = speedup for f Overall speedup: Speedup = 1 f 1 + f N Lecture 3-26

27 Revisit Amdahl s Law Sequential bottleneck Even if N is infinite lim v 1 Performance limited by non vectorizable portion (1 f) f 1 + f N = 1 1 f N No. of Processors 1 h 1 - h 1 - f f Time Lecture 3-27

28 Pipelined Processor Performance Model N Pipeline Depth 1 1-g 1g g g = fraction of time pipeline is filled 1 g = fraction of time pipeline is not filled (stalled) Lecture 3-28

29 Pipelined Processor Performance Model N Pipeline Depth 1 1-g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 100%, a big performance hit will result Stalled cycles in the pipeline are the key adversary and must be minimized as much as possible Canwe somehowfill the pipeline bubbles (stalled cycles)? Lecture 3-29

30 Motivation for Superscalar [Agerwala and Cocke] Spe eedup p Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar) n=6,s=2 n=100 n=12 n=6 n=4 2 1 Typical Range Vectorizability f Lecture 3-30

31 Superscalar Proposal Moderate the tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: 1 Speedup = 1 ( 1 f ) f S + N Lecture 3-31

32 Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 (Flynn s bottleneck) Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 (Jouppi disagreed) Kuck et al. [1972] 8 Riseman and Foster [1972] Nicolau and Fisher [1984] 51 (no control dependences) 90 (Fisher s optimism) Lecture 3-32

33 The Ideas Behind Modern Processors Superscalar or wide instruction issue Diversified pipelines Ideal IPC = n (CPI = 1/n) Different instructions go through different pipe stages Instructions go through needed stages only Out-of-order or data-flow execution Speculation Stall only on RAW hazards and structural hazards Overcome (some) RAW hazards through prediction And it all relies on: Instruction Level Parallelism (ILP) Independent instructions within sequential programs Lecture 3-33

34 Architectures for Instruction-Level Parallelism Scalar Pipeline (baseline) Instruction Parallelism = D Operation Latency = 1 Peak IPC = 1 D SU CCESSIV VE INST TRUCTIO NS IF DE EX WB TIME IN CYCLES (OF BASELINE MACHINE) Lecture 3-34

35 Superpipelined Processors Superpipelined Execution IP = DxM OL = M minor cycles Peak IPC = 1 per minor cycle (M per baseline cycle) major cycle = M minor cycle minor cycle IF DE EX WB Lecture 3-35

36 Superscalar Processors Superscalar (Pipelined) Execution IP = DxN OL = 1 baseline cycle Peak IPC = N per baseline cycle N IF DE EX WB Lecture 3-36

37 Superscalar and Superpipelined Superscalar Parallelism Operation Latency: 1 Issuing Rate: N Superscalar Degree (SSD): N (Determined by Issue Rate) Superpipeline Parallelism Operation Latency: M Issuing Rate: 1 Superpipelined Degree (SPD): M (Determined by Operation Latency) SUPERSCALAR Key: SUPERPIPELINED IFetch Dcode Execute Writeback Time in Cycles (of Base Machine) Superscalar and superpipelined machines of equal degree have roughly the same performance, i.e. if n = m then both have about the same IPC. Lecture 3-37

38 3. Superscalar Microprocessor Landscapes Lecture 3-38

39 Iron Law of Processor Performance Time 1/Processor Performance = Program Instructions Cycles Time = X X Program Instruction Cycle (inst. count) (CPI) (cycle time) IPC x GHz Processor Performance = inst. count Lecture 3-39

40 Landscape of Microprocessor Families (SPECint92) Lecture 3-40

41 Landscape of Microprocessor Families Landscape of Microprocessor Families (SPECint95) (SPECint95) SPECint 95 SPECint95/ /MHz PPro Pentium 15 PII Athlon PIII PIII Athlon 0.02 Alpha AMD-x86 Intel-x Frequency (MHz) ** Data source Lecture 3-41

42 SP PECint2000 0/MHz Landscape of Microprocessor Families Landscape of Microprocessor Families (SPECint2K) e PIII-Xeon 264A B Itanium (SPECint2000) 800 SPECint C Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha PowerPC Sparc IPF Frequency (MHz) ** Data source Lecture 3-42

43 Frequency vs. Parallelism Increase Frequency (GHz) Deeper Pipelines Increased Overall Latency Lower IPC Increase Instruction Parallelism (IPC) Wider Pipelines Increased Complexity Lower GHz Lecture 3-43

44 Deeper and Wider Pipelines Fetch Dec. Disp. Exec. Mem. Retire Fetch Decode Dispatch Execute Memory Branch Mispredict Penalty Retire Lecture 3-44

45 Front-End Pipe-Depth Penalty Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize Lecture 3-45

46 Alleviate Pipe-Depth Penalty Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, Pre-scheduling Memory Pre-fetch Prediction and Control Lecture 3-46

47 Execution Core Improvement Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching Lecture 3-47

48 Next Lecture Superscalar Pipeline Implementation: Instruction fetch Instruction decode Instruction dispatch Instruction execute Instruction complete and retire Instruction Flow Techniques Lecture 3-48

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real