High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Size: px

Start display at page:

Download "High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs"

Arabella Floyd
6 years ago
Views:

1 High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum

2 Intel s Microarchitecture Research Labs! USA: California, Oregon, Texas (John Shen) High Frequency Superscalar Processors Helper Threads for SMT and CMP Machines Future Enterprise Server Processors! Israel: Haifa (Ronny Ronen) Low Power Microarchitecture Techniques Future Mobile High-performance Processors! Spain: Barcelona (Antonio Gonzalez) Speculative Multithreading for SMT and CMP Clustered Microarchitecture Techniques

3 Microprocessor Performance Growth in Perspective! Doubling every 18 months ( ): 2000): Total of 3,200X Cars travel at 176,000 MPH; get 64,000 miles/gal. Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) Wheat yield: 320,000 bushels per acre! Doubling every 24 months ( ): Total of 36,000X Cars travel at 2,400,000 MPH; get 600,000 miles/gal. Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!!

4 Iron Law of Microprocessor Performance Time 1/Processor Performance = Program Instructions Cycles Time = X X Program Instruction Cycle (inst. count) (CPI) (cycle time) Processor Performance = IPC x GHz inst. count

5 Performance Improvement Techniques! Increase GHz Process Technology Circuit Techniques Pipelining and Caches! Increase IPC (Reduce CPI) Superscalar Pipelines Out-of of-order order Execution Cache Miss Reduction! Decrease Instruction Count Compiler Optimization Architecture Extensions Microarchitecture Techniques

6 SPECint92 Landscape

7 P6 vs. Pentium 4 Pipelines Basic P6 Pipeline Intro at 733MHz.18µ Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec Basic Pentium 4 Processor Pipeline TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive Intro at 1.5GHz.18µ

8 Hyper Pipelined intro 1.5 GHz 20 Netburst Micro-Architecture 1GHz Frequency 10 P6 Micro-Architecture 166MHz 60MHz Introduction Time 233MHz 5 P5 Micro-Architecture

9 Deeper and Wider Pipelines Branch Penalty Load Penalty Fetch Dec. Disp. Exec. Mem. Retire ALU Penalty Fetch Decode Dispatch Branch Penalty ALU Penalty Execute Memory Load Penalty Retire

10 Pipelining Penalty Loops! Branch Penalty Branch predictor CPI overhead: Branch% x Misprediction% % x PipeDepth Performance lost: CPI overhead x PipeWidth! Load Penalty Cache hierarchy CPI overhead: Load% x AvgLoadLatency Average Load Latency: Σ Cache(i)Hit% % x Cache(i)Latency! ALU Penalty Forwarding paths and super-pipelining pipelining

11 Branch Prediction specu. cond. prediction specu. target Branch Predictor BTB update (target addr. and history) FA-mux PC npc to Icache npc(seq.) = PC+4 Fetch Decode Dispatch Decode Buffer Dispatch Buffer Issue Branch Reservation Stations Execute Finish Completion Buffer

12 Branch Prediction Technology! Basic 2-bit 2 Local History Predictor ~80% prediction accuracy ~25 instructions/mispredict ~5 cycles/25 instructions (0.2 CPI)! Two-Level Correlated Predictor (P6) ~90% prediction accuracy ~50 instructions/mispredict ~10 cycles/50 instructions (0.2 PI)! Current State of the Art (Pentium 4) ~95% prediction accuracy ~100 instructions/mispredict ~20 cycles/100 instructions (0.2 CPI)! Current Research Challenge (2008) ~98% prediction accuracy ~250 instructions/mispredict ~25 cycles/250 instructions (0.1 CPI)

13 Data Cache and Prefetching Branch Predictor I-cache Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buffer Data Cache Main Memory

14 Cache Hierarchy Technology! Current Commercial Workload (6 cycles/load) L1 Hits: 80% x 2 cycles = 1.6 L2 Hits: 15% x 10 cycles = 1.5 L3 Hits: 4% x 30 cycles = 1.2 Memory: 1% x 150 cycles = 1.5! Future Commercial Workload (17 cycles/load) L1 Hits: 80% x 4 cycles = 3.2 L2 Hits: 15% x 20 cycles = 3.0 L3 Hits: 4% x 60 cycles = 2.4 Memory: 1% x 800 cycles = 8.0! Current Research Challenge (5 cycles/load) Efficient and judicious caches Load partitioning and specialized caching Aggressive memory prefetching

15 Memory Latency Bottleneck Cache Latency (Clocks) Instruction Cost 400 External Memory Latency 1 L1 L2 L3 External Memory Cache Prefetching: 0 Pentium Pentium proc Pro Proc Pentium III proc Hardware: Limited by predictable patterns Software: Limited by single control flow Research Challenge: Pointer-intensive code Future Processors

16 Frequency vs. Parallelism! Increase Frequency (GHz) Deeper Pipelines Increases Branch/Load penalties Lowers IPC! Increase Instruction Parallelism (IPC) Wider Pipelines Increases Complexity Lowers GHz

17 Front-End Pipe-Depth Penalty Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize

18 Alleviate Pipe-Depth Penalty! Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations! Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, rename, Pre-scheduling Memory Pre-fetch Prediction and Control

19 Execution Core Improvement Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching

20 Source: Intel Corporation How Deep Can You Go? Frequency CPI Performance Power 15 57? Pipeline Depth [Ed Grochowski, 7/6/01]

21 How Much ILP Is There? Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

22 SPECint95 Landscape 0.08 Landscape of Microprocessor Families SPECint SPECint95/MHz P PPro PII 164 PIII Athlon Athlon Alpha AMD-x Intel-x86 Bryan Black Frequency (MHz) ** Data source

23 SPECint2000 Landscape 1 Landscape of Microprocessor Families Intel-x86 SPECint2000/MHz e PIII-Xeon 264A SPECint B 264C Sparc-III Athlon Itanium P4 AMD-x86 Alpha PowerPC Sparc IPF Bryan Black Frequency (MHz) ** Data source

24 Parallelism in Transition MIPS Pentium 4 Architecture Trace Cache Pentium Pro Architecture Speculative Out of Order Pentium Architecture Super Scalar Multi-Threaded, Multi-Core Future Xeon Architecture Multi-Threaded Era of Instruction Parallelism Era of Thread Parallelism

25 Summary Performance Demand Continues! 5-10 billion transistors by 2010! GHz by 2010 Challenge Is Power and Efficiency! Power dissipation, delivery, density! New clever/efficient implementations New Frontiers to Explore! Synergism of ILP, TLP, and MLP! Semi-Custom Microarchitectures

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

EE382A Lecture 3: Superscalar and Out-of-order Processor Basics Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 3-1 Announcements HW1 is due today Hand