High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum

Intel s Microarchitecture Research Labs! USA: California, Oregon, Texas (John Shen) High Frequency Superscalar Processors Helper Threads for SMT and CMP Machines Future Enterprise Server Processors! Israel: Haifa (Ronny Ronen) Low Power Microarchitecture Techniques Future Mobile High-performance Processors! Spain: Barcelona (Antonio Gonzalez) Speculative Multithreading for SMT and CMP Clustered Microarchitecture Techniques

Microprocessor Performance Growth in Perspective! Doubling every 18 months (1982-2000): 2000): Total of 3,200X Cars travel at 176,000 MPH; get 64,000 miles/gal. Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) Wheat yield: 320,000 bushels per acre! Doubling every 24 months (1971-2001): Total of 36,000X Cars travel at 2,400,000 MPH; get 600,000 miles/gal. Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!!

Iron Law of Microprocessor Performance Time 1/Processor Performance = --------------- Program Instructions Cycles Time = ------------------ X ---------------- X ------------ Program Instruction Cycle (inst. count) (CPI) (cycle time) Processor Performance = ----------------- IPC x GHz inst. count

Performance Improvement Techniques! Increase GHz Process Technology Circuit Techniques Pipelining and Caches! Increase IPC (Reduce CPI) Superscalar Pipelines Out-of of-order order Execution Cache Miss Reduction! Decrease Instruction Count Compiler Optimization Architecture Extensions Microarchitecture Techniques

SPECint92 Landscape

P6 vs. Pentium 4 Pipelines Basic P6 Pipeline Intro at 733MHz.18µ 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec Basic Pentium 4 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch 13 14 Disp Disp 15 16 17 18 19 20 RF RF Ex Flgs Br Ck Drive Intro at 1.5GHz.18µ

Hyper Pipelined Technology @ intro 1.5 GHz 20 Netburst Micro-Architecture 1GHz Frequency 10 P6 Micro-Architecture 166MHz 60MHz Introduction Time 233MHz 5 P5 Micro-Architecture

Deeper and Wider Pipelines Branch Penalty Load Penalty Fetch Dec. Disp. Exec. Mem. Retire ALU Penalty Fetch Decode Dispatch Branch Penalty ALU Penalty Execute Memory Load Penalty Retire

Pipelining Penalty Loops! Branch Penalty Branch predictor CPI overhead: Branch% x Misprediction% % x PipeDepth Performance lost: CPI overhead x PipeWidth! Load Penalty Cache hierarchy CPI overhead: Load% x AvgLoadLatency Average Load Latency: Σ Cache(i)Hit% % x Cache(i)Latency! ALU Penalty Forwarding paths and super-pipelining pipelining

Branch Prediction specu. cond. prediction specu. target Branch Predictor BTB update (target addr. and history) FA-mux PC npc to Icache npc(seq.) = PC+4 Fetch Decode Dispatch Decode Buffer Dispatch Buffer Issue Branch Reservation Stations Execute Finish Completion Buffer

Branch Prediction Technology! Basic 2-bit 2 Local History Predictor ~80% prediction accuracy ~25 instructions/mispredict ~5 cycles/25 instructions (0.2 CPI)! Two-Level Correlated Predictor (P6) ~90% prediction accuracy ~50 instructions/mispredict ~10 cycles/50 instructions (0.2 PI)! Current State of the Art (Pentium 4) ~95% prediction accuracy ~100 instructions/mispredict ~20 cycles/100 instructions (0.2 CPI)! Current Research Challenge (2008) ~98% prediction accuracy ~250 instructions/mispredict ~25 cycles/250 instructions (0.1 CPI)

Data Cache and Prefetching Branch Predictor I-cache Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buffer Data Cache Main Memory

Cache Hierarchy Technology! Current Commercial Workload (6 cycles/load) L1 Hits: 80% x 2 cycles = 1.6 L2 Hits: 15% x 10 cycles = 1.5 L3 Hits: 4% x 30 cycles = 1.2 Memory: 1% x 150 cycles = 1.5! Future Commercial Workload (17 cycles/load) L1 Hits: 80% x 4 cycles = 3.2 L2 Hits: 15% x 20 cycles = 3.0 L3 Hits: 4% x 60 cycles = 2.4 Memory: 1% x 800 cycles = 8.0! Current Research Challenge (5 cycles/load) Efficient and judicious caches Load partitioning and specialized caching Aggressive memory prefetching

Memory Latency Bottleneck Cache Latency (Clocks) 1000 100 10 800 Instruction Cost 400 External Memory Latency 1 L1 L2 L3 External Memory Cache Prefetching: 0 Pentium Pentium proc Pro Proc Pentium III proc Hardware: Limited by predictable patterns Software: Limited by single control flow Research Challenge: Pointer-intensive code Future Processors

Frequency vs. Parallelism! Increase Frequency (GHz) Deeper Pipelines Increases Branch/Load penalties Lowers IPC! Increase Instruction Parallelism (IPC) Wider Pipelines Increases Complexity Lowers GHz

Front-End Pipe-Depth Penalty Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize

Alleviate Pipe-Depth Penalty! Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations! Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, rename, Pre-scheduling Memory Pre-fetch Prediction and Control

Execution Core Improvement Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching

Source: Intel Corporation How Deep Can You Go? 25 20 Frequency CPI Performance Power 15 57? 10 5 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 Pipeline Depth [Ed Grochowski, 7/6/01]

How Much ILP Is There? Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

SPECint95 Landscape 0.08 Landscape of Microprocessor Families 0.07 20 25 30 35 40 45 50 55 60 SPECint 95 0.06 15 10 264 SPECint95/MHz 0.05 0.04 0.03 5 P PPro PII 164 PIII Athlon Athlon 0.02 064 Alpha AMD-x86 0.01 Intel-x86 Bryan Black 0 80 180 280 380 480 580 680 780 880 980 Frequency (MHz) ** Data source www.spec.org

SPECint2000 Landscape 1 Landscape of Microprocessor Families Intel-x86 SPECint2000/MHz 0.5 200 100 50 25 604e 300 400 500 600 PIII-Xeon 264A 700 800 SPECint 2000 264B 264C Sparc-III Athlon Itanium P4 AMD-x86 Alpha PowerPC Sparc IPF Bryan Black 0 0 500 1000 1500 2000 2500 Frequency (MHz) ** Data source www.spec.org

Parallelism in Transition MIPS 1000000 100000 10000 1000 100 10 Pentium 4 Architecture Trace Cache Pentium Pro Architecture Speculative Out of Order Pentium Architecture Super Scalar Multi-Threaded, Multi-Core Future Xeon Architecture Multi-Threaded Era of Instruction Parallelism Era of Thread Parallelism 1 1980 1985 1990 1995 2000 2005 2010

Summary Performance Demand Continues! 5-10 billion transistors by 2010! 10-20 GHz by 2010 Challenge Is Power and Efficiency! Power dissipation, delivery, density! New clever/efficient implementations New Frontiers to Explore! Synergism of ILP, TLP, and MLP! Semi-Custom Microarchitectures