High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

Similar documents
EE382A Lecture 3: Superscalar and Out-of-order Processor Basics

John P. Shen Microprocessor Research Intel Labs March 19, 2002

Superscalar Organization

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Pipelining to Superscalar

Next Generation Technology from Intel Intel Pentium 4 Processor

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture

Day 1: Introduction Course: Superscalar Architecture

Pipeline Processor Design

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture

ECE/CS 552: Introduction to Superscalar Processors

Beyond Pipelining. CP-226: Computer Architecture. Lecture 23 (19 April 2013) CADSL

Pentium 4 Processor Block Diagram

EN2910A: Advanced Computer Architecture Topic 03: Superscalar core architecture Prof. Sherief Reda School of Engineering Brown University

Superscalar Organization

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

The Pentium II/III Processor Compiler on a Chip

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Page 1. Review: Dynamic Branch Prediction. Lecture 18: ILP and Dynamic Execution #3: Examples (Pentium III, Pentium 4, IBM AS/400)

Microarchitecture Overview. Performance

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

Exploring the Effects of Hyperthreading on Scientific Applications

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Microarchitecture Overview. Performance

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Intel released new technology call P6P

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

EC 513 Computer Architecture

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

The Processor: Instruction-Level Parallelism

Limitations of Scalar Pipelines

Processor (IV) - advanced ILP. Hwansoo Han

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Hyperthreading Technology

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Computer Science 146. Computer Architecture

CS425 Computer Systems Architecture

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Superscalar Processor

ECE 571 Advanced Microprocessor-Based Design Lecture 4

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Complex Pipelines and Branch Prediction

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

EECS 470 Lecture 4. Pipelining & Hazards II. Fall 2018 Jon Beaumont

CS 152, Spring 2011 Section 8

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

CS 654 Computer Architecture Summary. Peter Kemper

Agenda. Pentium III Processor New Features Pentium 4 Processor New Features. IA-32 Architecture. Sunil Saxena Principal Engineer Intel Corporation

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

COMPUTER ORGANIZATION AND DESI

Unit 8: Superscalar Pipelines

CS 152, Spring 2012 Section 8

TDT 4260 lecture 7 spring semester 2015

Inside Intel Core Microarchitecture

Portland State University ECE 587/687. Superscalar Issue Logic

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

E0-243: Computer Architecture

Lec 25: Parallel Processors. Announcements

45-year CPU Evolution: 1 Law -2 Equations

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Replenishing the Microarchitecture Treasure Chest. CMuART Members

Exploitation of instruction level parallelism

Lecture 1: Introduction

Advanced Computer Architecture

November 7, 2014 Prediction

Simultaneous Multithreading Processor

Pentium IV-XEON. Computer architectures M

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Handout 2 ILP: Part B

Processor Design Pipelined Processor. Hung-Wei Tseng

Intel Architecture for Software Developers

How to write powerful parallel Applications

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Wide Instruction Fetch

Multicore and Parallel Processing

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Keywords and Review Questions

Advanced Instruction-Level Parallelism

EECC551 Exam Review 4 questions out of 6 questions

Communications and Computer Engineering II: Lecturer : Tsuyoshi Isshiki

CS425 Computer Systems Architecture

Superscalar Processors

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

EECS 470. Control Hazards and ILP. Lecture 3 Winter 2014

OOO Execution and 21264

Advanced issues in pipelining

Lecture 9: Multiple Issue (Superscalar and VLIW)

Transcription:

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum

Intel s Microarchitecture Research Labs! USA: California, Oregon, Texas (John Shen) High Frequency Superscalar Processors Helper Threads for SMT and CMP Machines Future Enterprise Server Processors! Israel: Haifa (Ronny Ronen) Low Power Microarchitecture Techniques Future Mobile High-performance Processors! Spain: Barcelona (Antonio Gonzalez) Speculative Multithreading for SMT and CMP Clustered Microarchitecture Techniques

Microprocessor Performance Growth in Perspective! Doubling every 18 months (1982-2000): 2000): Total of 3,200X Cars travel at 176,000 MPH; get 64,000 miles/gal. Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200) Wheat yield: 320,000 bushels per acre! Doubling every 24 months (1971-2001): Total of 36,000X Cars travel at 2,400,000 MPH; get 600,000 miles/gal. Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000) Wheat yield: 3,600,000 bushels per acre Unmatched by any other industry!!

Iron Law of Microprocessor Performance Time 1/Processor Performance = --------------- Program Instructions Cycles Time = ------------------ X ---------------- X ------------ Program Instruction Cycle (inst. count) (CPI) (cycle time) Processor Performance = ----------------- IPC x GHz inst. count

Performance Improvement Techniques! Increase GHz Process Technology Circuit Techniques Pipelining and Caches! Increase IPC (Reduce CPI) Superscalar Pipelines Out-of of-order order Execution Cache Miss Reduction! Decrease Instruction Count Compiler Optimization Architecture Extensions Microarchitecture Techniques

SPECint92 Landscape

P6 vs. Pentium 4 Pipelines Basic P6 Pipeline Intro at 733MHz.18µ 1 2 3 4 5 6 7 8 9 10 Fetch Fetch Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch Exec Basic Pentium 4 Processor Pipeline 1 2 3 4 5 6 7 8 9 10 11 12 TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch 13 14 Disp Disp 15 16 17 18 19 20 RF RF Ex Flgs Br Ck Drive Intro at 1.5GHz.18µ

Hyper Pipelined Technology @ intro 1.5 GHz 20 Netburst Micro-Architecture 1GHz Frequency 10 P6 Micro-Architecture 166MHz 60MHz Introduction Time 233MHz 5 P5 Micro-Architecture

Deeper and Wider Pipelines Branch Penalty Load Penalty Fetch Dec. Disp. Exec. Mem. Retire ALU Penalty Fetch Decode Dispatch Branch Penalty ALU Penalty Execute Memory Load Penalty Retire

Pipelining Penalty Loops! Branch Penalty Branch predictor CPI overhead: Branch% x Misprediction% % x PipeDepth Performance lost: CPI overhead x PipeWidth! Load Penalty Cache hierarchy CPI overhead: Load% x AvgLoadLatency Average Load Latency: Σ Cache(i)Hit% % x Cache(i)Latency! ALU Penalty Forwarding paths and super-pipelining pipelining

Branch Prediction specu. cond. prediction specu. target Branch Predictor BTB update (target addr. and history) FA-mux PC npc to Icache npc(seq.) = PC+4 Fetch Decode Dispatch Decode Buffer Dispatch Buffer Issue Branch Reservation Stations Execute Finish Completion Buffer

Branch Prediction Technology! Basic 2-bit 2 Local History Predictor ~80% prediction accuracy ~25 instructions/mispredict ~5 cycles/25 instructions (0.2 CPI)! Two-Level Correlated Predictor (P6) ~90% prediction accuracy ~50 instructions/mispredict ~10 cycles/50 instructions (0.2 PI)! Current State of the Art (Pentium 4) ~95% prediction accuracy ~100 instructions/mispredict ~20 cycles/100 instructions (0.2 CPI)! Current Research Challenge (2008) ~98% prediction accuracy ~250 instructions/mispredict ~25 cycles/250 instructions (0.1 CPI)

Data Cache and Prefetching Branch Predictor I-cache Decode Dispatch Decode Buffer Dispatch Buffer Reservation Stations branch integer integer floating store load point Memory Reference Prediction Prefetch Queue Completion Buffer Complete Store Buffer Data Cache Main Memory

Cache Hierarchy Technology! Current Commercial Workload (6 cycles/load) L1 Hits: 80% x 2 cycles = 1.6 L2 Hits: 15% x 10 cycles = 1.5 L3 Hits: 4% x 30 cycles = 1.2 Memory: 1% x 150 cycles = 1.5! Future Commercial Workload (17 cycles/load) L1 Hits: 80% x 4 cycles = 3.2 L2 Hits: 15% x 20 cycles = 3.0 L3 Hits: 4% x 60 cycles = 2.4 Memory: 1% x 800 cycles = 8.0! Current Research Challenge (5 cycles/load) Efficient and judicious caches Load partitioning and specialized caching Aggressive memory prefetching

Memory Latency Bottleneck Cache Latency (Clocks) 1000 100 10 800 Instruction Cost 400 External Memory Latency 1 L1 L2 L3 External Memory Cache Prefetching: 0 Pentium Pentium proc Pro Proc Pentium III proc Hardware: Limited by predictable patterns Software: Limited by single control flow Research Challenge: Pointer-intensive code Future Processors

Frequency vs. Parallelism! Increase Frequency (GHz) Deeper Pipelines Increases Branch/Load penalties Lowers IPC! Increase Instruction Parallelism (IPC) Wider Pipelines Increases Complexity Lowers GHz

Front-End Pipe-Depth Penalty Fetch Decode Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Decode Dispatch Execute Memory Retire Optimize

Alleviate Pipe-Depth Penalty! Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations! Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, rename, Pre-scheduling Memory Pre-fetch Prediction and Control

Execution Core Improvement Fetch Super-pipelined ALU design Very high-speed arithmetic units Decode Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-fetching

Source: Intel Corporation How Deep Can You Go? 25 20 Frequency CPI Performance Power 15 57? 10 5 0 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 Pipeline Depth [Ed Grochowski, 7/6/01]

How Much ILP Is There? Weiss and Smith [1984] 1.58 Sohi and Vajapeyam [1987] 1.81 Tjaden and Flynn [1970] 1.86 Tjaden and Flynn [1973] 1.96 Uht [1986] 2.00 Smith et al. [1989] 2.00 Jouppi and Wall [1988] 2.40 Johnson [1991] 2.50 Acosta et al. [1986] 2.79 Wedig [1982] 3.00 Butler et al. [1991] 5.8 Melvin and Patt [1991] 6 Wall [1991] 7 Kuck et al. [1972] 8 Riseman and Foster [1972] 51 Nicolau and Fisher [1984] 90

SPECint95 Landscape 0.08 Landscape of Microprocessor Families 0.07 20 25 30 35 40 45 50 55 60 SPECint 95 0.06 15 10 264 SPECint95/MHz 0.05 0.04 0.03 5 P PPro PII 164 PIII Athlon Athlon 0.02 064 Alpha AMD-x86 0.01 Intel-x86 Bryan Black 0 80 180 280 380 480 580 680 780 880 980 Frequency (MHz) ** Data source www.spec.org

SPECint2000 Landscape 1 Landscape of Microprocessor Families Intel-x86 SPECint2000/MHz 0.5 200 100 50 25 604e 300 400 500 600 PIII-Xeon 264A 700 800 SPECint 2000 264B 264C Sparc-III Athlon Itanium P4 AMD-x86 Alpha PowerPC Sparc IPF Bryan Black 0 0 500 1000 1500 2000 2500 Frequency (MHz) ** Data source www.spec.org

Parallelism in Transition MIPS 1000000 100000 10000 1000 100 10 Pentium 4 Architecture Trace Cache Pentium Pro Architecture Speculative Out of Order Pentium Architecture Super Scalar Multi-Threaded, Multi-Core Future Xeon Architecture Multi-Threaded Era of Instruction Parallelism Era of Thread Parallelism 1 1980 1985 1990 1995 2000 2005 2010

Summary Performance Demand Continues! 5-10 billion transistors by 2010! 10-20 GHz by 2010 Challenge Is Power and Efficiency! Power dissipation, delivery, density! New clever/efficient implementations New Frontiers to Explore! Synergism of ILP, TLP, and MLP! Semi-Custom Microarchitectures