The Future Evolution of High-Performance Microprocessors

Similar documents
Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Microprocessor Trends and Implications for the Future

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

ECE 486/586. Computer Architecture. Lecture # 2

CSE502: Computer Architecture CSE 502: Computer Architecture

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Multi-Core Microprocessor Chips: Motivation & Challenges

Lecture 1: Introduction

Fundamentals of Quantitative Design and Analysis

Computer Architecture Spring 2016

Lecture 2: Performance

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture!

Copyright 2012, Elsevier Inc. All rights reserved.

Multicore Programming

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

The Effect of Temperature on Amdahl Law in 3D Multicore Era

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

Advanced Computer Architecture (CS620)

Fundamentals of Computer Design

With the announcement of multicore

Parallelism: The Real Y2K Crisis. Darek Mihocka August 14, 2008

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

Computer Architecture!

Outline Marquette University

The Implications of Multi-core

Computer Architecture

EECS4201 Computer Architecture

New Dimensions in Microarchitecture Harnessing 3D Integration Technologies

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

EE586 VLSI Design. Partha Pande School of EECS Washington State University

CS 6534: Tech Trends / Intro

Multithreading: Exploiting Thread-Level Parallelism within a Processor

ECE 2162 Intro & Trends. Jun Yang Fall 2009

Introduction. Summary. Why computer architecture? Technology trends Cost issues

Performance of computer systems

Transistors and Wires

Computer Architecture s Changing Definition

ECE 588/688 Advanced Computer Architecture II

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

An Introduction to Parallel Programming

Control Hazards. Branch Prediction

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Computer Architecture!

Hyperthreading Technology

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

CS/EE 6810: Computer Architecture

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Return of Innovation. David May. David May 1 Cambridge December 2005

Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

45-year CPU evolution: one law and two equations

Lecture 1: CS/ECE 3810 Introduction

Computer Architecture Crash course

CS671 Parallel Programming in the Many-Core Era

Computer Architecture

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

ECE 588/688 Advanced Computer Architecture II

Computer Architecture. R. Poss

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallelism and Concurrency. COS 326 David Walker Princeton University

CS 654 Computer Architecture Summary. Peter Kemper

Advanced Processor Architecture

Parallelism, Multicore, and Synchronization

1.3 Data processing; data storage; data movement; and control.

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Multicore Hardware and Parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Exploiting Dark Silicon in Server Design. Nikos Hardavellas Northwestern University, EECS

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

COSC 243. Computer Architecture 2. Lecture 12 Computer Architecture 2. COSC 243 (Computer Architecture)

Processor Architecture

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Microelettronica. J. M. Rabaey, "Digital integrated circuits: a design perspective" EE141 Microelettronica

Advanced Computer Architecture

Reconfigurable Multicore Server Processors for Low Power Operation

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multicore and Parallel Processing

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

Lecture 18: Multithreading and Multicores

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Parallelism in Hardware

Transcription:

The Future Evolution of High-Performance Microprocessors Norm Jouppi HP Labs 2005 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Keynote Overview What, Why, How, and When of Evolution Microprocessor Environmental Constraints The Power Wall Power: From the Transistor to the Data Center The Evolution of Computer Architecture Ideas Summary 11/15/2005 2

Disclaimer These views are mine, not necessarily HP Never make forecasts, especially about the future Samuel Goldwyn 11/15/2005 3

What is Evolution Definition: a process of continuous change from a lower, simpler, or worse to a higher, more complex, or better state unfolding 11/15/2005 4

Why Evolution Evolution is a Very Efficient Means of Building New Things Reuse, recycle Minimum of new stuff Much easier than revolution 11/15/2005 5

When Evolution Can be categorized into Eras, Periods, etc. 11/15/2005 6

Technology Usually evolution, not revolution Many revolutionary technologies have a bad history: Bubble memories Josephson junctions Anything but Ethernet Etc. Moore s Law has been a key force driving evolution of the technology 11/15/2005 7

Moore s Law Originally presented in 1965 Number of transistors per chip is 1.59 year-1959 (originally 2 year-1959 ) Classical scaling theory (Denard, 1974) With every feature size scaling of n You get O(n 2 ) transistors They run O(n) times faster Subsequently proposed: Moore s Design Law (Law #2) Moore s Fab Law (Law #3) 11/15/2005 8

Microprocessor Efficiency Eras (Jouppi) Moore s Law says number of transistors scaling as O(n 2 ) and speed as O(n) Microprocessor performance should scale as O(n 3 ) (log) Performance N 2 Era N 1 Era Efficiency N 0 N -1 N 3 Era Number of Transistors 11/15/2005 9

N 3 Era n from device speed, n 2 from transistor count 4004 to 386 Expansion of data path widths from 4 to 32 bits Basic pipelining Hardware support for complex ops (FP mul) Memory range and virtual memory Hard to measure performance Measured in MIPS 11/15/2005 10

N 3 Era n from device speed, n 2 from transistor count 4004 to 386 Expansion of data path widths from 4 to 32 bits Basic pipelining Hardware support for complex ops (FP mul) Memory range and virtual memory Hard to measure performance Measured in MIPS But how many 4-bit ops = 64-bit FP mul? 11/15/2005 11

N 3 Era n from device speed, n 2 from transistor count 4004 to 386 Expansion of data path widths from 4 to 32 bits Basic pipelining Hardware support for complex ops (FP mul) Memory range and virtual memory Hard to measure performance Measured in MIPS But how many 4-bit ops = 64-bit FP mul? More than 1500! 11/15/2005 12

N 2 Era n from device speed, only n from n 2 transistors 486 through Pentium III/IV Era of large on-chip caches Miss rate halves per quadrupling of cache size Superscalar issue 2X performance from quad issue? Superpipelining Diminishing returns 11/15/2005 13

N Era n from device frequency, 1 from n 2 transistor count Very wide issue machines Little help to many aps Need SMT to justify Increase complexity and size too much slowdown Long global wires Structure access times go up Time to market 11/15/2005 14

Environmental Constraints on Microprocessor Evolution Several categories: Technology scaling Economics Devices Voltage scaling System-level constraints Power 11/15/2005 15

Supply Economics: Moore s Fab Law Fab cost is scaling as 1/feature size 90nm fabs currently cost 1-2 billion dollars Few can afford one by themselves (except Intel) Fabless startups Fab partnerships IBM/Toshiba/etc. Large foundries But number of transistors scales as 1/feature size 2 Transistors still getting cheaper Transistors still getting faster 11/15/2005 16

Supply Economics: Moore s Design Law The number of designers goes as O(1/feature) The 4004 had 3 designers, 10µm Recent 90nm microprocessors have ~300 designers Implication: design cost becomes very large Consolidation in # of viable microprocessors Microprocessor cores often reused Too much work to design from scratch But shrinks and tweaks becoming difficult in deep submicron technologies 11/15/2005 17

Devices Transistors historically get faster feature size But transistors getting much leakier Gate leakage (fix with high-k gate dielectrics) Channel leakage (dual-gate or vertical transistors?) Even CMOS has significant static power Power is roughly proportional to # transistors Static power approaching dynamic power Static power increases with chip temperature Positive feedback is bad 11/15/2005 18

Voltage Scaling High performance MOS started out with12v Max Voltage scaling roughly as sqrt(feature) Power is CV 2 f Lower voltage can reduce the power as square But speed goes down with lower voltage Current high-performance microprocessors have 1.1V supplies Reduced power (12/1.1) 2 = 119X over 24 years! 11/15/2005 19

Limits of Voltage Scaling Beyond a certain voltage transistors don t turn off ITRS projects minimum voltage of 0.7V in 2018 Limited by threshold voltage variation, etc. But high-performance microprocessors are now 1.1V Only (1.1/0.7) 2 = 2.5X reduction left in next 14 years! 11/15/2005 20

System-Level Power Per-chip power envelope is peaking Itanium 2 @ 130W => Montecito @ 100W 1U servers and blades reduces heat sink height Cost of power in and heat out for several years can equal original system cost First class design constraint 11/15/2005 21

Current Microprocessor Power Trends Figure source: Shekhar Borkar, Low Power Design Challenges for the Decade, Proceedings of the 2001 Conference on Asia South Pacific Design Automation, IEEE. 11/15/2005 22

The Power Wall in Perspective The Memory Wall The Power Wall 11/15/2005 23

Pushing Out the Power Wall Waste Not Want Not (Ben Franklin) for circuits Power-efficient microarchitecture Single threaded vs. throughput tradeoff 11/15/2005 24

Waste Not Want Not for Circuits Lots of circuit ways to save power already in use Clock gating Multiple transistor thresholds Sleeper transistors Etc. Thus circuit-level power dissipation is already fairly efficient What about architectural savings? 11/15/2005 25

Power-Efficient Microarchitecture Off-chip memory reference costs a lot of power Thus drive to more highly-associative on-chip caches Limits to how far this can go Lots of other similar examples PACS workshop/conference proceedings No silver bullet Limited benefits from each technique 11/15/2005 26

Single-Threaded / Throughput Tradeoff Reducing transistors/core can yield higher MIPS/W Move back towards N 3 scaling efficiency Thus, expect trend to simpler processors Narrower issue width Shallower pipelines More in-order processors or smaller OOO windows Back to the Future But this gives lower single-thread performance Can t simplify core too quickly Tradeoffs on what to eliminate not always obvious Examples: Speculation, Multithreading 11/15/2005 27

Speculation Is speculation uniformly bad? No Example: branch prediction Executing down wrong path wastes performance & power Stalling at every branch would hurt performance & power Circuits leak when not switching Predicting a branch can save power Plus predictor memory takes less power/area than logic But current amount of speculation seems excessive 11/15/2005 28

Multithreading SMT is very useful in wide-issue OOO machines Good news: increases power efficiency Bad news: Wide issue still power inefficient Multithreading useful even in simple machines During cache misses transistors still leak Not enough time to gate power May only need 2 or 3 thread non-simultaneous MT 11/15/2005 29

Microarchitectural Research Implications Processors need to get simpler (not more complicated) to become more efficient More complicated microarchitecture mechanisms with little benefit not needed 11/15/2005 30

Recap: Possible Future Evolution in Terms of P = CV 2 f Formula doesn t change, but terms do: Power 10 0 Already at system limits, better if lower Voltage 10 0 Only a factor of 2.5 left over next 14 years Clock Frequency 10 0 Not scaling with device speed Fewer pipestages, higher efficiency Move from 30 to 10 stages from 90nm to 32 nm Capacitance (~Area) needs to be repartitioned for higher system performance 11/15/2005 31

Number of Cores Per Die Scale processor complexity as 1/feature Number of cores will go up dramatically as n 3 From 1 core at 90nm to 27 per die at 30nm! But can we efficiently use that many cores? 11/15/2005 32

The Coming Golden Age We are on the cusp of the golden age of parallel programming Every major application needs to become parallel Necessity is the mother of invention How to make use of many processors effectively? Some aps inherently parallel (Web, CFD, TPC-C) Some applications are very hard to parallelize Parallel programming is a hard problem Can t use large amounts of speculation in software Just moves power inefficiency to a higher level 11/15/2005 33

Important Architecture Research Problems Not an exhaustive list: How to wire up CMPs: Memory hierarchy? Transactional memory? Interconnection networks? How to build cores: Heterogeneous CMP design Conjoined core CMPs Power: From the transistor through the datacenter 11/15/2005 34

Important Architecture Research #2 Many system level problems: Cluster interconnects Manageability Availability Security Hardware/software tradeoffs (ASPLOS) increasingly important 11/15/2005 35

Power: From the Transistor through the Data Center CACTI 4.0 Heterogenous CMP Conjoined cores Data center power management 11/15/2005 36

CACTI 4.0 Collaboration with David Tarjan of UVa Now with leakage power model Also includes: Much improved technology scaling Circuit updates Scaling to much larger cache sizes Options: serial tag/data, high speed access, etc. Parameterized data and tag widths (including no tags) Beta version web interface: http://analog.cs.virginia.edu:81/cacti/index.y Full release coming soon (norm.jouppi@hp.com) 11/15/2005 37

11/15/2005 38

Heterogeneous Chip Multiprocessors A.k.a. Asymmetric, Non-homogeneous, synergistic, Single ISA vs. Multiple ISA Many benefits: Power Throughput Mitigating Amdahl s Law Open questions Best mix of heterogeneity 11/15/2005 39

Potential Power Benefits Grochowski et. al. ICCD 2004: Asymmetric CMP => 4-6X Further voltage scaling => 2-4X More gating => 2X Controlling speculation =>1.6X 11/15/2005 40

Clock Gating Transistors still leak And busses are still long, slow, & waste power Key: Yellow are inactive, red are active, orange are global busses 11/15/2005 41

Animation for CPU scale-down Combine most frequently active blocks into small heterogeneous core instead Power down large core when not needed 11/15/2005 42

Performance benefits from heterogeneity 63% 11/15/2005 43

Mitigating Amdahl s Law Amdahl s law: Parallel speedups limited by serial portions Annavaram et. al., ISCA 2005: Basic idea Use big core for serial portions Use many small cores for parallel portions Prototype built from discrete 4-way SMP Ran one socket at regular voltage, other 3 at low voltage 38% wall clock speedup using fixed power budget 11/15/2005 44

Conjoined Cores Ideally, provide for peak needs of any single thread without multiplying the cost with the number of cores Baseline usage resource Peak usage resource 11/15/2005 45

Conjoined-core Architecture Benefits % Area savings per core 60 50 40 30 20 10 Core-area savings (Core-area + crossbar-area per core) area savings 0 0 Performance degradation even less if < 8 threads! 11/15/2005 46 % performance degradation 80 70 60 50 40 30 20 10 INT Workloads FP Workloads

Datacenter Power Optimizations A rack of blades can dissipate 30kW or more! 11/15/2005 47

Data Center Power is Front Page News 11/15/2005 48

Data center thermal management at HP Power density also becoming a significant reliability issue Use 3D modeling and measurement to understand thermal characteristics of data centers Saving 25% today Exploit this for dynamic resource allocation and provisioning Chandrakant Patel et. al. 11/15/2005 49

A Theory on the Evolution of Computer Architecture Ideas Conjecture: There is no such thing as a bad or discredited idea in computer architecture, only ideas at the wrong level, place, or time Reuse, Recycle Evolution vs. Revolution Examples: SIMD Dataflow HLL architectures Capabilities Vectors 11/15/2005 50

SIMD Efficient way of computing proposed in late 60s 64-PE Illiac-IV operational in 1972 Difficult to program Intel MMX introduced in 1996 Efficient use of larger word sizes for small parallel data Used in libraries or specialized code Only small increase in hardware (<1%) 11/15/2005 51

Dataflow Widely researched in late 80s Lots of problems: Complicated machines New programming model Out-of-order (OOO) execution machines developed in 90s are a limited form of data flow Issue queues issue instructions when operands ready But keeps same instruction set architecture Keeps same programming model Still complex internally 11/15/2005 52

High-Level Language Architectures Popular research in 1970 s and early 1980 s Closing the semantic gap A few attempts to implement in hardware failed Machines interpreted HLLs in hardware Now we have Java interpreted in software JIT compilers Portability Modest performance loss doesn t matter for some apps 11/15/2005 53

Capabilities Popular research topic in 1970 s Intel 432 implemented capabilities in hardware Every memory reference Poor performance Died with 432 Security increasingly important Capabilities at the file-system level combined with standard memory protection models Much less overhead Virtual machine support 11/15/2005 54

Lessons from Idea Evolution Theory Don t be afraid to look at past ideas that didn t work out as a source of inspiration Some ideas make be successful if reinterpreted at a different level, place, or time when they can be made more evolutionary than revolutionary 11/15/2005 55

Conclusions The Power Wall is here It is the wall to worry about It has dramatic implications for the industry From the transistor through the data center We need to reclaim past efficiencies Microarchitectural complexity needs to be reduced The power wall will usher in the Golden Age of Parallel Programming Much open research in architecture It may be time to reexamine some previously discarded architecture ideas 11/15/2005 56

Thanks 11/15/2005 57