Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Similar documents
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Today (I)

Computer Architecture

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 2: Fundamental Concepts and ISA. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/15/2014

Computer Architecture Lecture 2: Fundamental Concepts and ISA. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 1/14/2014

Multithreaded Processors. Department of Electrical Engineering Stanford University

Computer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Architecture Spring 2016

ECE 588/688 Advanced Computer Architecture II

Computer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

ECE 588/688 Advanced Computer Architecture II

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Staged Memory Scheduling

CS425 Computer Systems Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Beyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji

Multiple Instruction Issue. Superscalars

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Hyperthreading Technology

Kaisen Lin and Michael Conley

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

THREAD LEVEL PARALLELISM


Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

Simultaneous Multithreading and the Case for Chip Multiprocessing

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Parallel Processing SIMD, Vector and GPU s cont.

High-Performance Processors Design Choices

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

November 7, 2014 Prediction

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

CS 654 Computer Architecture Summary. Peter Kemper

Multi-core Architectures. Dr. Yingwu Zhu

15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Computer Architecture

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

CS377P Programming for Performance Multicore Performance Multithreading

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Computer Architecture

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Achieving Out-of-Order Performance with Almost In-Order Complexity

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Multithreading: Exploiting Thread-Level Parallelism within a Processor

The Processor: Instruction-Level Parallelism

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Twos Complement Signed Numbers. IT 3123 Hardware and Software Concepts. Reminder: Moore s Law. The Need for Speed. Parallelism.

Simultaneous Multithreading Architecture

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University


Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Unit 8: Superscalar Pipelines

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

A comparison of three architectures: Superscalar, Simultaneous Multithreading CPUs and Single-Chip Multiprocessor.

CMSC 611: Advanced. Parallel Systems

Simultaneous Multithreading on Pentium 4

Control Hazards. Branch Prediction

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

CS 426 Parallel Computing. Parallel Computing Platforms

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Architecture-Conscious Database Systems

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Control Hazards. Prediction

Transcription:

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2

3

Conventional Processors Stop Scaling Performance by 50% each year 1e+7 1e+6 Perf (ps/inst) 1e+5 1e+4 1e+3 1e+2 1e+1 1e+0 1980 1990 2000 2010 2020 Bill Dally

Multi-Core Idea: Put multiple processors on the same die. Technology scaling (Moore s Law) enables more transistors to be placed on the same die area What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Simultaneous multithreading Integrate platform components on chip (e.g., network interface, memory controllers) 5

Why Not a Better Single Core? Alternative: Bigger, more powerful single core Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive) 6

Large Superscalar+OoO vs. Multi-Core Olukotun et al., The Case for a Single-Chip Multiprocessor, ASPLOS 1996. 7

Multi-Core vs. Large Superscalar+OoO Multi-core advantages + Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads reduced context switches + Higher system performance in parallel applications Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand 8

Large Superscalar vs. Multi-Core Olukotun et al., The Case for a Single-Chip Multiprocessor, ASPLOS 1996. Technology push Instruction issue queue size limits the cycle time of the superscalar, OoO processor diminishing performance Quadratic increase in complexity with issue width Large, multi-ported register files to support large instruction windows and issue widths reduced frequency or longer RF access, diminishing performance Application pull Integer applications: little parallelism? FP applications: abundant loop-level parallelism Others (transaction proc., multiprogramming): CMP better fit 9

Comparison Points 10

Why Not bigger caches? Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy 11

Cache vs. Core Cache Microprocessor Time 12 Number of Transistors

Why Not Multitheading? Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) + Good single-thread performance with SMT + No need to have an entire core for another thread + Parallel performance aided by tight sharing of caches - Scalability is limited: need bigger register files, more function units, larger issue width (and associated costs) to have many threads complex with many threads - Parallel performance limited by shared fetch bandwidth - Extensive resource sharing at the pipeline and memory system reduces both single-thread and parallel application performance 13

Why Not System on a Chip? Alternative: Integrate platform components on chip instead + Speeds up many system functions (e.g., network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e.g., CPU intensive code sections) 14

Why Not Clustering? Alternative: More scalable superscalar, out-of-order engines Clustered superscalar processors (with multithreading) + Simpler to design than superscalar, more scalable than simultaneous multithreading (less resource sharing) + Can improve both single-thread and parallel application performance - Diminishing performance returns on single thread: Clustering reduces IPC performance compared to monolithic superscalar. Why? - Parallel performance limited by shared fetch bandwidth - Difficult to design 15

Clustering (I) Palacharla et al., Complexity Effective Superscalar Processors, ISCA 1997. 16

Clustering (II) Each scheduler is a FIFO + Simpler + Can have N FIFOs (OoO w.r.t. each other) + Reduces scheduling complexity -- More dispatch stalls Inter-cluster bypass: Results produced by an FU in Cluster 0 is not individually forwarded to each FU in another cluster. Palacharla et al., Complexity Effective Superscalar Processors, ISCA 1997. 17

Clustering (III) Scheduling within each cluster can be out of order Brown, Reducing Critical Path Execution Time by Breaking Critical Loops, UT-Austin 2005. 18

Clustered Superscalar+OoO Processors Clustering (e.g., Alpha 21264 integer units) Divide the scheduling window (and register file) into multiple clusters Instructions steered into clusters (e.g. based on dependence) Clusters schedule instructions out-of-order, within cluster scheduling can be in-order Inter-cluster communication happens via register files (no full bypass) + Smaller scheduling windows, simpler wakeup algorithms + Fewer ports into register files + Faster within-cluster bypass -- Extra delay when instructions require across-cluster communication -- inherent difficulty of steering logic Kessler, The Alpha 21264 Microprocessor, IEEE Micro 1999. 19

Why Not Multi-Chip symmetric Multiproc? Alternative: Traditional symmetric multiprocessors + Smaller die size (for the same processing core) + More memory bandwidth (no pin bottleneck) + Fewer shared resources less contention between threads - Long latencies between cores (need to go off chip) shared data accesses limit performance parallel application scalability is limited - Worse resource efficiency due to less sharing worse power/energy efficiency 20

Why Multi-Core? Other alternatives? Dataflow? VLIW? Vector processors (SIMD)? Streaming processors? Integrating DRAM on chip? Reconfigurable logic? (general purpose?) 21

Review: Multi-Core Alternatives Bigger, more powerful single core Bigger caches (Simultaneous) multithreading Integrate platform components on chip instead More scalable superscalar, out-of-order engines Traditional symmetric multiprocessors Dataflow? Vector processors (SIMD)? Integrating DRAM on chip? Reconfigurable logic? (general purpose?) Other alternatives? Your solution? 22

Why Multi-Core (Cynically) Huge investment and need ROI Have to offer some kind of upgrade path It is easy for the processor manufacturers 23

Why Multi-Core (Cynically) Huge investment and need ROI Have to offer some kind of upgrade path It is easy for the processor manufacturers But, Seriously: Some easy parallelism Most general purpose machines run multiple tasks at a time Some (very important) Apps have easy parallelism Power is a real issue Design complexity is very costly Is it the right solution? 24

Computer Architecture Today (I) Today is a very exciting time to study computer architecture Industry is in a large paradigm shift (to multi-core and beyond) many different potential system designs possible Many difficult problems motivating and caused by the shift Power/energy constraints multi-core?, accelerators? Complexity of design multi-core? Difficulties in technology scaling new technologies? Memory wall/gap Reliability wall/issues Programmability wall/problem single-core? No clear, definitive answers to these problems 25

Computer Architecture Today (II) These problems affect all parts of the computing stack if we do not change the way we design systems Problem Algorithm Program/Language User Runtime System (VM, OS, MM) ISA Microarchitecture Logic Circuits Electrons No clear, definitive answers to these problems 26

Computer Architecture Today (III) You can revolutionize the way computers are built, if you understand both the hardware and the software (and change each accordingly) You can invent new paradigms for computation, communication, and storage Recommended book: Kuhn, The Structure of Scientific Revolutions (1962) Pre-paradigm science: no clear consensus in the field Normal science: dominant theory used to explain things (business as usual); exceptions considered anomalies Revolutionary science: underlying assumptions re-examined 27

but, first Let s understand the fundamentals You can change the world only if you understand it well enough Especially the past and present dominant paradigms And, their advantages and shortcomings -- tradeoffs 28

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Backup slides 30

Referenced Readings Moore, Cramming more components onto integrated circuits, Electronics, 1965. Olukotun et al., The Case for a Single-Chip Multiprocessor, ASPLOS 1996. Tullsen et al., Simultaneous Multithreading: Maximizing On-Chip Parallelism, ISCA 1995. Kessler, The Alpha 21264 Microprocessor, IEEE Micro 1999. Brown, Reducing Critical Path Execution Time by Breaking Critical Loops, UT-Austin 2005. Palacharla et al., Complexity Effective Superscalar Processors, ISCA 1997. Kuhn, The Structure of Scientific Revolutions, 1962. 31

Related Videos Multi-Core Systems and Heterogeneity http://www.youtube.com/watch?v=lldxt0hpl2u&list=plvngz 7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=1 http://www.youtube.com/watch?v=q0zylvnzkrm&list=plvngz 7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=2 32