How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

Similar documents
Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multi-Core Microprocessor Chips: Motivation & Challenges

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 141: Computer Architecture. Professor: Michael Taylor. UCSD Department of Computer Science & Engineering

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CS 475: Parallel Programming Introduction

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

The Art of Parallel Processing

Microprocessor Trends and Implications for the Future

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

New Challenges in Microarchitecture and Compiler Design

Computer Systems Architecture

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Parallel Computing. Parallel Computing. Hwansoo Han

Moore s Law. CS 6534: Tech Trends / Intro. Good Ol Days: Frequency Scaling. The Power Wall. Charles Reiss. 24 August 2016

CS 6534: Tech Trends / Intro

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Architecture. Hwansoo Han

Multiprocessors - Flynn s Taxonomy (1966)

An Introduction to Parallel Programming

Lecture 1: Gentle Introduction to GPUs

Multicore and Parallel Processing

Multicore Programming

The Effect of Temperature on Amdahl Law in 3D Multicore Era

EE586 VLSI Design. Partha Pande School of EECS Washington State University

The Implications of Multi-core

Computer Systems Architecture

Convergence of Parallel Architecture

45-year CPU Evolution: 1 Law -2 Equations

Advanced Processor Architecture

Exploiting Dark Silicon in Server Design. Nikos Hardavellas Northwestern University, EECS

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

WHY PARALLEL PROCESSING? (CE-401)

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Concurrency & Parallelism, 10 mi

Computer Architecture!

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Lec 25: Parallel Processors. Announcements

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

45-year CPU evolution: one law and two equations

Parallelism. CS6787 Lecture 8 Fall 2017

Chap. 4 Multiprocessors and Thread-Level Parallelism

CSC501 Operating Systems Principles. OS Structure

ECE 8823: GPU Architectures. Objectives

A unified multicore programming model

Parallel Software 2.0

Understanding Dual-processors, Hyper-Threading Technology, and Multicore Systems

EE241 - Spring 2004 Advanced Digital Integrated Circuits

High Performance Computing

CS671 Parallel Programming in the Many-Core Era

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Arquitecturas y Modelos de. Multicore

A Memory System Design Framework: Creating Smart Memories

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallelism, Multicore, and Synchronization

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

What is This Course About? CS 356 Unit 0. Today's Digital Environment. Why is System Knowledge Important?

ECE 588/688 Advanced Computer Architecture II

Computer Architecture Spring 2016

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Computer Performance Evaluation and Benchmarking. EE 382M Dr. Lizy Kurian John

Computer Architecture

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

COMP 633 Parallel Computing.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

CS758: Multicore Programming

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

CSE502: Computer Architecture CSE 502: Computer Architecture

Parallel and High Performance Computing CSE 745

New Dimensions in Microarchitecture Harnessing 3D Integration Technologies

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

The Beauty and Joy of Computing

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology

Introduction to Parallel Computing

Lecture 1: Introduction

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Computer Architecture Crash course

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Comp. Org II, Spring

Multiprocessor Synchronization

Parallel Computing Platforms

7 Trends driving the Industry to Software-Defined Servers

Parallel Processing & Multicore computers

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Transcription:

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group avi.mendelson@intel.com 1

Disclaimer No Intel proprietary information is disclosed. Every future estimate or projection is only a speculation Responsibility for all analysis, opinions and conclusions falls on the author only. It does not means you cannot trust them Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 2

Agenda What is the motivation for new generation of parallel processors on a die Comparing many cores vs. multi-cores vs. asymmetric core approaches My conclusions and future directions Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 3

Motivation of the new trend Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 4

The old good days of computer architectures Performance used to came out of process technology and new single threaded architectural improvements. Every 18-24 months new process is announced The target of any new process is to shrink the dimension of the transistors on 0.7 (ideal shrink) As a result, on the same die area, we can have more transistors each of them runs in higher frequency Every 5-7 years we had a new architecture generation Used to target better single threaded performance Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 5

Simple processes scaling rules. If optimal scaling could be achieved (0.7 shrink) we can Double the number of transistors on same area Improve the frequency (performance) in ~50% Consume the same power for the same are Preserve the power density. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 6

Ideal Scenarios... Ideal Shrink Same μarch 1X #Xistors 0.5X size 1.5X frequency 0.5X power 1X IPC (instr./cycle) 1.5X performance 1X power density Ideal New μarch Same die size 2X #Xistors 1X size 1.5X frequency 1X power 2X IPC 3X performance 1X power density Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 7

Process Technologies Reality But in reality: New process is not ideal anymore New designs squeeze frequency to 2X per process (Moore s law) New designs use more transistors (2X-3X to get 1.5X-1.7X perf) The die size increases So, every new process and architecture generation: Power goes up about 2X Power density goes up 30%~80% This is bad, and Will get worse in future process generations: Voltage (Vdd) will scale down less No ideal shrinking anymore Leakage is going to the roof and it heavily depends on temperature Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 8

Power Density Watts/cm 2 1000 100 10 1 i386 Hot plate i486 Nuclear Reactor Pentium 4 Pentium Pro Pentium Pentium III Pentium II 1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ Rocket Nozzle * New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 9

Outcome - We can t build microprocessors with ever increasing power density and die sizes The constraint is power and power density not manufacturability The design of any future micro-processor should take power into consideration. We need to distinguish between different aspects of power: Max power (TJ) Power density - hot spots Energy static + dynamic In order to achieve better single threaded performance improvement at the same power envelop, we need new micro-architecture innovation Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 10

Single threaded performance is too difficult to be achieved at the same power envelop, so--let s go parallel Intel core Duo (Yonah) was the first CMP processor Intel developed for the mobile market. The following processor core Duo-2 (Merom) is used for all the different segments. (More information can be found in Intel Journal of Technology, May, 2006) Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 11

Theoretical calculation of Single vs CMP cores power for the same performance In theory, power increases in order of the cube of the frequency. (2.5 is more realistic factor) If we assume that frequency approximates performance Doubling performance by increasing its frequency growth the power exponentially Doubling performance by adding another core, growth the power linearly. Conclusion: as long as enough parallelism exists, it is always more efficient to double the number of cores rather than the frequency in order to achieve the same performance. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 12

CPU Architecture - multicores Performance CMP MP Overhead Power Wall Uniprocessors Uniprocessors have lower power efficiency due to higher speculation and complexity Source: Tomer Morad, Ph.D student, Technion Power Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 13

Can we partition the system to have more cores forever? There are at least three camps in the computer architects community Multi-cores - System will continue to contain small number of big cores Intel, AMD, IBM, etc Many-cores System will contain a large number of small cores Sun Asymmetric cores a combination of small number of small cores together with large number of small cores IBM Cell architecture. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 14

How many core are too many cores? A simple Analytical model Basic assumptions Suppose that performance of a core proportional to square-root of its area (Polack rule) The thermal of each core is proportional to its power and the overall thermal capacitance is T = (1/T i ) Power - Work done per time unit (Watts) Active power: P = αcv 2 f (α : activity, C: capacitance, V: voltage, f: frequency) Static power is out of the scope of this model. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 15

Small cores big cores Suppose same area is partitioned to m smaller processors Each core has the area of A/m, performance of m*sqrt(a/m) and power of K 1 *(C/m * F 2.5 )*m=k 2 For the same power we can get much better performance. Suppose we like to achieve same performance Each core can run at F/m P new = m*k*(c/m)*(f/m) 2.5 P new =K*C(F/m) 2.5 P old /P new = m 2.5 So if m P 0 Perf. AND But: F performance Interconnect do not scale Static power mater If we try to build it Small die allows only simple architectures Memory BW and latency do not scale Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 16

Example: Caches and I/O traffic In order to reduce I/O we need to increase the local memory (caches) respectively. Thus the proportion of the memory within the overall logic increases in time. Let s assume an hierarchy model where at each node half of its area is devoted to memory and half to logic At most 2 cores can share a memory hierarchy 1 2 4 16 32 Logic 1/2 1/4 1/4 1/8 1/8 Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 17

Opteron: 2M L2, 2x64K L1 Implications: When increasing the number of cores, the active area of each core is less than 1/m It farther reduces the performance of each core Another example of the slogan - "The difference between theory and practice is always greater in practice than it is in theory" K8L: 512Kx4 L2, 4x2x64K L1 shared 2M L3 extendable. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 18

Symmetric Multiprocessor Model Parallelism coefficient: Area of each Core: a 0 λ 1 CPU1 CPU2 Number of cores: n CPU3 CPU4 λ Denotes the number of dynamic instructions within parallel phases, divided by the total number of dynamic instructions. Fully serial programs have λ=0 Source: Fully Tomer parallel programs Y. Morad, have Uri λ=1 C. Weiser, Avinoam Kolodny, Mateo Valero, and Eduard Ayguadé. Performance, Power Efficiency, and Scalability of Asymmetric Cluster Chip Multiprocessors. In Computer Architecture Letters, Volume 4, July 2005. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 19

Symmetric CMP Performance (λ=0.75) 5 Symmetric CMP Performance Vs. Power Symmetric Upper Bound a =8 Relative Performance 4 3 2 1 0 5 10 15 20 25 Relative Power Die size Symmetric Upper Bound Symmetric (a=8) Symmetric (a=4) Symmetric (a=2) Symmetric (a=1) a =4 a =2 a =1 Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 20

Asymmetric approach Single Application Large core area: βa (β>1) Small cores of area: a Serial phases execute on large core Parallel phases execute on all cores This model can be extended to other models such as GPGPU and parallel execution between BIG and small cores Serial βa Parallel Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 21

Performance ACCMP CPU Architecture - multicores CMP Power Wall MP Overhead Uniprocessors Uniprocessors have lower power efficiency due to higher speculation and complexity Source: Tomer Morad, Ph.D student, Technion Power Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 22

Conclusions for Symmetric cores The sequential portion of the code governs the overall performance Adding cores (in a naïve way) may reduce the performance For same area, the sequential part will be executed slower λ decrease when number of cores increases (more sequential part as a result of the cost of synchronization primitives We need almost perfect parallelism in order to gain many cores architectures. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 23

How to ease the problem Reduce the serial part smart prefetch ; e.g., scout, helper threads Push technology; e.g., Cell push the data to the right locations in parallel (DMA) and then execute Fast synchronization can be done in HW or SW Change the programming model; e.g., transactional memory Increase parallelism within the application Easier say than to be done We have been there too many times!!! Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 24

So, how many cores are too many cores? Many cores will never flourish without a breakthrough in compilers/application/new programming models but this can take looooong time Many cores can be very efficient for special purpose computing and for special MIPS since we can tailor the applications to the HW (and the HW to the software). Multicore has its limitation since cannot show future longevity but if we can keep improving the single core performance in parallel to moderate increasing the number of cores, we may gap the time till many cores become reality Asymmetric computer is mostly appealing as a combination of general purpose computers and special purpose. It can also become in a form of FPGA and GPGPU (e.g., CUDA, CTM) Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 25

Asymmetric computer - challenges I assume that asymmetric computers will be used mainly for special purpose Graphics, Multi-Media, Fix functions, In order to achieve that many technologies are still at a research phase Interconnect most likely NoC--Network on Chip Memory hierarchy: NUCA, DNUCA, UMA, NUMA? Programming model: OpenMP, MPI, CUDA, CTM, other? Comilers/debugger/ developing environment OS Do we need one scheduler or two (as today) or unified one? Do we need virtualization layer for that? I believe that asymmetric computer architectures will generate many new research opportunities for at least the next decay. Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 26

Question? Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 27

Multi-core Intel AMD and IBM Intel and AMD companies have dual and quad core architectures For Dual core - Intel uses shared cache architecture and AMD introduces split cache architecture for the first generation and shared L3 for the new generation For servers, analysts claims that Intel is building a 16 way Itanium based processor for 2009 time frame. Power4 has 2 cores and power5 has 2 cores+smt. Analysts claim that IBM consider to move in the near future to 2 cores+smt for each of them. Xbox has 3 Power4 cores. All the three companies promise to increase the number of cores in a pace that fits the market s needs. back Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 28

Many-cores Sun Sparc-T1:Niagara (a) (b) Looking at in-order machine, each thread has computational time followed by LONG memory access time (a) If you put 4 of them on die, you can overlap between I/O, memory and computation (b) You can use this approach to extend your system (c) Alewife project did it in the 80 s back Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 29 (c)

Cell Architecture - IBM Small core Ring base bus unit BIG core back Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 30

Processor Power Evolution 100? Pentium Pro Pentium II Pentium 4 Max Power (Watts) 10 i486 Pentium Pentium w/mmx tech. Pentium III 1 Traditionally: i386 1.5μ 1μ 0.8μ 0.6μ 0.35μ 0.25μ 0.18μ 0.13μ new processor generation always increases power Using the same micro-architecture with new process decreases its power Dr. Avi Mendelson - Talk in IBM seminar/hipeac 2007 31