Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

Similar documents
Stockholm Brain Institute Blue Gene/L

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Amdahl's Law in the Multicore Era

Chapter 18 - Multicore Computers

Parallel Computing. Hwansoo Han (SKKU)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Designing for Performance. Patrick Happ Raul Feitosa

Early experience with Blue Gene/P. Jonathan Follows IBM United Kingdom Limited HPCx Annual Seminar 26th. November 2007

Lecture 1: Gentle Introduction to GPUs

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Multicore Programming

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Spring 2016

An Introduction to Parallel Programming

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

USES AND ABUSES OF AMDAHL'S LAW *

Preparing GPU-Accelerated Applications for the Summit Supercomputer

CSE502: Computer Architecture CSE 502: Computer Architecture

The Effect of Temperature on Amdahl Law in 3D Multicore Era

Parallel Architecture. Hwansoo Han

Roadmapping of HPC interconnects

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Data Centric Computing

COSC 6385 Computer Architecture - Multi Processor Systems

International IEEE Symposium on Field-Programmable Custom Computing Machines

Trends and Challenges in Multicore Programming

High Performance Computing Systems

Design of Parallel Algorithms. Course Introduction

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Computer Architecture

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

The IBM Blue Gene/Q: Application performance, scalability and optimisation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

CSE 262 Spring Scott B. Baden. Lecture 1 Introduction

LIMITS OF ILP. B649 Parallel Architectures and Programming

Steve Scott, Tesla CTO SC 11 November 15, 2011

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Timothy Lanfear, NVIDIA HPC

CA463 Concurrent Programming

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

Multi-Core Microprocessor Chips: Motivation & Challenges

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Microarchitecture Overview. Performance

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

WHY PARALLEL PROCESSING? (CE-401)

HPC Architectures. Types of resource currently in use

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

The Fusion Distributed File System

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

The typical speedup curve - fixed problem size

Analytical Modeling of Parallel Programs

How to Write Fast Code , spring th Lecture, Mar. 31 st

Chap. 4 Multiprocessors and Thread-Level Parallelism

Microarchitecture Overview. Performance

Introduction to High-Performance Computing

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Active Disks - Remote Execution

Multiprocessors - Flynn s Taxonomy (1966)

CA463 Concurrent Programming

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

BREAKING THE MEMORY WALL

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Exascale: Parallelism gone wild!

Lecture 20: Distributed Memory Parallelism. William Gropp

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Parallel Computing Platforms

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

High performance Computing and O&G Challenges

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Pedraforca: a First ARM + GPU Cluster for HPC

Carlo Cavazzoni, HPC department, CINECA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

HW Trends and Architectures

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Arguably one of the most fundamental discipline that touches all other disciplines and people

Transcription:

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era 11/16/2011 Many-Core Computing 2

Gene M. Amdahl, Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities, 1967 Amdahl s law (Amdahl s speedup model) Speedup Amdahl lim Speedup n Amdahl 1 (1 f ) 1 1 f f n 11/16/2011 Many-Core Computing 3

Multicore architecture Integrate multiple processing units into a single chip Conquer the performance limitation of uni-processor Pipeline depth (limited ILP) Frequency Power consumption Cost-effective architecture Pollack s Rule Adds a new dimension of parallelism Core Core Core Core Core Core Core Core L1 $ Core L1 $ Core L1 $ Core L1 $ Core L1 $ Core L1 $ Core L1 $ Core L1 $ Core L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ 11/16/2011 Many-Core Computing 4

Hill & Marty, Amdahl s Law in the Multicore Era, IEEE Computer 2008 Study the applicability of Amdahl s law to multicore architecture Assumptions n BCEs (Base Core Equivalents) A powerful perf(r) core can be built with r BCEs Analyze performance of three multicore architecture organizations Symmetric, Asymmetric, Dynamic 11/16/2011 Many-Core Computing 5 Symmetric Asymmetric Dynamic

Speedup of symmetric architecture Speedup symmetric ( f, n, r) Speedup of asymmetric architecture Speedup asymmetric ( f, n, r) Speedup of dynamic architecture Speedupdynamic ( f, n, r) 1 1 f f r perf ( r) perf ( r) n 1 1 f f perf ( r) perf ( r) n r 1 1 f f perf () r n 11/16/2011 Many-Core Computing 6

Hill and Marty s study Limited speedup at large-scale size Dynamic architecture delivers a better speedup, but just an ideal situation, with f 0.975 Suggest a large-scale multicore is less interesting Current major industries also cite Amdahl s law 11/16/2011 Many-Core Computing 7

AMD Bulldozer: 16 cores, 2011 IBM Cell: 8 slave cores + 1 master core, 2005 Sun T2: 8 cores, 2007 Intel Dunnington: 6 cores, 2008 11/16/2011 Many-Core Computing 8

Kilocore: 256-core prototype By Rapport Inc. 512-core GPUs NVIDIA FERMI GRAPE-DR chip: 512-core, By Japan Quadro FX 3700M: 128-core, By nvidia GRAPE-DR testboard 11/16/2011 Many-Core Computing 9

All have up to 8 processors, citing Amdahl s law, lim Speedup n Amdahl 1 1 f IBM 7030 Stretch IBM 7950 Harvest Cray Y-MP Cray X-MP Fastest computer 1983-1985 11/16/2011 Many-Core Computing 10

The scale size is far beyond implication of Amdahl s law 11/16/2011 Many-Core Computing 11

Tacit assumption in Amdahl s law The problem size is fixed The speedup emphasizes time reduction Gustafson s Law, 1988 Fixed-time speedup model 1-f f Work: 1 1-f f*n Work: (1-f)+nf Speedup Sun and Ni s law, 1990 fixed time (1 f ) nf -bounded speedup model Speedup memorybounded Sequential Timeof SolvingScaled Workload ParallelTime of SolvingScaled Workload SequentialTimeof SolvingScaledWorkload ParallelTimeof SolvingScaledWorkload (1 f ) fg( n) (1 f ) fg( n) / n 11/16/2011 Many-Core Computing 12

Rack Cabled 8x8x16 Petaflops System 72 Racks 32 Node Cards 1024 chips, 4096 procs Source: ANL ALCF Compute Card 1 chip, 20 DRAMs Chip 4 processors 850 MHz 8 MB EDRAM Node Card (32 chips 4x4x2) 32 compute, 0-2 IO cards 13.6 GF/s 2.0 GB DDR Supports 4-way SMP 435 GF/s 64 GB 14 TF/s 2 TB Front End Node / Service Node System p Servers 1 PF/s 144 TB Maximum System 256 racks 3.5 PF/s 512 TB 11/16/2011 Many-Core Computing 13 Linux SLES10 HPC SW: Compilers GPFS ESSL Loadleveler

Multicore is a way of parallel computing on chip Independent ALUs, FPUs Independent register files Independent pipelines Independent memory (private cache) Connected with ultra-highspeed on-chip interconnect Scalable computing viewpoint applies to multicore Applications demand quicker and more accurate results when possible Video game (e.g. 3-D game) High-quality multi-channel audio Real-time applications (e.g. video-on-demand) Scientific applications Very unlikely to compute a fixed-size problem when have enormous computing 11/16/2011 Many-Core Computing 14

Definition 1. The work (or workload, or problem size) is defined as the number of instructions that are to be executed. Denoted as w Including improvable portion fw and non-improvable (1-f)w Definition 2. The execution time is defined as the number of cycles spent for executing the instructions, either for computation or for data access. Definition 3. The fixed-size speedup is defined as the ratio of the execution time in the original architecture and the execution time in the enhanced architecture. 11/16/2011 Many-Core Computing 15

Speedup T original fixed size T T w perf(1) original enhanced w (1 f ) w fw Tenhanced perf () r n perf () r r w perf(1) Speedup fixed size (1 f ) w fwr perf ( r) n perf ( r) 1 1 f f r perf ( r) perf ( r) n Consistent with Hill and Marty s findings 11/16/2011 Many-Core Computing 16

Speedup Quickly limited by non-improvable portion 50 45 40 35 30 Fixed-size Speedup of Multicore Architecture f = 0.2 f = 0.4 f = 0.6 f = 0.8 f = 0.9 f = 0.92 f = 0.94 f = 0.96 f = 0.98 25 20 15 10 5 0 4 32 64 128 256 512 1024 Number of Cores 11/16/2011 Many-Core Computing 17

Definition 4. The fixed-time speedup of multicore architecture is defined as the ratio of execution time of solving the scaled workload in the original mode to execution time of solving the scaled workload in enhanced mode, where the scaled workload is the amount of work that is finished in the enhanced mode within the same amount of time as in the original mode. The fixed-time constraint, when the number of cores scales from r to mr (1 f ) w fw (1 f ) w fw' => w' mw perf ( r) perf ( r) perf ( r) perf ( r) m Timeof Solvingw' inoriginalmode Speedup fixed time Timeof SolvingwinOriginalMode The scaled fixed-time speedup (1 f ) w fw' perf ( r) perf ( r) (1 f ) mf w perf () r 11/16/2011 Many-Core Computing 18

Speedup 1200 Fixed-time Speedup of Multicore Architecture 1000 800 f = 0.2 f = 0.4 f = 0.6 f = 0.8 f = 0.9 f = 0.92 f = 0.94 f = 0.96 f = 0.98 Scales linearly 600 400 200 0 4 32 64 128 256 512 1024 Number of Cores 11/16/2011 Many-Core Computing 19

Definition 5. The memory-bounded speedup of multicore architecture is defined as the ratio of execution time of solving the scaled workload in the original mode to execution time of solving the scaled workload in enhanced mode, where the scaled workload is the amount of work that is finished in the enhanced mode with a constraint on the memory capacity. Assume the scaled workload under the memory-bounded constraint is w * = g(m)w, where g(m) is the computing requirement in terms memory requirement, e.g. g(m) = 0.38m 3/2, for matrix-multiplication (2N 3 v.s. 3N 2 * ) Speedup memorybounded The scaled memory-bounded speedup (1 f ) g( m) f g( m) f (1 f ) m Timeof Solvingw inoriginalmode Timeof Solvingw inoriginalmode 11/16/2011 Many-Core Computing 20

Speedup 1200 1000 800 600 f = 0.2 f = 0.4 f = 0.6 f = 0.8 f = 0.9 f = 0.92 f = 0.94 f = 0.96 f = 0.98 -bounded Speedup of Multicore Architecture Scales linearly and better than fixed-time 400 200 0 4 32 64 128 256 512 1024 Number of Cores g( m) 0.38m 11/16/2011 Many-Core Computing 21 3/ 2

Speedup 1200 Fixed-size, Fixed-time and -bounded Speedup of Multicore Architecture 1000 800 FS, f = 0.4 FS, f = 0.8 FS, f = 0.98 FT, f = 0.4 FT, f = 0.8 FT, f = 0.98 MB, f = 0.4 MB, f = 0.8 MB, f = 0.98 Scalable computing shows an optimistic view 600 400 200 0 4 32 64 128 256 512 1024 Number of Cores 11/16/2011 Many-Core Computing 22

Result 1: Amdahl s law presents a limited and pessimistic view on the multicore scalability Implication 1: Micro-architects should jump out the pessimistic view to avoid the history to repeat itself Result 2: The scalable computing concept and two scaled speedup models are applicable to multicore architecture Implication 2: The manufactures should be actively move into building a large-scale multicore processor Result 3: The memory-bounded model considers a realistic constraint and presents a practical and an even better view Implication 3: The problem size scaling should prohibit extensive accesses across memory hierarchies 11/16/2011 Many-Core Computing 23

Sequential processing capability is enormous Long data access delay, a.k.a. memory-wall problem, is the identified performance bottleneck Assume a task has two parts, w = w p + w c Data processing work, w p Data communication (access) work, w c Fixed-size speedup with data-access processing consideration Speedup 1 w wp r c perf ( r) perf ( r) n 11/16/2011 Many-Core Computing 24

Fixed-time model constraint w w ' c p w wp c perf ( r) perf ( r) perf ( r) m perf ( r) => w p ' mw p Fixed-time scaled speedup w w ' c p perf ( r) m perf ( r) wc m wp (1 f ') mf ' w w c p wc w p perf ( r) perf ( r), where f ' w c w p w p -bounded scaled speedup Computing requirement is generally greater than memory requirement Likely to be greater than the fixed-time speedup 11/16/2011 Many-Core Computing 25

Result 4: Data-access delay remains as a dominant factor that decides the sustained performance of multicore architecture When the processor-memory performance gap grows larger, data access delay has more impact on the overall system performance Implication 4: Architects should not only focus on-chip layout design to deliver a high peak performance, but also should focus on memory and storage component design to achieve a high sustained performance Result 5: With data-access delay consideration, scalable computing concept are still applicable to multicore architecture design Implication 5: Scalable computing concept characterizes the application requirements and reflect the inherent scalability constraints of multicore architecture well 11/16/2011 Many-Core Computing 26

hierarchy Principle of locality Data prefetching Software prefetching technique Adaptive, compete for computing power, and costly Hardware prefetching technique Solutions Fixed, simple, and less powerful Data Access History Cache Server-based Push Prefetching Hybrid Adaptive Prefetching Architecture 11/16/2011 Many-Core Computing 27

With scalable computing viewpoint, multicore architecture can scale up well and linearly Scalable computing concept provides a theoretical foundation for building a large-scale multicore processor -bounded speedup model considers memory constraint on performance and indicates the tradeoff between memory capacity and computing power Scalable computing view is applicable to task-level Multicore architecture is not built for single task Explore an overall performance speedup improvement Exploit a high-throughput computing 11/16/2011 Many-Core Computing 31

11/16/2011 Many-Core Computing 32