Multicore Programming

Similar documents
CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Chap. 4 Multiprocessors and Thread-Level Parallelism

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Technology. Giorgio Richelli

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Chapter 18 - Multicore Computers

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Computer Systems Architecture

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

27. Parallel Programming I

27. Parallel Programming I

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

CSE502: Computer Architecture CSE 502: Computer Architecture

Computer Systems Architecture

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Quiz for Chapter 1 Computer Abstractions and Technology

Imperative Data Parallelism (Performance)

Master Informatics Eng.

1.3 Data processing; data storage; data movement; and control.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

The Art of Parallel Processing

Performance, Power, Die Yield. CS301 Prof Szajda

Lec 25: Parallel Processors. Announcements

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Parallel Functional Programming Lecture 1. John Hughes

Computer Architecture Spring 2016

27. Parallel Programming I

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

COSC 6385 Computer Architecture - Multi Processor Systems

Computer Architecture. R. Poss

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

Performance of computer systems

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Computer Architecture Crash course

Multicore Hardware and Parallelism

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 14 November 2014

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Top500 Supercomputer list

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Advanced Parallel Programming I

Giorgio Buttazzo. Scuola Superiore Sant Anna, Pisa. The transition

Why do we care about parallel?

EE/CSCI 451: Parallel and Distributed Computation

Multiprocessors & Thread Level Parallelism

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

High Performance Computing

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

CS654 Advanced Computer Architecture. Lec 3 - Introduction

Parallelism, Multicore, and Synchronization

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multicore and Parallel Processing

Designing for Performance. Patrick Happ Raul Feitosa

Computer Architecture Spring 2016

PERFORMANCE MEASUREMENT

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CDA3101 Recitation Section 13

Computer Architecture

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Transistors and Wires

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

CS 110 Computer Architecture

An Introduction to Parallel Programming

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Lecture 23: Parallelism. To Use Library. Getting Good Results. CS 62 Fall 2016 Kim Bruce & Peter Mawhorter

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

Lecture 7: Parallel Processing

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Parallel Computing. Hwansoo Han (SKKU)

ECE 588/688 Advanced Computer Architecture II

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

LECTURE 1. Introduction

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Department of Computer Science Duke University Ph.D. Qualifying Exam. Computer Architecture 180 minutes

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Course web site: teaching/courses/car. Piazza discussion forum:

BlueGene/L (No. 4 in the Latest Top500 List)

Instructor Information

The Implications of Multi-core

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University. P & H Chapter 4.10, 1.7, 1.8, 5.10, 6

Advanced Computer Architecture (CS620)

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

B649 Graduate Computer Architecture. Lec 1 - Introduction

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

High Performance Computing (HPC) Introduction

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Transcription:

Multi Programming Parallel Hardware and Performance 8 Nov 00 (Part ) Peter Sewell Jaroslav Ševčík Tim Harris

Merge sort 6MB input (-bit integers) Recurse(left) ~98% execution time Recurse(right) Merge to scratch array ~% execution time Copy back to input array

Merge sort, dual 6MB input (-bit integers) Recurse(left) Recurse(right) Merge to scratch array Copy back to input array

Throughput per / elements per ms T700 dual- laptop 00 00 000 980 960 90 90 900 880 - - 0 000 000 000 000 000 Wall-clock execution time / ms

Throughput per / elements per ms AMD Phenom - 600 00 00-00 00-00 000 0 00 000 00 000 00 000 00 000 Wall-clock execution time / ms

The power wall Moore s law: area of transistor is about 0% smaller each generation A given chip area accommodates twice as many transistors P dyn = α f V Shrinking process technology (constant field scaling) allows reducing V to partially counteract increasing f V cannot be reduced arbitrarily Halving V more than halves max f (for a given transistor) Physical limits P leak = V(I sub + I ox ) Reduce I sub (sub-threshold leakage): turn off component, increase threshold volatage (reduces max f) Reduce I ox (gate-oxide leakage): increase oxide thickness (but it needs to decrease with process scale)

Operations / µs The memory wall 0000 CPU 000 00 0 MHz CPU clock, 00ns to access memory in 98 Memory

Operations / µs The memory wall 00 000 00 CPU 000 00 000 00 000 Cycles per memory access: 00-00 00 0 Memory

The ILP wall ILP = Instruction level parallelism Implicit parallelism between instructions in a single thread Identified by the hardware Speculate past memory accesses Speculate past control transfer Diminishing returns

Power wall + ILP wall + memory wall = brick wall Power wall means we can t just clock processors faster any longer Memory wall means that many workload s perf is dominated by memory access times ILP wall means we can t find extra work to keep functional units busy while waiting for memory accesses

Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

Amdahl s law Sorting takes 70% of the execution time of a sequential program. You replace the sorting algorithm with one that scales perfectly on multi- hardware. On a machine with n s, how many s do you need to use to get a x speed-up on the overall algorithm?

Speedup Amdahl s law, f=70%..0..0..0..0 0. 0.0 Desired x speedup Speedup achieved (perfect scaling on 70%) 6 7 8 9 0 6 #s

Amdahl s law, f=70% MaxSpeedup(f, c) = f + f c f = fraction of code speedup applies to c = number of s used

Speedup Amdahl s law, f=70%..0..0..0..0 0. 0.0 Desired x speedup Speedup achieved (perfect scaling on 70%) Limit as c = /(-f) =. 6 7 8 9 0 6 #s

Speedup Amdahl s law, f=0%..0.08.06.0.0.00 0.98 0.96 0.9 Speedup achieved with perfect scaling Amdahl s law limit, just.x 6 7 8 9 0 6 #s

7 9 7 9 6 67 7 79 8 9 97 0 09 7 Speedup Amdahl s law, f=98% 60 0 0 0 0 0 0 #s

Amdahl s law & multi- Suppose that the same h/w budget (space or power) can make us: 6 7 8 9 0 6

Core perf (relative to big Perf of big & small s..0 Assumption: perf = α resource 0.8 0.6 0. Total perf: 6 * / = Total perf: * = 0. 0.0 /6 /8 / / Resources dedicated to

Perf (relative to big ) Amdahl s law, f=98%..0..0 medium 6 small. big.0 0. 0.0 6 7 8 9 0 6 #Cores

Perf (relative to big ) Amdahl s law, f=7%..0 big medium 0.8 0.6 6 small 0. 0. 0.0 6 7 8 9 0 6 #Cores

Perf (relative to big ) Amdahl s law, f=%..0 0.8 0.6 big medium 0. 6 small 0. 0.0 6 7 8 9 0 6 #Cores

Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

Asymmetric chips 7 8 9 0 6

Perf (relative to big ) Amdahl s law, f=7%.6.. 0.8 0.6 0. 0. 0 + medium big 6 small 6 7 8 9 0 6 #Cores

Perf (relative to big ) Amdahl s law, f=%. 0.8 0.6 big medium 0. 0. 0 + 6 small 6 7 8 9 0 6 #Cores

Perf (relative to big ) Amdahl s law, f=98%.. medium + 6 small. big 0. 0 6 7 8 9 0 6 #Cores

Speedup (relative to big ) Amdahl s law, f=98% 9 8 7 6 0 +9 6 small #Cores

Speedup (relative to big ) Amdahl s law, f=98% 9 8 7 6 0 +9 6 small Leave larger idle in parallel section #Cores

Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

Merge sort (-)

Merge sort (-)

Merge sort (8-)

T (span): critical path length

T (work): time to run sequentially

T serial : optimized sequential code

A good multi- parallel algorithm T /T serial is low What we lose on sequential performance we must make up through parallelism Resource availability may limit the ability to do that T grows slowly with the problem size We tackle bigger problems by using more s, not by running for longer

Quick-sort on good input Recurse T = O(n logn) T = O(n) Sequential partition Sequential append Recurse

Quick-sort on good input T = O(n logn) T = O(n) Parallel partition Recurse Recurse Parallel append

Quick-sort on good input Recurse T = O(n log n) T = O(log n) Parallel partition Parallel append Recurse

Execution time Scheduling from a DAG 0 T = 0 T p T / P 8 6 T p T T = 0 6 7 8 9 0 # s

Speedup (T p / T ) Scheduling from a DAG 0 8 6 T p T 0 T p T / P 6 7 8 9 0 # s

Speedup (T p / T ) In CILK: T p T / p + c T 0 8 6 T p T / P Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low T p T c=. 0 6 7 8 9 0 # s

Speedup (T p / T ) In CILK: T p T / p + c T 0 8 Given abundant parallelism (T << T ): Work first principle Keep T close to T serial Put overhead on T to keep T low 6 T p T / P T p T c= c=. 0 6 7 8 9 0 # s

Why parallelism? Amdahl s law Why asymmetric performance? Parallel algorithms Multi-processing hardware

Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads in a workload with: Poor spatial locality Frequent memory accesses Main memory

Multi-threaded h/w ALU ALU Multi-threaded single L cache (6KB) L cache (MB) Multiple threads with synergistic resource needs Main memory

Multi- h/w common L Core Core ALU ALU ALU ALU L cache L cache L cache Main memory

Multi- h/w common L Read L cache L cache L cache Main memory

Multi- h/w common L Read L cache L cache L cache S Main memory

Multi- h/w common L Read L S cache L cache L cache S Main memory

Multi- h/w common L Read L S cache L cache L cache S Main memory

Multi- h/w common L Read L S cache L S cache L cache S Main memory

Multi- h/w common L Write L S cache L S cache L cache S Main memory

Multi- h/w common L Write L cache L S cache L cache S Main memory

Multi- h/w common L Write L cache L S cache L cache Main memory

Multi- h/w common L Write L cache L I cache L cache Main memory

Multi- h/w common L Write L M cache L I cache L cache E Main memory

Multi- h/w common L Write L M cache L I cache L cache E Main memory

Multi- h/w common L Write L M cache L cache L cache E Main memory

Multi- h/w common L Write L M cache L cache L cache Main memory

Multi- h/w common L Write L I cache L cache L M cache Main memory

Multi- h/w common L Write L I cache L E cache L M cache Main memory

Multi- h/w separate L L cache L cache L cache L cache Main memory

Multi- h/w additional L L cache L cache L cache L cache L cache Main memory

Multi-threaded multi- h/w L cache L cache L cache L cache L cache Main memory

SMP multiprocessor L cache L cache L cache L cache Main memory

NUMA multiprocessor L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory Interconnect L cache L cache L cache L cache L cache L cache L cache L cache Memory & directory Memory & directory

Three kinds of parallel hardware Multi-threaded s Increase utilization of a or memory b/w Peak ops/cycle fixed Multiple s Increase ops/cycle Don t necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately

Throughput per / elements per ms AMD Phenom 600 00 00-00 00-00 000 0 00 000 00 000 00 000 00 000 Wall-clock execution time / ms