Understanding Parallelism and the Limitations of Parallel Computing

Similar documents
Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Introduction to parallel computing

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

High Performance Computing Systems

Performance Metrics. Measuring Performance

CSC630/CSC730 Parallel & Distributed Computing

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Parallel Programming with MPI and OpenMP

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Design of Parallel Algorithms. Course Introduction

Introduction to parallel Computing

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

Parallel Programming Patterns. Overview and Concepts

The typical speedup curve - fixed problem size

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

High Performance Computing. Introduction to Parallel Computing

CMPSCI 691AD General Purpose Computation on the GPU

Parallel Numerics, WT 2017/ Introduction. page 1 of 127

Analytical Modeling of Parallel Programs

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

CSC2/458 Parallel and Distributed Systems Machines and Models

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

High-Performance and Parallel Computing

Parallel Numerics, WT 2013/ Introduction

Analytical Modeling of Parallel Programs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Chapter 13 Strong Scaling

Parallel Programming

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

What is Performance for Internet/Grid Computation?

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

DPHPC: Performance Recitation session

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallel Programming with MPI and OpenMP

Interconnect Technology and Computational Speed

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

27. Parallel Programming I

27. Parallel Programming I

Performance analysis. Performance analysis p. 1

High Performance Computing (HPC) Introduction

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

What is Parallel Computing?

Overview of High Performance Computing

Performance of Parallel Programs. Michelle Ku3el

Parallelism. CS6787 Lecture 8 Fall 2017

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Programming at Scale: Concurrency

CSL 860: Modern Parallel

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

SCALABILITY ANALYSIS

Making extreme computations possible with virtual machines

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

27. Parallel Programming I

High Performance Computing

Distributed Memory Parallel Markov Random Fields Using Graph Partitioning

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

CSCE 626 Experimental Evaluation.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Workload Optimization on Hybrid Architectures

Introduction to Parallel Programming

Parallelization Principles. Sathish Vadhiyar

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

COMP 633 Parallel Computing.

David Cronk University of Tennessee, Knoxville, TN

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Chapter 5: Thread-Level Parallelism Part 1

Engineering Robust Server Software

A Generic Distributed Architecture for Business Computations. Application to Financial Risk Analysis.

Parallelization of Sequential Programs

Review: Creating a Parallel Program. Programming for Performance

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Chapter 5: Analytical Modelling of Parallel Programs

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Introduction to Parallel Programming

PRACE Autumn School Basic Programming Models

Introduction to Parallel Programming and Computing for Computational Sciences. By Justin McKennon

Lecture 7: Parallel Processing

Distributed Systems CS /640

Partitioning Effects on MPI LS-DYNA Performance

Introduction to Parallel Computing

COSC 6385 Computer Architecture - Multi Processor Systems

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Programming Programowanie równoległe

PARALLEL AND DISTRIBUTED COMPUTING

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Παράλληλη Επεξεργασία

CEC 450 Real-Time Systems

Master-Worker pattern

Parallel Computing. Hwansoo Han (SKKU)

BİL 542 Parallel Computing

CS 6453: Parameter Server. Soumya Basu March 7, 2017

Transcription:

Understanding Parallelism and the Limitations of Parallel omputing

Understanding Parallelism: Sequential work After 16 time steps: 4 cars Scalability Laws 2

Understanding Parallelism: Parallel work After 4 time steps: 4 cars perfect speedup Scalability Laws 3

Understanding parallelism: Shared resources, imbalance shared resource Unused resources due to resource bottleneck and imbalance! Waiting for shared resource Waiting for synchronization Scalability Laws 4

Limitations of Parallel omputing: Amdahl's Law Ideal world: All work is perfectly parallelizable serial serial serial serial serial seriell serial loser to reality: Purely serial parts limit maximum speedup Reality is even worse: ommunication and synchronization impede scalability even further Scalability Laws 5

Limitations of Parallel omputing: alculating Speedup in a Simple Model ( strong scaling ) T(1) = s+p = serial compute time (=1) parallelizable part: p = 1-s purely serial part s Parallel execution time: T(N) = s+p/n General formula for speedup: Amdahl's Law (1967) strong scaling S k p T(1) T( N) s 1 1s N Scalability Laws 6

Limitations of Parallel omputing: Amdahl's Law ( strong scaling ) Reality: No task is perfectly parallelizable Shared resources have to be used serially Task interdependencies must be accounted for ommunication overhead (but that can be modeled separately) Benefit of parallelization may be strongly limited "Side effect": limited scalability leads to inefficient use of resources Metric: Parallel Efficiency (what percentage of the workers/processors is efficiently used): Amdahl case: ( N) p p s( N S p ( N) N 1 1) 1 Scalability Laws 7

Limitations of Parallel omputing: Adding a simple communication model for strong scaling T(1) = s+p = serial compute time parallelizable part: p = 1-s purely serial part s parallel: T(N) = s+p/n+nk Model assumption: nonoverlapping communication messages fraction k for communication per worker General formula for speedup: S k p T(1) T( N) s 1s N 1 Nk Scalability Laws 8

Limitations of Parallel omputing: Amdahl's Law ( strong scaling ) Large N limits at k=0, Amdahl's Law predicts lim 0 ( N) N S p 1 s independent of N! at k 0, our simple model of communication overhead yields a beaviour of S k p ( N) N 1 1 Nk Problems in real world programming Load imbalance Shared resources have to be used serially (e.g. IO) Task interdependencies must be accounted for ommunication overhead Scalability Laws 9

S(N) Limitations of Parallel omputing: Amdahl s Law ( strong scaling ) + comm. model 10 9 8 7 6 5 4 s=0.01 s=0.1 s=0.1, k=0.05 3 2 1 0 1 2 3 4 5 6 7 8 9 10 # PUs Scalability Laws 10

S(N) Limitations of Parallel omputing: Amdahl s Law ( strong scaling ) 100 90 80 70 Parallel efficiency: <10% ~50% 60 50 40 s=0.01 s=0.1 s=0.1, k=0.05 30 20 10 0 1 10 100 500 1000 # PUs Scalability Laws 11

Limitations of Parallel omputing: How to mitigate overheads ommunication is not necessarily purely serial Non-blocking crossbar networks can transfer many messages concurrently factor Nk in denominator becomes k (technical measure) Sometimes, communication can be overlapped with useful work (implementation, algorithm): ommunication overhead may show a more fortunate behavior than Nk "superlinear speedups : data size per PU decreases with increasing PU count may fit into cache at large PU counts Scalability Laws 12

Limits of Scalability: Serial & Parallel fraction Serial fraction s may depend on Program / algorithm Non-parallelizable part, e.g. recursive data setup Non-parallelizable IO, e.g. reading input data ommunication structure Load balancing (assumed so far: perfect balanced) omputer hardware Processor: ache effects & memory bandwidth effects Parallel Library; Network capabilities; Parallel IO Determine s "experimentally": Measure speedup and fit data to Amdahl s law but that could be more complicated than it seems Scalability Laws 13

Scalability data on modern multi-core systems An example Scaling across nodes 12 sockets on node 12 cores on socket P P P P P P P P P P P P P P P hipset hipset hipset Memory hipset Memory Memory Memory P Scalability Laws 14

Speedup Scalability data on modern multi-core systems The scaling baseline Scalability presentations should be grouped according to the memorybound largest unit that the scaling is based on code! (the scaling baseline ) 2,0 1,5 1,0 0,5 0,0 intranode 1 2 4 Good scaling across sockets # PUs internode Amdahl model with communication: Fit S( N) s 1 1 s N kn to inter-node scalability numbers (N = # nodes, >1) Scalability Laws 15

Application to accelerated computing SIMD, GPUs, ell SPEs, FPGAs, just any optimization Assume overall (serial, un-accelerated) runtime to be T s =s+p=1 Assume p can be accelerated and run α times faster. We neglect any additional cost (communication ) To get a speedup of rα, how small must s be? Solve for s: r 1 1 s s s 1 r 1 1 At α=10 and r =0.9 (for an overall speedup of 9), we get s 0.012, i.e. you must accelerate 98.8% of serial runtime! Limited memory on accelerators may limit the achievable speedup Scalability Laws 16