CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Similar documents
CSC630/CSC730 Parallel & Distributed Computing

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Programs

High Performance Computing Systems

CMPSCI 691AD General Purpose Computation on the GPU

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Chapter 5: Analytical Modelling of Parallel Programs

Introduction to parallel computing

CSE 613: Parallel Programming. Lecture 11 ( Graph Algorithms: Connected Components )

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

arxiv: v1 [cs.dc] 2 Apr 2016

Design of Parallel Algorithms. Course Introduction

CSE 638: Advanced Algorithms. Lectures 18 & 19 ( Cache-efficient Searching and Sorting )

Understanding Parallelism and the Limitations of Parallel Computing

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Review: Creating a Parallel Program. Programming for Performance

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time

Principles of Parallel Algorithm Design: Concurrency and Decomposition

The typical speedup curve - fixed problem size

PARALLEL AND DISTRIBUTED COMPUTING

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

DPHPC: Performance Recitation session

Review of previous examinations TMA4280 Introduction to Supercomputing

Parallel Programming with MPI and OpenMP

CSE 638: Advanced Algorithms. Lectures 10 & 11 ( Parallel Connected Components )

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

ECE 563 Second Exam, Spring 2014

NAMD Serial and Parallel Performance

Introduction to parallel Computing

Interconnect Technology and Computational Speed

CSC2/458 Parallel and Distributed Systems Machines and Models

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

Parallel Programming Programowanie równoległe

Parallel Programming

Parallel Programming Patterns. Overview and Concepts

CSE 613: Parallel Programming. Lecture 6 ( Basic Parallel Algorithmic Techniques )

27. Parallel Programming I

Shared-memory Parallel Programming with Cilk Plus

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Parallelization of Sequential Programs

Evaluation of Parallel Programs by Measurement of Its Granularity

Performance analysis basics

Parallel Computing. Hwansoo Han (SKKU)

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Parallel Programming with MPI and OpenMP

CSE 613: Parallel Programming

Lecture 10 Midterm review

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CSCE 626 Experimental Evaluation.

An Introduction to Parallel Programming

Overview of High Performance Computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Lecture 7: Parallel Processing

Performance of Parallel Programs. Michelle Ku3el

Designing for Performance. Patrick Happ Raul Feitosa

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

27. Parallel Programming I

TDT4260/DT8803 COMPUTER ARCHITECTURE EXAM

Parallelization Principles. Sathish Vadhiyar

27. Parallel Programming I

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Limitations of parallel processing

High Performance Computing. Introduction to Parallel Computing

Chapter 18 - Multicore Computers

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Basics of Performance Engineering

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

SCALABILITY ANALYSIS

Performance analysis. Performance analysis p. 1

Optimize HPC - Application Efficiency on Many Core Systems

Exercise 1 Advanced Computer Architecture. Exercise 1

Course web site: teaching/courses/car. Piazza discussion forum:

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Shared-memory Parallel Programming with Cilk Plus

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

CSE 548: Analysis of Algorithms. Lectures 14 & 15 ( Dijkstra s SSSP & Fibonacci Heaps )

Lecture 7: Parallel Processing

Analysis of parallel suffix tree construction

ECE 563 Spring 2012 First Exam

CSL 860: Modern Parallel

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Computer Architecture Spring 2016

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

CENG3420 Lecture 08: Memory Organization

Distributed Systems CS /640

Fork / Join Parallelism

CSE 230 Intermediate Programming in C and C++ Recursion

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Transcription:

CSE 613: Parallel Programming Lecture 2 ( Analytical Modeling of Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2017

Parallel Execution Time & Overhead Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition Parallel running time on processing elements,, where, = starting time of the processing element that starts first = termination time of the processing element that finishes last

Parallel Execution Time & Overhead Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition Sources of overhead ( w.r.t. serial execution ) Interprocess interaction Idling Interact and communicate data ( e.g., intermediate results ) Due to load imbalance, synchronization, presence of serial computation, etc. Excess computation Fastest serial algorithm may be difficult/impossible to parallelize

Parallel Execution Time & Overhead Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition Overhead function or total parallel overhead,, where, = number of processing elements = time spent doing useful work ( often execution time of the fastest serial algorithm )

Speedup Let = running time using identical processing elements Speedup, Theoretically, ( why? ) Perfector linear or ideal speedup if

Speedup Consider adding numbers using identical processing elements. Serial runtime, = Θ Parallel runtime, = Θ log Speedup, = Θ Speedup not ideal. Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition

Superlinear Speedup Theoretically, But in practice superlinearspeedup is sometimes observed, that is, ( why? ) Reasons for superlinear speedup Cache effects Exploratory decomposition

Let cache access latency = 2 ns DRAM access latency = 100 ns Suppose we want solve a problem instance that executes k FLOPs. Superlinear Speedup ( Cache Effects ) With 1 Core: Suppose cache hit rate is 80%. cache CPU core DRAM cache If the computation performs 1 FLOP/memory access, then each FLOP will take 2 0.8 + 100 0.2 = 21.6 ns to execute. With 2 Cores: Cache hit rate will improve. ( why? ) Suppose cache hit rate is now 90%. Then each FLOP will take 2 0.9 + 100 0.1 = 11.8 ns to execute. Since now each core will execute only k/ 2 FLOPs, Speedup,! = "#!.% &"/!(#.) 3.66 2 CPU core

Superlinear Speedup ( Due to Exploratory Decomposition ) Consider searching an array of 2nunordered elements for a specific element x. Suppose x is located at array location k >n andkis odd. Serial runtime, - Parallel running time with n processing elements, 1 Speedup, - A[1] A[2] A[3] A[k] A[2n] sequential search x x A[1] A[2] A[3] A[k] A[2n] Speedup is superlinear! P 1 P 2 P // 0 parallel search P n

Parallelism & Span Law We defined, = runtime on identical processing elements Then span, 2 = runtime on an infinite number of identical processing elements Parallelism, 1 Parallelism is an upper bound on speedup, i.e., ( why? ) Span Law 3 2

Work Law The cost of solving ( or work performed for solving ) a problem: On a Serial Computer: is given by On a Parallel Computer: is given by Work Law 3

Work Optimality Let 4 = runtime of the optimal or the fastest known serial algorithm A parallel algorithm is cost-optimal or work-optimal provided Θ 4 Our algorithm for adding numbers using identical processing elements is clearly not work optimal.

Adding n Numbers Work-Optimality We reduce the number of processing elements which in turn increases the granularity of the subproblem assigned to each processing element. Suppose we use processing elements. First each processing element locally adds its numbers in time Θ. Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition Then processing elements adds these partial sums in time Θ log. Thus Θ 5log, and 4 Θ. So the algorithm is work-optimal provided Ω log.

Scaling Laws

Scaling of Parallel Algorithms ( Amdahl s Law ) Suppose only a fraction f of a computation can be parallelized. Then parallel running time, 3 167 57 Speedup, 89 :8 :8 9 ; :8

Scaling of Parallel Algorithms ( Amdahl s Law ) Suppose only a fraction f of a computation can be parallelized. Speedup, :8 9 ; :8 Source: Wikipedia

Scaling of Parallel Algorithms ( Gustafson-Barsis Law ) Suppose only a fraction f of a computation was parallelized. Then serial running time, 167 57 Speedup, :8 98 15 61 7

Scaling of Parallel Algorithms ( Gustafson-Barsis Law ) Suppose only a fraction f of a computation was parallelized. Speedup, :8 98 15 61 7 f= 0.9 f= 0.8 f= 0.7 f= 0.6 f= 0.5 Speedup f= 0.4 f= 0.3 f= 0.2 f= 0.1 Number of Processors Source: Wikipedia

Strong Scaling vs. Weak Scaling Source: Martha Kim, Columbia University Strong Scaling How ( or ) varies with when the problem size is fixed. Weak Scaling How ( or ) varies with when the problem size per processing element is fixed.

Efficiency, < = Scalable Parallel Algorithms Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition A parallel algorithm is called scalable if its efficiency can be maintained at a fixed value by simultaneously increasing the number of processing elements and the problem size. Scalability reflects a parallel algorithm s ability to utilize increasing processing elements effectively.

Scalable Parallel Algorithms In order to keep < fixed at a constant k, we need < - - - For the algorithm that adds n numbers using p processing elements: and 52log So in order to keep < fixed at k, we must have: - 52log 2-16- log Fig: Efficiency for adding n numbers using p processing elements Source: Grama et al., Introduction to Parallel Computing, 2 nd Edition