CSC630/CSC730 Parallel & Distributed Computing

Similar documents
Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Chapter 5: Analytical Modelling of Parallel Programs

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Analytical Modeling of Parallel Programs

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

SCALABILITY ANALYSIS

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CMPSCI 691AD General Purpose Computation on the GPU

Design of Parallel Algorithms. Course Introduction

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time

Introduction to parallel computing

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Analytical Modeling of Parallel Programs

What is Performance for Internet/Grid Computation?

Evaluation of Parallel Programs by Measurement of Its Granularity

arxiv: v1 [cs.dc] 2 Apr 2016

Review: Creating a Parallel Program. Programming for Performance

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

The typical speedup curve - fixed problem size

Parallel Programming with MPI and OpenMP

Lecture 7: Parallel Processing

Performance Metrics. Measuring Performance

Understanding Parallelism and the Limitations of Parallel Computing

ECE 669 Parallel Computer Architecture

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

High Performance Computing Systems

Lecture 7: Parallel Processing

1 Introduction to Parallel Computing

Dense Matrix Algorithms

Introduction to parallel Computing

Lecture 8 Parallel Algorithms II

Principles of Parallel Algorithm Design: Concurrency and Decomposition

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

MapReduce Design Patterns

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Scalability of Processing on GPUs

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

CSC630/CSC730 Parallel & Distributed Computing

Parallel quicksort algorithms with isoefficiency analysis. Parallel quicksort algorithmswith isoefficiency analysis p. 1

Parallel Programming with MPI and OpenMP

What is Parallel Computing?

Chapter 13 Strong Scaling

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

Designing for Performance. Patrick Happ Raul Feitosa

Computer Science 210 Data Structures Siena College Fall Topic Notes: Complexity and Asymptotic Analysis

Sorting (Chapter 9) Alexandre David B2-206

Performance analysis basics

Παράλληλη Επεξεργασία

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

Lecture: Benchmarks, Pipelining Intro. Topics: Performance equations wrap-up, Intro to pipelining

ECE Spring 2017 Exam 2

Parallel Programming

Software Analysis. Asymptotic Performance Analysis

Parallel Computing Concepts. CSInParallel Project

Parallel Programming Patterns. Overview and Concepts

ENGG 1203 Tutorial. Computer Arithmetic (1) Computer Arithmetic (3) Computer Arithmetic (2) Convert the following decimal values to binary:

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Performance Profiling

Algorithm Analysis CISC4080 CIS, Fordham Univ. Instructor: X. Zhang

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

ECE 563 Second Exam, Spring 2014

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Sorting (Chapter 9) Alexandre David B2-206

Overview of High Performance Computing

TECHNISCHE UNIVERSITEIT EINDHOVEN Faculteit Wiskunde en Informatica

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Multiprocessors - Flynn s Taxonomy (1966)

Interconnect Technology and Computational Speed

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Lecture 3: Sorting 1

Parallel Algorithm Design. CS595, Fall 2010

IT 4043 Data Structures and Algorithms

Introduction to Parallel Computing

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Algorithmic Analysis. Go go Big O(h)!

Chapter 8 Dense Matrix Algorithms

NAMD Serial and Parallel Performance

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Introduction to Parallel Performance Engineering

Lecture 4: Parallel Speedup, Efficiency, Amdahl s Law

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

Parallel Numerics, WT 2013/ Introduction

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

LogCA: A High-Level Performance Model for Hardware Accelerators

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Sorting Algorithms. Slides used during lecture of 8/11/2013 (D. Roose) Adapted from slides by

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Fundamentals of the Analysis of Algorithm Efficiency

Transcription:

CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2 1

Sources of Parallel Overhead Interprocessor Communication Extra Computation The difference in computation performed by the parallel program and the best serial program Idling Load imbalance Synchronization Sequential components 3 Execution Profile Tools for analyzing parallel performance gprof upshot Speedshop 4 2

Performance visualization Jumpshot is a Java-based visualization tool for doing performance analysis [Link] http://www.mcs.anl.gov/research/projects/perfvis/software/viewers / 5 Performance Metrics Execution time The time that elapses from the moment a parallel computation starts to the moment the last PE finishes execution T s = serial run time = parallel run time T p Total overhead: The total time collectively spent by all the PEs over and above that required by the fastest known serial algorithm for solving the same problem on a single PE. T pt T 0 p s 6 3

Performance Metrics - Speedup Speedup Ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on a parallel computer with p identical PEs. S T / T S Theoretically, S < p Superlinear speedup, S>p more work by a serial algorithm or result from hardware features (cache effects) P 7 Superlinear Speedup Superlinear speedup (>p) can occur due to cache effects and algorithmic shortcuts Consider performing a simple version of matrix multiplication on 1 system vs 16 The matrices might be too large for cache on the 1 system trial, but might fit quite well on the 16 system trial. 8 4

Superlinear Effects 9 Superlinear Speedup Exploratory decomposition: Suppose there are 10 processes and the first 9 compute for 2 seconds before determining there is no solution Assume the tenth process finds a solution in 1 second The serial program would take 19 seconds to find the solution, while the parallel takes only 1 Speedup = 19 10 5

False Speedup Have we compared against the best serial algorithm? It is often not possible to find a serial algorithms that is optimal for all instances Serial bubble sort takes 150 seconds Parallel version (odd-even sort) takes 40 seconds Is this a speedup of 150/40=3.75? No! You need to consider the best serial algorithm. Perhaps quicksort takes 30 seconds The real speedup is 30/40=0.75 11 Performance Metrics - Efficiency and Cost Efficiency E = S / p Efficiency is also the ratio of sequential cost to parallel cost Parallel cost (work or processor-time product) The product of parallel runtime and the number of PEs Cost-optimal The parallel cost has the same asymptotic growth as a function of the input size as the fast-known sequential algorithm on a single PE. pt p E (1) 12 6

Adding n Numbers in Parallel 13 Speedup for Addition Using binary tree-based algorithm log n levels Tp (log n) n S ( ) log n E n ( ) log n 1 ( ) n log n Is the algorithm in Figure 5.2 cost optimal? No 14 7

Importance of Cost-Optimality Consider a parallel sort with n PEs Serial cost: Then speedup is Ts nlog n n S log n If we implement with p < n processors T p n(log n) p 2 S p log If n = 1000000000 and p = 32, S = 32/30 = 1.07 n Tp (log n) 1 E log n 2 15 Granularity and Data Mapping Scaling down Using fewer PEs to execute a parallel algorithm Design for one input element per processor and use less PEs Let p physical PEs to simulate a large number of virtual PEs Each of physical PE simulate n/p virtual PEs Scaling down preserves cost-optimality But it may or may not create cost-optimality 16 8

Cost-Optimal Addition 17 Cost-Optimal Addition Each processor adds n/p numbers locally Then add the p partial sums on p PEs Cost is p times the parallel time n Tp ( log p) p Its total cost is n plog p As long as n dominates p log p the cost is optimal With n=1000000 and p = 32, p log p = 160 E =? 18 9

Scalability Amdahl's Law: Given a serial component of W s for a job of size W, speedup is limited by W W s, no matter how many PEs are used As the problem size grows the serial percentage tends to decrease for many algorithms Scalability The capability to increase speedup in proportion to the number of PEs. Efficiency generally drops as p increases A scalable algorithm allows maintaining a certain level of efficiency by increasing the problem size as p increases. 19 FFT Speedup 20 10

FFT Speedup In the FFT speedup graph we see that two algorithms perform much better for smaller arrays, while the third scales better It is typical to run into contradictory speedup results for different sizes 21 Scaling Characteristics Efficiency can be written in several ways S TS TS 1 E p pt T P T0 TS 1 T The overhead term is an increasing function of p Since there is some serial component and p-1 processors are idle during that time. The overhead is also a function of problem size Many times the overhead percentage becomes smaller as the problem size grows As a result we can scale up to more processors effectively by increasing the problem size A system which can maintain efficiency by scaling the problem size is called scalable 0 S 22 11

Speedup versus PE Number 23 Efficiency of Addition The efficiency can be maintained at 0.8 for more CPUs with larger size arrays N P=1 P=4 P=8 P=16 P=32 64 1 0.8 0.57 0.33 0.17 192 1 0.92 0.8 0.6 0.38 320 1 0.95 0.87 0.71 0.5 512 1 0.97 0.91 0.8 0.62 24 12

Isoefficiency Isoefficiency is a metric to relate problem size required to maintain efficiency with more CPUs 25 Problem Size Problem size The number of basic computation steps in the best sequential algorithm to solve the problem on a single PE. Problem size for matrix multiplication is Problem size is denoted by W and W T S Assumption: it takes unit time to perform one basic computation step of an algorithm n 3 26 13

Isoefficiency Function To maintain a given level of efficiency E, it can be shown that the problem size W must satisfy E W Wp S T W T ( W, p) W 1 p p p W T ( W, p) 1 T ( W, p) / W P 0 0 0 E W T0( W, p) KT0( W, p) 1 E Where T o is the overhead function Since E is constant we simplify to get W equation Sometimes the problem size equation can be solved as p increases These are scalable systems and W is the isoefficiency function 27 CSC630/CSC730 Parallel & Distributed Computing Questions? 28 14