Performance Metrics. Measuring Performance

Similar documents
Understanding Parallelism and the Limitations of Parallel Computing

CSC630/CSC730 Parallel & Distributed Computing

What is Performance for Internet/Grid Computation?

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Designing for Performance. Patrick Happ Raul Feitosa

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

CS 475: Parallel Programming Introduction

Chapter 13 Strong Scaling

Lecture 7: Parallel Processing

Interconnect Technology and Computational Speed

Vector and Parallel Processors. Amdahl's Law

Lecture 7: Parallel Processing

ECE 669 Parallel Computer Architecture

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Overview of High Performance Computing

Introduction to Parallel Programming

Blocking SEND/RECEIVE

SCALABILITY ANALYSIS

Parallel Algorithms. Parallel Algorithms

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Analytical Modeling of Parallel Programs

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

High Performance Computing Systems

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs

Parallel Programming

Performance analysis. Performance analysis p. 1

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

High Performance Computing

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Design of Parallel Algorithms. Course Introduction

COSC 6385 Computer Architecture - Multi Processor Systems

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

The typical speedup curve - fixed problem size

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

1.3 Data processing; data storage; data movement; and control.

1 Introduction to Parallel Computing

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

What is Parallel Computing?

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Scalability of Processing on GPUs

Introduction to Parallel Programming

Pipelining to Superscalar

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

ENGG 1203 Tutorial. Computer Arithmetic (1) Computer Arithmetic (3) Computer Arithmetic (2) Convert the following decimal values to binary:

CSL 860: Modern Parallel

CSC2/458 Parallel and Distributed Systems Machines and Models

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Unit 9 : Fundamentals of Parallel Processing

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

Review: Creating a Parallel Program. Programming for Performance

Analytical Modeling of Parallel Programs

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

6. Parallel Volume Rendering Algorithms

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

Parallel Computation/Program Issues

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Quiz for Chapter 1 Computer Abstractions and Technology

Beyond Latency and Throughput

The Role of Performance

Draft Notes 1 : Scaling in Ad hoc Routing Protocols

Evaluation of Parallel Programs by Measurement of Its Granularity

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Course web site: teaching/courses/car. Piazza discussion forum:

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Introduction to Parallel Computing

Introduction to Parallel Programming

Exercise 1 Advanced Computer Architecture. Exercise 1

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Parallel Computing Concepts. CSInParallel Project

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

Parallel Processing in Mixed Integer Programming

Parallel Programming with MPI and OpenMP

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Centralized versus distributed schedulers for multiple bag-of-task applications

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

NAMD Serial and Parallel Performance

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Final Lecture. A few minutes to wrap up and add some perspective

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

CMPSCI 691AD General Purpose Computation on the GPU

Parallel Programming with MPI and OpenMP

Transcription:

Metrics 12/9/2003 6 Measuring How should the performance of a parallel computation be measured? Traditional measures like MIPS and MFLOPS really don t cut it New ways to measure parallel performance are needed Speedup Efficiency 12/9/2003 7 1

Measures of for Parallel Programs Speed-up How much faster than a sequential program? Efficiency How efficiently are the processors utilized Speedup Speedup is the most often used measure of parallel performance If T s (N) is the best possible serial time T(P,N) is the time taken by a parallel algorithm of size N on P processors Then Speedup, S(P,N) = T s (N) / T(P,N) 12/9/2003 9 2

Measuring Empirically Time programs on sequential and parallel machines Timings must be done under appropriate conditions (e.g., no time sharing) Theoretically Use estimates of times for different operations Build estimate of total time for program Useful for comparing different program organization Read Between the Lines Exactly what is meant by T s (N) (i.e., the time taken to run the fastest serial algorithm on one processor) One processor of the parallel computer? The fastest serial machine available? A parallel algorithm run on a single processor? Is the serial algorithm the best one? To keep things fair, T s (N) should be the best possible time in the serial world 12/9/2003 11 3

Practical Speedup A slightly different definition of speedup also exists. The time taken by the parallel algorithm on one processor divided by the time taken by the parallel algorithm on P processors. However this is misleading since many parallel algorithms contain extra operations to accommodate the parallelism (e.g., the communication) The result is T s (N) is increased thus exaggerating the speedup. 12/9/2003 12 Components of Execution Time Execution of standard instructions arithmetic, logical, branch Input and Output Communications time T tot = T comp + T io + T comm 4

Time for Computation Computation time (Tcomp) will depend on the number of instructions executed and the computation time per instruction. Tcomp = Number of Instr x Time per Instr Although different instructions take different times, we can approximate by using averages for a given machine. Input Output Importance depends on the problem Critical for high I/O problems Opens the avenue of research into parallel I/O systems and storage Less important for computationally intensive problems, simulations 5

Communications Time Dependent on the Link/Switch technology Dependent on the network topology Components start up time, or time for smallest message to get through (latency) time for an additional unit of a message to get through (rate = inverse of bandwidth) Communications Time for Message T comm = T l + kt c T l = latency (setup time) T c = rate k = length of message T c = d * T r d = distance = link rate T r 6

Typical Proportions of Times T c is 1 to 10 times Tcomp T l is 100 to 1000 times Tcomp (Distributed memory MIMD) or worse T r is comparable to T c for nearest neighbor communications on a SIMD machine SIMD Communications Nearest neighbor Distant (two hop, three hop, etc) Global (reduce) times depend on network behind best possible is log P (P = number of PEs) 7

Factors That Limit Speedup Software Overhead Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation Load Balancing Speedup is generally limited by the speed of the slowest node. So an important consideration is to ensure that each node performs the same amount of work Communication Overhead Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processors directly degrades the speedup 12/9/2003 20 Linear Speedup Which ever definition is used the ideal is to produce linear speedup A speedup of P using P processors However in practice the speedup is reduced from its ideal value of P Superlinear speedup results when unfair values are used for T s (N) Differences in the nature of the hardware used 12/9/2003 21 8

Speedup Curves Superlinear Speedup Speedup Linear Speedup Typical Speedup Number of Processors 12/9/2003 22 Efficiency Speed up does not measure how efficiently the processors are being used Is it worth using 100 processors to get a speedup of 2? Efficiency is defined as the ratio of the speedup and the number of processors required to achieve it Efficiency is given by E(P,N) = S(P, N) / P The efficiency is bounded from above by 1 12/9/2003 23 9

Example Processors Time(secs) Speedup Efficiency 1 76 1.00 1.00 2 38 2.00 1.00 4 20 3.80 0.95 5 16 4.75 0.95 6 14 5.42 0.90 8 11 6.90 0.86 9 10 7.60 0.84 12/9/2003 24 Speedup Curve Speedup Speedup 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Processors Linear Actual 12/9/2003 25 10

Timing for assignment 1 Let s look at FindMax and PeopleWave 12/9/2003 26 FindMax.pm MODULE FindMax; CONST N = 3; CONFIGURATION grid[1..n],[1..n]; CONNECTION right: grid[i,j] <-> grid[i,j+1]:left; up: grid[i,j] <-> grid[i+1,j]:down; VAR i : INTEGER; value, buffer : grid OF INTEGER; 12/9/2003 27 11

FindMax.pm (continued) BEGIN value := ID(grid); FOR i := 1 TO N-1 DO buffer := MOVE.left(value); IF buffer > value THEN value := buffer END; END ; (* FOR *) FOR i := 1 TO N-1 DO buffer := MOVE.down(value); IF buffer > value THEN value := buffer END; END ; (* FOR *) i := value<:1,1:>; WriteInt(i,10); WriteLn; END FindMax. 12/9/2003 28 MODULE PeopleWave; PeopleWave.pm CONST GRIDSIZE = 4; NUMBEROFNEIGHBORS = 8; CONFIGURATION grid[1..gridsize],[1..gridsize]; CONNECTION dir[0] : grid[i,j] <-> grid[i-1, j ]:dir[4]; dir[1] : grid[i,j] <-> grid[i-1, j+1]:dir[5]; dir[2] : grid[i,j] <-> grid[i, j+1]:dir[6]; dir[3] : grid[i,j] <-> grid[i+1, j+1]:dir[7]; VAR i,k : INTEGER; WaveElement Averager AllNeighbors OneNeighbor : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; 12/9/2003 29 12

PeopleWave.pm(2) BEGIN (* Initialize the wave *) IF DIM(grid,1) = GRIDSIZE THEN WaveElement := 1 ELSE WaveElement := 0 END; (* Divisor for determining new WaveElement is normally 3, but only 2 on edges *) IF (DIM(grid,2) = 1) OR (DIM(grid,2) = GRIDSIZE) OR (DIM(grid,1) = 1) OR (DIM(grid,1) = GRIDSIZE) THEN Averager := 2 ELSE Averager := 3 END; WriteInt(WaveElement, 1); 12/9/2003 30 PeopleWave.pm (3) FOR i := 1 TO GRIDSIZE DO (* retrieve and average (weighted) info about neighbors *) AllNeighbors := WaveElement; FOR k := 0 TO NUMBEROFNEIGHBORS-1 DO OneNeighbor :=0; SEND.dir[(k+4) MOD 8] (WaveElement, OneNeighbor); IF k < 5 THEN (* apply the template *) AllNeighbors := AllNeighbors + OneNeighbor; ELSE AllNeighbors := AllNeighbors - OneNeighbor; END; (* IF *) END; (* FOR k *) WaveElement := AllNeighbors DIV Averager; 12/9/2003 31 13

PeopleWave.pm (4) IF WaveElement >= 1 THEN WaveElement := 1 ELSE WaveElement := 0 END; WriteInt(WaveElement, 1); END; (* FOR i *) END PeopleWave. 12/9/2003 32 Asymptotic Analysis Analysis as a variable increases toward infinity Speedup depends on two variables: S(P, N) Three possible type of limit: Fixed P, N increases Fixed N, P increases Both P and N increase in some fixed relationship 14

P fixed, N increases Fixed number of processors, size of problem increases. Should see more computation relative to communications, reducing overheads Should see asymptotic time similar to single processor, thus improving efficiency May be exceptions if complexity of communications grows with problem size N fixed, P increases In other words, what happens as we use more and more processors to solve a given problem? Amdahl s Law -- the law of diminishing returns Based on the observation that every problem has a part which must be computed in sequence. Let the whole problem need time 1 on a single processor. Let s be the necessarily sequential part, let p be the parallel part: s + p = 1 Assume that the parallel part can be arbitrarily divided with no overhead or communications time. What is the speedup? 15

N fixed, P increases S(P, N) = 1 / (s + p/n) limit as N--> infinity? lim S(P,N) = 1 / s This says that no matter how many processors we have, the speedup is limited by 1 / s. Amdahl s Law A parallel computation has two types of operations Those which must be executed in serial Those which can be executed in parallel Amdahl s law states that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially 12/9/2003 37 16

Amdahl s Law Let the time taken to do the serial calculations be some fraction σ of the total time ( 0 < σ 1 ) The parallelizable portion is 1- σ of the total Assuming linear speedup T serial = σt 1 T parallel = (1- σ)t 1 /N By substitution Speedup <= 1 (1- σ ) σ + N 12/9/2003 38 Consequences of Amdahl s Say we have a program containing 100 operations each of which take 1 time unit. Suppose σ=.2, using 80 processors Speedup = 100 / (20 + 80/80) = 100 / 21 < 5 A speedup of only 5 is possible no matter how many processors are available So why bother with parallel computing? Just wait for a faster processor 12/9/2003 39 17

Avoiding Amdahl There are several ways to avoid Amdahl s law Concentrate on parallel algorithms with small serial components Amdahl s law is not complete in that it does not take into account problem size 12/9/2003 40 Amdahl s Law and Assignment 1 Let s look at PeopleWave 12/9/2003 41 18

MODULE PeopleWave; PeopleWave.pm CONST GRIDSIZE = 4; NUMBEROFNEIGHBORS = 8; CONFIGURATION grid[1..gridsize],[1..gridsize]; CONNECTION dir[0] : grid[i,j] <-> grid[i-1, j ]:dir[4]; dir[1] : grid[i,j] <-> grid[i-1, j+1]:dir[5]; dir[2] : grid[i,j] <-> grid[i, j+1]:dir[6]; dir[3] : grid[i,j] <-> grid[i+1, j+1]:dir[7]; VAR i,k : INTEGER; WaveElement Averager AllNeighbors OneNeighbor : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; 12/9/2003 42 PeopleWave.pm(2) BEGIN (* Initialize the wave *) IF DIM(grid,1) = GRIDSIZE THEN WaveElement := 1 ELSE WaveElement := 0 END; (* Divisor for determining new WaveElement is normally 3, but only 2 on edges *) IF (DIM(grid,2) = 1) OR (DIM(grid,2) = GRIDSIZE) OR (DIM(grid,1) = 1) OR (DIM(grid,1) = GRIDSIZE) THEN Averager := 2 ELSE Averager := 3 END; WriteInt(WaveElement, 1); 12/9/2003 43 19

PeopleWave.pm (3) FOR i := 1 TO GRIDSIZE DO (* retrieve and average (weighted) info about neighbors *) AllNeighbors := WaveElement; FOR k := 0 TO NUMBEROFNEIGHBORS-1 DO OneNeighbor :=0; SEND.dir[(k+4) MOD 8] (WaveElement, OneNeighbor); IF k < 5 THEN (* apply the template *) AllNeighbors := AllNeighbors + OneNeighbor; ELSE AllNeighbors := AllNeighbors - OneNeighbor; END; (* IF *) END; (* FOR k *) WaveElement := AllNeighbors DIV Averager; 12/9/2003 44 PeopleWave.pm (4) IF WaveElement >= 1 THEN WaveElement := 1 ELSE WaveElement := 0 END; WriteInt(WaveElement, 1); END; (* FOR i *) END PeopleWave. 12/9/2003 45 20

Classifying Parallel Programs Parallel programs can be placed into broad categories based on expected speedups Trivial Parallel Assumes complete parallelism with no overhead due to communication Divide and Conquer N log N speedup Communication Bound Parallelism 12/9/2003 46 21