Performance Metrics. Measuring Performance

Metrics 12/9/2003 6 Measuring How should the performance of a parallel computation be measured? Traditional measures like MIPS and MFLOPS really don t cut it New ways to measure parallel performance are needed Speedup Efficiency 12/9/2003 7 1

Measures of for Parallel Programs Speed-up How much faster than a sequential program? Efficiency How efficiently are the processors utilized Speedup Speedup is the most often used measure of parallel performance If T s (N) is the best possible serial time T(P,N) is the time taken by a parallel algorithm of size N on P processors Then Speedup, S(P,N) = T s (N) / T(P,N) 12/9/2003 9 2

Measuring Empirically Time programs on sequential and parallel machines Timings must be done under appropriate conditions (e.g., no time sharing) Theoretically Use estimates of times for different operations Build estimate of total time for program Useful for comparing different program organization Read Between the Lines Exactly what is meant by T s (N) (i.e., the time taken to run the fastest serial algorithm on one processor) One processor of the parallel computer? The fastest serial machine available? A parallel algorithm run on a single processor? Is the serial algorithm the best one? To keep things fair, T s (N) should be the best possible time in the serial world 12/9/2003 11 3

Practical Speedup A slightly different definition of speedup also exists. The time taken by the parallel algorithm on one processor divided by the time taken by the parallel algorithm on P processors. However this is misleading since many parallel algorithms contain extra operations to accommodate the parallelism (e.g., the communication) The result is T s (N) is increased thus exaggerating the speedup. 12/9/2003 12 Components of Execution Time Execution of standard instructions arithmetic, logical, branch Input and Output Communications time T tot = T comp + T io + T comm 4

Time for Computation Computation time (Tcomp) will depend on the number of instructions executed and the computation time per instruction. Tcomp = Number of Instr x Time per Instr Although different instructions take different times, we can approximate by using averages for a given machine. Input Output Importance depends on the problem Critical for high I/O problems Opens the avenue of research into parallel I/O systems and storage Less important for computationally intensive problems, simulations 5

Communications Time Dependent on the Link/Switch technology Dependent on the network topology Components start up time, or time for smallest message to get through (latency) time for an additional unit of a message to get through (rate = inverse of bandwidth) Communications Time for Message T comm = T l + kt c T l = latency (setup time) T c = rate k = length of message T c = d * T r d = distance = link rate T r 6

Typical Proportions of Times T c is 1 to 10 times Tcomp T l is 100 to 1000 times Tcomp (Distributed memory MIMD) or worse T r is comparable to T c for nearest neighbor communications on a SIMD machine SIMD Communications Nearest neighbor Distant (two hop, three hop, etc) Global (reduce) times depend on network behind best possible is log P (P = number of PEs) 7

Factors That Limit Speedup Software Overhead Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation Load Balancing Speedup is generally limited by the speed of the slowest node. So an important consideration is to ensure that each node performs the same amount of work Communication Overhead Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processors directly degrades the speedup 12/9/2003 20 Linear Speedup Which ever definition is used the ideal is to produce linear speedup A speedup of P using P processors However in practice the speedup is reduced from its ideal value of P Superlinear speedup results when unfair values are used for T s (N) Differences in the nature of the hardware used 12/9/2003 21 8

Speedup Curves Superlinear Speedup Speedup Linear Speedup Typical Speedup Number of Processors 12/9/2003 22 Efficiency Speed up does not measure how efficiently the processors are being used Is it worth using 100 processors to get a speedup of 2? Efficiency is defined as the ratio of the speedup and the number of processors required to achieve it Efficiency is given by E(P,N) = S(P, N) / P The efficiency is bounded from above by 1 12/9/2003 23 9

Example Processors Time(secs) Speedup Efficiency 1 76 1.00 1.00 2 38 2.00 1.00 4 20 3.80 0.95 5 16 4.75 0.95 6 14 5.42 0.90 8 11 6.90 0.86 9 10 7.60 0.84 12/9/2003 24 Speedup Curve Speedup Speedup 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 Processors Linear Actual 12/9/2003 25 10

Timing for assignment 1 Let s look at FindMax and PeopleWave 12/9/2003 26 FindMax.pm MODULE FindMax; CONST N = 3; CONFIGURATION grid[1..n],[1..n]; CONNECTION right: grid[i,j] <-> grid[i,j+1]:left; up: grid[i,j] <-> grid[i+1,j]:down; VAR i : INTEGER; value, buffer : grid OF INTEGER; 12/9/2003 27 11

FindMax.pm (continued) BEGIN value := ID(grid); FOR i := 1 TO N-1 DO buffer := MOVE.left(value); IF buffer > value THEN value := buffer END; END ; (* FOR *) FOR i := 1 TO N-1 DO buffer := MOVE.down(value); IF buffer > value THEN value := buffer END; END ; (* FOR *) i := value<:1,1:>; WriteInt(i,10); WriteLn; END FindMax. 12/9/2003 28 MODULE PeopleWave; PeopleWave.pm CONST GRIDSIZE = 4; NUMBEROFNEIGHBORS = 8; CONFIGURATION grid[1..gridsize],[1..gridsize]; CONNECTION dir[0] : grid[i,j] <-> grid[i-1, j ]:dir[4]; dir[1] : grid[i,j] <-> grid[i-1, j+1]:dir[5]; dir[2] : grid[i,j] <-> grid[i, j+1]:dir[6]; dir[3] : grid[i,j] <-> grid[i+1, j+1]:dir[7]; VAR i,k : INTEGER; WaveElement Averager AllNeighbors OneNeighbor : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; 12/9/2003 29 12

PeopleWave.pm(2) BEGIN (* Initialize the wave *) IF DIM(grid,1) = GRIDSIZE THEN WaveElement := 1 ELSE WaveElement := 0 END; (* Divisor for determining new WaveElement is normally 3, but only 2 on edges *) IF (DIM(grid,2) = 1) OR (DIM(grid,2) = GRIDSIZE) OR (DIM(grid,1) = 1) OR (DIM(grid,1) = GRIDSIZE) THEN Averager := 2 ELSE Averager := 3 END; WriteInt(WaveElement, 1); 12/9/2003 30 PeopleWave.pm (3) FOR i := 1 TO GRIDSIZE DO (* retrieve and average (weighted) info about neighbors *) AllNeighbors := WaveElement; FOR k := 0 TO NUMBEROFNEIGHBORS-1 DO OneNeighbor :=0; SEND.dir[(k+4) MOD 8] (WaveElement, OneNeighbor); IF k < 5 THEN (* apply the template *) AllNeighbors := AllNeighbors + OneNeighbor; ELSE AllNeighbors := AllNeighbors - OneNeighbor; END; (* IF *) END; (* FOR k *) WaveElement := AllNeighbors DIV Averager; 12/9/2003 31 13

PeopleWave.pm (4) IF WaveElement >= 1 THEN WaveElement := 1 ELSE WaveElement := 0 END; WriteInt(WaveElement, 1); END; (* FOR i *) END PeopleWave. 12/9/2003 32 Asymptotic Analysis Analysis as a variable increases toward infinity Speedup depends on two variables: S(P, N) Three possible type of limit: Fixed P, N increases Fixed N, P increases Both P and N increase in some fixed relationship 14

P fixed, N increases Fixed number of processors, size of problem increases. Should see more computation relative to communications, reducing overheads Should see asymptotic time similar to single processor, thus improving efficiency May be exceptions if complexity of communications grows with problem size N fixed, P increases In other words, what happens as we use more and more processors to solve a given problem? Amdahl s Law -- the law of diminishing returns Based on the observation that every problem has a part which must be computed in sequence. Let the whole problem need time 1 on a single processor. Let s be the necessarily sequential part, let p be the parallel part: s + p = 1 Assume that the parallel part can be arbitrarily divided with no overhead or communications time. What is the speedup? 15

N fixed, P increases S(P, N) = 1 / (s + p/n) limit as N--> infinity? lim S(P,N) = 1 / s This says that no matter how many processors we have, the speedup is limited by 1 / s. Amdahl s Law A parallel computation has two types of operations Those which must be executed in serial Those which can be executed in parallel Amdahl s law states that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially 12/9/2003 37 16

Amdahl s Law Let the time taken to do the serial calculations be some fraction σ of the total time ( 0 < σ 1 ) The parallelizable portion is 1- σ of the total Assuming linear speedup T serial = σt 1 T parallel = (1- σ)t 1 /N By substitution Speedup <= 1 (1- σ ) σ + N 12/9/2003 38 Consequences of Amdahl s Say we have a program containing 100 operations each of which take 1 time unit. Suppose σ=.2, using 80 processors Speedup = 100 / (20 + 80/80) = 100 / 21 < 5 A speedup of only 5 is possible no matter how many processors are available So why bother with parallel computing? Just wait for a faster processor 12/9/2003 39 17

Avoiding Amdahl There are several ways to avoid Amdahl s law Concentrate on parallel algorithms with small serial components Amdahl s law is not complete in that it does not take into account problem size 12/9/2003 40 Amdahl s Law and Assignment 1 Let s look at PeopleWave 12/9/2003 41 18

MODULE PeopleWave; PeopleWave.pm CONST GRIDSIZE = 4; NUMBEROFNEIGHBORS = 8; CONFIGURATION grid[1..gridsize],[1..gridsize]; CONNECTION dir[0] : grid[i,j] <-> grid[i-1, j ]:dir[4]; dir[1] : grid[i,j] <-> grid[i-1, j+1]:dir[5]; dir[2] : grid[i,j] <-> grid[i, j+1]:dir[6]; dir[3] : grid[i,j] <-> grid[i+1, j+1]:dir[7]; VAR i,k : INTEGER; WaveElement Averager AllNeighbors OneNeighbor : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; 12/9/2003 42 PeopleWave.pm(2) BEGIN (* Initialize the wave *) IF DIM(grid,1) = GRIDSIZE THEN WaveElement := 1 ELSE WaveElement := 0 END; (* Divisor for determining new WaveElement is normally 3, but only 2 on edges *) IF (DIM(grid,2) = 1) OR (DIM(grid,2) = GRIDSIZE) OR (DIM(grid,1) = 1) OR (DIM(grid,1) = GRIDSIZE) THEN Averager := 2 ELSE Averager := 3 END; WriteInt(WaveElement, 1); 12/9/2003 43 19

PeopleWave.pm (3) FOR i := 1 TO GRIDSIZE DO (* retrieve and average (weighted) info about neighbors *) AllNeighbors := WaveElement; FOR k := 0 TO NUMBEROFNEIGHBORS-1 DO OneNeighbor :=0; SEND.dir[(k+4) MOD 8] (WaveElement, OneNeighbor); IF k < 5 THEN (* apply the template *) AllNeighbors := AllNeighbors + OneNeighbor; ELSE AllNeighbors := AllNeighbors - OneNeighbor; END; (* IF *) END; (* FOR k *) WaveElement := AllNeighbors DIV Averager; 12/9/2003 44 PeopleWave.pm (4) IF WaveElement >= 1 THEN WaveElement := 1 ELSE WaveElement := 0 END; WriteInt(WaveElement, 1); END; (* FOR i *) END PeopleWave. 12/9/2003 45 20

Classifying Parallel Programs Parallel programs can be placed into broad categories based on expected speedups Trivial Parallel Assumes complete parallelism with no overhead due to communication Divide and Conquer N log N speedup Communication Bound Parallelism 12/9/2003 46 21