Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

Parallel Systems Part 7: Evaluation of Computers and Programs foils by Yang-Suk Kee, X. Sun, T. Fahringer

How To Evaluate Computers and Programs? Learning objectives: Predict performance of parallel programs on parallel computers Understand barriers to higher performance Simulation-based evaluation Accurate simulators are costly to develop and verify Simulation is time-consuming Sometimes this is done for machines not yet existing. Quantitative evaluation A grounded engineering discipline Standard benchmarks Understanding of parallel programs as workloads is critical!

Workload Classification Serial type : increase throughput Single application runs serially Batch processing number of jobs per time unit I/O issue: network bandwidth >= aggregate I/O requirements Workload management Interactive Transaction processing Multi-user logins Multi-job serial Parametric computation

Workload Classification (Cont d) Parallel type : turn-around time or response time Single application run on multiple nodes Workloads Workload with large effort Workload with minimum effort

Workload with Large Efforts Grand Challenge Problems PetaFLOP levels of computation Fundamental problems in science and engineering with broad application (ex) computational fluid dynamics for weather forecasting Academic research thesis making Heavily used programs Databases, OLTP servers, Internet servers, Online Games, Stocks prediction, etc. Aggressive parallelization effort should be justified

Workload with Minimum Efforts Commercial Transaction Processing Systems Inter-transaction parallelism: multiple transactions at the same time Intra-application parallelism: parallelism within a single database operation: learn how to express queries for best parallel execution

Performance Improvement (Speedup) When work is fixed speedup ( p) = performance( p) performance(1) speedup ( p) = time(1) time( p) Basic measures of multiprocessor performance efficiency( p) = speedup( p p)

Scaling Problem (Small Work) Appropriate for small machine Parallelism overheads begin to dominate benefits for larger machines Load imbalance Communication to computation ratio May even achieve slowdowns

Scaling Problem (Large Work) Appropriate for big machine Difficult to measure improvement May not fit for small machine Can t run Thrashing to disk Working set doesn t fit in cache Fits at some p, leading to superlinear speedup

Demonstrating Scaling Problems Small Ocean problem On SGI Origin2000 Big equation solver problem On SGI Origin2000 parallelism overhead superlinear User want to scale problems as machines grow!

Definitions Scaling a machine Make a machine more powerful Machine size <processor, memory, communication, I/O> Scaling a machine in parallel processing Add more identical nodes Problem size Input configuration data set size: the amount of storage required to run it on a single processor memory usage: the amount of memory used by the program

Two Key Issues in Problem Scaling Under what constraints should the problem be scaled? Some properties must be fixed as the machine scales How should the problem be scaled? Which parameters? How?

Constraints To Scale Two types of constraints User-oriented Easy to think about change e.g. particles, rows, transactions Resource-oriented e.g. Memory, time

Resource-Oriented Constraints Problem constrained (PC) Problem size fixed Memory constrained (MC) Memory size fixed Time constrained (TC) Execution time fixed

Some Definitions t s : Processing time of the serial part of a program (using 1 processor). t p (1) : Processing time of the parallel part of the program using 1 processor. t p (P) : Processing time of the parallel part of the program using P processors. T(1) : Total processing time of the program including both the serial and the parallel parts using 1 processor = t s + t p (1) T(P) : Total processing time of the program including both the serial and the parallel parts using P processor = t s + t p (P)

Problem Constrained Scaling: Amdahl s law The main objective is to produce the results as soon as possible (turnaround time) (ex) video compression, computer graphics, VLSI routing, etc main usage of Amdahl s and Gustafson s laws: estimate speedup as a measure to determine a program s potential for parallelism Implications Upper-bound is 1/α Decrease serial part as much as possible Optimize the common case Modified Amdahl s law for fixed problem size including the overhead

Fixed-Size Speedup (Amdahl Law, 67) Amount of Work W 1 W 1 W 1 W 1 W 1 Elapsed Time T 1 T 1 T1 W p W p W p W p W p T p T 1 T1 T p Tp T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)

Limitations of Amdahl s Law Ignores performance overhead (e.g. communication, load imbalance) Overestimates speedup achievable

Enhanced Amdahl s Law The overhead includes parallelism and interaction overheads Speedup PC = T as (1 α) T T p 1 overhead αt 1 + + Toverhead α + p T 1 1 1 Amdahl s law: argument against massively parallel systems

Speedup PC Amdahl Effect = T as (1 α) T T p 1 overhead αt 1 + + Toverhead α + p T 1 1 1 Typically T overhead has lower complexity than (1-α)T 1 /p As problem size n increases (1-α)T 1 /p dominates T overhead As problem size n increases, speedup increases

Illustration of Amdahl Effect Speedup n = 10,000 n = 1,000 Processors n = 100

Review of Amdahl s Law Treats problem size as a constant Shows how execution time decreases as number of processors increases

Another Perspective We often use faster computers to solve larger problem instances Let s treat time as a constant and allow problem size to increase with number of processors

Time Constrained Scaling: Gustafson s Law User wants more accurate results within a time limit. Execution time is fixed as system scales (ex) FEM for structural analysis, FDM for fluid dynamics Properties of a work metric Easy to measure Architecture independent Easy to model with an analytical expression The measure of work should scale linearly with sequential time complexity of the algorithm

Gustafson s Law (Without Overhead) α 1-α time α = t t ( p) p (1-α)p s + p t s Speedup TC = Work( p) Work(1) = αw + (1 α ) pw W = α + (1 α) p p is the nr. of processors

Fixed-Time Speedup (Gustafson) W 1 Amount of Work W 1 W 1 Elapsed Time W 1 W 1 Wp W p W p W p T 1 T 1 T 1 T 1 T 1 W p T p T p T p T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)

Converting α s between Amdahl s and Gustafon s laws αa = 1+ 1 (1 αg ). n Based on this observation, Amdahl s and Gustafon s laws are identical. αg αa ( n n 1) + 1 = αg + (1 αg ) n

Memory Constrained Scaling: Sun and Ni s Law Scale the largest possible solution limited by the memory space. Or, fix memory usage per processor e.g. N-body problem problem size is scaled from W to W* W* is the work executed under memory limitation of a parallel computer * T(1, W Speedup MC = T(P, W * * ) )

Memory-Boundary Speedup (Sun & Ni) Work executed under memory limitation Hierarchical memory W 1 Amount of Work W 1 W 1 Elapsed Time W 1 Wp W p W p W 1 W p W p T 1 T p T 1 T 1 T p T p T 1 T 1 T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)

Parallel Performance Metrics (Run-time is the dominant metric) Run-Time (Execution Time) Speed: mflops, mips Speedup Efficiency: E = Scalability Speedup Number of Processors

Scalability The Need for New Metrics Comparison of performances with different workload Availability of massively parallel processing Definition: Scalability Ability to maintain parallel processing gain when both problem size and machine size increase

Ideally Scalable T(m p, m W) = T(p, W) T: execution time W: work executed p: number of processors used m: scale up m times work: flop count based on the best practical serial algorithm Fact: T(m p, m W) = T(p, W) if and only if the average unit speed is fixed

Definition (average unit speed): The average unit speed is the achieved (work) divided by the number of processors Definition (Isospeed Scalability): An algorithm-machine combination is scalable if the achieved average unit speed can remain constant with increasing numbers of processors, provided the problem size is increased proportionally

Issoefficiency Parallel system: parallel program executing on a parallel computer Scalability of a parallel system: measure of its ability to increase performance as the number of processors increases A scalable system maintains efficiency as processors are added Isoefficiency: way to measure scalability

Isospeed Scalability (Sun & Rover, 91) W: work executed when p processors are employed W': work executed when p' > p processors are employed to maintain the average speed Ideal case p' W W ' = p Scalabilit y =ψ ( p, p') = Scalability in terms of time ψ ( p, p' ) T = T p p', W p W ' p' = ψ ( p, p') = 1 p' W p W ' ( W ) timefor workw on p processors = ( W ') timefor workw 'on p' processors

The Relation of Scalability and Time More scalable leads to smaller time Better initial run-time and higher scalability lead to superior run-time Same initial run-time and same scalability lead to same scaled performance Superior initial performance may not last long if scalability is low

Summary (1/3) Performance terms Speedup Efficiency Model of speedup Serial component Parallel component overhead component

Summary (2/3) What prevents linear speedup? Serial operations Communication operations Process start-up Imbalanced workloads Architectural limitations

Summary (3/3) Analyzing parallel performance Amdahl s Law Gustafson s Law Sun-Ni s Law Isoefficiency metric