Parallel Systems Part 7: Evaluation of Computers and Programs foils by Yang-Suk Kee, X. Sun, T. Fahringer
How To Evaluate Computers and Programs? Learning objectives: Predict performance of parallel programs on parallel computers Understand barriers to higher performance Simulation-based evaluation Accurate simulators are costly to develop and verify Simulation is time-consuming Sometimes this is done for machines not yet existing. Quantitative evaluation A grounded engineering discipline Standard benchmarks Understanding of parallel programs as workloads is critical!
Workload Classification Serial type : increase throughput Single application runs serially Batch processing number of jobs per time unit I/O issue: network bandwidth >= aggregate I/O requirements Workload management Interactive Transaction processing Multi-user logins Multi-job serial Parametric computation
Workload Classification (Cont d) Parallel type : turn-around time or response time Single application run on multiple nodes Workloads Workload with large effort Workload with minimum effort
Workload with Large Efforts Grand Challenge Problems PetaFLOP levels of computation Fundamental problems in science and engineering with broad application (ex) computational fluid dynamics for weather forecasting Academic research thesis making Heavily used programs Databases, OLTP servers, Internet servers, Online Games, Stocks prediction, etc. Aggressive parallelization effort should be justified
Workload with Minimum Efforts Commercial Transaction Processing Systems Inter-transaction parallelism: multiple transactions at the same time Intra-application parallelism: parallelism within a single database operation: learn how to express queries for best parallel execution
Performance Improvement (Speedup) When work is fixed speedup ( p) = performance( p) performance(1) speedup ( p) = time(1) time( p) Basic measures of multiprocessor performance efficiency( p) = speedup( p p)
Scaling Problem (Small Work) Appropriate for small machine Parallelism overheads begin to dominate benefits for larger machines Load imbalance Communication to computation ratio May even achieve slowdowns
Scaling Problem (Large Work) Appropriate for big machine Difficult to measure improvement May not fit for small machine Can t run Thrashing to disk Working set doesn t fit in cache Fits at some p, leading to superlinear speedup
Demonstrating Scaling Problems Small Ocean problem On SGI Origin2000 Big equation solver problem On SGI Origin2000 parallelism overhead superlinear User want to scale problems as machines grow!
Definitions Scaling a machine Make a machine more powerful Machine size <processor, memory, communication, I/O> Scaling a machine in parallel processing Add more identical nodes Problem size Input configuration data set size: the amount of storage required to run it on a single processor memory usage: the amount of memory used by the program
Two Key Issues in Problem Scaling Under what constraints should the problem be scaled? Some properties must be fixed as the machine scales How should the problem be scaled? Which parameters? How?
Constraints To Scale Two types of constraints User-oriented Easy to think about change e.g. particles, rows, transactions Resource-oriented e.g. Memory, time
Resource-Oriented Constraints Problem constrained (PC) Problem size fixed Memory constrained (MC) Memory size fixed Time constrained (TC) Execution time fixed
Some Definitions t s : Processing time of the serial part of a program (using 1 processor). t p (1) : Processing time of the parallel part of the program using 1 processor. t p (P) : Processing time of the parallel part of the program using P processors. T(1) : Total processing time of the program including both the serial and the parallel parts using 1 processor = t s + t p (1) T(P) : Total processing time of the program including both the serial and the parallel parts using P processor = t s + t p (P)
Problem Constrained Scaling: Amdahl s law The main objective is to produce the results as soon as possible (turnaround time) (ex) video compression, computer graphics, VLSI routing, etc main usage of Amdahl s and Gustafson s laws: estimate speedup as a measure to determine a program s potential for parallelism Implications Upper-bound is 1/α Decrease serial part as much as possible Optimize the common case Modified Amdahl s law for fixed problem size including the overhead
Fixed-Size Speedup (Amdahl Law, 67) Amount of Work W 1 W 1 W 1 W 1 W 1 Elapsed Time T 1 T 1 T1 W p W p W p W p W p T p T 1 T1 T p Tp T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)
Limitations of Amdahl s Law Ignores performance overhead (e.g. communication, load imbalance) Overestimates speedup achievable
Enhanced Amdahl s Law The overhead includes parallelism and interaction overheads Speedup PC = T as (1 α) T T p 1 overhead αt 1 + + Toverhead α + p T 1 1 1 Amdahl s law: argument against massively parallel systems
Speedup PC Amdahl Effect = T as (1 α) T T p 1 overhead αt 1 + + Toverhead α + p T 1 1 1 Typically T overhead has lower complexity than (1-α)T 1 /p As problem size n increases (1-α)T 1 /p dominates T overhead As problem size n increases, speedup increases
Illustration of Amdahl Effect Speedup n = 10,000 n = 1,000 Processors n = 100
Review of Amdahl s Law Treats problem size as a constant Shows how execution time decreases as number of processors increases
Another Perspective We often use faster computers to solve larger problem instances Let s treat time as a constant and allow problem size to increase with number of processors
Time Constrained Scaling: Gustafson s Law User wants more accurate results within a time limit. Execution time is fixed as system scales (ex) FEM for structural analysis, FDM for fluid dynamics Properties of a work metric Easy to measure Architecture independent Easy to model with an analytical expression The measure of work should scale linearly with sequential time complexity of the algorithm
Gustafson s Law (Without Overhead) α 1-α time α = t t ( p) p (1-α)p s + p t s Speedup TC = Work( p) Work(1) = αw + (1 α ) pw W = α + (1 α) p p is the nr. of processors
Fixed-Time Speedup (Gustafson) W 1 Amount of Work W 1 W 1 Elapsed Time W 1 W 1 Wp W p W p W p T 1 T 1 T 1 T 1 T 1 W p T p T p T p T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)
Converting α s between Amdahl s and Gustafon s laws αa = 1+ 1 (1 αg ). n Based on this observation, Amdahl s and Gustafon s laws are identical. αg αa ( n n 1) + 1 = αg + (1 αg ) n
Memory Constrained Scaling: Sun and Ni s Law Scale the largest possible solution limited by the memory space. Or, fix memory usage per processor e.g. N-body problem problem size is scaled from W to W* W* is the work executed under memory limitation of a parallel computer * T(1, W Speedup MC = T(P, W * * ) )
Memory-Boundary Speedup (Sun & Ni) Work executed under memory limitation Hierarchical memory W 1 Amount of Work W 1 W 1 Elapsed Time W 1 Wp W p W p W 1 W p W p T 1 T p T 1 T 1 T p T p T 1 T 1 T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)
Parallel Performance Metrics (Run-time is the dominant metric) Run-Time (Execution Time) Speed: mflops, mips Speedup Efficiency: E = Scalability Speedup Number of Processors
Scalability The Need for New Metrics Comparison of performances with different workload Availability of massively parallel processing Definition: Scalability Ability to maintain parallel processing gain when both problem size and machine size increase
Ideally Scalable T(m p, m W) = T(p, W) T: execution time W: work executed p: number of processors used m: scale up m times work: flop count based on the best practical serial algorithm Fact: T(m p, m W) = T(p, W) if and only if the average unit speed is fixed
Definition (average unit speed): The average unit speed is the achieved (work) divided by the number of processors Definition (Isospeed Scalability): An algorithm-machine combination is scalable if the achieved average unit speed can remain constant with increasing numbers of processors, provided the problem size is increased proportionally
Issoefficiency Parallel system: parallel program executing on a parallel computer Scalability of a parallel system: measure of its ability to increase performance as the number of processors increases A scalable system maintains efficiency as processors are added Isoefficiency: way to measure scalability
Isospeed Scalability (Sun & Rover, 91) W: work executed when p processors are employed W': work executed when p' > p processors are employed to maintain the average speed Ideal case p' W W ' = p Scalabilit y =ψ ( p, p') = Scalability in terms of time ψ ( p, p' ) T = T p p', W p W ' p' = ψ ( p, p') = 1 p' W p W ' ( W ) timefor workw on p processors = ( W ') timefor workw 'on p' processors
The Relation of Scalability and Time More scalable leads to smaller time Better initial run-time and higher scalability lead to superior run-time Same initial run-time and same scalability lead to same scaled performance Superior initial performance may not last long if scalability is low
Summary (1/3) Performance terms Speedup Efficiency Model of speedup Serial component Parallel component overhead component
Summary (2/3) What prevents linear speedup? Serial operations Communication operations Process start-up Imbalanced workloads Architectural limitations
Summary (3/3) Analyzing parallel performance Amdahl s Law Gustafson s Law Sun-Ni s Law Isoefficiency metric