Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

Similar documents
Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

ECE 669 Parallel Computer Architecture

CSC630/CSC730 Parallel & Distributed Computing

Parallel Programming with MPI and OpenMP

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Parallel Programming with MPI and OpenMP

The typical speedup curve - fixed problem size

Scalability of Heterogeneous Computing

Design of Parallel Algorithms. Course Introduction

Analytical Modeling of Parallel Programs

CSE5351: Parallel Processing Part III

Analytical Modeling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

CMPSCI 691AD General Purpose Computation on the GPU

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Understanding Parallelism and the Limitations of Parallel Computing

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

COURSE 12. Parallel DBMS

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

Lecture 7: Parallel Processing

Parallel Computing. Hwansoo Han (SKKU)

Performance analysis. Performance analysis p. 1

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Review: Creating a Parallel Program. Programming for Performance

Introduction to Modeling. Lecture Overview

SCALABILITY ANALYSIS

Παράλληλη Επεξεργασία

Superlinear Speedup in Parallel Computation

High Performance Computing Systems

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

1 Introduction to Parallel Computing

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

ECE Spring 2017 Exam 2

Overview of High Performance Computing

Unit 9 : Fundamentals of Parallel Processing

Course web site: teaching/courses/car. Piazza discussion forum:

Index. ADEPT (tool for modelling proposed systerns),

Parallel SimOS: Scalability and Performance for Large System Simulation

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Designing for Performance. Patrick Happ Raul Feitosa

Parallelization Principles. Sathish Vadhiyar

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

HYRISE In-Memory Storage Engine

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Lecture 7: Parallel Processing

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Lecture 9: Workload-Driven Performance Evaluation. Parallel Computer Architecture and Programming CMU /15-618, Spring 2017

Programming as Successive Refinement. Partitioning for Performance

CS533 Modeling and Performance Evaluation of Network and Computer Systems

Scalability of Processing on GPUs

Mean Value Analysis and Related Techniques

CS533 Modeling and Performance Evaluation of Network and Computer Systems

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Quiz for Chapter 1 Computer Abstractions and Technology

Performance Evaluation for Parallel Systems: A Survey

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Response Time and Throughput

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Interconnect Technology and Computational Speed

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Performance, Power, Die Yield. CS301 Prof Szajda

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

COSC 6374 Parallel Computation. Analytical Modeling of Parallel Programs (I) Edgar Gabriel Fall Execution Time

High Performance Computing

The Role of Performance

An Effective Speedup Metric Considering I/O Constraint in Large-scale Parallel Computer Systems

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Amdahl's Law in the Multicore Era

Introduction to Data Management CSE 344

Parallel Programming Patterns. Overview and Concepts

PARALLEL AND DISTRIBUTED COMPUTING

A Comparative Evaluation of Techniques for Studying Parallel System Performance

Extracting Performance and Scalability Metrics from TCP. Baron Schwartz April 2012

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Selection of Techniques and Metrics

Instructor Information

Workload-Driven Architectural Evaluation

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

CSC2/458 Parallel and Distributed Systems Machines and Models

Transcription:

Parallel Systems Part 7: Evaluation of Computers and Programs foils by Yang-Suk Kee, X. Sun, T. Fahringer

How To Evaluate Computers and Programs? Learning objectives: Predict performance of parallel programs on parallel computers Understand barriers to higher performance Simulation-based evaluation Accurate simulators are costly to develop and verify Simulation is time-consuming Sometimes this is done for machines not yet existing. Quantitative evaluation A grounded engineering discipline Standard benchmarks Understanding of parallel programs as workloads is critical!

Workload Classification Serial type : increase throughput Single application runs serially Batch processing number of jobs per time unit I/O issue: network bandwidth >= aggregate I/O requirements Workload management Interactive Transaction processing Multi-user logins Multi-job serial Parametric computation

Workload Classification (Cont d) Parallel type : turn-around time or response time Single application run on multiple nodes Workloads Workload with large effort Workload with minimum effort

Workload with Large Efforts Grand Challenge Problems PetaFLOP levels of computation Fundamental problems in science and engineering with broad application (ex) computational fluid dynamics for weather forecasting Academic research thesis making Heavily used programs Databases, OLTP servers, Internet servers, Online Games, Stocks prediction, etc. Aggressive parallelization effort should be justified

Workload with Minimum Efforts Commercial Transaction Processing Systems Inter-transaction parallelism: multiple transactions at the same time Intra-application parallelism: parallelism within a single database operation: learn how to express queries for best parallel execution

Performance Improvement (Speedup) When work is fixed speedup ( p) = performance( p) performance(1) speedup ( p) = time(1) time( p) Basic measures of multiprocessor performance efficiency( p) = speedup( p p)

Scaling Problem (Small Work) Appropriate for small machine Parallelism overheads begin to dominate benefits for larger machines Load imbalance Communication to computation ratio May even achieve slowdowns

Scaling Problem (Large Work) Appropriate for big machine Difficult to measure improvement May not fit for small machine Can t run Thrashing to disk Working set doesn t fit in cache Fits at some p, leading to superlinear speedup

Demonstrating Scaling Problems Small Ocean problem On SGI Origin2000 Big equation solver problem On SGI Origin2000 parallelism overhead superlinear User want to scale problems as machines grow!

Definitions Scaling a machine Make a machine more powerful Machine size <processor, memory, communication, I/O> Scaling a machine in parallel processing Add more identical nodes Problem size Input configuration data set size: the amount of storage required to run it on a single processor memory usage: the amount of memory used by the program

Two Key Issues in Problem Scaling Under what constraints should the problem be scaled? Some properties must be fixed as the machine scales How should the problem be scaled? Which parameters? How?

Constraints To Scale Two types of constraints User-oriented Easy to think about change e.g. particles, rows, transactions Resource-oriented e.g. Memory, time

Resource-Oriented Constraints Problem constrained (PC) Problem size fixed Memory constrained (MC) Memory size fixed Time constrained (TC) Execution time fixed

Some Definitions t s : Processing time of the serial part of a program (using 1 processor). t p (1) : Processing time of the parallel part of the program using 1 processor. t p (P) : Processing time of the parallel part of the program using P processors. T(1) : Total processing time of the program including both the serial and the parallel parts using 1 processor = t s + t p (1) T(P) : Total processing time of the program including both the serial and the parallel parts using P processor = t s + t p (P)

Problem Constrained Scaling: Amdahl s law The main objective is to produce the results as soon as possible (turnaround time) (ex) video compression, computer graphics, VLSI routing, etc main usage of Amdahl s and Gustafson s laws: estimate speedup as a measure to determine a program s potential for parallelism Implications Upper-bound is 1/α Decrease serial part as much as possible Optimize the common case Modified Amdahl s law for fixed problem size including the overhead

Fixed-Size Speedup (Amdahl Law, 67) Amount of Work W 1 W 1 W 1 W 1 W 1 Elapsed Time T 1 T 1 T1 W p W p W p W p W p T p T 1 T1 T p Tp T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)

Limitations of Amdahl s Law Ignores performance overhead (e.g. communication, load imbalance) Overestimates speedup achievable

Enhanced Amdahl s Law The overhead includes parallelism and interaction overheads Speedup PC = T as (1 α) T T p 1 overhead αt 1 + + Toverhead α + p T 1 1 1 Amdahl s law: argument against massively parallel systems

Speedup PC Amdahl Effect = T as (1 α) T T p 1 overhead αt 1 + + Toverhead α + p T 1 1 1 Typically T overhead has lower complexity than (1-α)T 1 /p As problem size n increases (1-α)T 1 /p dominates T overhead As problem size n increases, speedup increases

Illustration of Amdahl Effect Speedup n = 10,000 n = 1,000 Processors n = 100

Review of Amdahl s Law Treats problem size as a constant Shows how execution time decreases as number of processors increases

Another Perspective We often use faster computers to solve larger problem instances Let s treat time as a constant and allow problem size to increase with number of processors

Time Constrained Scaling: Gustafson s Law User wants more accurate results within a time limit. Execution time is fixed as system scales (ex) FEM for structural analysis, FDM for fluid dynamics Properties of a work metric Easy to measure Architecture independent Easy to model with an analytical expression The measure of work should scale linearly with sequential time complexity of the algorithm

Gustafson s Law (Without Overhead) α 1-α time α = t t ( p) p (1-α)p s + p t s Speedup TC = Work( p) Work(1) = αw + (1 α ) pw W = α + (1 α) p p is the nr. of processors

Fixed-Time Speedup (Gustafson) W 1 Amount of Work W 1 W 1 Elapsed Time W 1 W 1 Wp W p W p W p T 1 T 1 T 1 T 1 T 1 W p T p T p T p T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)

Converting α s between Amdahl s and Gustafon s laws αa = 1+ 1 (1 αg ). n Based on this observation, Amdahl s and Gustafon s laws are identical. αg αa ( n n 1) + 1 = αg + (1 αg ) n

Memory Constrained Scaling: Sun and Ni s Law Scale the largest possible solution limited by the memory space. Or, fix memory usage per processor e.g. N-body problem problem size is scaled from W to W* W* is the work executed under memory limitation of a parallel computer * T(1, W Speedup MC = T(P, W * * ) )

Memory-Boundary Speedup (Sun & Ni) Work executed under memory limitation Hierarchical memory W 1 Amount of Work W 1 W 1 Elapsed Time W 1 Wp W p W p W 1 W p W p T 1 T p T 1 T 1 T p T p T 1 T 1 T p T p 1 2 3 4 5 Number of Processors (p) 1 2 3 4 5 Number of Processors (p)

Parallel Performance Metrics (Run-time is the dominant metric) Run-Time (Execution Time) Speed: mflops, mips Speedup Efficiency: E = Scalability Speedup Number of Processors

Scalability The Need for New Metrics Comparison of performances with different workload Availability of massively parallel processing Definition: Scalability Ability to maintain parallel processing gain when both problem size and machine size increase

Ideally Scalable T(m p, m W) = T(p, W) T: execution time W: work executed p: number of processors used m: scale up m times work: flop count based on the best practical serial algorithm Fact: T(m p, m W) = T(p, W) if and only if the average unit speed is fixed

Definition (average unit speed): The average unit speed is the achieved (work) divided by the number of processors Definition (Isospeed Scalability): An algorithm-machine combination is scalable if the achieved average unit speed can remain constant with increasing numbers of processors, provided the problem size is increased proportionally

Issoefficiency Parallel system: parallel program executing on a parallel computer Scalability of a parallel system: measure of its ability to increase performance as the number of processors increases A scalable system maintains efficiency as processors are added Isoefficiency: way to measure scalability

Isospeed Scalability (Sun & Rover, 91) W: work executed when p processors are employed W': work executed when p' > p processors are employed to maintain the average speed Ideal case p' W W ' = p Scalabilit y =ψ ( p, p') = Scalability in terms of time ψ ( p, p' ) T = T p p', W p W ' p' = ψ ( p, p') = 1 p' W p W ' ( W ) timefor workw on p processors = ( W ') timefor workw 'on p' processors

The Relation of Scalability and Time More scalable leads to smaller time Better initial run-time and higher scalability lead to superior run-time Same initial run-time and same scalability lead to same scaled performance Superior initial performance may not last long if scalability is low

Summary (1/3) Performance terms Speedup Efficiency Model of speedup Serial component Parallel component overhead component

Summary (2/3) What prevents linear speedup? Serial operations Communication operations Process start-up Imbalanced workloads Architectural limitations

Summary (3/3) Analyzing parallel performance Amdahl s Law Gustafson s Law Sun-Ni s Law Isoefficiency metric