Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Similar documents
Parallel Programming

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Designing for Performance. Patrick Happ Raul Feitosa

Quiz for Chapter 1 Computer Abstractions and Technology

Chapter 13 Strong Scaling

CSC630/CSC730 Parallel & Distributed Computing

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Introduction to parallel computing

ENGG 1203 Tutorial. Computer Arithmetic (1) Computer Arithmetic (3) Computer Arithmetic (2) Convert the following decimal values to binary:

Exercise 1 Advanced Computer Architecture. Exercise 1

High Performance Computing Systems

Performance, Power, Die Yield. CS301 Prof Szajda

Parallel Programming with MPI and OpenMP

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

What is Good Performance. Benchmark at Home and Office. Benchmark at Home and Office. Program with 2 threads Home program.

Parallel Programming with MPI and OpenMP

Multicore Programming

The typical speedup curve - fixed problem size

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Analytical Modeling of Parallel Programs

Lecture 7: Parallel Processing

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Understanding Parallelism and the Limitations of Parallel Computing

CMPSCI 691AD General Purpose Computation on the GPU

Course web site: teaching/courses/car. Piazza discussion forum:

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Performance analysis. Performance analysis p. 1

Gables: A Roofline Model for Mobile SoCs

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Performance analysis basics

Review of previous examinations TMA4280 Introduction to Supercomputing

Solutions for Chapter 4 Exercises

Chapter 5: Analytical Modelling of Parallel Programs

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Response Time and Throughput

CpE 442 Introduction to Computer Architecture. The Role of Performance

1.3 Data processing; data storage; data movement; and control.

Performance Metrics. Measuring Performance

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Parallel Computing Concepts. CSInParallel Project

Lecture 4: Instruction Set Architectures. Review: latency vs. throughput

Performance. February 12, Howard Huang 1

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

High-Performance and Parallel Computing

SCALABILITY ANALYSIS

Performance Analysis

T T T T T T N T T T T T T T T N T T T T T T T T T N T T T T T T T T T T T N.

BİL 542 Parallel Computing

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

Lecture 7: Parallel Processing

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

CS654 Advanced Computer Architecture. Lec 2 - Introduction

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

EE/CSCI 451: Parallel and Distributed Computation

Design of Parallel Algorithms. Course Introduction

Agenda Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2

Master Informatics Eng.

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

CS430 Computer Architecture

1 Introduction to Parallel Computing

27. Parallel Programming I

Fundamentals of Quantitative Design and Analysis

Computer Architecture. Chapter 1 Part 2 Performance Measures

M1 Computers and Data

Introduction to Parallel Computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Performance of computer systems

Chapter 5. Radicals. Lesson 1: More Exponent Practice. Lesson 2: Square Root Functions. Lesson 3: Solving Radical Equations

27. Parallel Programming I

Performance Metrics (I)

Slides compliment of Yong Chen and Xian-He Sun From paper Reevaluating Amdahl's Law in the Multicore Era. 11/16/2011 Many-Core Computing 2

10th August Part One: Introduction to Parallel Computing

NAMD Serial and Parallel Performance

Homework 4 - Solutions (Floating point representation, Performance, Recursion and Stacks) Maximum points: 80 points

Properties. Comparing and Ordering Rational Numbers Using a Number Line

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CSC2/458 Parallel and Distributed Systems Machines and Models

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Topics/Assignments. Class 10: Big Picture. What s Coming Next? Perspectives. So Far Mostly Programmer Perspective. Where are We? Where are We Going?

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it.

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

Chapter 18 - Multicore Computers

Overview of High Performance Computing

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Introduction to parallel Computing

2/22/ Transformations but first 1.3 Recap. Section Objectives: Students will know how to analyze graphs of functions.

CMPT 300: Operating Systems I Assignment 1

Transcription:

Parallel Programming WS6 HOMEWORK (with solutions) Performance Metrics Basic concepts. Performance. Suppose we have two computers A and B. Computer A has a clock cycle of ns and performs on average 2 instructions per cycle. Computer B, instead, has a clock cycle of 600 ps and performs on average.25 instructions per cycle. Consider the program P, which requires the execution of the same number of instructions in both computers: Which computer is faster for this program? What if P, when executed in Computer B, performs 0% more instructions than when executed in Computer A? Computer A performs Computer B performs 2 instructions cycle cycle instructions = 2 0 9 09 seconds second.25 instructions cycle cycle instructions = 2.08 600 0 2 09 seconds second Computer B performs more instructions per second, thus it is the fastest for this program. Now, let s n be the number of instructions required by Computer A, and. n the number n of instructions required by Computer B. The program will take seconds in Computer 2 0 n 9 A and seconds in Computer B. Therefore, in this scenario, Computer A executes the.89 0 9 program faster. 2. Speedup. Assume the runtime of a program is 00 seconds for a problem of size. The program consists of an initialization phase which lasts for 0 seconds and cannot be parallelized, and a problem solving phase which can be perfectly parallelized and grows quadratically with increasing problem size. What is the speedup for the program as a function of the number of processors p and the problem size n. What is the execution time and speedup of the program with problem size, if it is parallelized and run on 4 processors? What is the execution time of the program if the problem size is increased by a factor of 4 and it is run on 4 processors? And on 6 processors? What is the speedup of both measurements?

The application has an inherently sequential part (c s ) that takes 0 seconds, and a parallelizable part (c p ) that takes 90 seconds for problem size. Since, the parallelizable part grows quadratically with the problems size, we can model T (execution time in processor) as: The function of the speedup is, thus, S(p, n) := c s + c p n 2. T (, n) T (p, n) := c s + c p n 2 c s + (c p n 2 )/p 0 + 90 n2 0 + (90 n 2 )/p. For problem size (n = ) and 4 processors (p = 4), the execution time is 32.5 seconds. The achieved speedup is 3.08. Finally, if problem size is increased to 4, the execution time and speedup using 4 and 6 processors is: 4 processors: 370 seconds, speedup of 3.92. 6 processors: 00 seconds, speedup of 4.5. 3. Amdahl s law. Assume an application where the execution of floating-point instructions on a certain processor P consumes 60% of the total runtime. Moreover, let s assume that 25% of the floating-point time is spent in square root calculations. Based on some initial research, the design team of the next-generation processor P 2 believes that they could either improve the performance of all floating point instructions by a factor of.5 or alternatively speed up the square root operation by a factor of 8. From which design alternative would the aforementioned application benefit the most? Instead of waiting for the next processor generation the developers of the application decide to parallelize the code. What speedup can be achieved on a 6-CPU system, if 90% of the entire program can be perfectly parallelized? What fraction of the code has to be parallelized to get a speedup of 0? Amdahl s law: S p (n) := β+( β)/p. Improvement (all fp instructions sped up by a factor.5). Sequential part (β): 0.4. p =.5. The application would observe a total speedup of: S p (n) := β + ( β)/p = 0.4 + ( 0.4)/.5 =.25. Improvement 2 (square root instructions sped up by a factor of 8). Sequential part (β): 0.4 + 0.45. p = 8. The application would observe a total speedup of: S p (n) := β + ( β)/p =.85 + ( 0.85)/8 =.5. 2

Thus, the application would benefit the most from the first alternative. Parallelization of code. The speedup achieved on a 6-CPU system is: S p (n) := β + ( β)/p = 0. + ( 0.)/6 = 6.4. To attain a speedup of 0, a 96% of the code would need to be perfectly parallelizable. This value is obtained by solving the equation: 0 == β + ( β)/6. 4. Efficiency. Consider a computer that has a peak performance of 8 GFlops/s. An application running on this computer executes 5 TFlops, and takes hour to complete. How many GFlops/s did the application attain? Which efficiency did it achieve? The application attained: The achieved efficiency is: 5 TFlops/s 3600 s = 4.7 GFlops/s. 4.7 GFlops/s 8 GFlops/s = 52%. 5. Parallel efficiency. Given the data in Tab., use your favorite plotting tool to plot a) The scalability of the program (speedup vs number of processors) b) The parallel efficiency attained (parallel efficiency vs number of processors) In both cases plot also the ideal case, that is, scalability equal to the number of processors and parallel efficiency equal to, respectively. # Processors Best seq.() 2 4 8 6 # GFlops/s 4.0 7.6 4.9 23. 35.6 Table : Performance attained vs number of processors. Table 2 includes the speedup and parallel efficiency attained. Figure gives an example of the requested plots. # Processors Best seq.() 2 4 8 6 # Speedup.9 3.725 5.775 8.9 # Par. Eff. 0.95 0.93 0.72 0.56 Table 2: Performance attained vs number of processors. 3

6 Speedup 8 Parallel efficiency 0.75 0.5 4 0.25 2 2 4 8 6 Number of processors 2 4 8 6 Number of processors Figure : Scalability and parallel efficiency. 6. Weak scalability. Algorithm A takes an integer k as input, and uses O(k 2 ) workspace. In terms of complexity, A is cubic in k. Let T p (k) be the execution time of A to solve a problem of size k with p processes. It is known that T (8000) = 24 minutes. Keeping the ratio workspace/processor constant, and assuming perfect scalability, what is the expected execution time for k = 28000? We have: Workspace complexity: O(n 2 ) Computational complexity: O(n 3 ) T (8000) = 24 minutes We have to find p and t in: T p (28000) = t minutes We start by finding the necessary number of processors. The problem size k is 6 times larger (because 28000 8000 = 6). Since the workspace complexity is quadratic, we will need 256 times more memory. To keep the workspace/processor constant, we will need 256 times more processors. Therefore, p = 256. Now, to compute T 256 (28000) we can start from T (8000) and work our way towards the solution. First we compute T (28000) as T (28000) = 6 3 T (8000) = 6 3 24 minutes, because we have cubic computational complexity. Then, since we have perfect scalability, T 256 (28000) is simply T 256 (28000) = T (28000)/256 = 6 24 = 384 minutes. 4

2 Real world scenarios. A given program was run on a node consisting of 4 multi-core processors, each of which comprises 0 cores. The peak performance of the node is 320 GFlops/s. To study the scalability of the program, it was run using, 2, 4, 8, 6, 24, 32 and 40 of the cores, for fixed problem size. For this problem size, the program executes 22 0 2 flops. The execution time for each number of cores is collected in Tab. 3. Complete the table with the performance, speedup, efficiency, and parallel efficiency for each case. # Cores Time (s) Perf. (GF/s) Eff. Speedup Par. Eff. 34566.23 0.64 0.08.00.00 2 733.64.28 0.08 2.02.0 4 8535.92 2.58 0.08 4.05.0 8 4326.95 5.08 0.08 7.99.00 6 2206.02 9.97 0.08 5.67 0.98 24 53.88 4.36 0.07 22.56 0.94 32 33.53 9.4 0.08 30.49 0.95 40 943.30 23.32 0.07 36.64 0.92 Table 3: Study of performance, speedup and efficiency. 2. A program P consists of parts A, B and C. Part A is not parallelized while parts B and C are parallelized. The program is run on a computer with 0 cores. For a given problem size, the execution of the program in of the cores takes 0, 20, and 53 seconds, for parts A, B, and C, respectively. Answer the following questions: Which is the minimum execution time we can attain for each part if we now execute the same program with same problem sizes on 5 of the cores? What is the best speedup we can expect for the entire program? What if we run it using the 0 cores? In practice, the execution of the program using 5 and 0 cores to the time shown in Tab. 4. What is the speedup attained in each case for each part? # Cores Part A(s) Part B (s) Part C(s) 7 20 53 5 7 33 8 0 7 7 9 Table 4: Timings for program P using, 5 and 0 cores. 5

The minimum possible execution time for each part, for 5 and 0 cores, is given in Tab. 5. # Cores Part A(s) Part B (s) Part C(s) 7 20 53 5 7 24 0.6 0 7 2 5.3 Table 5: Minimum possible timings for program P using, 5 and 0 cores. The best speedup we can achieve in each case is: For 5 cores: S 5 = T T 5 = 90 7+73/5 = 3.68 For 0 cores: S 0 = T T 0 = 90 7+73/0 = 5.54 The speedup achieved in practice is: For 5 cores: S 5 = T T 5 = 90 68 = 2.79 For 0 cores: S 0 = T T 0 = 90 43 = 4.42 6