Performance Metrics. Measuring Performance

Size: px
Start display at page:

Download "Performance Metrics. Measuring Performance"

Transcription

1 Metrics 12/9/ Measuring How should the performance of a parallel computation be measured? Traditional measures like MIPS and MFLOPS really don t cut it New ways to measure parallel performance are needed Speedup Efficiency 12/9/

2 Measures of for Parallel Programs Speed-up How much faster than a sequential program? Efficiency How efficiently are the processors utilized Speedup Speedup is the most often used measure of parallel performance If T s (N) is the best possible serial time T(P,N) is the time taken by a parallel algorithm of size N on P processors Then Speedup, S(P,N) = T s (N) / T(P,N) 12/9/

3 Measuring Empirically Time programs on sequential and parallel machines Timings must be done under appropriate conditions (e.g., no time sharing) Theoretically Use estimates of times for different operations Build estimate of total time for program Useful for comparing different program organization Read Between the Lines Exactly what is meant by T s (N) (i.e., the time taken to run the fastest serial algorithm on one processor) One processor of the parallel computer? The fastest serial machine available? A parallel algorithm run on a single processor? Is the serial algorithm the best one? To keep things fair, T s (N) should be the best possible time in the serial world 12/9/

4 Practical Speedup A slightly different definition of speedup also exists. The time taken by the parallel algorithm on one processor divided by the time taken by the parallel algorithm on P processors. However this is misleading since many parallel algorithms contain extra operations to accommodate the parallelism (e.g., the communication) The result is T s (N) is increased thus exaggerating the speedup. 12/9/ Components of Execution Time Execution of standard instructions arithmetic, logical, branch Input and Output Communications time T tot = T comp + T io + T comm 4

5 Time for Computation Computation time (Tcomp) will depend on the number of instructions executed and the computation time per instruction. Tcomp = Number of Instr x Time per Instr Although different instructions take different times, we can approximate by using averages for a given machine. Input Output Importance depends on the problem Critical for high I/O problems Opens the avenue of research into parallel I/O systems and storage Less important for computationally intensive problems, simulations 5

6 Communications Time Dependent on the Link/Switch technology Dependent on the network topology Components start up time, or time for smallest message to get through (latency) time for an additional unit of a message to get through (rate = inverse of bandwidth) Communications Time for Message T comm = T l + kt c T l = latency (setup time) T c = rate k = length of message T c = d * T r d = distance = link rate T r 6

7 Typical Proportions of Times T c is 1 to 10 times Tcomp T l is 100 to 1000 times Tcomp (Distributed memory MIMD) or worse T r is comparable to T c for nearest neighbor communications on a SIMD machine SIMD Communications Nearest neighbor Distant (two hop, three hop, etc) Global (reduce) times depend on network behind best possible is log P (P = number of PEs) 7

8 Factors That Limit Speedup Software Overhead Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation Load Balancing Speedup is generally limited by the speed of the slowest node. So an important consideration is to ensure that each node performs the same amount of work Communication Overhead Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processors directly degrades the speedup 12/9/ Linear Speedup Which ever definition is used the ideal is to produce linear speedup A speedup of P using P processors However in practice the speedup is reduced from its ideal value of P Superlinear speedup results when unfair values are used for T s (N) Differences in the nature of the hardware used 12/9/

9 Speedup Curves Superlinear Speedup Speedup Linear Speedup Typical Speedup Number of Processors 12/9/ Efficiency Speed up does not measure how efficiently the processors are being used Is it worth using 100 processors to get a speedup of 2? Efficiency is defined as the ratio of the speedup and the number of processors required to achieve it Efficiency is given by E(P,N) = S(P, N) / P The efficiency is bounded from above by 1 12/9/

10 Example Processors Time(secs) Speedup Efficiency /9/ Speedup Curve Speedup Speedup Processors Linear Actual 12/9/

11 Timing for assignment 1 Let s look at FindMax and PeopleWave 12/9/ FindMax.pm MODULE FindMax; CONST N = 3; CONFIGURATION grid[1..n],[1..n]; CONNECTION right: grid[i,j] <-> grid[i,j+1]:left; up: grid[i,j] <-> grid[i+1,j]:down; VAR i : INTEGER; value, buffer : grid OF INTEGER; 12/9/

12 FindMax.pm (continued) BEGIN value := ID(grid); FOR i := 1 TO N-1 DO buffer := MOVE.left(value); IF buffer > value THEN value := buffer END; END ; (* FOR *) FOR i := 1 TO N-1 DO buffer := MOVE.down(value); IF buffer > value THEN value := buffer END; END ; (* FOR *) i := value<:1,1:>; WriteInt(i,10); WriteLn; END FindMax. 12/9/ MODULE PeopleWave; PeopleWave.pm CONST GRIDSIZE = 4; NUMBEROFNEIGHBORS = 8; CONFIGURATION grid[1..gridsize],[1..gridsize]; CONNECTION dir[0] : grid[i,j] <-> grid[i-1, j ]:dir[4]; dir[1] : grid[i,j] <-> grid[i-1, j+1]:dir[5]; dir[2] : grid[i,j] <-> grid[i, j+1]:dir[6]; dir[3] : grid[i,j] <-> grid[i+1, j+1]:dir[7]; VAR i,k : INTEGER; WaveElement Averager AllNeighbors OneNeighbor : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; 12/9/

13 PeopleWave.pm(2) BEGIN (* Initialize the wave *) IF DIM(grid,1) = GRIDSIZE THEN WaveElement := 1 ELSE WaveElement := 0 END; (* Divisor for determining new WaveElement is normally 3, but only 2 on edges *) IF (DIM(grid,2) = 1) OR (DIM(grid,2) = GRIDSIZE) OR (DIM(grid,1) = 1) OR (DIM(grid,1) = GRIDSIZE) THEN Averager := 2 ELSE Averager := 3 END; WriteInt(WaveElement, 1); 12/9/ PeopleWave.pm (3) FOR i := 1 TO GRIDSIZE DO (* retrieve and average (weighted) info about neighbors *) AllNeighbors := WaveElement; FOR k := 0 TO NUMBEROFNEIGHBORS-1 DO OneNeighbor :=0; SEND.dir[(k+4) MOD 8] (WaveElement, OneNeighbor); IF k < 5 THEN (* apply the template *) AllNeighbors := AllNeighbors + OneNeighbor; ELSE AllNeighbors := AllNeighbors - OneNeighbor; END; (* IF *) END; (* FOR k *) WaveElement := AllNeighbors DIV Averager; 12/9/

14 PeopleWave.pm (4) IF WaveElement >= 1 THEN WaveElement := 1 ELSE WaveElement := 0 END; WriteInt(WaveElement, 1); END; (* FOR i *) END PeopleWave. 12/9/ Asymptotic Analysis Analysis as a variable increases toward infinity Speedup depends on two variables: S(P, N) Three possible type of limit: Fixed P, N increases Fixed N, P increases Both P and N increase in some fixed relationship 14

15 P fixed, N increases Fixed number of processors, size of problem increases. Should see more computation relative to communications, reducing overheads Should see asymptotic time similar to single processor, thus improving efficiency May be exceptions if complexity of communications grows with problem size N fixed, P increases In other words, what happens as we use more and more processors to solve a given problem? Amdahl s Law -- the law of diminishing returns Based on the observation that every problem has a part which must be computed in sequence. Let the whole problem need time 1 on a single processor. Let s be the necessarily sequential part, let p be the parallel part: s + p = 1 Assume that the parallel part can be arbitrarily divided with no overhead or communications time. What is the speedup? 15

16 N fixed, P increases S(P, N) = 1 / (s + p/n) limit as N--> infinity? lim S(P,N) = 1 / s This says that no matter how many processors we have, the speedup is limited by 1 / s. Amdahl s Law A parallel computation has two types of operations Those which must be executed in serial Those which can be executed in parallel Amdahl s law states that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially 12/9/

17 Amdahl s Law Let the time taken to do the serial calculations be some fraction σ of the total time ( 0 < σ 1 ) The parallelizable portion is 1- σ of the total Assuming linear speedup T serial = σt 1 T parallel = (1- σ)t 1 /N By substitution Speedup <= 1 (1- σ ) σ + N 12/9/ Consequences of Amdahl s Say we have a program containing 100 operations each of which take 1 time unit. Suppose σ=.2, using 80 processors Speedup = 100 / ( /80) = 100 / 21 < 5 A speedup of only 5 is possible no matter how many processors are available So why bother with parallel computing? Just wait for a faster processor 12/9/

18 Avoiding Amdahl There are several ways to avoid Amdahl s law Concentrate on parallel algorithms with small serial components Amdahl s law is not complete in that it does not take into account problem size 12/9/ Amdahl s Law and Assignment 1 Let s look at PeopleWave 12/9/

19 MODULE PeopleWave; PeopleWave.pm CONST GRIDSIZE = 4; NUMBEROFNEIGHBORS = 8; CONFIGURATION grid[1..gridsize],[1..gridsize]; CONNECTION dir[0] : grid[i,j] <-> grid[i-1, j ]:dir[4]; dir[1] : grid[i,j] <-> grid[i-1, j+1]:dir[5]; dir[2] : grid[i,j] <-> grid[i, j+1]:dir[6]; dir[3] : grid[i,j] <-> grid[i+1, j+1]:dir[7]; VAR i,k : INTEGER; WaveElement Averager AllNeighbors OneNeighbor : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; : grid OF INTEGER; 12/9/ PeopleWave.pm(2) BEGIN (* Initialize the wave *) IF DIM(grid,1) = GRIDSIZE THEN WaveElement := 1 ELSE WaveElement := 0 END; (* Divisor for determining new WaveElement is normally 3, but only 2 on edges *) IF (DIM(grid,2) = 1) OR (DIM(grid,2) = GRIDSIZE) OR (DIM(grid,1) = 1) OR (DIM(grid,1) = GRIDSIZE) THEN Averager := 2 ELSE Averager := 3 END; WriteInt(WaveElement, 1); 12/9/

20 PeopleWave.pm (3) FOR i := 1 TO GRIDSIZE DO (* retrieve and average (weighted) info about neighbors *) AllNeighbors := WaveElement; FOR k := 0 TO NUMBEROFNEIGHBORS-1 DO OneNeighbor :=0; SEND.dir[(k+4) MOD 8] (WaveElement, OneNeighbor); IF k < 5 THEN (* apply the template *) AllNeighbors := AllNeighbors + OneNeighbor; ELSE AllNeighbors := AllNeighbors - OneNeighbor; END; (* IF *) END; (* FOR k *) WaveElement := AllNeighbors DIV Averager; 12/9/ PeopleWave.pm (4) IF WaveElement >= 1 THEN WaveElement := 1 ELSE WaveElement := 0 END; WriteInt(WaveElement, 1); END; (* FOR i *) END PeopleWave. 12/9/

21 Classifying Parallel Programs Parallel programs can be placed into broad categories based on expected speedups Trivial Parallel Assumes complete parallelism with no overhead due to communication Divide and Conquer N log N speedup Communication Bound Parallelism 12/9/

Understanding Parallelism and the Limitations of Parallel Computing

Understanding Parallelism and the Limitations of Parallel Computing Understanding Parallelism and the Limitations of Parallel omputing Understanding Parallelism: Sequential work After 16 time steps: 4 cars Scalability Laws 2 Understanding Parallelism: Parallel work After

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

What is Performance for Internet/Grid Computation?

What is Performance for Internet/Grid Computation? Goals for Internet/Grid Computation? Do things you cannot otherwise do because of: Lack of Capacity Large scale computations Cost SETI Scale/Scope of communication Internet searches All of the above 9/10/2002

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2018 Outline Performance scalability Analytical performance measures Amdahl

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program. Parallel Programming WS6 HOMEWORK (with solutions) Performance Metrics Basic concepts. Performance. Suppose we have two computers A and B. Computer A has a clock cycle of ns and performs on average 2 instructions

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Chapter 13 Strong Scaling

Chapter 13 Strong Scaling Chapter 13 Strong Scaling Part I. Preliminaries Part II. Tightly Coupled Multicore Chapter 6. Parallel Loops Chapter 7. Parallel Loop Schedules Chapter 8. Parallel Reduction Chapter 9. Reduction Variables

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Vector and Parallel Processors. Amdahl's Law

Vector and Parallel Processors. Amdahl's Law Vector and Parallel Processors. Vector processors are processors which have special hardware for performing operations on vectors: generally, this takes the form of a deep pipeline specialized for this

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance Necessity of evaluation computer performance MEASURING COMPUTER PERFORMANCE For comparing different computer performances User: Interested in reducing the execution time (response time) of a task. Computer

More information

Overview of High Performance Computing

Overview of High Performance Computing Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

SCALABILITY ANALYSIS

SCALABILITY ANALYSIS SCALABILITY ANALYSIS PERFORMANCE AND SCALABILITY OF PARALLEL SYSTEMS Evaluation Sequential: runtime (execution time) Ts =T (InputSize) Parallel: runtime (start-->last PE ends) Tp =T (InputSize,p,architecture)

More information

Parallel Algorithms. Parallel Algorithms

Parallel Algorithms. Parallel Algorithms Parallel Algorithms Parallel Algorithms Goals: Speedup and Efficiency Speedup, in general, is limited linearly to the number of processors (P) used. When can we epect to have linear speedup? Fully (embarrassingly)

More information

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks Measuring Performance Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks Why Measure Performance? Performance tells you how you are doing and whether things can be improved appreciably When

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CO Computer Architecture and Programming Languages CAPL. Lecture 15

CO Computer Architecture and Programming Languages CAPL. Lecture 15 CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs 2014 IJEDR Volume 2, Issue 1 ISSN: 2321-9939 Analytical Modeling of Parallel Programs Hardik K. Molia Master of Computer Engineering, Department of Computer Engineering Atmiya Institute of Technology &

More information

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC Lecture 10: Performance Metrics Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 5 slides of the text, by A. Grama w/ a few changes, augmentations and corrections in colored text by Shantanu

More information

High Performance Computing Systems

High Performance Computing Systems High Performance Computing Systems Shared Memory Doug Shook Shared Memory Bottlenecks Trips to memory Cache coherence 2 Why Multicore? Shared memory systems used to be purely the domain of HPC... What

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January 2016 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

Parallel Programming

Parallel Programming Parallel Programming Timings, pt.3 Prof. Paolo Bientinesi HPAC, RWTH Aachen pauldj@aices.rwth-aachen.de WS17/18 Time Wall time or wall-clock time : real time between the beginning and the end of a computation

More information

Performance analysis. Performance analysis p. 1

Performance analysis. Performance analysis p. 1 Performance analysis Performance analysis p. 1 An example of time measurements Dark grey: time spent on computation, decreasing with p White: time spent on communication, increasing with p Performance

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by Dan Grossman (with small tweaks by Alan Hu) Learning

More information

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1 Program Performance Metrics The parallel run time (Tpar) is the time from the moment when computation starts to the moment when the last processor finished his execution The speedup (S) is defined as the

More information

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example 1 Which is the best? 2 Lecture 05 Performance Metrics and Benchmarking 3 Measuring & Improving Performance (if planes were computers...) Plane People Range (miles) Speed (mph) Avg. Cost (millions) Passenger*Miles

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Lecture 29 Review CPU time: the best metric Be sure you understand CC, clock period Common (and good) performance metrics Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3

More information

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms ) CSE 613: Parallel Programming Lecture 2 ( Analytical Modeling of Parallel Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2017 Parallel Execution Time & Overhead

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018 CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs Ruth Anderson Autumn 2018 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer

More information

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #

More information

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015 CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015 Outline Done: How to write a parallel algorithm with fork and join Why using divide-and-conquer

More information

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002,

Chapter 1. Instructor: Josep Torrellas CS433. Copyright Josep Torrellas 1999, 2001, 2002, Chapter 1 Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Course Goals Introduce you to design principles, analysis techniques and design options in computer architecture

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

The typical speedup curve - fixed problem size

The typical speedup curve - fixed problem size Performance analysis Goals are 1. to be able to understand better why your program has the performance it has, and 2. what could be preventing its performance from being better. The typical speedup curve

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

1.3 Data processing; data storage; data movement; and control.

1.3 Data processing; data storage; data movement; and control. CHAPTER 1 OVERVIEW ANSWERS TO QUESTIONS 1.1 Computer architecture refers to those attributes of a system visible to a programmer or, put another way, those attributes that have a direct impact on the logical

More information

1 Introduction to Parallel Computing

1 Introduction to Parallel Computing 1 Introduction to Parallel Computing 1.1 Goals of Parallel vs Distributed Computing Distributed computing, commonly represented by distributed services such as the world wide web, is one form of computational

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Scalability of Processing on GPUs

Scalability of Processing on GPUs Scalability of Processing on GPUs Keith Kelley, CS6260 Final Project Report April 7, 2009 Research description: I wanted to figure out how useful General Purpose GPU computing (GPGPU) is for speeding up

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

ENGG 1203 Tutorial. Computer Arithmetic (1) Computer Arithmetic (3) Computer Arithmetic (2) Convert the following decimal values to binary:

ENGG 1203 Tutorial. Computer Arithmetic (1) Computer Arithmetic (3) Computer Arithmetic (2) Convert the following decimal values to binary: ENGG 203 Tutorial Computer Arithmetic () Computer Systems Supplementary Notes Learning Objectives Compute via Computer Arithmetic Evaluate the performance of processing via Amdahl s law News Revision tutorial

More information

CSL 860: Modern Parallel

CSL 860: Modern Parallel CSL 860: Modern Parallel Computation Course Information www.cse.iitd.ac.in/~subodh/courses/csl860 Grading: Quizes25 Lab Exercise 17 + 8 Project35 (25% design, 25% presentations, 50% Demo) Final Exam 25

More information

CSC2/458 Parallel and Distributed Systems Machines and Models

CSC2/458 Parallel and Distributed Systems Machines and Models CSC2/458 Parallel and Distributed Systems Machines and Models Sreepathi Pai January 23, 2018 URCS Outline Recap Scalability Taxonomy of Parallel Machines Performance Metrics Outline Recap Scalability Taxonomy

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [15 points] Consider two different implementations, M1 and

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser Parallel Constraint Programming (and why it is hard... ) This Week s Lectures Search and Discrepancies Parallel Constraint Programming Why? Some failed attempts A little bit of theory and some very simple

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Analytical Modeling of Parallel Programs

Analytical Modeling of Parallel Programs Analytical Modeling of Parallel Programs Alexandre David Introduction to Parallel Computing 1 Topic overview Sources of overhead in parallel programs. Performance metrics for parallel systems. Effect of

More information

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8 PROCESSES AND THREADS THREADING MODELS CS124 Operating Systems Winter 2016-2017, Lecture 8 2 Processes and Threads As previously described, processes have one sequential thread of execution Increasingly,

More information

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science,

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY Sarvani Chadalapaka HPC Administrator University of California Merced, Office of Information Technology schadalapaka@ucmerced.edu it.ucmerced.edu

More information

Parallel Computation/Program Issues

Parallel Computation/Program Issues Parallel Computation/Program Issues Dependency Analysis: Types of dependency Dependency Graphs Bernstein s s Conditions of Parallelism Asymptotic Notations for Algorithm Complexity Analysis Parallel Random-Access

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

Quiz for Chapter 1 Computer Abstractions and Technology

Quiz for Chapter 1 Computer Abstractions and Technology Date: Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Beyond Latency and Throughput

Beyond Latency and Throughput Beyond Latency and Throughput Performance for Heterogeneous Multi-Core Architectures JoAnn M. Paul Virginia Tech, ECE National Capital Region Common basis for two themes Flynn s Taxonomy Computers viewed

More information

The Role of Performance

The Role of Performance Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance What is performance? A set of metrics that allow us to compare two different hardware

More information

Draft Notes 1 : Scaling in Ad hoc Routing Protocols

Draft Notes 1 : Scaling in Ad hoc Routing Protocols Draft Notes 1 : Scaling in Ad hoc Routing Protocols Timothy X Brown University of Colorado April 2, 2008 2 Introduction What is the best network wireless network routing protocol? This question is a function

More information

Evaluation of Parallel Programs by Measurement of Its Granularity

Evaluation of Parallel Programs by Measurement of Its Granularity Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl

More information

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon]

Performance. CS 3410 Computer System Organization & Programming. [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance CS 3410 Computer System Organization & Programming [K. Bala, A. Bracy, E. Sirer, and H. Weatherspoon] Performance Complex question How fast is the processor? How fast your application runs?

More information

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 1 CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 Instructions: This is an in class, open note exam. Please use the paper provided to submit your responses. You can include additional

More information

Course web site: teaching/courses/car. Piazza discussion forum:

Course web site:   teaching/courses/car. Piazza discussion forum: Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start

More information

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3 System Performance CPU Performance Disk Performance CMSC 313 Lecture 27 Announcement: Don t use oscillator in DigSim3 UMBC, CMSC313, Richard Chang Bottlenecks The performance of a process

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing This document consists of two parts. The first part introduces basic concepts and issues that apply generally in discussions of parallel computing. The second part consists

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally

More information

Exercise 1 Advanced Computer Architecture. Exercise 1

Exercise 1 Advanced Computer Architecture. Exercise 1 Folie a: Name Advanced Computer Architecture Department of Electrical Engineering and Information Technology Institute for g Dipl.-Ing. M.A. Lebedev Institute for BB 321, Tel: 0203 379-1019 E-mail: michail.lebedev@uni-due.de

More information

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism CS 61C: Great Ideas in Computer Architecture Amdahl s Law, Thread Level Parallelism Instructor: Alan Christopher 07/17/2014 Summer 2014 -- Lecture #15 1 Review of Last Lecture Flynn Taxonomy of Parallel

More information

Parallel Computing Concepts. CSInParallel Project

Parallel Computing Concepts. CSInParallel Project Parallel Computing Concepts CSInParallel Project July 26, 2012 CONTENTS 1 Introduction 1 1.1 Motivation................................................ 1 1.2 Some pairs of terms...........................................

More information

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM Speedup Thoai am Outline Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & i s Law Speedup & Efficiency Speedup: S = Time(the most efficient sequential Efficiency: E = S / algorithm) / Time(parallel

More information

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers UNIVERSITI SAINS MALAYSIA Second Semester Examination Academic Session 2003/2004 September/October 2003 CCS524 Parallel Computing Architectures, Algorithms & Compilers Duration : 3 hours INSTRUCTION TO

More information

Parallel Processing in Mixed Integer Programming

Parallel Processing in Mixed Integer Programming Parallel Processing in Mixed Integer Programming Laurent Poirrier Université de Liège Montefiore Institute March 27, 2009 Outline Parallel Processing Basics What, why, how? Types of Parallel Computing

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 7 Performance Analysis Learning Objectives n Predict performance of parallel programs n Understand barriers to higher performance Outline

More information

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015

Parallel Programming. Presentation to Linux Users of Victoria, Inc. November 4th, 2015 Parallel Programming Presentation to Linux Users of Victoria, Inc. November 4th, 2015 http://levlafayette.com 1.0 What Is Parallel Programming? 1.1 Historically, software has been written for serial computation

More information

Centralized versus distributed schedulers for multiple bag-of-task applications

Centralized versus distributed schedulers for multiple bag-of-task applications Centralized versus distributed schedulers for multiple bag-of-task applications O. Beaumont, L. Carter, J. Ferrante, A. Legrand, L. Marchal and Y. Robert Laboratoire LaBRI, CNRS Bordeaux, France Dept.

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

NAMD Serial and Parallel Performance

NAMD Serial and Parallel Performance NAMD Serial and Parallel Performance Jim Phillips Theoretical Biophysics Group Serial performance basics Main factors affecting serial performance: Molecular system size and composition. Cutoff distance

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM

Outline. Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & Ni s Law. Khoa Khoa học và Kỹ thuật Máy tính - ĐHBK TP.HCM Speedup Thoai am Outline Speedup & Efficiency Amdahl s Law Gustafson s Law Sun & i s Law Speedup & Efficiency Speedup: S = T seq T par - T seq : Time(the most efficient sequential algorithm) - T par :

More information

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction.

CMSC 411 Practice Exam 1 w/answers. 1. CPU performance Suppose we have the following instruction mix and clock cycles per instruction. CMSC 4 Practice Exam w/answers General instructions. Be complete, yet concise. You may leave arithmetic expressions in any form that a calculator could evaluate.. CPU performance Suppose we have the following

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

CMPSCI 691AD General Purpose Computation on the GPU

CMPSCI 691AD General Purpose Computation on the GPU CMPSCI 691AD General Purpose Computation on the GPU Spring 2009 Lecture 5: Quantitative Analysis of Parallel Algorithms Rui Wang (cont. from last lecture) Device Management Context Management Module Management

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn (revised by L.M. Liebrock) Chapter 7 Performance Analysis Learning Objectives Predict performance of parallel programs Understand barriers to higher

More information