ECE Spring 2017 Exam 2

Similar documents
ECE 563 Spring 2016, Second Exam

ECE 563 Second Exam, Spring 2014

ECE 563 Spring 2012 First Exam

Advanced Message-Passing Interface (MPI)

Parallel Programming with MPI and OpenMP

The typical speedup curve - fixed problem size

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

PARALLEL AND DISTRIBUTED COMPUTING

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Performance analysis. Performance analysis p. 1

Parallel Programming with MPI and OpenMP

DPHPC: Introduction to OpenMP Recitation session

Parallel Programming in C with MPI and OpenMP

Lecture 4: OpenMP Open Multi-Processing

Parallel Numerical Algorithms

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ECE 574 Cluster Computing Lecture 10

Multithreading in C with OpenMP

DPHPC: Introduction to OpenMP Recitation session

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming in C with MPI and OpenMP

Parallel Programming using OpenMP

Parallel Programming using OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

High Performance Computing: Tools and Applications

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Shared Memory Programming Model

OpenMP Programming. Aiichiro Nakano

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Wide operands. CP1: hardware can multiply 64-bit floating-point numbers RAM MUL. core

Allows program to be incrementally parallelized

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it.

Why C? Because we can t in good conscience espouse Fortran.

Scientific Programming in C XIV. Parallel programming

<Insert Picture Here> OpenMP on Solaris

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

COMP528: Multi-core and Multi-Processor Computing

CSC630/CSC730 Parallel & Distributed Computing

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Some features of modern CPUs. and how they help us

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Computing. Lecture 13: OpenMP - I

Scalability of Processing on GPUs

Sometimes incorrect Always incorrect Slower than serial Faster than serial. Sometimes incorrect Always incorrect Slower than serial Faster than serial

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

CS 5220: Shared memory programming. David Bindel

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

ECE 563 Midterm 1 Spring 2015

CS 470 Spring Mike Lam, Professor. OpenMP

Concurrent Programming with OpenMP

OpenMP Fundamentals Fork-join model and data environment

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

A brief introduction to OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Parallel and Distributed Programming. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

CSE-160 (Winter 2017, Kesden) Practice Midterm Exam. volatile int count = 0; // volatile just keeps count in mem vs register

Raspberry Pi Basics. CSInParallel Project

GLOSSARY. OpenMP. OpenMP brings the power of multiprocessing to your C, C++, and. Fortran programs. BY WOLFGANG DAUTERMANN

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

CSE 332 Winter 2015: Final Exam (closed book, closed notes, no calculators)

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Parallel Computing. Lecture 17: OpenMP Last Touch

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Programming at Scale: Concurrency

OpenMP - exercises -

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Introduction to OpenMP

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Can't live without it.

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

COSC 462. Parallel Algorithms. The Design Basics. Piotr Luszczek

Introduction to OpenMP

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

CSE 4342/ 5343 Lab 1: Introduction

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lock Yourself Out. Ruud van der Pas. Distinguished Engineer SPARC Microelectronics. Santa Clara, CA, USA

ECE 574 Cluster Computing Lecture 13

ESC101N: Fundamentals of Computing End-sem st semester

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.

HOW TO WRITE PARALLEL PROGRAMS AND UTILIZE CLUSTERS EFFICIENTLY

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Data Environment: Default storage attributes

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Transcription:

ECE 56300 Spring 2017 Exam 2 All questions are worth 5 points. For isoefficiency questions, do not worry about breaking costs down to t c, t w and t s. Question 1. Innovative Big Machines has developed special hardware that allows their machines to do reductions on four numbers at a time. Thus the complexity of a reduction s communication cost is log 4 P instead of the typical log 2 P. Thanks to special Time Travel Logic (TTL), the cost of doing the three recurrence operations at each step (e.g., +, *, etc.) is 0. What is the isoefficiency of a program that does n 2 operations and a reduction? A simple solution is: Tp = n 2 /P + log 4 P To = P Tp = n 2 + P log 4 P n 2 = P log 4 P which is the isoefficiency relation

Question 2. Time Industries, the developers of the Time Travel Logic, are unable to deliver, and so reduction is redesigned. The new reduction has P0 send its partial sum to P1, which adds its partial sum to P0 s, and then sends this sum to P2, and so forth. What is the isoefficiency relation of this reduction? Is it better or worse, in terms of scalability, than the isoefficiency relation of Question 1? Solution: The recurrence takes O(P-1) time, which is close to O(P) time for large P. This is because of the sequential nature of the communication. Therefore, Tp = n 2 /P + P To = P Tp = n 2 + P 2 n 2 = P 2 which is the isoefficiency relation

Question 3. Bob is given the job to write a program that will get a speedup of 3.8 on 4 processors. He makes it 95% parallel, and goes home dreaming of a big pay raise. Using Amdahl s law, and assuming the problem size is the same as the serial version, and ignoring communication costs, what speedup will Bob actually get? Solution (using Amdahl s Law): Speedup = 1/(f (1 f)/p) = 1/(.05 -.95/4) (1) = 3.478 Full credit was given if you set up the fraction (3) correctly. Question 4. You need to decide which of two parallel programs to use. You will run on 1024 processors, but you currently only have 16 processors to measure performance. You collect the following information, where Ψ is speedup and e is the experimentally determined serial fraction. p 2 4 8 16 Program A Ψ 1.88 3.12 4.5 5.8 e 0.06 0.09 0.11.12 P 2 6 8 16 Program B Ψ 1.79 2.9 4.3 5.7 e 0.12 0.12 0.12 0.12 Which program will give the best performance on 1024 processors? Explain your answer in ~30 words or less. Solution (Karp-Flatt): Program B is best because its sequential overhead is not growing, and therefore going to larger numbers of processors it will have better speedups.

Question 5. A parallel program is run on 1, 2, 4, 8, 16 and 32 cores, and we get the following speedups and efficiency. P 2 4 8 16 32 64 Ψ 1.5 4.2 7.6 15.12 30.24 60.48 E 0.75 1.05 0.94 0.92 0.90 0.88 The speedup for 4 processors is greater than 4, and the efficiency is greater than 1. The amount of computation is the same as in the sequential version. We can conclude that the super-linear speedup is the result of (circle the correct answer.) (a) More cache misses because of poor locality (b) Fewer cache misses because of good locality (c) Fewer cache misses because of more cache available on four cores than two, and sufficiently good locality to take advantage of this. (d) Good programming, and the program would have gotten a super-linear speedup even if processors did not have caches. Question 6. A programmer has parallelized 99% of a program, but there is no value in increasing the problem size, i.e., the program will always be run with the same problem size regardless of the number of processors or cores used. What is the expected speedup on 20 processors? Show your work. Solution (using Amdahl s Law): 1/(.01 + (1-.01)/20 = 16.81 Again, full credit was given for setting up the fraction. Question 7. A programmer has parallelized a program such that s, the amount of time the program spends in serial code, is.01. Moreover, the the problem size can expand as the number of processors increases. What is the expected speedup on 20 processors? Show your work. Solution (using Gustavson-Barsis): Speedup = P + (1-P) s = 20 19 * 0.01 = 19.81 Full credit was given for setting up the problem.

An OMP program contains the following code: void search(struct node* root, int findvalue, int level) if (root == NULL) return; if (root->data == findvalue) found++; if (level < 1) seqsearch(root left, key); seqsearch(root right, key); else #pragma omp task search(key, self->left, level+1); #pragma omp task unisearch(key, self->right, level+1); Question 8. found is a global variable. Is there a race on the increment? Yes. Question 9. Assume you are not sure about the answer to Question 8, and you want to be sure that there is not race. What choices do you have to ensure there is no race? Circle all that are true. (a) Use an OpenMP critical pragma (b) On processors where the increment operation can be done using an atomic hardware increment, use an OpenMP atomic pragma (c) There cannot be a race because the increment is not contained within a #pragma omp parallel or a #pragma omp task inside the function. (d) The odds of two threads updating found at the same time is small, so there is no need to worry about it.

Questions 10 14 use the following code. #include <stdio.h> #include <stdlib.h> #include <omp.h> void main() int c1 = 0; int c2 = 0; int i; Other answers may be correct. Let me know if you think you lost points for a correct answer Question 10. What value(s) are printed for the/a thread number in L1? (b) 0, 1, 2 or 3 (c) 0, 1, 2 and 3 omp_set_num_threads(4); #pragma omp parallel #pragma omp atomic C1++; printf("par: %d\n", omp_get_thread_num( )); //L1 #pragma omp master #pragma omp critical c2++; printf("master: %d\n", omp_get_thread_num( )); //L2 #pragma omp single #pragma omp critical printf("single: %d\n", omp_get_thread_num( )); //L3 c2++; printf("c1: %d, c2: %d\n", c1, c2); //L4 Question 13. What value(s) are printed for the c1 in L4? (b) 1 (c) 2 (d) 3 (e) 4 (f) some other value Question 11. What value(s) are printed for the/a thread number in L2? (b) 0, 1, 2 or 3 (c) 0, 1, 2 and 3 (b) is also correct since although the master thread is usually 0, it is implementation dependent and only must always the be same thread. Question 12. What value(s) are printed for the/a thread number in L3? (b) 0, 1, 2 or 3 (c) 0, 1, 2 and 3 Question 14. What value(s) are printed for the c2 in L4? (b) 1 (c) 2 (d) 3 (e) 4 (f) some other value

Question 15. Bob gets a job programming MPI programs. His program only uses SSend and Srecv communication primitives and does not deadlock. He later changes these to non-blocking Isend and Irecvs, with a correct and valid Wait instruction immediately after each Isend and Irecv. Can his program deadlock from communication. Answer yes or no. Question 16 19. The processes in an MPI program have data that looks Figure (a) below. For Questions 16 through 19 give the name (you do not need to give the full function call or arguments) of the collective communication operation that leads to the data being communicated as shown for the corresponding question. (a) Question 16 Question 17 16. AllReduce 17. AlltoAll 18. Bcast 19. Scan Question 18 Question 19