PARALLEL AND DISTRIBUTED COMPUTING 2010/2011 1 st Semester Recovery Exam February 2, 2011 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc. - Give your answers in the available space after each question. You can use either Portuguese or English. - Be sure to write your name and number on all pages, non-identified pages will not be graded! - Justify all your answers. - Don t hurry, you should have plenty of time to finish this test. Skip questions that you find less comfortable with and come back to them later on. I. (1,5 + 0,5 + 1 + 1 + 1 = 5 val.) 1. Discuss the advantages and disadvantages of using dynamic loop scheduling versus static loop scheduling in SMP systems. In this context, argue about the reason for the option guided in the parallel for directive of OpenMP. Number: Name: 1/10
2. Consider a three processor shared-memory multiprocessor (with processors named A, B, and C) with snooping cache coherent system using an invalidate protocol. Suppose that processor A has block X (in its cache) in exclusive state. a) What will be state of block X in processors B and C? b) Describe the sequence of actions (bus activity, state transitions etc.) that will be performed by the snooping cache coherent protocol when another processor B issues a write to block X? Number: Name: 2/10
3. The following program was working correctly and in order to parallelize it the pragma directive shown in the program was added. #define N 1000 int i, a = 1; a) For all the variables in this program, state which are private and which are shared in the parallel region. int main(int argc, char **argv) { int found = 0; int b = 3; #pragma omp parallel for private(b) for (i = 0; i < N; i++){ a = i * i * i; if (mult(a, b) % 42 = 0) found++; } printf("result: %i\n", found); } int mult(int x, int y) { int z; z = x * y; } return z; b) Is this parallel implementation working correctly? Justify. If not, suggest how to correct it. Number: Name: 3/10
II. (1 + 0,75 + 0,75 + 0,5 + 1 + 1 = 5 val.) 1. What does Non Uniform Memory Architecture mean? 2. In a distributed application, the computation of the elements of an array was evenly divided among the nodes of the system. It is now necessary to make available the complete array to all the nodes. a) State the best way to perform this operation in MPI. You don t need to know the exact syntax of the MPI functions, but be sure to indicate the name of the functions and the essential parameters they require. Number: Name: 4/10
b) Analyze the asymptotic complexity of this procedure, as a function of the number of nodes p and the size of the array n. Justify. c) How would you modify the previous solution if, for some reason, the array distribution was unbalanced among nodes? Number: Name: 5/10
3. Consider the following MPI program, which is executed by all nodes: MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &me); while(1) { MPI_Send(A, SIZE, MPI_DOUBLE, (me + 1) % nprocs, MTAG, MPI_COMM_WORLD); MPI_Recv(B, SIZE, MPI_DOUBLE, (nprocs + me - 1) % nprocs, MTAG, MPI_COMM_WORLD, &status); } A = update_state(a, B); a) Ignoring eventual programming errors, briefly describe the intended general workings of this program. b) Is there the possibility of a deadlock situation? Justify. If so, propose a solution that resolves this problem. Number: Name: 6/10
III. (1,5 + 1,5 + 2 = 5 val.) 1. Why is the parallel programming community much more fond of Gustafson s Law than Amdahl s Law? What is the reasoning behind Gustafson s Law? 2. The following execution times T where obtained for a parallel program with a varying number of processors p: p 1 2 4 8 T 1.75.625.5625 Explain how you can use this information to improve your program. (note:.75 = 6/8 ;.625 = 5/8 ;.5625 = 9/16) Number: Name: 7/10
3. Suppose that the serial cost is O(n), the parallel computation cost O(n/p) and memory requirements O(n 2 ). Derive the maximum allowable parallel overhead h(p, n) for an application to scale indefinitely. Number: Name: 8/10
IV. (1,5 + 1,5 + 1 + 1 = 5 val.) 1. Describe what you understand as the Foster s design methodology. Give a brief description of each step, indicating its main objective and how to achieve it. 2. Give an example of a piece of code in C (no more than 4 lines) that is easily ported to an efficient implementation on a GPUs, and a different example for which this porting is particularly difficult. Justify. Number: Name: 9/10
3. Consider a problem that is being solved through a 2-dimensional finite difference method, and which has been discretized in a n n matrix, where n is the problem size. A parallel implementation is being run on a cluster with p nodes. If λ is the fixed cost of setting up a message β is a measure of the network band-width between processors compute an estimate of the amortized communication cost, per iteration and per processor, as a function of n, p, λ and β, for both a row-wise decomposition and a checkered-board decomposition, if: a) 1 level of ghostpoints is used. b) 2 levels of ghostpoints are used (caution: think carefully about this one, a drawing may help). Number: Name: 10/10