Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Size: px

Start display at page:

Download "Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3"

Vernon Burke
5 years ago
Views:

1 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 Maximum number of points is 60. For the grade 3 you need 30 points, for the grade 4 you need 40 points and for the grade 5 you need 50 points. The problems are in no particular order. Allowed aids: Pocket calculator. NB! Start each problem on a new page and write your code number on each page. Carefully explain your reasoning. Good Luck! 1

2 Problem 1: (7=3+2+2) Performance and Scalability a. (3) Give at least three sources of overhead in parallel programs. Explain what can be done to reduce them and what the typical trade-offs are. b. (2) Parallel overhead can be defined analytically as T o (W, p) = pt p T s, where T s is the serial execution time (for problem size (= work) W ) and T p is the parallel execution time using p processors. Give an intuitive explanation of the term pt p. What is the common name for pt p? c. (2) What does the so called isoefficiency function tell about a parallel system? Show that the isoefficiency function can be obtained by solving W = K T o (W, p) for W in terms of p. Hint: E = T s W = pt p W + T o (W, p). Problem 2: (7=3+4) Communication Operations a. (3) In the scatter operation, one process sends a unique message to each of the other processes. Describe (e.g., using pictures and words instead of pseudocode) an efficient algorithm for the scatter operation on a hypercube. b. (4) Give a detailed analysis of your scatter algorithm. Find T p and the asymptotic isoefficiency function. Problem 3: (6) Parallel Algorithm Design Principles a. (6) Describe the following decomposition techniques for achieving concurrency and give examples for each type: Recursive Decomposition Data Decomposition Exploratory Decomposition Speculative Decomposition Problem 4: (7=2+2+3) Dense Matrix Computations The following tasks deal with important concepts in the design of efficient blocked and parallel algorithms for dense matrix computations. a. (2) Give definitions of the two major types of data locality and explain in what ways they have impact on matrix computations and vice versa. 2

3 b. (2) A matrix is distributed using a 2D Block Cyclic Layout onto a 2 4 processor mesh using distribution block size The first matrix element is mapped to processor (0, 0). Which of the processors will hold at least one of the diagonal blocks of the matrix? Motivate your answer. c. (3) During the lectures we considered a wave-front algorithm for computing the LU factorization of a matrix A. Explain and possibly illustrate how the use of wave-fronts can overlap communications and computations in a parallel algorithm. Problem 5: (6=4+2) Memory Hierarchies and Recursive Blocking These tasks deal with the management of deep memory hierarchies using recursive blocked algorithms and hybrid data structures. a. (4) Cholesky factorization of a symmetric positive definite matrix A = A T is a special case of Gaussian elimination. Devise a recursive blocked algorithm for computing the lower triangular Cholesky factorization A = LL T where L is lower triangular. Start from the 2 2 blocking of A and L given below and do block identification to find a sequence of Level 3 BLAS operations (e.g., TRSM triangular multiple right hand solve, SYRK symmetric rank-k update) and recursive applications of your Cholesky factor algorithm. ( ) ( ) ( ) A11 A A = T 21 = LL T L11 0 L T = 11 L T 21 A 21 A 22 L 21 L 22 0 L T 22 Explain and illustrate by a visual example how the recursive blocked algorithm has the potential to automatically adapt to all levels of a deep memory hierarchy. b. (2) Z-Morton is one example of a recursive data structure. In the simplest case, a 2 2 matrix, the elements are mapped to memory in a Z-pattern. For larger matrices the Z-pattern is applied recursively on the blocks of a 2 2 partitioning of the matrix. Define the Z-Morton mapping (i.e., from (i, j) element to memory location) recursively. You may assume that the matrix is N N with N being a power of 2. Problem 6: (7=5+2) SIMD Aspects a. (5) Consider the following out-of-place scalar implementation of the oddeven permutation operation half = (n + 1) / 2! odd do i = 1, half b(i) = a(2*i - 1) end do! even do i = 1, n / 2 3

4 b(half + i) = a(2*i) end do that packs the odd-indexed elements of a into the first half of b and the even-indexed elements of a into the second half of b. As an example, take n = 7: a b with n + 1 half = = 4. 2 Describe an efficient SIMD vectorization of the procedure above using an architecture similar to the Cell BE (i.e., vector registers/instructions and no gather/scatter loads/stores). Points will be awarded based on the clarity of the description and the efficient use of SIMD constructs. b. (2) On current generation hardware, do you expect the SIMD vectorization you just described to result in a speedup close to four, significantly less than four, or significantly more than four? Motivate your answer. (There is no correct answer, any reasonable motivation will be rewarded.) Problem 7: (7=5+2) Graph Algorithms Given a directed and weighted graph with vertices V = {v 1, v 2,..., v n }. Let D (k) be a matrix where element d (k) i,j is the length of the shortest path from v i to v j using only vertices belonging to the set {v 1, v 2,..., v k }. In the matrix D (0) element d (0) i,j contains the weight for the edge {v i, v j }, if it exists, otherwise. The shortest path between all pairs of vertices can be computed by the algorithm in Figure??. procedure FLOYD_ALL_PAIRS begin D (0) = A for k := 1 to n for i := 1 to n for j := 1 do n( ) d (k) i,j := min d (k) i,j, d(k 1) i,k + d (k 1) k,j end Figure 1: Floyd s all-pairs shortest paths algorithm. a. (5) Formulate an efficient parallel algorithm for the serial algorithm in Figure?? when the matrix D (k) is 1-D block partitioned, where each processor has n/p columns of D (k). What are the advantages and disadvantages of this algorithm compared to an algorithm that operates on a 2-D block partitioned matrix? 4

5 b. (2) Derive expressions fo T p (parallel execution time) and S p (speedup) for the parallel 1D-block partitioned algorithm in task a. Problem 8: (7=2+2+3) Search Algorithms for DOPs a. (2) Explain the pros and cons of the two main classes of search algorithms for discrete optimization problems (DOPs), namely Depth-First Search (DFS) and Best-First Search (BFS). In which way does the use of heuristics differ between the two classes. b. (2) Which one of the following three schemes for load balancing is most scalable in theory: global round robin, asynchronous round robin, and random polling? Informally, what are the drawbacks of the two other schemes that render them inferior? Motivate your answer. c. (3) In the best case, how much overhead (asymptotically) is caused by a binary tree-based scheme for termination detection? Motivate your answer. Problem 9: (6=2+4) Sparse Matrix Computations a. (2) Motivate, explain and illustrate two storage schemes for sparse matrices. b. (4) Given that we are using the Jacobi method for the iterative solution of Ax = b where A is a diagonal dominant matrix. One way of expressing a Jacobi iteration is in terms of the residual r k = Ax k b where x k is the k-th approximation of the exact solution x. Let x k [i] be the i-th component of the vector x k and A[i, i] is the i-th diagonal element of A. Then x k+1 [i] = r k [i]/a[i, i] + x k [i], for i = 0,..., n 1. Formulate a high-level algorithm that identify the three main computations of the Jacobi method and discuss some aspects of their parallel implementations. 5

Dense Matrix Algorithms

Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication