THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This exam is worth 40% of your total course mark. Exam questions total 100 marks, with marks awarded according to the breakdown given. Answer ALL questions. Write your answers using a black or blue pen. Your answers should be clear and concise; marks may be lost for supplying irrelevant information.
Question 1 [9 marks] (a) [1 mark] Explain the differences between blocking and non-blocking communications. (b) [2 marks] In the context of parallel computing, what is a superlinear speedup? Explain why you might sometimes observe such a speedup. (c) [3 marks] For a communication network represented as an undirected graph, what is (i) the diameter and (ii) the bisection bandwidth? Why are these concepts important in designing communication networks for parallel computers? (d) [3 marks] For what class of computing system was the Hadoop file system designed? Briefly describe two of its main features. Question 2 [25 marks] Four programming models/languages/libraries that are applicable to distributed and/or shared memory parallel computers are: (A) MPI; (B) Pthreads; (C) OpenMP; (D) Cilk. (a) [20 marks] For each of (A) (D): (i) give a brief description of what it is; (ii) mention the class of parallel computers on which it is applicable; (iii) comment on its advantages and disadvantages; (iv) give an example of an application for which it is well-suited. (b) [3 marks] On what parallel computer architectures could MPI and OpenMP be combined? Give an example of an application where such a combination would be useful. Justify your answer. (c) [2 marks] Which of (A) (D) above would you recommend to a parallel programming novice? Explain your answer. COMP4300/6430 First Semester Exam 2011 Page 2 of 5
Question 3 [25 marks] The following C code performs a binary radix sort of an array val of N non-negative integers whose maximum value is at most MxInt: void radixsort (int *val, int N, int MxInt) { int i, j, low, high, level; int *tmp; tmp = (int*) malloc (N*sizeof(int)); if (tmp == NULL) { /* Error-handling code omitted */ for (i=1, level=0; i <= MxInt; i *= 2, level++) { low = high = 0; for (j = 0; j < N; j++) { if (((val[j] >> level) & 1) == 0) val[low++] = val[j]; else tmp[high++] = val[j]; for (j = 0; j < high; j++) val[low+j] = tmp[j]; free (tmp); You can assume that the code compiles and runs correctly on a single core. (a) [15 marks] Explain how you would parallelise this code for a uniform memory access (UMA) shared-memory system using OpenMP. You are free to use additional storage if this is necessary for your solution. You should provide pseudo-code, i.e. you are not required to write syntactically correct C code or OpenMP pragmas, but you should make your intentions clear. (b) [6 marks] (i) Discuss how you would expect your code to perform as a function of the parameters N, MxInt, and the number of threads used. (ii) How might the performance differ on a non-uniform memory access (NUMA) machine? To be specific, consider the case of up to eight threads on a four-processor machine where each processor has two cores. (c) [4 marks] Outline how a solution using Cilk would differ from your OpenMP solution to part (a). COMP4300/6430 First Semester Exam 2011 Page 3 of 5
Question 4 [25 marks] This question assumes a CPU (host) with attached GPU (device), programmed using CUDA. You are not required to write syntactically correct CUDA code, but you should make your intentions clear. (a) [6 marks] In the context of a GPU programmed using CUDA, what are (i) threads; (ii) blocks; and (iii) global memory? (b) [10 marks] The following fragment of C code performs matrix multiplication of n n matrices A and B, and stores the result in a matrix C. The matrices are assumed to be stored in onedimensional arrays with the usual C convention (contiguous by rows), and C must not overlap A or B. void MatMulOnHost (float *A, float *B, float *C, int n) { int i, j, k; float x, y, sum; for (i = 0; i < n; i++) for (j = 0; j < n; j++) { sum = 0.0; for (k = 0; k < n; k++) { x = A[i*n+k]; /* A[i][k] */ y = B[k*n+j]; /* B[k][j] */ sum += x*y; C[i*n+j] = sum; /* C[i][j] */ Describe how you would convert this to a routine MatMulKernel to run on a GPU, using CUDA. How would you invoke MatMulKernel from the host? (c) [4 marks] Outline how you would allocate and free memory for the arrays A, B and C on the GPU, and how you would transfer data from and to the host CPU. (d) [5 marks] Why is matrix multiplication in the class of problems that can be computed efficiently on a CUDA-enabled GPU? Would your routine MatMulKernel give good performance on the GPU? If not, suggest how it might be modified to give better performance. COMP4300/6430 First Semester Exam 2011 Page 4 of 5
Question 5 [16 marks] MapReduce is a programming paradigm well-suited for embarrassingly parallel applications. (a) [8 marks] Give an overview of the MapReduce programming model and how it implements parallelism. Comment on aspects such as task granularity, load balancing, fault tolerance, and mechanisms to achieve data locality. (b) [2 marks] Give an example of a problem that is well-suited to be solved using MapReduce. (c) [6 marks] Suppose that you have been given two documents with content such as the following: Document1: Test test Test test test Document2: This is a test file Based on your experience in developing a MapReduce program for inverted index creation, give MapReduce program pseudo-code to generate a list of locations (word number in the document and identifier for the document) for each word occurrence. An identifier for each document is provided as the key to the map() function. The output generated by your program should look like: Test Document1: 1, 3 test Document1: 2, 4, 5 Document2: 4 This Document2: 1 is Document2: 2 a Document2: 3 file Document2: 5 COMP4300/6430 First Semester Exam 2011 Page 5 of 5