THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

Similar documents
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Basic Communication Operations (Chapter 4)

Parallelism paradigms

Comp2310 & Comp6310 Systems, Networks and Concurrency

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP4300/6430 Parallel Systems

Sample Examination. Family Name:... Other Names:... Signature:... Student Number:...

Parallel and Distributed Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CSE 230 Intermediate Programming in C and C++ Arrays and Pointers

STA141C: Big Data & High Performance Statistical Computing

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Hybrid Model Parallel Programs

Homework 3 (r1.2) Due: Part (A) -- Apr 28, 2017, 11:55pm Part (B) -- Apr 28, 2017, 11:55pm Part (C) -- Apr 28, 2017, 11:55pm

PARALLEL AND DISTRIBUTED COMPUTING

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Comp2310 & Comp6310 Systems, Networks and Concurrency

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Numerical Algorithms

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

End-Term Examination Second Semester [MCA] MAY-JUNE 2006

An array is a collection of data that holds fixed number of values of same type. It is also known as a set. An array is a data type.

I BCS-031 BACHELOR OF COMPUTER APPLICATIONS (BCA) (Revised) Term-End Examination. June, 2015 BCS-031 : PROGRAMMING IN C ++

CUDA GPGPU Workshop 2012

OF VICTORIA EXAMINATIONS- DECEMBER 2010 CSC

MapReduce: A Programming Model for Large-Scale Distributed Computation

PARALLEL AND DISTRIBUTED COMPUTING

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Introduction to Multicore Programming

Parallel Computing Introduction

Comp2310 & Comp6310 Systems, Networks and Concurrency

Matrix Multiplication

Matrix Multiplication

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Parallel Computing. Hwansoo Han (SKKU)

R10 SET - 1. Code No: R II B. Tech I Semester, Supplementary Examinations, May

How to declare an array in C?

Case study: OpenMP-parallel sparse matrix-vector multiplication

Arrays. Defining arrays, declaration and initialization of arrays. Designed by Parul Khurana, LIECA.

THE AUSTRALIAN NATIONAL UNIVERSITY Mid Semester Examination April COMP3320/6464 High Performance Scientific Computing

Introduction to Parallel Programming Part 4 Confronting Race Conditions

Allows program to be incrementally parallelized

AE52/AC52/AT52 C & Data Structures JUNE 2014

R13. II B. Tech I Semester Supplementary Examinations, May/June DATA STRUCTURES (Com. to ECE, CSE, EIE, IT, ECC)

Shared-memory Parallel Programming with Cilk Plus

Lecture 2. Memory locality optimizations Address space organization

Message Passing Interface (MPI)

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

This exam paper contains 8 questions (12 pages) Total 100 points. Please put your official name and NOT your assumed name. First Name: Last Name:

THE AUSTRALIAN NATIONAL UNIVERSITY Mid Semester Examination April 2010 COMP3320/6464. High Performance Scientific Computing

Declaring Pointers. Declaration of pointers <type> *variable <type> *variable = initial-value Examples:

OpenACC 2.6 Proposed Features

MPI CS 732. Joshua Hegie

ESC101N: Fundamentals of Computing End-sem st semester

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Fundamental of Programming (C)

Distributed Systems CS /640

(the bubble footer is automatically inserted into this space)

NOTE: Answer ANY FOUR of the following 6 sections:

Subject: PROBLEM SOLVING THROUGH C Time: 3 Hours Max. Marks: 100

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Lecture 2 Arrays, Searching and Sorting (Arrays, multi-dimensional Arrays)

MLR Institute of Technology

Concurrency for data-intensive applications

ECE331: Hardware Organization and Design

CMU /618 Exam 2 Practice Problems

COMP1917 Computing 1 Written Exam Sample Questions

General Instructions. You can use QtSpim simulator to work on these assignments.

Wide operands. CP1: hardware can multiply 64-bit floating-point numbers RAM MUL. core

Introduction to Algorithms October 12, 2005 Massachusetts Institute of Technology Professors Erik D. Demaine and Charles E. Leiserson Quiz 1.

CS427 Multicore Architecture and Parallel Computing

ESC101 : Fundamental of Computing

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Programming in C with MPI and OpenMP

COMPUTER SCIENCE Paper 2 (PRACTICAL)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

ECE 122. Engineering Problem Solving with Java

ECE 122. Engineering Problem Solving with Java

CSCE 110 PROGRAMMING FUNDAMENTALS. Prof. Amr Goneid AUC Part 7. 1-D & 2-D Arrays

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

INF3380: Parallel Programming for Scientific Problems

CS575: Parallel Processing Sanjay Rajopadhye CSU. Course Topics

CONTENTS: Array Usage Multi-Dimensional Arrays Reference Types. COMP-202 Unit 6: Arrays

Concurrent Programming with OpenMP

Sri Vidya College of Engineering & Technology

SAMPLE QUESTIONS FOR DIPLOMA IN INFORMATION TECHNOLOGY; YEAR 1

Computer Science 1 Bh

CS222: Cache Performance Improvement

CSE 260 Lecture 19. Parallel Programming Languages

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

Today Cache memory organization and operation Performance impact of caches

Transcription:

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This exam is worth 40% of your total course mark. Exam questions total 100 marks, with marks awarded according to the breakdown given. Answer ALL questions. Write your answers using a black or blue pen. Your answers should be clear and concise; marks may be lost for supplying irrelevant information.

Question 1 [9 marks] (a) [1 mark] Explain the differences between blocking and non-blocking communications. (b) [2 marks] In the context of parallel computing, what is a superlinear speedup? Explain why you might sometimes observe such a speedup. (c) [3 marks] For a communication network represented as an undirected graph, what is (i) the diameter and (ii) the bisection bandwidth? Why are these concepts important in designing communication networks for parallel computers? (d) [3 marks] For what class of computing system was the Hadoop file system designed? Briefly describe two of its main features. Question 2 [25 marks] Four programming models/languages/libraries that are applicable to distributed and/or shared memory parallel computers are: (A) MPI; (B) Pthreads; (C) OpenMP; (D) Cilk. (a) [20 marks] For each of (A) (D): (i) give a brief description of what it is; (ii) mention the class of parallel computers on which it is applicable; (iii) comment on its advantages and disadvantages; (iv) give an example of an application for which it is well-suited. (b) [3 marks] On what parallel computer architectures could MPI and OpenMP be combined? Give an example of an application where such a combination would be useful. Justify your answer. (c) [2 marks] Which of (A) (D) above would you recommend to a parallel programming novice? Explain your answer. COMP4300/6430 First Semester Exam 2011 Page 2 of 5

Question 3 [25 marks] The following C code performs a binary radix sort of an array val of N non-negative integers whose maximum value is at most MxInt: void radixsort (int *val, int N, int MxInt) { int i, j, low, high, level; int *tmp; tmp = (int*) malloc (N*sizeof(int)); if (tmp == NULL) { /* Error-handling code omitted */ for (i=1, level=0; i <= MxInt; i *= 2, level++) { low = high = 0; for (j = 0; j < N; j++) { if (((val[j] >> level) & 1) == 0) val[low++] = val[j]; else tmp[high++] = val[j]; for (j = 0; j < high; j++) val[low+j] = tmp[j]; free (tmp); You can assume that the code compiles and runs correctly on a single core. (a) [15 marks] Explain how you would parallelise this code for a uniform memory access (UMA) shared-memory system using OpenMP. You are free to use additional storage if this is necessary for your solution. You should provide pseudo-code, i.e. you are not required to write syntactically correct C code or OpenMP pragmas, but you should make your intentions clear. (b) [6 marks] (i) Discuss how you would expect your code to perform as a function of the parameters N, MxInt, and the number of threads used. (ii) How might the performance differ on a non-uniform memory access (NUMA) machine? To be specific, consider the case of up to eight threads on a four-processor machine where each processor has two cores. (c) [4 marks] Outline how a solution using Cilk would differ from your OpenMP solution to part (a). COMP4300/6430 First Semester Exam 2011 Page 3 of 5

Question 4 [25 marks] This question assumes a CPU (host) with attached GPU (device), programmed using CUDA. You are not required to write syntactically correct CUDA code, but you should make your intentions clear. (a) [6 marks] In the context of a GPU programmed using CUDA, what are (i) threads; (ii) blocks; and (iii) global memory? (b) [10 marks] The following fragment of C code performs matrix multiplication of n n matrices A and B, and stores the result in a matrix C. The matrices are assumed to be stored in onedimensional arrays with the usual C convention (contiguous by rows), and C must not overlap A or B. void MatMulOnHost (float *A, float *B, float *C, int n) { int i, j, k; float x, y, sum; for (i = 0; i < n; i++) for (j = 0; j < n; j++) { sum = 0.0; for (k = 0; k < n; k++) { x = A[i*n+k]; /* A[i][k] */ y = B[k*n+j]; /* B[k][j] */ sum += x*y; C[i*n+j] = sum; /* C[i][j] */ Describe how you would convert this to a routine MatMulKernel to run on a GPU, using CUDA. How would you invoke MatMulKernel from the host? (c) [4 marks] Outline how you would allocate and free memory for the arrays A, B and C on the GPU, and how you would transfer data from and to the host CPU. (d) [5 marks] Why is matrix multiplication in the class of problems that can be computed efficiently on a CUDA-enabled GPU? Would your routine MatMulKernel give good performance on the GPU? If not, suggest how it might be modified to give better performance. COMP4300/6430 First Semester Exam 2011 Page 4 of 5

Question 5 [16 marks] MapReduce is a programming paradigm well-suited for embarrassingly parallel applications. (a) [8 marks] Give an overview of the MapReduce programming model and how it implements parallelism. Comment on aspects such as task granularity, load balancing, fault tolerance, and mechanisms to achieve data locality. (b) [2 marks] Give an example of a problem that is well-suited to be solved using MapReduce. (c) [6 marks] Suppose that you have been given two documents with content such as the following: Document1: Test test Test test test Document2: This is a test file Based on your experience in developing a MapReduce program for inverted index creation, give MapReduce program pseudo-code to generate a list of locations (word number in the document and identifier for the document) for each word occurrence. An identifier for each document is provided as the key to the map() function. The output generated by your program should look like: Test Document1: 1, 3 test Document1: 2, 4, 5 Document2: 4 This Document2: 1 is Document2: 2 a Document2: 3 file Document2: 5 COMP4300/6430 First Semester Exam 2011 Page 5 of 5