Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00

Similar documents
Exam Issued: December 18, 2015, 09:00 Hand in: December 18, 2015, 12:00

MCMC Methods for data modeling

Advanced Message-Passing Interface (MPI)

Monte Carlo for Spatial Models

Message Passing Interface

Mixture Models and the EM Algorithm

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

ECE 563 Midterm 1 Spring 2015

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

An Introduction to OpenAcc

10-701/15-781, Fall 2006, Final

Markov chain Monte Carlo methods

Note Set 4: Finite Mixture Models and the EM Algorithm

GAMES Webinar: Rendering Tutorial 2. Monte Carlo Methods. Shuang Zhao

Parallel Numerical Algorithms

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

MPI Collective communication

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Statistical Matching using Fractional Imputation

Heterogeneous Multicore Parallel Programming

PARALLEL AND DISTRIBUTED COMPUTING

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

PARALLEL AND DISTRIBUTED COMPUTING

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Biostatistics 615/815 Lecture 16: Importance sampling Single dimensional optimization

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Exam Issued: December 18, 2012, 14:00 Hand in: December 18, 2012, 17:00

Assignment 2 Using Paraguin to Create Parallel Programs

HPC Parallel Programing Multi-node Computation with MPI - I

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

19.1. Unit 19. OpenMP Library for Parallelism

Monte Carlo Methods and Statistical Computing: My Personal E

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

ECE 574 Cluster Computing Lecture 13

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Assignment 1 OpenMP Tutorial Assignment

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Parallel Programming Libraries and implementations

Parallelism paradigms

ECE 563 Spring 2012 First Exam

Clustering Relational Data using the Infinite Relational Model

Floating-point numbers. Phys 420/580 Lecture 6

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Hybrid Quasi-Monte Carlo Method for the Simulation of State Space Models

Hybrid Programming with MPI and OpenMP

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

UNIVERSITY OF MORATUWA

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

ECE 574 Cluster Computing Lecture 13

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Basic Communication Operations (Chapter 4)

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Intermediate MPI features

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistical Methods -

Fundamentals of OmpSs

INF3380: Parallel Programming for Scientific Problems

PCAP Assignment II. 1. With a neat diagram, explain the various stages of fixed-function graphic pipeline.

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

ITCS 4145/5145 Assignment 2

INF3380: Parallel Programming for Natural Sciences

Chapter 1. Introduction

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

ESC101N: Fundamentals of Computing End-sem st semester

Convexization in Markov Chain Monte Carlo

Discussion on Bayesian Model Selection and Parameter Estimation in Extragalactic Astronomy by Martin Weinberg

PHYSICS 115/242 Homework 5, Solutions X = and the mean and variance of X are N times the mean and variance of 12/N y, so

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Pictorial Structures for Object Recognition

Semiparametric Mixed Effecs with Hierarchical DP Mixture

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

CS 426. Building and Running a Parallel Application

Introduction to the Message Passing Interface (MPI)

Assignment 5 Using Paraguin to Create Parallel Programs

Parallel Programming. Libraries and Implementations

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Mixture models and clustering

Introduction to Languages for Scientific Computing, winter semester 14/15: Final Exam

Introduction to parallel Computing

Nested Sampling: Introduction and Implementation

PARALLEL PROGRAMMING MULTI- CORE COMPUTERS

Laplacian Eigenmaps and Bayesian Clustering Based Layout Pattern Sampling and Its Applications to Hotspot Detection and OPC

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Parallel Computing and the MPI environment

Quantitative Biology II!

Discrete Mathematics Course Review 3

Lecture 6: Parallel Matrix Algorithms (part 3)

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

A Message Passing Standard for MPP and Workstations

Computer Architecture

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Transcription:

P. Hadjidoukas, C. Papadimitriou ETH Zentrum, CTL E 13 CH-8092 Zürich High Performance Computing for Science and Engineering II Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00 Spring semester 2017 Last Name: First Name: Student ID: With your signature you confirm that you: Have read the exam directives You solved the exam without any unauthorized help You wrote your answers following the outlined directives Signature: 1

Grading table Question Maximum score Score TA 1 TA 2 Question 1 40 Question 2 5 Question 3 20 Question 4 25 Question 5 15 Question 6 10 Question 7 10 Question 8 10 Question 9 15 Question 10 10 Question 11 20 Question 12 25 Total: 205 A perfect score (6) is 180 points. (out of 205 available). 2

Question 1: Posterior Distribution Derivation (40 points) Let y = (y 1,..., y N ) with y i R, be N independent observations obtained at positions x i R. Assume that we want to model each observation with a linear function, f(x; ϑ) = ϑx for ϑ R. Moreover, assume that the observations are linked with the model through the following relation, y = f(x; ϑ) + ζ, (1) where ζ is a normal random variable with zero mean and variance σ 2. We want to update our knowledge on the parameters ϑ after obtaining the N observations using Bayes theorem with an improper prior 1, i.e. p(ϑ) 1. a) Identify the posterior distribution of ϑ conditioned on the set of observations y = y 1, i.e., in the presence of only one observation. b) Let A R N M, y R N, ϑ R M, and ˆϑ := (A A) 1 A y. Verify that the following holds: (y Aϑ) (y Aϑ) = (ϑ ˆϑ) A A(ϑ ˆϑ) + const., (2) where terms in const. do not depend on ˆϑ or ϑ. c) Use the result of the previous question to show that the posterior distribution of ϑ R given the linear model in (1) in the presence of N observations is given by, p( ϑ y ) = N ( ϑ ˆϑ, σ 2 (x x) 1 ), (3) for x = (x 1,..., x N ). d) Show that in the limit N the posterior distribution converges to a Dirac distribution supported on ˆϑ. 1 Improper is a prior distribution that does not integrate to one. Here is the uniform distribution over all real numbers. 3

p(x) Question 2: Metropolis-Hastings (5 points) We want to sample a distribution with a probability density function p(x), x R using the Metropolis-Hastings algorithm. The distribution is shown in Fig. 1. Assuming that we start at x 0 = 4, give a description of points x 1 which should be necessarily accepted at the next step if the algorithm proposes them. Explain your answer. 0.4 0.2 0.1 0-4 -2 0 2 4 6 8 10 x Figure 1: Probability density function p(x). 4

Question 3: Non-linear model (20 points) Consider the mathematical model of a physical process represented by the equation, y = ϑ 2 + ζ, (4) where ϑ R is the uncertain parameter of the model, y R is the output quantity of interest, and ζ represents the model error which is quantified by a Gaussian distribution ζ N (0, σe), 2 where σe 2 is known. The prior uncertainty in ϑ is quantified by a Gaussian distribution with mean µ π and standard deviation σ π. Given N independent measurements y 1,..., y N, show, using Bayesian central limit theorem, that the posterior uncertainty in ϑ is given by, ˆϑσ ˆσ = πσ 2 e 2 µ π σe 2 + 4N ˆϑ, 3 σπ 2 where ˆϑ is the most probable value of the posterior distribution. 5

Question 4: Optimal Experimental Design (25 points) Consider a linear model f(ϑ) = Φϑ, where Φ R N M, ϑ R M, N are the number of states that the model can predict, M is the number of model parameters, f i (ϑ) = Φ i ϑ is the prediction of the model at the state i, and Φ i R 1 M is the i-th row of Φ. Let the prediction error e i R between the data y i R collected at state i and the model prediction f i at state i satisfy the prediction error equation y i = Φ i ϑ + e i. (5) Consider the case of M = 1. It is assumed that the prediction error follows a normal distribution e i N (0, a 2 i σ2 ) with mean zero and variance a 2 i σ2, where (i) a i = 1 for the case of measurement error and negligible model error, and (ii) a i = φ i for the case of model error and negligible measurement error. Implicitly it is assumed in case (i) that the sensor error is constant, independent on the value of the state. In case (ii), it is assumed that this error is a fraction of the value of the measured quantity. Assume a Gaussian prior for the model parameter ϑ with mean µ π and standard deviation σ π. a) Show that mutual information for the case of a single sensor, measuring at state i, is I(ϑ; y d i ) = 1 2 log φ2 i a 2 i σ 2 + σπ 2 + c, (6) where d i denotes the location of the sensor at state i. b) Find the optimal location of a single sensor (the best state to measure given that one sensor is available). Consider two cases: (i) a i = 1, and (ii) a i = φ i. Give a physical justification of your findings in each case. c) Show that mutual information for the case of two sensors, measuring at states i and j, is I(ϑ; y d i, d j ) = 1 2 log φ2 i a 2 i σ 2 + φ 2 ja 2 σ 2 + σπ 2 + c, (7) where d i and d j denotes the locations of the two sensors at states i and j. d) Find the optimal locations of two sensors (the best states to measure given that two sensors are available) for the case (i) a i = 1. j 6

Hint: Based on the theory, the mutual information for linear models of the form f(ϑ) = Aϑ, where A R N M and ϑ R M with Gaussian prior is given by I(ϑ; y d) = 1 2 log A Σ 1 e A + Σ 1 π + c, (8) where Σ e is the covariance of the prediction error, Σ π the covariance matrix of the prior, and c is a constant. 7

Question 5: Parallel Speedup (15 points) Assume you implemented a serial Monte Carlo application that computes N samples, and you want to parallelize it with MPI in the following way: One MPI process reads the data and distributes to the other MPI processes. Each of the n p MPI processes starts an independent Markov-Chain, performs some initialization (burn-in), and then computes N/n p samples. At the end the results are collected at the root process. All the communication is performed using MPI_Bcast and MPI_Reduce only. The profile of the serial application gives the following results: Reading the data from the input files takes T setup. The burn-in phase takes T burn-in. Each sample takes T sample. a) Estimate the time needed to run the serial application. b) How do MPI collectives MPI_Bcast and MPI_Reduce scale with the number of MPI processes n p? c) Estimate the time needed to run the parallel application. Assume that any point to point communication takes T comm time. Identify which is the serial part, the parallel part and the communication part. d) Estimate the weak scaling of the application. Sketch the curve and determine the asymptotic weak scaling. e) Estimate the strong scaling of the application. Note that for n p > N, n p N processes will be idle. Sketch the curve and determine the asymptotic scaling. 8

Question 6: OpenMP Tasks (10 points) The following serial code checks if the numbers contained in the input vector are prime numbers. The input vector numbers has been filled with uniformly distributed random numbers between lower_limit and upper_limit. 1 template <class integer_type> 2 bool is_prime(integer_type i) 3 { 4 if (i==1) return false; 5 if (i==2) return true; 6 if (i%2 == 0) return false; 7 8 integer_type limit = static_cast<integer_type>(std::sqrt(i)); 9 for(integer_type j=3; j<limit; j+=2) 10 if (i%j == 0) return false; 11 12 return true; 13 } 14 15 template <class integer_type> 16 std::vector<int> is_prime_serial(std::vector<integer_type>& numbers) 17 { 18 size_t n = numbers.size(); 19 std::vector<int> res(n, true); 20 21 for(size_t i = 0; i < n; ++i) 22 res[i] = is_prime(val[i]); 23 24 return res; 25 } 26 27 template <class integer_type> 28 std::vector<int> is_prime_parallel(std::vector<integer_type>& numbers) 29 { 30 // TODO 31 } a) Implement the function is_prime_parallel. This function should behave as is_prime_serial and be parallelized with OpenMP tasks. b) Explain why parallelizing this code with static OpenMP loop scheduling is not a good choice. 9

Question 7: OpenMP Taskloop (10 points) The following code snippet includes two nested loops that cannot be collapsed and have been parallelized with OpenMP. 1 // A: matrix of size NxN 2 3 // nested parallelism is enabled 4 #pragma omp parallel for 5 for (int i = 0; i < N; i++) 6 { 7 // code here 8 9 // TODO: use OpenMP tasks for this loop 10 int chunksize = 10; 11 #pragma omp parallel for schedule(dynamic,chunksize) 12 for (int j = 0; j < N; j++) 13 { 14 A[i][j] = func(i, j); 15 } 16 17 // code here 18 } a) Provide an equivalent parallel implementation of the above code using OpenMP tasks for the innermost loop. b) Discuss which parallelization approach (original or task-based one) is more efficient and explain why. 10

Question 8: MPI Datatypes (10 points) In the following MPI code, we want to send some elements of a square matrix from rank 0 to rank 1. 1 if (rank == 0) 2 { 3 double A[N][N]; 4 // initialization of A 5 6 // TODO: send the anti diagonal elements of matrix A to rank 1 7 } 8 if (rank == 1) 9 { 10 double buffer[n]; 11 12 // TODO: receive elements from rank 0 13 14 // buffer can be used here 15 } a) Build and use an appropriate MPI datatype to send the anti-diagonal elements of matrix A. b) What modifications to your solution code are required in order to send the diagonal elements of matrix A? 11

Question 9: MPI Groups and Communicators (15 points) Consider the function void diffusion_mpi(mpi_comm comm, int N, int id), which solves the diffusion equation in two dimensions on a NxN grid using MPI. The communicator comm might be other than MPI_COMM_WORLD. The last argument enables users to perform multiple calls to diffusion_mpi(), as in the following example: 1 // MPI initialization 2 3 // TODO: perform the four calls in parallel, using np=p^2 ranks in total 4 diffusion_mpi(mpi_comm_world, 1024, 0); 5 diffusion_mpi(mpi_comm_world, 1024, 1); 6 diffusion_mpi(mpi_comm_world, 1024, 2); 7 diffusion_mpi(mpi_comm_world, 1024, 3); 8 9 // MPI shutdown Use MPI groups and communicators to perform the 4 calls to diffusion_mpi() at once and within a single MPI application. Assume that the application uses n p = P 2 processes and each call must be performed by a subgroup of processes. There are 4 subgroups of equal size and each subgroup corresponds to a square sub-grid of MPI processes. An example for n p = 16 is depicted in the following figure: Extend the above MPI code so as to perform the 4 calls in parallel as described above. Note: Assume that P is even and P >= 2. 12

Question 10: Non-blocking MPI Barrier (10 points) MPI_Ibarrier() is a non-blocking version of MPI_Barrier(). By calling MPI_Ibarrier(), a process notifies that it has reached the barrier. The call returns immediately, independent of whether other processes have called MPI_Ibarrier(). An example of using MPI_Ibarrier() is given in the following code: 1 MPI_Ibarrier(comm, &request); 2 work(); / computation, other MPI communications / 3 MPI_Wait(&request, &status); Use OpenMP to implement an equivalent version of the above code for cases where only the blocking version of MPI_Barrier() is provided by the MPI library. State any other requirements that must be satisfied by your code. 13

Question 11: Hybrid Quadratic Form (20 points) The following serial code computes the quadratic form Q = v T A w = v i A ij w j (9) ij 1 #include <stdio.h> 2 3 4 int main(int argc, char argv) 5 { 6 int n = 1024; 7 8 9 double A[n][n]; 10 double v[n], w[n]; 11 12 13 // init A_ij = (i + 2. j) / n^2 14 for (int i=0; i<n; ++i) 15 for (int j=0; j<n; ++j) 16 A[i][j] = (i + 2. j) / n / n; 17 18 19 // init v_i = 1. + 2. / (i+.5) 20 for (int i=0; i<n; ++i) 21 v[i] = 1. + 2. / (i + 0.5); 22 23 24 // init w_i = 1. i / 3. / n 25 for (int i=0; i<n; ++i) 26 w[i] = 1. i / 3. / n; 27 28 29 double result = 0.; 30 for (int i=0; i<n; ++i) 31 for (int j=0; j<n; ++j) 32 result += v[i] A[i][j] w[j]; 33 34 35 printf("result: %f\n", result); 36 37 38 return 0; 39 } Parallelize the above code for multicore clusters using the hybrid MPI + OpenMP programming model. Using a block-row distribution every process must initialize only the needed portion of the objects v, w, A. Assume that size n is a multiple of the number of processes. 14

Question 12: GEMV on GPUs (25 points) The code below computes a matrix-vector product (GEMV) using an N N matrix a = M b, where a i = N 1 j=0 M i,j b j. (10) 1 2 #include <vector> 3 4 #define VECSIZE 512 5 6 void random_init(std::vector<float> & v, std::vector<float> & m) 7 { 8 // Some random initialization this could be anything 9 for(std::size_t i = 0; i < v.size(); ++i) 10 v[i] = i; 11 for(std::size_t i = 0; i < m.size(); ++i) 12 m[i] = i; 13 } 14 15 void gemv_cpu(float a, float m, float b, unsigned int n) 16 { 17 for(unsigned int i = 0; i < n; ++i) 18 for(unsigned int j = 0; j < n; ++j) 19 a[i] += m[n j+i] b[j]; 20 } 21 22 int main() 23 { 24 unsigned int const n = VECSIZE; 25 std::vector<float> a(n); 26 std::vector<float> b(n); 27 std::vector<float> matrix(n n); 28 29 random_init(b, matrix); 30 31 // 32 // TODO replace gemv_cpu. 33 // The code should perform the matrix vector multiplication 34 // on the GPU and store the result in the CPU vector a. 35 // 36 gemv_cpu(&a[0], &matrix[0], &b[0], n); 37 38 return 0; 39 } a) Modify the code such that the matrix-vector product is computed on the GPU. The final result should be available on the CPU. 15

b) Write and call a CUDA kernel (gemv_gpu1) for square matrices of arbitrary dimension N employing N GPU threads in total. c) Write and call a CUDA kernel (gemv_gpu2) for square matrices of arbitrary dimension N employing N/2 GPU threads in total. d) Write and call a CUDA kernel (gemv_gpu3) for square matrices of fixed dimension N = 64 employing N 2 GPU threads in total. (Hint: Use N blocks.) Notes: Show how the kernels are called in the main() function. Checking CUDA for errors is not mandatory. 16