Project C/MPI: Matrix-Vector Multiplication

Size: px
Start display at page:

Download "Project C/MPI: Matrix-Vector Multiplication"

Transcription

1 Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving a wide variety of problems. This assignment aims at studying three different ways to carry out the multiplication of a dense matrix A M m,n (K) by a vector b K n in parallel. More precisely, the purpose is to output the vector c K m such that: a 0,0 a 0,1... a 0,n 1 b 0 a 1,0 a 1,1... a 1,n 1 A.b = c b 1. = [ ] c 0 c 1... c m 1 a m 1,0 a m 1,1... a m 1,n 1 b n 1 where i [0, m 1], c i = n a i,j b j We will assume that both matrix and vectors are stored in files that respect the following format: Matrix A = (a i,j ) M m,n (K): [m, n, a 0,0, a 0,1,..., a m 1,n 1 ]. Vector v K n : [n, v 0,..., v n 1 ] j=0 Exercise 1 Matrix-vector I/O and sequential algorithm 1. Implement the functions handling the allocation and freeing for matrix and vectors of objects having a size of size bytes, i.e. void vector alloc (size t n, size t size ); void vector free (void v); void matrix alloc(size t m, size t n, size t size ); void matrix free (void M); A vector is allocated just like a normal array (using calloc for instance) A matrix A M m,n (K) will be allocated in three steps: (a) the memory for the m n matrix values are allocated (associated to the pointer Astorage); (b) the memory of the m row pointers is allocated and A points to the beginning of this block of memory; (c) the values of A[0],...,A[m-1] are initialized to &Astorage[0],...,&Astorage[(m-1)*n] so that A[i][j] corresponds to the matrix element a i,j. This is illustrated in the Figure Implement the following functions which store in the file f the vector v (resp. the matrix M) having n (resp. m n) elements, each of size size bytes. You may want to use fwrite in those functions. 1

2 void store vector(char f, size t n, size t size, void v); void store matrix(char f, size t m, size t n, size t size, void M); 3. Implement the functions which read the file f to allocate and populate the vector v (resp. the matrix M) with the n (resp. m n) elements of size size bytes from the file. fread is probably your best friend here. void read vector(char f, size t size, size t n, void v); void read matrix(char f, size t size, size t m, size t n, void M); 4. Implement the following printing functions (use type to provide the correct format and cast in printf): void print vector lf (void v, MPI Datatype type, size t n); void print matrix lf(void M, MPI Datatype type, size t m, size t n); 5. Implement a program that asks the user to enter the values m and n then generate and store (with appropriate i, m and n): the base vectors e i R n (file base_i_vector_n.dat) where the ith elements is 1 and the others 0; a random vector v R n (file random_vector_n.dat). the identity matrix ID (file id_matrix_m_n.dat) such that: { id i,i = 1 for 0 i < min(m, n) id i,j = 0 for i j a random matrix A M m,n (K) (file random_matrix_m_n.dat); 6. Implement the sequential algorithm that operate the matrix-vector multiplication over elements of type double through the following function which computes c = M b: void seq matvec mult lf(double M, double b, size t m, size t n, double c); Check your program using the vectors and matrix genererated previously. 7. What is the complexity of the sequential matrix-vector multiplication, in the general case and when m = n? Astorage sizeof(void*) A Step 2 m*sizeof(void*) Step 1 Step n (m 1)*n m*n 1 m*n*size Figure 1: Matrix allocation in three steps. We use a domain decomposition strategy to develop the parallel algorithm. There are three straightforward ways to decompose a m n matrix A: rowwise 2

3 block striping, columnwise block striping and the checkerboard block decomposition. Each approach (after agglomeration and mapping) is illustrated in Figure 2. We will also assume that a single process is responsible for Input/Output. n = 9 m = 8 Rowwise block striped decomposition Columnwise block striped decomposition Checkerboard block decomposition Figure 2: Three ways to decompose and map a matrix having 8 rows and 9 columns over 4 processes. Exercise 2 Block data decomposition Whatever kind of decomposition we will consider, a strategy for distributing the primitive tasks among the processors should be defined. We propose here to analyse the properties of a classical block allocation scheme. Let k be a number of elements (indexed from 0 to k 1) to be distributed on p processes (indexed from 0 to p 1). We assume that the process P i controls the elements between i k p and (i + 1) k p 1 1. Show that the process controlling a particular element j is: i = p(j+1) 1 k 2. Show that the last process (i.e. P p 1 ) is responsible for k p elements. In the sequel, we will assume the following macros to be defined: #define BLOCK LOW(id,p,k) ((id) (k)/(p)) #define BLOCK HIGH(id,p,k) (BLOCK LOW((id)+1,p,k) 1) #define BLOCK SIZE(id,p,k) (BLOCK HIGH(id,p,k) BLOCK LOW(id,p,k)+1) #define BLOCK OWNER(j,p,k) (((p) ((j)+1) 1)/(k)) 3. Use those macros in the following function that initiate the count and displacement arrays needed to scatter/gather a buffer of n data items on/from p processes, when the number of elements sent/received to/from other processes varies: void create mixed count disp arrays(int p, size t k, int count, int disp); In other words, after the call to this function, count[i] is the number of elements sent/received to/from process i (i.e BLOCK_SIZE(i,p,k)) while disp[i] give the starting point in the buffer of the element sent/received to/from process i (i.e disp[i-1] + count[i-1]). 3

4 4. Similarly, implement the following function that initiate the count and displacement arrays needed by a process id when it is supposed to get the same number of element from every other process to populate the buffer of n data items: void create count disp arrays(int id,int p,size t k,int count,int disp); In other words, after the call to this function, count[i] is the number of elements received from process i (i.e BLOCK_SIZE(id,p,k)) while disp[i] give the starting point in the buffer of the element sent/received by process i (i.e disp[i-1] + count[i-1]). Exercise 3 Vector decomposition stategies As for the vectors, there are two natural ways to distribute them among the processes: a block decomposition as proposed in exercise 2 or a full replication meaning all vector elements are copied on all the processes. Explain why this second approach is acceptable in the context of matrixvector multiplication. Finally, we consider here three ways of partitionning the matrix and two for distributing the vectors. it leads to six combinaisons. We will limit our investigation to three of them: rowwise block striping with replicated vectors (exercise 4), columnwise block striping with block-decomposed vectors (exercise 5) and the checkerboard block decomposition with block-decomposed vectors over the first column of processes (exercise 6). Inner Product Task i b b Task i A[i] A[i] c[i] Task i b c All gather communication A[i] Figure 3: Operations performed by a primitive task i in a rowwise block-striped decomposition. Exercise 4 Parallel version using Rowwise Block-Striped Decomposition In this exercise, we associate a primitive task with each row of the matrix A. Vectors b and c are replicated among the primitive tasks so the memory should 4

5 be allocated for entire vectors on each processor. It follows that the task i has all the elements required to compute the inner-product leading to the element c i of c. Once this is done, processes must concatenate their pieces of the vector c into a complete vector. This is done through the MPI_Allgatherv function (see Appendix A). The sequence of all operations are summarized in Figure Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern. Which mapping will you use for the parallel algorithm? How many rows are assigned to each processes? Which process is advised for I/O and why 1? 2. Complexity analysis. For simplicity, we assume m = n and don t take into account the reading of matrix/vector files. Let χ be the time needed to compute a single iteration of the loop performing the inner product (χ can be determined by dividing the execution time of the sequential algorithm by n 2 ). What is the expected time for the computational portion of parallel program? We assume each vector element to be a double occupying 8 bytes. Each process beeing responsible for around n p elements, an all-gather communication requires each process to send log p messages, the first one having length 8n 2 8n p, the second p etc. Let λ be the latency of each message and β be the bandwith i.e. the number of data item that can be sent down a channel in one unit of time 2. Show that the expected communication time for the all-gather step is O(log p + n) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j reads and sends (using MPI_Send) the rows associated to the each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read row matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which the I/O process j read the file f, broadcast the dimension of the vector then read and broadcast the vector elements: 1 Hint: you may refer to the second question of the exercise 2 2 Sending a message containing k data items requires then time λ + k β. 5

6 void read vector and replicate(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Assuming elements of type double, implement the fonctions that print a matrix (resp. a vector) distributed in row-striped fashion among the processes in a communicator: void print row vector(void v, MPI Datatype type, size t n, MPI Comm comm); void print row matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operate the matrix-vector multiplication based on rowwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Task j Task j Column j of A b[j] Partial inner products Column j of A b[j] partial c All to all exchange Task j Task j Column j of A b[j] c[j] Sum partial results (Reduction) Column j of A b[j] n products part of c[j] Figure 4: Operations performed by a primitive task i in a columnwise blockstriped decomposition. Exercise 5 Parallel version using Columnwise Block-Striped Decomposition In this exercise, we assume m = n (for simplification) but keep the distinction in notation if possible. This hypothesis simplify the all-to-all step and the decomposition. Here a primitive task j is associated with the jth column of the matrix A and the element j of the vectors b and c. Recall that the elements of vector c are given by: c 0 c 1. c m 1 = a 0,0 b 0 + a 0,1 b a 0,n 1 b n 1 = a 1,0 b 0 + a 1,1 b a 0,n 1 b n 1 = a m 1,0 b 0 + a m 1,1 b a m 1,n 1 b n 1 The computation begins with each task j multiplying its column of A by b j, resulting in a vector of partial results (corresponding to a i,j b j i [0, m 1]). 6

7 Note that the product a i,j b j is the jth term of the inner-product leading to c i. In other words, the m multiplications performed by task j yield m terms for which only the jth term is relevant for the computation of c j (handled by this task). It follows that the task j needs to distribute m 1 = n 1 result terms it doesn t need to the other processes and collect n 1 results it does need from them. This is done through an all-to-all exchange performed in MPI with the function MPI_Alltoallv (see Appendix A). Once this is done, task j is able to compute c j by adding every member of the inner-product. The sequence of all operations are summarized in Figure Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern, which mapping will you use for the parallel algorithm? How many columns/part of the vectors are assigned to each processes? Which process is advised for I/O and why 3? 2. Complexity analysis. Again we don t take into account the reading of matrix/vector files. Let s take again χ, λ and β as defined in exercise 4. We also consider elements of type double occupying 8 bytes. After the initial multiplications, process j owns a vector partial_c of size n. Through the all-to-all exchange, process j receives p 1 partial vector from the other tasks, each of size BLOCK_SIZE(j,p,n) and operate an addition element by element (a reduction) to obtain the portion of c handled by this process. What is the expected time for the computational portion of parallel program? The algorithm perform an all-to-all exchange of partially computed portions of vector c. There are two common ways to handle such a communication: (1) in each of log p phases all nodes exchange half of their accumulated data with the others. (2) each process send directly to each of the other processes the elements destined to that process. This requires that each process send p 1 messages, each of size n p. Show that the communication complexity in approach 1 is log p α + log p n 2β Show that the communication complexity in approach 2 is (p 1)α + (p 1) n pβ Which approach is the most suited for short messages? long message? Conclude by giving the complexity of this parallel algorithm in both cases. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3 Hint: you may refer to the second question of the exercise 2 7

8 3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size *m * BLOCK_SIZE(id,p,*n) is allocated (c) process j reads in the matrix one row at a time and distributes each row among the the other processes (using MPI_Scatterv cf Appendix A). void read col matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which (a) the I/O process j opens the file f, read and broadcast (using MPI_Bcast) the dimension n of the vector (b) the vector *v of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j read and send (using MPI_Send) the portion associated to each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read block vector(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Implement the fonctions that print a matrix (resp. a vector) distributed in column-striped fashion among the processes in a communicator. In order to print values in the correct order for each row, a single process must gather (using MPI_Gatherv see Appendix A) together the elements of that row from the entire set of processes 4 void print col vector (void v,mpi Datatype type,size t n,mpi Comm comm); void print col matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on columnwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Exercise 6 Parallel version using Checkerboard Block Decomposition Here a primitive task is associated with each element of the matrix. The task responsible for a i,j multiplies it by b j yielding d i,j. Each element c i of the result vector is then n 1 j=0 d i,j. The tasks are agglomerated into rectangular blocks of approximatively the same size asssigned on a two-dimentional grid of processes (see figure 2). We assume that the vector b is distributed by blocks among the tasks of the first column of processes. The first step is then to redistribute b among all processes, then operate a matrix-vector multiplication on each process (between the owned block and the corresponding subpart of b). Finally, each row of tasks performs a sum-reduction of the result vectors, creating vector c. 4 Hence the dataflow of this function is opposite that of the function read_col_matrix and read_col_vector. 8

9 b A Redistribute b c Reduction accross rows partial c partial c partial c partial c Matrix vector multiplication Figure 5: Operations performed by a primitive task i in a checkerboard block decomposition. After the reduction, result vector c is distributed among the tasks in the first column of the process grid. The sequence of all operations are summarized in Figure 5. Vector Redistribution. Assume that p processed are divided over a k l grid. If k = l = p. The redistribution of b is easier and operated as illustrated on Figure 6(a). In the general case where k l, the redistribution is more complicated and detailled in Figure 6(b). 1. Parallel computation analysis. For the sake of simplicity, we assume here that m = n and that p is a square number so that the processes are arranged on a square grid. Each process is responsible for a matrix block of size at most n p n p. What is the expected time for the computational portion of parallel program? Again we don t take into account the reading of matrix/vector files. Show that the expected communication time is O(n log p p ) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 2. The exercise requires the creation of communicators respecting cartesian topology. To create a virtual mesh of processes that is close to square as possible (for maximal scalability). The first step is to determine the number of nodes in each dimension of the grid. This is handled by the function MPI_Dims_create (see Appendix A). With the result of this function, you can use the function MPI_Cart_create to create a communicator with a Cartesian Topology. Note that once created, the function MPI_Cart_get helps to retrieve information about the Cartesian communicator created. Assuming grid_comm to be a communicator with Cartesian topology, write the functions that partition the process grid into columns (resp. 9

10 Task (i,0) send its portion of b to task (0,i) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (a) Task (0,0) gather b Task (0,0) scatter from tasks (*,0) b on tasks (0,*) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (b) Figure 6: Redistribition of vector b (a) when the process grid is square and (b) when the process grid is not square. rows) relative to the process id. After a call to these functions, col_comm (resp. row_comm) should be a communicator containing the calling process and all other processes in the same column (resp. row) of the process grid, but no others. void create col comm(mpi Comm grid comm, int id, MPI Comm col comm); void create row comm(mpi Comm grid comm, int id, MPI Comm row comm); 3. Suppose again that grid_comm is a communicator with Cartesian topology. Write a code segment illustrating how the function read_block_vector (implemented in Exercise 5.4) can be used to open a file containing the vector b and distribute it among the first column of processes in grid_comm. 4. The distribution pattern of the matrix is similar to the one used for the fonction read_block_vector (implemented in Exercise 5.3) except that instead of scattering each matrix row among all the process, we must scatter them among a subset of the processes those occupying a single row in the virtual process grid. As always, we assume a single process to be responsible for matrix I/O yet the choice does not really matter in this case. Let s select P 0 for this task. Implement the following function in which (a) P 0 opens the file f, read and broadcast (using MPI_Bcast) 10

11 A the dimension of the matrix (b) each process determines the size of the submatrix it is responsible for (using MPI_Cart_get in particular) and allocates it. (c) P 0 reads in the matrix one row at a time and sends it to the process in the first column of the correct row in the virtual process grid. (d) after the receiving process reads the matrix row, it scatters it among the other processes belonging to the same row in the virtual process grid: void read grid matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm grid comm); 5. Write the function that redistribute the vector b among the processes following the operation illustrated in Figure 6: void redistribute vector (void v, MPI Datatype type, size t n, MPI Comm grid comm); Initially, vector b is distributed among k l virtual grids. The process at grid location (i, 0) is responsible for BLOCK_SIZE(i,k,n) elements of b, beginning with the element having index BLOCK_LOW(i,k,n). After the redistribution, every process in column j of the grid is responsible for BLOCK_SIZE(j,l,n) elements of b, beginning with the element having index BLOCK_LOW(j,l,n). You may rely on two sub-functions that deal with the specific cases where k = l = p and k l. 6. Implement the printing function: void print checkerboard matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 7. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on checkerboard block decomposition. Check your program using the vectors and matrix genererated previously. 8. Benchmark your program for various number of processors and plot your data using gnuplot. Merry Christmas and Happy New Year! MPI functions used in this exercise See : MPI_Allgatherv: MPI_Allgather: MPI_Gatherv: MPI_Gather: MPI_Scatterv: MPI_Scatter: MPI_Alltoallv: MPI_Dims_Create: MPI_Cart_create: MPI_Cart_get: MPI_Cart_rank: MPI_Cart_coords: MPI_Cart_split:

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three

More information

Matrix-vector Multiplication

Matrix-vector Multiplication Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm

More information

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product) Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading

More information

All-Pairs Shortest Paths - Floyd s Algorithm

All-Pairs Shortest Paths - Floyd s Algorithm All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel

More information

Chapter 8 Matrix-Vector Multiplication

Chapter 8 Matrix-Vector Multiplication Chapter 8 Matrix-Vector Multiplication We can't solve problems by using the same kind of thinking we used when we created them. - Albert Einstein 8. Introduction The purpose of this chapter is two-fold:

More information

MPI Collective communication

MPI Collective communication MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication

More information

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III) Topics Lecture 7 MPI Programming (III) Collective communication (cont d) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking

More information

More Communication (cont d)

More Communication (cont d) Data types and the use of communicators can simplify parallel program development and improve code readability Sometimes, however, simply treating the processors as an unstructured collection is less than

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Basic MPI Communications. Basic MPI Communications (cont d)

Basic MPI Communications. Basic MPI Communications (cont d) Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Programming with MPI Collectives

Programming with MPI Collectives Programming with MPI Collectives Jan Thorbecke Type to enter text Delft University of Technology Challenge the future Collectives Classes Communication types exercise: BroadcastBarrier Gather Scatter exercise:

More information

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

MA471. Lecture 5. Collective MPI Communication

MA471. Lecture 5. Collective MPI Communication MA471 Lecture 5 Collective MPI Communication Today: When all the processes want to send, receive or both Excellent website for MPI command syntax available at: http://www-unix.mcs.anl.gov/mpi/www/ 9/10/2003

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast

More information

High-Performance Computing: MPI (ctd)

High-Performance Computing: MPI (ctd) High-Performance Computing: MPI (ctd) Adrian F. Clark: alien@essex.ac.uk 2015 16 Adrian F. Clark: alien@essex.ac.uk High-Performance Computing: MPI (ctd) 2015 16 1 / 22 A reminder Last time, we started

More information

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III) Topics Lecture 6 MPI Programming (III) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking communication Manager-Worker Programming

More information

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction 1 of 18 11/1/2006 3:59 PM Cornell Theory Center Discussion: MPI Collective Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate

More information

Parallel Computing. Parallel Algorithm Design

Parallel Computing. Parallel Algorithm Design Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels

More information

Matrix multiplication

Matrix multiplication Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:

More information

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3 MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3 Schedule 4Course Map 4Fix environments to run MPI codes 4CartesianTopology! MPI_Cart_create! MPI_

More information

Chapter 3. Distributed Memory Programming with MPI

Chapter 3. Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule

More information

Introduction to MPI part II. Fabio AFFINITO

Introduction to MPI part II. Fabio AFFINITO Introduction to MPI part II Fabio AFFINITO (f.affinito@cineca.it) Collective communications Communications involving a group of processes. They are called by all the ranks involved in a communicator (or

More information

Learning Lab 2: Parallel Algorithms of Matrix Multiplication

Learning Lab 2: Parallel Algorithms of Matrix Multiplication Learning Lab 2: Parallel Algorithms of Matrix Multiplication Lab Objective... Exercise Stating the Matrix Multiplication Problem... 2 Task 2 Code the Seriial Matrix Multiplication Program... 2 Task Open

More information

Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix Algorithms (part 3) Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of

More information

Introduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer

Introduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer Introduction to TDDC78 Lab Series Lu Li Linköping University Parts of Slides developed by Usman Dastgeer Goals Shared- and Distributed-memory systems Programming parallelism (typical problems) Goals Shared-

More information

Intermediate MPI features

Intermediate MPI features Intermediate MPI features Advanced message passing Collective communication Topologies Group communication Forms of message passing (1) Communication modes: Standard: system decides whether message is

More information

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Parallel Algorithm Design. Parallel Algorithm Design p. 1 Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

Code Parallelization

Code Parallelization Code Parallelization a guided walk-through m.cestari@cineca.it f.salvadore@cineca.it Summer School ed. 2015 Code Parallelization two stages to write a parallel code problem domain algorithm program domain

More information

Message-Passing and MPI Programming

Message-Passing and MPI Programming Message-Passing and MPI Programming 2.1 Transfer Procedures Datatypes and Collectives N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 July 2010 These are the procedures that actually transfer

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Chapter 8 Dense Matrix Algorithms

Chapter 8 Dense Matrix Algorithms Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview

More information

Mathematics and Computer Science

Mathematics and Computer Science Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR

More information

CPS 303 High Performance Computing

CPS 303 High Performance Computing CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 7: Communicators and topologies Communicators: a communicator is a collection of processes that

More information

Collective Communications II

Collective Communications II Collective Communications II Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2014 Outline Scatter Example: parallel A b Distributing a matrix Gather Serial A b Parallel A b Allocating memory

More information

Practical Scientific Computing: Performanceoptimized

Practical Scientific Computing: Performanceoptimized Practical Scientific Computing: Performanceoptimized Programming Advanced MPI Programming December 13, 2006 Dr. Ralf-Peter Mundani Department of Computer Science Chair V Technische Universität München,

More information

Message Passing with MPI

Message Passing with MPI Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix

More information

MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0

More information

CINES MPI. Johanne Charpentier & Gabriel Hautreux

CINES MPI. Johanne Charpentier & Gabriel Hautreux Training @ CINES MPI Johanne Charpentier & Gabriel Hautreux charpentier@cines.fr hautreux@cines.fr Clusters Architecture OpenMP MPI Hybrid MPI+OpenMP MPI Message Passing Interface 1. Introduction 2. MPI

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

More MPI. Bryan Mills, PhD. Spring 2017

More MPI. Bryan Mills, PhD. Spring 2017 More MPI Bryan Mills, PhD Spring 2017 MPI So Far Communicators Blocking Point- to- Point MPI_Send MPI_Recv CollecEve CommunicaEons MPI_Bcast MPI_Barrier MPI_Reduce MPI_Allreduce Non-blocking Send int MPI_Isend(

More information

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 5 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project

More information

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Introduction to Parallel Programming Message Passing Interface Practical Session Part I Introduction to Parallel Programming Message Passing Interface Practical Session Part I T. Streit, H.-J. Pflug streit@rz.rwth-aachen.de October 28, 2008 1 1. Examples We provide codes of the theoretical

More information

Distributed Memory Parallel Programming

Distributed Memory Parallel Programming COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining

More information

Introduction to Parallel. Programming

Introduction to Parallel. Programming University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem. 1. Introduction to Parallel Processing In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem. a) Types of machines and computation. A conventional

More information

COMMUNICATION IN HYPERCUBES

COMMUNICATION IN HYPERCUBES PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

CS 6230: High-Performance Computing and Parallelization Introduction to MPI CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA

More information

L19: Putting it together: N-body (Ch. 6)!

L19: Putting it together: N-body (Ch. 6)! Administrative L19: Putting it together: N-body (Ch. 6)! November 22, 2011! Project sign off due today, about a third of you are done (will accept it tomorrow, otherwise 5% loss on project grade) Next

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP Chapter 18 Combining MPI and OpenMP Outline Advantages of using both MPI and OpenMP Case Study: Conjugate gradient method Case Study: Jacobi method C+MPI vs. C+MPI+OpenMP Interconnection Network P P P

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm In the name of God Part 4. 4.1. Dantzig-Wolf Decomposition Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Introduction Real world linear programs having thousands of rows and columns.

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

Standard MPI - Message Passing Interface

Standard MPI - Message Passing Interface c Ewa Szynkiewicz, 2007 1 Standard MPI - Message Passing Interface The message-passing paradigm is one of the oldest and most widely used approaches for programming parallel machines, especially those

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs 1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) s http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx

More information

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,

More information

Basic Communication Ops

Basic Communication Ops CS 575 Parallel Processing Lecture 5: Ch 4 (GGKK) Sanjay Rajopadhye Colorado State University Basic Communication Ops n PRAM, final thoughts n Quiz 3 n Collective Communication n Broadcast & Reduction

More information

Parallel Computing and the MPI environment

Parallel Computing and the MPI environment Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

6.001 Notes: Section 4.1

6.001 Notes: Section 4.1 6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,

More information

Outline. Communication modes MPI Message Passing Interface Standard

Outline. Communication modes MPI Message Passing Interface Standard MPI THOAI NAM Outline Communication modes MPI Message Passing Interface Standard TERMs (1) Blocking If return from the procedure indicates the user is allowed to reuse resources specified in the call Non-blocking

More information

Recap of Parallelism & MPI

Recap of Parallelism & MPI Recap of Parallelism & MPI Chris Brady Heather Ratcliffe The Angry Penguin, used under creative commons licence from Swantje Hess and Jannis Pohlmann. Warwick RSE 13/12/2017 Parallel programming Break

More information

20 Dynamic allocation of memory: malloc and calloc

20 Dynamic allocation of memory: malloc and calloc 20 Dynamic allocation of memory: malloc and calloc As noted in the last lecture, several new functions will be used in this section. strlen (string.h), the length of a string. fgets(buffer, max length,

More information

Lecture 16: Sorting. CS178: Programming Parallel and Distributed Systems. April 2, 2001 Steven P. Reiss

Lecture 16: Sorting. CS178: Programming Parallel and Distributed Systems. April 2, 2001 Steven P. Reiss Lecture 16: Sorting CS178: Programming Parallel and Distributed Systems April 2, 2001 Steven P. Reiss I. Overview A. Before break we started talking about parallel computing and MPI 1. Basic idea of a

More information

Linear systems of equations

Linear systems of equations Linear systems of equations Michael Quinn Parallel Programming with MPI and OpenMP material do autor Terminology Back substitution Gaussian elimination Outline Problem System of linear equations Solve

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Topologies in MPI. Instructor: Dr. M. Taufer

Topologies in MPI. Instructor: Dr. M. Taufer Topologies in MPI Instructor: Dr. M. Taufer WS2004/2005 Topology We can associate additional information (beyond the group and the context) to a communicator. A linear ranking of processes may not adequately

More information

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs 1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx

More information

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC) Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts

More information

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 8 Consistency and Redundancy in Project networks In today s lecture

More information

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems Lecture 7 Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems Announcements Quiz #1 in section on Friday Midterm Room: SSB 106 Monday 10/30, 7:00 to 8:20 PM

More information

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage

More information

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms

More information

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU? CPS 540 Fall 204 Shirley Moore, Instructor Test November 9, 204 Answers Please show all your work.. Draw a sketch of the extended von Neumann architecture for a 4-core multicore processor with three levels

More information

Exercises: Message-Passing Programming

Exercises: Message-Passing Programming T H U I V R S I T Y O H F R G I U xercises: Message-Passing Programming Hello World avid Henty. Write an MPI program which prints the message Hello World. ompile and run on one process. Run on several

More information

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

CPS343 Parallel and High Performance Computing Project 1 Spring 2018 CPS343 Parallel and High Performance Computing Project 1 Spring 2018 Assignment Write a program using OpenMP to compute the estimate of the dominant eigenvalue of a matrix Due: Wednesday March 21 The program

More information