Project C/MPI: Matrix-Vector Multiplication
|
|
- Whitney Elliott
- 5 years ago
- Views:
Transcription
1 Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving a wide variety of problems. This assignment aims at studying three different ways to carry out the multiplication of a dense matrix A M m,n (K) by a vector b K n in parallel. More precisely, the purpose is to output the vector c K m such that: a 0,0 a 0,1... a 0,n 1 b 0 a 1,0 a 1,1... a 1,n 1 A.b = c b 1. = [ ] c 0 c 1... c m 1 a m 1,0 a m 1,1... a m 1,n 1 b n 1 where i [0, m 1], c i = n a i,j b j We will assume that both matrix and vectors are stored in files that respect the following format: Matrix A = (a i,j ) M m,n (K): [m, n, a 0,0, a 0,1,..., a m 1,n 1 ]. Vector v K n : [n, v 0,..., v n 1 ] j=0 Exercise 1 Matrix-vector I/O and sequential algorithm 1. Implement the functions handling the allocation and freeing for matrix and vectors of objects having a size of size bytes, i.e. void vector alloc (size t n, size t size ); void vector free (void v); void matrix alloc(size t m, size t n, size t size ); void matrix free (void M); A vector is allocated just like a normal array (using calloc for instance) A matrix A M m,n (K) will be allocated in three steps: (a) the memory for the m n matrix values are allocated (associated to the pointer Astorage); (b) the memory of the m row pointers is allocated and A points to the beginning of this block of memory; (c) the values of A[0],...,A[m-1] are initialized to &Astorage[0],...,&Astorage[(m-1)*n] so that A[i][j] corresponds to the matrix element a i,j. This is illustrated in the Figure Implement the following functions which store in the file f the vector v (resp. the matrix M) having n (resp. m n) elements, each of size size bytes. You may want to use fwrite in those functions. 1
2 void store vector(char f, size t n, size t size, void v); void store matrix(char f, size t m, size t n, size t size, void M); 3. Implement the functions which read the file f to allocate and populate the vector v (resp. the matrix M) with the n (resp. m n) elements of size size bytes from the file. fread is probably your best friend here. void read vector(char f, size t size, size t n, void v); void read matrix(char f, size t size, size t m, size t n, void M); 4. Implement the following printing functions (use type to provide the correct format and cast in printf): void print vector lf (void v, MPI Datatype type, size t n); void print matrix lf(void M, MPI Datatype type, size t m, size t n); 5. Implement a program that asks the user to enter the values m and n then generate and store (with appropriate i, m and n): the base vectors e i R n (file base_i_vector_n.dat) where the ith elements is 1 and the others 0; a random vector v R n (file random_vector_n.dat). the identity matrix ID (file id_matrix_m_n.dat) such that: { id i,i = 1 for 0 i < min(m, n) id i,j = 0 for i j a random matrix A M m,n (K) (file random_matrix_m_n.dat); 6. Implement the sequential algorithm that operate the matrix-vector multiplication over elements of type double through the following function which computes c = M b: void seq matvec mult lf(double M, double b, size t m, size t n, double c); Check your program using the vectors and matrix genererated previously. 7. What is the complexity of the sequential matrix-vector multiplication, in the general case and when m = n? Astorage sizeof(void*) A Step 2 m*sizeof(void*) Step 1 Step n (m 1)*n m*n 1 m*n*size Figure 1: Matrix allocation in three steps. We use a domain decomposition strategy to develop the parallel algorithm. There are three straightforward ways to decompose a m n matrix A: rowwise 2
3 block striping, columnwise block striping and the checkerboard block decomposition. Each approach (after agglomeration and mapping) is illustrated in Figure 2. We will also assume that a single process is responsible for Input/Output. n = 9 m = 8 Rowwise block striped decomposition Columnwise block striped decomposition Checkerboard block decomposition Figure 2: Three ways to decompose and map a matrix having 8 rows and 9 columns over 4 processes. Exercise 2 Block data decomposition Whatever kind of decomposition we will consider, a strategy for distributing the primitive tasks among the processors should be defined. We propose here to analyse the properties of a classical block allocation scheme. Let k be a number of elements (indexed from 0 to k 1) to be distributed on p processes (indexed from 0 to p 1). We assume that the process P i controls the elements between i k p and (i + 1) k p 1 1. Show that the process controlling a particular element j is: i = p(j+1) 1 k 2. Show that the last process (i.e. P p 1 ) is responsible for k p elements. In the sequel, we will assume the following macros to be defined: #define BLOCK LOW(id,p,k) ((id) (k)/(p)) #define BLOCK HIGH(id,p,k) (BLOCK LOW((id)+1,p,k) 1) #define BLOCK SIZE(id,p,k) (BLOCK HIGH(id,p,k) BLOCK LOW(id,p,k)+1) #define BLOCK OWNER(j,p,k) (((p) ((j)+1) 1)/(k)) 3. Use those macros in the following function that initiate the count and displacement arrays needed to scatter/gather a buffer of n data items on/from p processes, when the number of elements sent/received to/from other processes varies: void create mixed count disp arrays(int p, size t k, int count, int disp); In other words, after the call to this function, count[i] is the number of elements sent/received to/from process i (i.e BLOCK_SIZE(i,p,k)) while disp[i] give the starting point in the buffer of the element sent/received to/from process i (i.e disp[i-1] + count[i-1]). 3
4 4. Similarly, implement the following function that initiate the count and displacement arrays needed by a process id when it is supposed to get the same number of element from every other process to populate the buffer of n data items: void create count disp arrays(int id,int p,size t k,int count,int disp); In other words, after the call to this function, count[i] is the number of elements received from process i (i.e BLOCK_SIZE(id,p,k)) while disp[i] give the starting point in the buffer of the element sent/received by process i (i.e disp[i-1] + count[i-1]). Exercise 3 Vector decomposition stategies As for the vectors, there are two natural ways to distribute them among the processes: a block decomposition as proposed in exercise 2 or a full replication meaning all vector elements are copied on all the processes. Explain why this second approach is acceptable in the context of matrixvector multiplication. Finally, we consider here three ways of partitionning the matrix and two for distributing the vectors. it leads to six combinaisons. We will limit our investigation to three of them: rowwise block striping with replicated vectors (exercise 4), columnwise block striping with block-decomposed vectors (exercise 5) and the checkerboard block decomposition with block-decomposed vectors over the first column of processes (exercise 6). Inner Product Task i b b Task i A[i] A[i] c[i] Task i b c All gather communication A[i] Figure 3: Operations performed by a primitive task i in a rowwise block-striped decomposition. Exercise 4 Parallel version using Rowwise Block-Striped Decomposition In this exercise, we associate a primitive task with each row of the matrix A. Vectors b and c are replicated among the primitive tasks so the memory should 4
5 be allocated for entire vectors on each processor. It follows that the task i has all the elements required to compute the inner-product leading to the element c i of c. Once this is done, processes must concatenate their pieces of the vector c into a complete vector. This is done through the MPI_Allgatherv function (see Appendix A). The sequence of all operations are summarized in Figure Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern. Which mapping will you use for the parallel algorithm? How many rows are assigned to each processes? Which process is advised for I/O and why 1? 2. Complexity analysis. For simplicity, we assume m = n and don t take into account the reading of matrix/vector files. Let χ be the time needed to compute a single iteration of the loop performing the inner product (χ can be determined by dividing the execution time of the sequential algorithm by n 2 ). What is the expected time for the computational portion of parallel program? We assume each vector element to be a double occupying 8 bytes. Each process beeing responsible for around n p elements, an all-gather communication requires each process to send log p messages, the first one having length 8n 2 8n p, the second p etc. Let λ be the latency of each message and β be the bandwith i.e. the number of data item that can be sent down a channel in one unit of time 2. Show that the expected communication time for the all-gather step is O(log p + n) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j reads and sends (using MPI_Send) the rows associated to the each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read row matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which the I/O process j read the file f, broadcast the dimension of the vector then read and broadcast the vector elements: 1 Hint: you may refer to the second question of the exercise 2 2 Sending a message containing k data items requires then time λ + k β. 5
6 void read vector and replicate(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Assuming elements of type double, implement the fonctions that print a matrix (resp. a vector) distributed in row-striped fashion among the processes in a communicator: void print row vector(void v, MPI Datatype type, size t n, MPI Comm comm); void print row matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operate the matrix-vector multiplication based on rowwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Task j Task j Column j of A b[j] Partial inner products Column j of A b[j] partial c All to all exchange Task j Task j Column j of A b[j] c[j] Sum partial results (Reduction) Column j of A b[j] n products part of c[j] Figure 4: Operations performed by a primitive task i in a columnwise blockstriped decomposition. Exercise 5 Parallel version using Columnwise Block-Striped Decomposition In this exercise, we assume m = n (for simplification) but keep the distinction in notation if possible. This hypothesis simplify the all-to-all step and the decomposition. Here a primitive task j is associated with the jth column of the matrix A and the element j of the vectors b and c. Recall that the elements of vector c are given by: c 0 c 1. c m 1 = a 0,0 b 0 + a 0,1 b a 0,n 1 b n 1 = a 1,0 b 0 + a 1,1 b a 0,n 1 b n 1 = a m 1,0 b 0 + a m 1,1 b a m 1,n 1 b n 1 The computation begins with each task j multiplying its column of A by b j, resulting in a vector of partial results (corresponding to a i,j b j i [0, m 1]). 6
7 Note that the product a i,j b j is the jth term of the inner-product leading to c i. In other words, the m multiplications performed by task j yield m terms for which only the jth term is relevant for the computation of c j (handled by this task). It follows that the task j needs to distribute m 1 = n 1 result terms it doesn t need to the other processes and collect n 1 results it does need from them. This is done through an all-to-all exchange performed in MPI with the function MPI_Alltoallv (see Appendix A). Once this is done, task j is able to compute c j by adding every member of the inner-product. The sequence of all operations are summarized in Figure Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern, which mapping will you use for the parallel algorithm? How many columns/part of the vectors are assigned to each processes? Which process is advised for I/O and why 3? 2. Complexity analysis. Again we don t take into account the reading of matrix/vector files. Let s take again χ, λ and β as defined in exercise 4. We also consider elements of type double occupying 8 bytes. After the initial multiplications, process j owns a vector partial_c of size n. Through the all-to-all exchange, process j receives p 1 partial vector from the other tasks, each of size BLOCK_SIZE(j,p,n) and operate an addition element by element (a reduction) to obtain the portion of c handled by this process. What is the expected time for the computational portion of parallel program? The algorithm perform an all-to-all exchange of partially computed portions of vector c. There are two common ways to handle such a communication: (1) in each of log p phases all nodes exchange half of their accumulated data with the others. (2) each process send directly to each of the other processes the elements destined to that process. This requires that each process send p 1 messages, each of size n p. Show that the communication complexity in approach 1 is log p α + log p n 2β Show that the communication complexity in approach 2 is (p 1)α + (p 1) n pβ Which approach is the most suited for short messages? long message? Conclude by giving the complexity of this parallel algorithm in both cases. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3 Hint: you may refer to the second question of the exercise 2 7
8 3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size *m * BLOCK_SIZE(id,p,*n) is allocated (c) process j reads in the matrix one row at a time and distributes each row among the the other processes (using MPI_Scatterv cf Appendix A). void read col matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which (a) the I/O process j opens the file f, read and broadcast (using MPI_Bcast) the dimension n of the vector (b) the vector *v of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j read and send (using MPI_Send) the portion associated to each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read block vector(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Implement the fonctions that print a matrix (resp. a vector) distributed in column-striped fashion among the processes in a communicator. In order to print values in the correct order for each row, a single process must gather (using MPI_Gatherv see Appendix A) together the elements of that row from the entire set of processes 4 void print col vector (void v,mpi Datatype type,size t n,mpi Comm comm); void print col matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on columnwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Exercise 6 Parallel version using Checkerboard Block Decomposition Here a primitive task is associated with each element of the matrix. The task responsible for a i,j multiplies it by b j yielding d i,j. Each element c i of the result vector is then n 1 j=0 d i,j. The tasks are agglomerated into rectangular blocks of approximatively the same size asssigned on a two-dimentional grid of processes (see figure 2). We assume that the vector b is distributed by blocks among the tasks of the first column of processes. The first step is then to redistribute b among all processes, then operate a matrix-vector multiplication on each process (between the owned block and the corresponding subpart of b). Finally, each row of tasks performs a sum-reduction of the result vectors, creating vector c. 4 Hence the dataflow of this function is opposite that of the function read_col_matrix and read_col_vector. 8
9 b A Redistribute b c Reduction accross rows partial c partial c partial c partial c Matrix vector multiplication Figure 5: Operations performed by a primitive task i in a checkerboard block decomposition. After the reduction, result vector c is distributed among the tasks in the first column of the process grid. The sequence of all operations are summarized in Figure 5. Vector Redistribution. Assume that p processed are divided over a k l grid. If k = l = p. The redistribution of b is easier and operated as illustrated on Figure 6(a). In the general case where k l, the redistribution is more complicated and detailled in Figure 6(b). 1. Parallel computation analysis. For the sake of simplicity, we assume here that m = n and that p is a square number so that the processes are arranged on a square grid. Each process is responsible for a matrix block of size at most n p n p. What is the expected time for the computational portion of parallel program? Again we don t take into account the reading of matrix/vector files. Show that the expected communication time is O(n log p p ) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 2. The exercise requires the creation of communicators respecting cartesian topology. To create a virtual mesh of processes that is close to square as possible (for maximal scalability). The first step is to determine the number of nodes in each dimension of the grid. This is handled by the function MPI_Dims_create (see Appendix A). With the result of this function, you can use the function MPI_Cart_create to create a communicator with a Cartesian Topology. Note that once created, the function MPI_Cart_get helps to retrieve information about the Cartesian communicator created. Assuming grid_comm to be a communicator with Cartesian topology, write the functions that partition the process grid into columns (resp. 9
10 Task (i,0) send its portion of b to task (0,i) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (a) Task (0,0) gather b Task (0,0) scatter from tasks (*,0) b on tasks (0,*) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (b) Figure 6: Redistribition of vector b (a) when the process grid is square and (b) when the process grid is not square. rows) relative to the process id. After a call to these functions, col_comm (resp. row_comm) should be a communicator containing the calling process and all other processes in the same column (resp. row) of the process grid, but no others. void create col comm(mpi Comm grid comm, int id, MPI Comm col comm); void create row comm(mpi Comm grid comm, int id, MPI Comm row comm); 3. Suppose again that grid_comm is a communicator with Cartesian topology. Write a code segment illustrating how the function read_block_vector (implemented in Exercise 5.4) can be used to open a file containing the vector b and distribute it among the first column of processes in grid_comm. 4. The distribution pattern of the matrix is similar to the one used for the fonction read_block_vector (implemented in Exercise 5.3) except that instead of scattering each matrix row among all the process, we must scatter them among a subset of the processes those occupying a single row in the virtual process grid. As always, we assume a single process to be responsible for matrix I/O yet the choice does not really matter in this case. Let s select P 0 for this task. Implement the following function in which (a) P 0 opens the file f, read and broadcast (using MPI_Bcast) 10
11 A the dimension of the matrix (b) each process determines the size of the submatrix it is responsible for (using MPI_Cart_get in particular) and allocates it. (c) P 0 reads in the matrix one row at a time and sends it to the process in the first column of the correct row in the virtual process grid. (d) after the receiving process reads the matrix row, it scatters it among the other processes belonging to the same row in the virtual process grid: void read grid matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm grid comm); 5. Write the function that redistribute the vector b among the processes following the operation illustrated in Figure 6: void redistribute vector (void v, MPI Datatype type, size t n, MPI Comm grid comm); Initially, vector b is distributed among k l virtual grids. The process at grid location (i, 0) is responsible for BLOCK_SIZE(i,k,n) elements of b, beginning with the element having index BLOCK_LOW(i,k,n). After the redistribution, every process in column j of the grid is responsible for BLOCK_SIZE(j,l,n) elements of b, beginning with the element having index BLOCK_LOW(j,l,n). You may rely on two sub-functions that deal with the specific cases where k = l = p and k l. 6. Implement the printing function: void print checkerboard matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 7. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on checkerboard block decomposition. Check your program using the vectors and matrix genererated previously. 8. Benchmark your program for various number of processors and plot your data using gnuplot. Merry Christmas and Happy New Year! MPI functions used in this exercise See : MPI_Allgatherv: MPI_Allgather: MPI_Gatherv: MPI_Gather: MPI_Scatterv: MPI_Scatter: MPI_Alltoallv: MPI_Dims_Create: MPI_Cart_create: MPI_Cart_get: MPI_Cart_rank: MPI_Cart_coords: MPI_Cart_split:
Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8
Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition
More informationParallel Programming in C with MPI and OpenMP
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three
More informationMatrix-vector Multiplication
Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm
More informationParallel Programming. Matrix Decomposition Options (Matrix-Vector Product)
Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8
Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition
More informationParallel Programming with MPI and OpenMP
Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading
More informationAll-Pairs Shortest Paths - Floyd s Algorithm
All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel
More informationChapter 8 Matrix-Vector Multiplication
Chapter 8 Matrix-Vector Multiplication We can't solve problems by using the same kind of thinking we used when we created them. - Albert Einstein 8. Introduction The purpose of this chapter is two-fold:
More informationMPI Collective communication
MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication
More informationTopics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)
Topics Lecture 7 MPI Programming (III) Collective communication (cont d) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking
More informationMore Communication (cont d)
Data types and the use of communicators can simplify parallel program development and improve code readability Sometimes, however, simply treating the processors as an unstructured collection is less than
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationBasic MPI Communications. Basic MPI Communications (cont d)
Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each
More informationLecture 17: Array Algorithms
Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationProgramming with MPI Collectives
Programming with MPI Collectives Jan Thorbecke Type to enter text Delft University of Technology Challenge the future Collectives Classes Communication types exercise: BroadcastBarrier Gather Scatter exercise:
More informationParallel Computing: Parallel Algorithm Design Examples Jin, Hai
Parallel Computing: Parallel Algorithm Design Examples Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology ! Given associative operator!! a 0! a 1! a 2!! a
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationMA471. Lecture 5. Collective MPI Communication
MA471 Lecture 5 Collective MPI Communication Today: When all the processes want to send, receive or both Excellent website for MPI command syntax available at: http://www-unix.mcs.anl.gov/mpi/www/ 9/10/2003
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationHigh-Performance Computing: MPI (ctd)
High-Performance Computing: MPI (ctd) Adrian F. Clark: alien@essex.ac.uk 2015 16 Adrian F. Clark: alien@essex.ac.uk High-Performance Computing: MPI (ctd) 2015 16 1 / 22 A reminder Last time, we started
More informationPeter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in
More informationTopics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)
Topics Lecture 6 MPI Programming (III) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking communication Manager-Worker Programming
More informationDistributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in
More informationCornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction
1 of 18 11/1/2006 3:59 PM Cornell Theory Center Discussion: MPI Collective Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate
More informationParallel Computing. Parallel Algorithm Design
Parallel Computing Parallel Algorithm Design Task/Channel Model Parallel computation = set of tasks Task Program Local memory Collection of I/O ports Tasks interact by sending messages through channels
More informationMatrix multiplication
Matrix multiplication Standard serial algorithm: procedure MAT_VECT (A, x, y) begin for i := 0 to n - 1 do begin y[i] := 0 for j := 0 to n - 1 do y[i] := y[i] + A[i, j] * x [j] end end MAT_VECT Complexity:
More informationMPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3
MPI Workshop - III Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3 Schedule 4Course Map 4Fix environments to run MPI codes 4CartesianTopology! MPI_Cart_create! MPI_
More informationChapter 3. Distributed Memory Programming with MPI
An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule
More informationIntroduction to MPI part II. Fabio AFFINITO
Introduction to MPI part II Fabio AFFINITO (f.affinito@cineca.it) Collective communications Communications involving a group of processes. They are called by all the ranks involved in a communicator (or
More informationLearning Lab 2: Parallel Algorithms of Matrix Multiplication
Learning Lab 2: Parallel Algorithms of Matrix Multiplication Lab Objective... Exercise Stating the Matrix Multiplication Problem... 2 Task 2 Code the Seriial Matrix Multiplication Program... 2 Task Open
More informationLecture 6: Parallel Matrix Algorithms (part 3)
Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of
More informationIntroduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer
Introduction to TDDC78 Lab Series Lu Li Linköping University Parts of Slides developed by Usman Dastgeer Goals Shared- and Distributed-memory systems Programming parallelism (typical problems) Goals Shared-
More informationIntermediate MPI features
Intermediate MPI features Advanced message passing Collective communication Topologies Group communication Forms of message passing (1) Communication modes: Standard: system decides whether message is
More informationParallel Algorithm Design. Parallel Algorithm Design p. 1
Parallel Algorithm Design Parallel Algorithm Design p. 1 Overview Chapter 3 from Michael J. Quinn, Parallel Programming in C with MPI and OpenMP Another resource: http://www.mcs.anl.gov/ itf/dbpp/text/node14.html
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationParallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop
Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j
More informationCode Parallelization
Code Parallelization a guided walk-through m.cestari@cineca.it f.salvadore@cineca.it Summer School ed. 2015 Code Parallelization two stages to write a parallel code problem domain algorithm program domain
More informationMessage-Passing and MPI Programming
Message-Passing and MPI Programming 2.1 Transfer Procedures Datatypes and Collectives N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 July 2010 These are the procedures that actually transfer
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationChapter 8 Dense Matrix Algorithms
Chapter 8 Dense Matrix Algorithms (Selected slides & additional slides) A. Grama, A. Gupta, G. Karypis, and V. Kumar To accompany the text Introduction to arallel Computing, Addison Wesley, 23. Topic Overview
More informationMathematics and Computer Science
Technical Report TR-2006-010 Revisiting hypergraph models for sparse matrix decomposition by Cevdet Aykanat, Bora Ucar Mathematics and Computer Science EMORY UNIVERSITY REVISITING HYPERGRAPH MODELS FOR
More informationCPS 303 High Performance Computing
CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 7: Communicators and topologies Communicators: a communicator is a collection of processes that
More informationCollective Communications II
Collective Communications II Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2014 Outline Scatter Example: parallel A b Distributing a matrix Gather Serial A b Parallel A b Allocating memory
More informationPractical Scientific Computing: Performanceoptimized
Practical Scientific Computing: Performanceoptimized Programming Advanced MPI Programming December 13, 2006 Dr. Ralf-Peter Mundani Department of Computer Science Chair V Technische Universität München,
More informationMessage Passing with MPI
Message Passing with MPI PPCES 2016 Hristo Iliev IT Center / JARA-HPC IT Center der RWTH Aachen University Agenda Motivation Part 1 Concepts Point-to-point communication Non-blocking operations Part 2
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationMatrix Multiplication
Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix
More informationMPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard Version 2.1 Message Passing Interface Forum June 23, 2008 Contents Acknowledgments xvl1 1 Introduction to MPI 1 1.1 Overview and Goals 1 1.2 Background of MPI-1.0
More informationCINES MPI. Johanne Charpentier & Gabriel Hautreux
Training @ CINES MPI Johanne Charpentier & Gabriel Hautreux charpentier@cines.fr hautreux@cines.fr Clusters Architecture OpenMP MPI Hybrid MPI+OpenMP MPI Message Passing Interface 1. Introduction 2. MPI
More informationThe Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing
The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task
More informationMore MPI. Bryan Mills, PhD. Spring 2017
More MPI Bryan Mills, PhD Spring 2017 MPI So Far Communicators Blocking Point- to- Point MPI_Send MPI_Recv CollecEve CommunicaEons MPI_Bcast MPI_Barrier MPI_Reduce MPI_Allreduce Non-blocking Send int MPI_Isend(
More informationMPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018
MPI 5 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationL17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!
L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project
More informationIntroduction to Parallel Programming Message Passing Interface Practical Session Part I
Introduction to Parallel Programming Message Passing Interface Practical Session Part I T. Streit, H.-J. Pflug streit@rz.rwth-aachen.de October 28, 2008 1 1. Examples We provide codes of the theoretical
More informationDistributed Memory Parallel Programming
COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining
More informationIntroduction to Parallel. Programming
University of Nizhni Novgorod Faculty of Computational Mathematics & Cybernetics Introduction to Parallel Section 9. Programming Parallel Methods for Solving Linear Systems Gergel V.P., Professor, D.Sc.,
More informationMPI Casestudy: Parallel Image Processing
MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by
More informationIn the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.
1. Introduction to Parallel Processing In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem. a) Types of machines and computation. A conventional
More informationCOMMUNICATION IN HYPERCUBES
PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)
More informationHigh Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore
High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on
More informationCS 6230: High-Performance Computing and Parallelization Introduction to MPI
CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA
More informationL19: Putting it together: N-body (Ch. 6)!
Administrative L19: Putting it together: N-body (Ch. 6)! November 22, 2011! Project sign off due today, about a third of you are done (will accept it tomorrow, otherwise 5% loss on project grade) Next
More informationCopyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP
Chapter 18 Combining MPI and OpenMP Outline Advantages of using both MPI and OpenMP Case Study: Conjugate gradient method Case Study: Jacobi method C+MPI vs. C+MPI+OpenMP Interconnection Network P P P
More informationCS 426. Building and Running a Parallel Application
CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations
More informationPart 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm
In the name of God Part 4. 4.1. Dantzig-Wolf Decomposition Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Introduction Real world linear programs having thousands of rows and columns.
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More information15. The Software System ParaLab for Learning and Investigations of Parallel Methods
15. The Software System ParaLab for Learning and Investigations of Parallel Methods 15. The Software System ParaLab for Learning and Investigations of Parallel Methods... 1 15.1. Introduction...1 15.2.
More informationMessage Passing Interface
Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented
More informationStandard MPI - Message Passing Interface
c Ewa Szynkiewicz, 2007 1 Standard MPI - Message Passing Interface The message-passing paradigm is one of the oldest and most widely used approaches for programming parallel machines, especially those
More informationHomework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization
ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor
More informationLecture 4: Principles of Parallel Algorithm Design (part 4)
Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction
More informationThe Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs
1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) s http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx
More informationEFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI
EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI 1 Akshay N. Panajwar, 2 Prof.M.A.Shah Department of Computer Science and Engineering, Walchand College of Engineering,
More informationBasic Communication Ops
CS 575 Parallel Processing Lecture 5: Ch 4 (GGKK) Sanjay Rajopadhye Colorado State University Basic Communication Ops n PRAM, final thoughts n Quiz 3 n Collective Communication n Broadcast & Reduction
More informationParallel Computing and the MPI environment
Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela
More informationHPC Parallel Programing Multi-node Computation with MPI - I
HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright
More information6.001 Notes: Section 4.1
6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,
More informationOutline. Communication modes MPI Message Passing Interface Standard
MPI THOAI NAM Outline Communication modes MPI Message Passing Interface Standard TERMs (1) Blocking If return from the procedure indicates the user is allowed to reuse resources specified in the call Non-blocking
More informationRecap of Parallelism & MPI
Recap of Parallelism & MPI Chris Brady Heather Ratcliffe The Angry Penguin, used under creative commons licence from Swantje Hess and Jannis Pohlmann. Warwick RSE 13/12/2017 Parallel programming Break
More information20 Dynamic allocation of memory: malloc and calloc
20 Dynamic allocation of memory: malloc and calloc As noted in the last lecture, several new functions will be used in this section. strlen (string.h), the length of a string. fgets(buffer, max length,
More informationLecture 16: Sorting. CS178: Programming Parallel and Distributed Systems. April 2, 2001 Steven P. Reiss
Lecture 16: Sorting CS178: Programming Parallel and Distributed Systems April 2, 2001 Steven P. Reiss I. Overview A. Before break we started talking about parallel computing and MPI 1. Basic idea of a
More informationLinear systems of equations
Linear systems of equations Michael Quinn Parallel Programming with MPI and OpenMP material do autor Terminology Back substitution Gaussian elimination Outline Problem System of linear equations Solve
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationParallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville
Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.1 Vector and Matrix Products Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationTopologies in MPI. Instructor: Dr. M. Taufer
Topologies in MPI Instructor: Dr. M. Taufer WS2004/2005 Topology We can associate additional information (beyond the group and the context) to a communicator. A linear ranking of processes may not adequately
More informationThe Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs
1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx
More informationCSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)
Parallel Algorithms on a cluster of PCs Ian Bush Daresbury Laboratory I.J.Bush@dl.ac.uk (With thanks to Lorna Smith and Mark Bull at EPCC) Overview This lecture will cover General Message passing concepts
More informationProject and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi
Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi Lecture - 8 Consistency and Redundancy in Project networks In today s lecture
More informationLecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems
Lecture 7 Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems Announcements Quiz #1 in section on Friday Midterm Room: SSB 106 Monday 10/30, 7:00 to 8:20 PM
More informationContents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet
Contents 2 F10: Parallel Sparse Matrix Computations Figures mainly from Kumar et. al. Introduction to Parallel Computing, 1st ed Chap. 11 Bo Kågström et al (RG, EE, MR) 2011-05-10 Sparse matrices and storage
More informationLast Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort
Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms
More informationa. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?
CPS 540 Fall 204 Shirley Moore, Instructor Test November 9, 204 Answers Please show all your work.. Draw a sketch of the extended von Neumann architecture for a 4-core multicore processor with three levels
More informationExercises: Message-Passing Programming
T H U I V R S I T Y O H F R G I U xercises: Message-Passing Programming Hello World avid Henty. Write an MPI program which prints the message Hello World. ompile and run on one process. Run on several
More informationCPS343 Parallel and High Performance Computing Project 1 Spring 2018
CPS343 Parallel and High Performance Computing Project 1 Spring 2018 Assignment Write a program using OpenMP to compute the estimate of the dominant eigenvalue of a matrix Due: Wednesday March 21 The program
More information