Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette <Sebastien.Varrette@uni.lu> Matrix-vector multiplication is embedded in many algorithms for solving a wide variety of problems. This assignment aims at studying three different ways to carry out the multiplication of a dense matrix A M m,n (K) by a vector b K n in parallel. More precisely, the purpose is to output the vector c K m such that: a 0,0 a 0,1... a 0,n 1 b 0 a 1,0 a 1,1... a 1,n 1 A.b = c...... b 1. = [ ] c 0 c 1... c m 1 a m 1,0 a m 1,1... a m 1,n 1 b n 1 where i [0, m 1], c i = n a i,j b j We will assume that both matrix and vectors are stored in files that respect the following format: Matrix A = (a i,j ) M m,n (K): [m, n, a 0,0, a 0,1,..., a m 1,n 1 ]. Vector v K n : [n, v 0,..., v n 1 ] j=0 Exercise 1 Matrix-vector I/O and sequential algorithm 1. Implement the functions handling the allocation and freeing for matrix and vectors of objects having a size of size bytes, i.e. void vector alloc (size t n, size t size ); void vector free (void v); void matrix alloc(size t m, size t n, size t size ); void matrix free (void M); A vector is allocated just like a normal array (using calloc for instance) A matrix A M m,n (K) will be allocated in three steps: (a) the memory for the m n matrix values are allocated (associated to the pointer Astorage); (b) the memory of the m row pointers is allocated and A points to the beginning of this block of memory; (c) the values of A[0],...,A[m-1] are initialized to &Astorage[0],...,&Astorage[(m-1)*n] so that A[i][j] corresponds to the matrix element a i,j. This is illustrated in the Figure 1. 2. Implement the following functions which store in the file f the vector v (resp. the matrix M) having n (resp. m n) elements, each of size size bytes. You may want to use fwrite in those functions. 1
void store vector(char f, size t n, size t size, void v); void store matrix(char f, size t m, size t n, size t size, void M); 3. Implement the functions which read the file f to allocate and populate the vector v (resp. the matrix M) with the n (resp. m n) elements of size size bytes from the file. fread is probably your best friend here. void read vector(char f, size t size, size t n, void v); void read matrix(char f, size t size, size t m, size t n, void M); 4. Implement the following printing functions (use type to provide the correct format and cast in printf): void print vector lf (void v, MPI Datatype type, size t n); void print matrix lf(void M, MPI Datatype type, size t m, size t n); 5. Implement a program that asks the user to enter the values m and n then generate and store (with appropriate i, m and n): the base vectors e i R n (file base_i_vector_n.dat) where the ith elements is 1 and the others 0; a random vector v R n (file random_vector_n.dat). the identity matrix ID (file id_matrix_m_n.dat) such that: { id i,i = 1 for 0 i < min(m, n) id i,j = 0 for i j a random matrix A M m,n (K) (file random_matrix_m_n.dat); 6. Implement the sequential algorithm that operate the matrix-vector multiplication over elements of type double through the following function which computes c = M b: void seq matvec mult lf(double M, double b, size t m, size t n, double c); Check your program using the vectors and matrix genererated previously. 7. What is the complexity of the sequential matrix-vector multiplication, in the general case and when m = n? Astorage sizeof(void*) A Step 2 m*sizeof(void*) Step 1 Step 3 0 1 n (m 1)*n m*n 1 m*n*size Figure 1: Matrix allocation in three steps. We use a domain decomposition strategy to develop the parallel algorithm. There are three straightforward ways to decompose a m n matrix A: rowwise 2
block striping, columnwise block striping and the checkerboard block decomposition. Each approach (after agglomeration and mapping) is illustrated in Figure 2. We will also assume that a single process is responsible for Input/Output. n = 9 m = 8 Rowwise block striped decomposition Columnwise block striped decomposition Checkerboard block decomposition Figure 2: Three ways to decompose and map a matrix having 8 rows and 9 columns over 4 processes. Exercise 2 Block data decomposition Whatever kind of decomposition we will consider, a strategy for distributing the primitive tasks among the processors should be defined. We propose here to analyse the properties of a classical block allocation scheme. Let k be a number of elements (indexed from 0 to k 1) to be distributed on p processes (indexed from 0 to p 1). We assume that the process P i controls the elements between i k p and (i + 1) k p 1 1. Show that the process controlling a particular element j is: i = p(j+1) 1 k 2. Show that the last process (i.e. P p 1 ) is responsible for k p elements. In the sequel, we will assume the following macros to be defined: #define BLOCK LOW(id,p,k) ((id) (k)/(p)) #define BLOCK HIGH(id,p,k) (BLOCK LOW((id)+1,p,k) 1) #define BLOCK SIZE(id,p,k) (BLOCK HIGH(id,p,k) BLOCK LOW(id,p,k)+1) #define BLOCK OWNER(j,p,k) (((p) ((j)+1) 1)/(k)) 3. Use those macros in the following function that initiate the count and displacement arrays needed to scatter/gather a buffer of n data items on/from p processes, when the number of elements sent/received to/from other processes varies: void create mixed count disp arrays(int p, size t k, int count, int disp); In other words, after the call to this function, count[i] is the number of elements sent/received to/from process i (i.e BLOCK_SIZE(i,p,k)) while disp[i] give the starting point in the buffer of the element sent/received to/from process i (i.e disp[i-1] + count[i-1]). 3
4. Similarly, implement the following function that initiate the count and displacement arrays needed by a process id when it is supposed to get the same number of element from every other process to populate the buffer of n data items: void create count disp arrays(int id,int p,size t k,int count,int disp); In other words, after the call to this function, count[i] is the number of elements received from process i (i.e BLOCK_SIZE(id,p,k)) while disp[i] give the starting point in the buffer of the element sent/received by process i (i.e disp[i-1] + count[i-1]). Exercise 3 Vector decomposition stategies As for the vectors, there are two natural ways to distribute them among the processes: a block decomposition as proposed in exercise 2 or a full replication meaning all vector elements are copied on all the processes. Explain why this second approach is acceptable in the context of matrixvector multiplication. Finally, we consider here three ways of partitionning the matrix and two for distributing the vectors. it leads to six combinaisons. We will limit our investigation to three of them: rowwise block striping with replicated vectors (exercise 4), columnwise block striping with block-decomposed vectors (exercise 5) and the checkerboard block decomposition with block-decomposed vectors over the first column of processes (exercise 6). Inner Product Task i b b Task i A[i] A[i] c[i] Task i b c All gather communication A[i] Figure 3: Operations performed by a primitive task i in a rowwise block-striped decomposition. Exercise 4 Parallel version using Rowwise Block-Striped Decomposition In this exercise, we associate a primitive task with each row of the matrix A. Vectors b and c are replicated among the primitive tasks so the memory should 4
be allocated for entire vectors on each processor. It follows that the task i has all the elements required to compute the inner-product leading to the element c i of c. Once this is done, processes must concatenate their pieces of the vector c into a complete vector. This is done through the MPI_Allgatherv function (see Appendix A). The sequence of all operations are summarized in Figure 3. 1. Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern. Which mapping will you use for the parallel algorithm? How many rows are assigned to each processes? Which process is advised for I/O and why 1? 2. Complexity analysis. For simplicity, we assume m = n and don t take into account the reading of matrix/vector files. Let χ be the time needed to compute a single iteration of the loop performing the inner product (χ can be determined by dividing the execution time of the sequential algorithm by n 2 ). What is the expected time for the computational portion of parallel program? We assume each vector element to be a double occupying 8 bytes. Each process beeing responsible for around n p elements, an all-gather communication requires each process to send log p messages, the first one having length 8n 2 8n p, the second p etc. Let λ be the latency of each message and β be the bandwith i.e. the number of data item that can be sent down a channel in one unit of time 2. Show that the expected communication time for the all-gather step is O(log p + n) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j reads and sends (using MPI_Send) the rows associated to the each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read row matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which the I/O process j read the file f, broadcast the dimension of the vector then read and broadcast the vector elements: 1 Hint: you may refer to the second question of the exercise 2 2 Sending a message containing k data items requires then time λ + k β. 5
void read vector and replicate(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Assuming elements of type double, implement the fonctions that print a matrix (resp. a vector) distributed in row-striped fashion among the processes in a communicator: void print row vector(void v, MPI Datatype type, size t n, MPI Comm comm); void print row matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operate the matrix-vector multiplication based on rowwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Task j Task j Column j of A b[j] Partial inner products Column j of A b[j] partial c All to all exchange Task j Task j Column j of A b[j] c[j] Sum partial results (Reduction) Column j of A b[j] n products part of c[j] Figure 4: Operations performed by a primitive task i in a columnwise blockstriped decomposition. Exercise 5 Parallel version using Columnwise Block-Striped Decomposition In this exercise, we assume m = n (for simplification) but keep the distinction in notation if possible. This hypothesis simplify the all-to-all step and the decomposition. Here a primitive task j is associated with the jth column of the matrix A and the element j of the vectors b and c. Recall that the elements of vector c are given by: c 0 c 1. c m 1 = a 0,0 b 0 + a 0,1 b 1 +... + a 0,n 1 b n 1 = a 1,0 b 0 + a 1,1 b 1 +... + a 0,n 1 b n 1 = a m 1,0 b 0 + a m 1,1 b 1 +... + a m 1,n 1 b n 1 The computation begins with each task j multiplying its column of A by b j, resulting in a vector of partial results (corresponding to a i,j b j i [0, m 1]). 6
Note that the product a i,j b j is the jth term of the inner-product leading to c i. In other words, the m multiplications performed by task j yield m terms for which only the jth term is relevant for the computation of c j (handled by this task). It follows that the task j needs to distribute m 1 = n 1 result terms it doesn t need to the other processes and collect n 1 results it does need from them. This is done through an all-to-all exchange performed in MPI with the function MPI_Alltoallv (see Appendix A). Once this is done, task j is able to compute c j by adding every member of the inner-product. The sequence of all operations are summarized in Figure 4. 1. Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern, which mapping will you use for the parallel algorithm? How many columns/part of the vectors are assigned to each processes? Which process is advised for I/O and why 3? 2. Complexity analysis. Again we don t take into account the reading of matrix/vector files. Let s take again χ, λ and β as defined in exercise 4. We also consider elements of type double occupying 8 bytes. After the initial multiplications, process j owns a vector partial_c of size n. Through the all-to-all exchange, process j receives p 1 partial vector from the other tasks, each of size BLOCK_SIZE(j,p,n) and operate an addition element by element (a reduction) to obtain the portion of c handled by this process. What is the expected time for the computational portion of parallel program? The algorithm perform an all-to-all exchange of partially computed portions of vector c. There are two common ways to handle such a communication: (1) in each of log p phases all nodes exchange half of their accumulated data with the others. (2) each process send directly to each of the other processes the elements destined to that process. This requires that each process send p 1 messages, each of size n p. Show that the communication complexity in approach 1 is log p α + log p n 2β Show that the communication complexity in approach 2 is (p 1)α + (p 1) n pβ Which approach is the most suited for short messages? long message? Conclude by giving the complexity of this parallel algorithm in both cases. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3 Hint: you may refer to the second question of the exercise 2 7
3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size *m * BLOCK_SIZE(id,p,*n) is allocated (c) process j reads in the matrix one row at a time and distributes each row among the the other processes (using MPI_Scatterv cf Appendix A). void read col matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which (a) the I/O process j opens the file f, read and broadcast (using MPI_Bcast) the dimension n of the vector (b) the vector *v of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j read and send (using MPI_Send) the portion associated to each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read block vector(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Implement the fonctions that print a matrix (resp. a vector) distributed in column-striped fashion among the processes in a communicator. In order to print values in the correct order for each row, a single process must gather (using MPI_Gatherv see Appendix A) together the elements of that row from the entire set of processes 4 void print col vector (void v,mpi Datatype type,size t n,mpi Comm comm); void print col matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on columnwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Exercise 6 Parallel version using Checkerboard Block Decomposition Here a primitive task is associated with each element of the matrix. The task responsible for a i,j multiplies it by b j yielding d i,j. Each element c i of the result vector is then n 1 j=0 d i,j. The tasks are agglomerated into rectangular blocks of approximatively the same size asssigned on a two-dimentional grid of processes (see figure 2). We assume that the vector b is distributed by blocks among the tasks of the first column of processes. The first step is then to redistribute b among all processes, then operate a matrix-vector multiplication on each process (between the owned block and the corresponding subpart of b). Finally, each row of tasks performs a sum-reduction of the result vectors, creating vector c. 4 Hence the dataflow of this function is opposite that of the function read_col_matrix and read_col_vector. 8
b A Redistribute b c Reduction accross rows partial c partial c partial c partial c Matrix vector multiplication Figure 5: Operations performed by a primitive task i in a checkerboard block decomposition. After the reduction, result vector c is distributed among the tasks in the first column of the process grid. The sequence of all operations are summarized in Figure 5. Vector Redistribution. Assume that p processed are divided over a k l grid. If k = l = p. The redistribution of b is easier and operated as illustrated on Figure 6(a). In the general case where k l, the redistribution is more complicated and detailled in Figure 6(b). 1. Parallel computation analysis. For the sake of simplicity, we assume here that m = n and that p is a square number so that the processes are arranged on a square grid. Each process is responsible for a matrix block of size at most n p n p. What is the expected time for the computational portion of parallel program? Again we don t take into account the reading of matrix/vector files. Show that the expected communication time is O(n log p p ) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 2. The exercise requires the creation of communicators respecting cartesian topology. To create a virtual mesh of processes that is close to square as possible (for maximal scalability). The first step is to determine the number of nodes in each dimension of the grid. This is handled by the function MPI_Dims_create (see Appendix A). With the result of this function, you can use the function MPI_Cart_create to create a communicator with a Cartesian Topology. Note that once created, the function MPI_Cart_get helps to retrieve information about the Cartesian communicator created. Assuming grid_comm to be a communicator with Cartesian topology, write the functions that partition the process grid into columns (resp. 9
Task (i,0) send its portion of b to task (0,i) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (a) Task (0,0) gather b Task (0,0) scatter from tasks (*,0) b on tasks (0,*) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (b) Figure 6: Redistribition of vector b (a) when the process grid is square and (b) when the process grid is not square. rows) relative to the process id. After a call to these functions, col_comm (resp. row_comm) should be a communicator containing the calling process and all other processes in the same column (resp. row) of the process grid, but no others. void create col comm(mpi Comm grid comm, int id, MPI Comm col comm); void create row comm(mpi Comm grid comm, int id, MPI Comm row comm); 3. Suppose again that grid_comm is a communicator with Cartesian topology. Write a code segment illustrating how the function read_block_vector (implemented in Exercise 5.4) can be used to open a file containing the vector b and distribute it among the first column of processes in grid_comm. 4. The distribution pattern of the matrix is similar to the one used for the fonction read_block_vector (implemented in Exercise 5.3) except that instead of scattering each matrix row among all the process, we must scatter them among a subset of the processes those occupying a single row in the virtual process grid. As always, we assume a single process to be responsible for matrix I/O yet the choice does not really matter in this case. Let s select P 0 for this task. Implement the following function in which (a) P 0 opens the file f, read and broadcast (using MPI_Bcast) 10
A the dimension of the matrix (b) each process determines the size of the submatrix it is responsible for (using MPI_Cart_get in particular) and allocates it. (c) P 0 reads in the matrix one row at a time and sends it to the process in the first column of the correct row in the virtual process grid. (d) after the receiving process reads the matrix row, it scatters it among the other processes belonging to the same row in the virtual process grid: void read grid matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm grid comm); 5. Write the function that redistribute the vector b among the processes following the operation illustrated in Figure 6: void redistribute vector (void v, MPI Datatype type, size t n, MPI Comm grid comm); Initially, vector b is distributed among k l virtual grids. The process at grid location (i, 0) is responsible for BLOCK_SIZE(i,k,n) elements of b, beginning with the element having index BLOCK_LOW(i,k,n). After the redistribution, every process in column j of the grid is responsible for BLOCK_SIZE(j,l,n) elements of b, beginning with the element having index BLOCK_LOW(j,l,n). You may rely on two sub-functions that deal with the specific cases where k = l = p and k l. 6. Implement the printing function: void print checkerboard matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 7. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on checkerboard block decomposition. Check your program using the vectors and matrix genererated previously. 8. Benchmark your program for various number of processors and plot your data using gnuplot. Merry Christmas and Happy New Year! MPI functions used in this exercise See : MPI_Allgatherv: MPI_Allgather: MPI_Gatherv: MPI_Gather: MPI_Scatterv: MPI_Scatter: MPI_Alltoallv: MPI_Dims_Create: MPI_Cart_create: MPI_Cart_get: MPI_Cart_rank: MPI_Cart_coords: MPI_Cart_split: http://mpi.deino.net/mpi_functions/mpi_allgatherv.html http://mpi.deino.net/mpi_functions/mpi_allgather.html http://mpi.deino.net/mpi_functions/mpi_gatherv.html http://mpi.deino.net/mpi_functions/mpi_gather.html http://mpi.deino.net/mpi_functions/mpi_scatterv.html http://mpi.deino.net/mpi_functions/mpi_scatter.html http://mpi.deino.net/mpi_functions/mpi_alltoallv.html http://mpi.deino.net/mpi_functions/mpi_dims_create.html http://mpi.deino.net/mpi_functions/mpi_cart_create.html http://mpi.deino.net/mpi_functions/mpi_cart_get.html http://mpi.deino.net/mpi_functions/mpi_cart_rank.html http://mpi.deino.net/mpi_functions/mpi_cart_coords.html http://mpi.deino.net/mpi_functions/mpi_cart_split.html 11