Project C/MPI: Matrix-Vector Multiplication

Similar documents
Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Parallel Programming in C with MPI and OpenMP

Matrix-vector Multiplication

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Parallel Programming with MPI and OpenMP

All-Pairs Shortest Paths - Floyd s Algorithm

Chapter 8 Matrix-Vector Multiplication

MPI Collective communication

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

More Communication (cont d)

Basic Communication Operations (Chapter 4)

Basic MPI Communications. Basic MPI Communications (cont d)

Lecture 17: Array Algorithms

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Programming with MPI Collectives

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

Dense Matrix Algorithms

MA471. Lecture 5. Collective MPI Communication

Parallel Numerical Algorithms

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

High-Performance Computing: MPI (ctd)

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Parallel Computing. Parallel Algorithm Design

Matrix multiplication

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

Chapter 3. Distributed Memory Programming with MPI

Introduction to MPI part II. Fabio AFFINITO

Learning Lab 2: Parallel Algorithms of Matrix Multiplication

Lecture 6: Parallel Matrix Algorithms (part 3)

Introduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer

Intermediate MPI features

Parallel Algorithm Design. Parallel Algorithm Design p. 1

Computer Architecture

2.3 Algorithms Using Map-Reduce

Blocking SEND/RECEIVE

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Code Parallelization

Message-Passing and MPI Programming

Scalasca performance properties The metrics tour

Chapter 8 Dense Matrix Algorithms

Mathematics and Computer Science

CPS 303 High Performance Computing

Collective Communications II

Practical Scientific Computing: Performanceoptimized

Message Passing with MPI

Numerical Algorithms

Matrix Multiplication

MPI: A Message-Passing Interface Standard

CINES MPI. Johanne Charpentier & Gabriel Hautreux

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

More MPI. Bryan Mills, PhD. Spring 2017

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Distributed Memory Parallel Programming

Introduction to Parallel. Programming

MPI Casestudy: Parallel Image Processing

In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a problem.

COMMUNICATION IN HYPERCUBES

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

L19: Putting it together: N-body (Ch. 6)!

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

CS 426. Building and Running a Parallel Application

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

15. The Software System ParaLab for Learning and Investigations of Parallel Methods

Message Passing Interface

Standard MPI - Message Passing Interface

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Lecture 4: Principles of Parallel Algorithm Design (part 4)

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

Basic Communication Ops

Parallel Computing and the MPI environment

HPC Parallel Programing Multi-node Computation with MPI - I

6.001 Notes: Section 4.1

Outline. Communication modes MPI Message Passing Interface Standard

Recap of Parallelism & MPI

20 Dynamic allocation of memory: malloc and calloc

Lecture 16: Sorting. CS178: Programming Parallel and Distributed Systems. April 2, 2001 Steven P. Reiss

Linear systems of equations

Matrix Multiplication

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Numerical Algorithms

Topologies in MPI. Instructor: Dr. M. Taufer

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Lecture 7. Revisiting MPI performance & semantics Strategies for parallelizing an application Word Problems

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

Exercises: Message-Passing Programming

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Transcription:

Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette <Sebastien.Varrette@uni.lu> Matrix-vector multiplication is embedded in many algorithms for solving a wide variety of problems. This assignment aims at studying three different ways to carry out the multiplication of a dense matrix A M m,n (K) by a vector b K n in parallel. More precisely, the purpose is to output the vector c K m such that: a 0,0 a 0,1... a 0,n 1 b 0 a 1,0 a 1,1... a 1,n 1 A.b = c...... b 1. = [ ] c 0 c 1... c m 1 a m 1,0 a m 1,1... a m 1,n 1 b n 1 where i [0, m 1], c i = n a i,j b j We will assume that both matrix and vectors are stored in files that respect the following format: Matrix A = (a i,j ) M m,n (K): [m, n, a 0,0, a 0,1,..., a m 1,n 1 ]. Vector v K n : [n, v 0,..., v n 1 ] j=0 Exercise 1 Matrix-vector I/O and sequential algorithm 1. Implement the functions handling the allocation and freeing for matrix and vectors of objects having a size of size bytes, i.e. void vector alloc (size t n, size t size ); void vector free (void v); void matrix alloc(size t m, size t n, size t size ); void matrix free (void M); A vector is allocated just like a normal array (using calloc for instance) A matrix A M m,n (K) will be allocated in three steps: (a) the memory for the m n matrix values are allocated (associated to the pointer Astorage); (b) the memory of the m row pointers is allocated and A points to the beginning of this block of memory; (c) the values of A[0],...,A[m-1] are initialized to &Astorage[0],...,&Astorage[(m-1)*n] so that A[i][j] corresponds to the matrix element a i,j. This is illustrated in the Figure 1. 2. Implement the following functions which store in the file f the vector v (resp. the matrix M) having n (resp. m n) elements, each of size size bytes. You may want to use fwrite in those functions. 1

void store vector(char f, size t n, size t size, void v); void store matrix(char f, size t m, size t n, size t size, void M); 3. Implement the functions which read the file f to allocate and populate the vector v (resp. the matrix M) with the n (resp. m n) elements of size size bytes from the file. fread is probably your best friend here. void read vector(char f, size t size, size t n, void v); void read matrix(char f, size t size, size t m, size t n, void M); 4. Implement the following printing functions (use type to provide the correct format and cast in printf): void print vector lf (void v, MPI Datatype type, size t n); void print matrix lf(void M, MPI Datatype type, size t m, size t n); 5. Implement a program that asks the user to enter the values m and n then generate and store (with appropriate i, m and n): the base vectors e i R n (file base_i_vector_n.dat) where the ith elements is 1 and the others 0; a random vector v R n (file random_vector_n.dat). the identity matrix ID (file id_matrix_m_n.dat) such that: { id i,i = 1 for 0 i < min(m, n) id i,j = 0 for i j a random matrix A M m,n (K) (file random_matrix_m_n.dat); 6. Implement the sequential algorithm that operate the matrix-vector multiplication over elements of type double through the following function which computes c = M b: void seq matvec mult lf(double M, double b, size t m, size t n, double c); Check your program using the vectors and matrix genererated previously. 7. What is the complexity of the sequential matrix-vector multiplication, in the general case and when m = n? Astorage sizeof(void*) A Step 2 m*sizeof(void*) Step 1 Step 3 0 1 n (m 1)*n m*n 1 m*n*size Figure 1: Matrix allocation in three steps. We use a domain decomposition strategy to develop the parallel algorithm. There are three straightforward ways to decompose a m n matrix A: rowwise 2

block striping, columnwise block striping and the checkerboard block decomposition. Each approach (after agglomeration and mapping) is illustrated in Figure 2. We will also assume that a single process is responsible for Input/Output. n = 9 m = 8 Rowwise block striped decomposition Columnwise block striped decomposition Checkerboard block decomposition Figure 2: Three ways to decompose and map a matrix having 8 rows and 9 columns over 4 processes. Exercise 2 Block data decomposition Whatever kind of decomposition we will consider, a strategy for distributing the primitive tasks among the processors should be defined. We propose here to analyse the properties of a classical block allocation scheme. Let k be a number of elements (indexed from 0 to k 1) to be distributed on p processes (indexed from 0 to p 1). We assume that the process P i controls the elements between i k p and (i + 1) k p 1 1. Show that the process controlling a particular element j is: i = p(j+1) 1 k 2. Show that the last process (i.e. P p 1 ) is responsible for k p elements. In the sequel, we will assume the following macros to be defined: #define BLOCK LOW(id,p,k) ((id) (k)/(p)) #define BLOCK HIGH(id,p,k) (BLOCK LOW((id)+1,p,k) 1) #define BLOCK SIZE(id,p,k) (BLOCK HIGH(id,p,k) BLOCK LOW(id,p,k)+1) #define BLOCK OWNER(j,p,k) (((p) ((j)+1) 1)/(k)) 3. Use those macros in the following function that initiate the count and displacement arrays needed to scatter/gather a buffer of n data items on/from p processes, when the number of elements sent/received to/from other processes varies: void create mixed count disp arrays(int p, size t k, int count, int disp); In other words, after the call to this function, count[i] is the number of elements sent/received to/from process i (i.e BLOCK_SIZE(i,p,k)) while disp[i] give the starting point in the buffer of the element sent/received to/from process i (i.e disp[i-1] + count[i-1]). 3

4. Similarly, implement the following function that initiate the count and displacement arrays needed by a process id when it is supposed to get the same number of element from every other process to populate the buffer of n data items: void create count disp arrays(int id,int p,size t k,int count,int disp); In other words, after the call to this function, count[i] is the number of elements received from process i (i.e BLOCK_SIZE(id,p,k)) while disp[i] give the starting point in the buffer of the element sent/received by process i (i.e disp[i-1] + count[i-1]). Exercise 3 Vector decomposition stategies As for the vectors, there are two natural ways to distribute them among the processes: a block decomposition as proposed in exercise 2 or a full replication meaning all vector elements are copied on all the processes. Explain why this second approach is acceptable in the context of matrixvector multiplication. Finally, we consider here three ways of partitionning the matrix and two for distributing the vectors. it leads to six combinaisons. We will limit our investigation to three of them: rowwise block striping with replicated vectors (exercise 4), columnwise block striping with block-decomposed vectors (exercise 5) and the checkerboard block decomposition with block-decomposed vectors over the first column of processes (exercise 6). Inner Product Task i b b Task i A[i] A[i] c[i] Task i b c All gather communication A[i] Figure 3: Operations performed by a primitive task i in a rowwise block-striped decomposition. Exercise 4 Parallel version using Rowwise Block-Striped Decomposition In this exercise, we associate a primitive task with each row of the matrix A. Vectors b and c are replicated among the primitive tasks so the memory should 4

be allocated for entire vectors on each processor. It follows that the task i has all the elements required to compute the inner-product leading to the element c i of c. Once this is done, processes must concatenate their pieces of the vector c into a complete vector. This is done through the MPI_Allgatherv function (see Appendix A). The sequence of all operations are summarized in Figure 3. 1. Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern. Which mapping will you use for the parallel algorithm? How many rows are assigned to each processes? Which process is advised for I/O and why 1? 2. Complexity analysis. For simplicity, we assume m = n and don t take into account the reading of matrix/vector files. Let χ be the time needed to compute a single iteration of the loop performing the inner product (χ can be determined by dividing the execution time of the sequential algorithm by n 2 ). What is the expected time for the computational portion of parallel program? We assume each vector element to be a double occupying 8 bytes. Each process beeing responsible for around n p elements, an all-gather communication requires each process to send log p messages, the first one having length 8n 2 8n p, the second p etc. Let λ be the latency of each message and β be the bandwith i.e. the number of data item that can be sent down a channel in one unit of time 2. Show that the expected communication time for the all-gather step is O(log p + n) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j reads and sends (using MPI_Send) the rows associated to the each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read row matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which the I/O process j read the file f, broadcast the dimension of the vector then read and broadcast the vector elements: 1 Hint: you may refer to the second question of the exercise 2 2 Sending a message containing k data items requires then time λ + k β. 5

void read vector and replicate(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Assuming elements of type double, implement the fonctions that print a matrix (resp. a vector) distributed in row-striped fashion among the processes in a communicator: void print row vector(void v, MPI Datatype type, size t n, MPI Comm comm); void print row matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operate the matrix-vector multiplication based on rowwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Task j Task j Column j of A b[j] Partial inner products Column j of A b[j] partial c All to all exchange Task j Task j Column j of A b[j] c[j] Sum partial results (Reduction) Column j of A b[j] n products part of c[j] Figure 4: Operations performed by a primitive task i in a columnwise blockstriped decomposition. Exercise 5 Parallel version using Columnwise Block-Striped Decomposition In this exercise, we assume m = n (for simplification) but keep the distinction in notation if possible. This hypothesis simplify the all-to-all step and the decomposition. Here a primitive task j is associated with the jth column of the matrix A and the element j of the vectors b and c. Recall that the elements of vector c are given by: c 0 c 1. c m 1 = a 0,0 b 0 + a 0,1 b 1 +... + a 0,n 1 b n 1 = a 1,0 b 0 + a 1,1 b 1 +... + a 0,n 1 b n 1 = a m 1,0 b 0 + a m 1,1 b 1 +... + a m 1,n 1 b n 1 The computation begins with each task j multiplying its column of A by b j, resulting in a vector of partial results (corresponding to a i,j b j i [0, m 1]). 6

Note that the product a i,j b j is the jth term of the inner-product leading to c i. In other words, the m multiplications performed by task j yield m terms for which only the jth term is relevant for the computation of c j (handled by this task). It follows that the task j needs to distribute m 1 = n 1 result terms it doesn t need to the other processes and collect n 1 results it does need from them. This is done through an all-to-all exchange performed in MPI with the function MPI_Alltoallv (see Appendix A). Once this is done, task j is able to compute c j by adding every member of the inner-product. The sequence of all operations are summarized in Figure 4. 1. Parallel algorithm analysis. Given the static number of tasks and the regular communication pattern, which mapping will you use for the parallel algorithm? How many columns/part of the vectors are assigned to each processes? Which process is advised for I/O and why 3? 2. Complexity analysis. Again we don t take into account the reading of matrix/vector files. Let s take again χ, λ and β as defined in exercise 4. We also consider elements of type double occupying 8 bytes. After the initial multiplications, process j owns a vector partial_c of size n. Through the all-to-all exchange, process j receives p 1 partial vector from the other tasks, each of size BLOCK_SIZE(j,p,n) and operate an addition element by element (a reduction) to obtain the portion of c handled by this process. What is the expected time for the computational portion of parallel program? The algorithm perform an all-to-all exchange of partially computed portions of vector c. There are two common ways to handle such a communication: (1) in each of log p phases all nodes exchange half of their accumulated data with the others. (2) each process send directly to each of the other processes the elements destined to that process. This requires that each process send p 1 messages, each of size n p. Show that the communication complexity in approach 1 is log p α + log p n 2β Show that the communication complexity in approach 2 is (p 1)α + (p 1) n pβ Which approach is the most suited for short messages? long message? Conclude by giving the complexity of this parallel algorithm in both cases. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 3 Hint: you may refer to the second question of the exercise 2 7

3. As mentioned before, a single process j is responsible for I/O. Implement the following function in which (a) process j opens the file f, read and broadcast (using MPI_Bcast) the dimension of the matrix (b) a matrix of size *m * BLOCK_SIZE(id,p,*n) is allocated (c) process j reads in the matrix one row at a time and distributes each row among the the other processes (using MPI_Scatterv cf Appendix A). void read col matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm comm) 4. Similarly, implement the following function in which (a) the I/O process j opens the file f, read and broadcast (using MPI_Bcast) the dimension n of the vector (b) the vector *v of size BLOCK_SIZE(id,p,*m) * *n is allocated (c) process j read and send (using MPI_Send) the portion associated to each processes (except himself) (d) the other processes receive their matrix part from process j (using MPI_Recv): void read block vector(char f, MPI Datatype dtype, size t n, void v, MPI Comm comm) 5. Implement the fonctions that print a matrix (resp. a vector) distributed in column-striped fashion among the processes in a communicator. In order to print values in the correct order for each row, a single process must gather (using MPI_Gatherv see Appendix A) together the elements of that row from the entire set of processes 4 void print col vector (void v,mpi Datatype type,size t n,mpi Comm comm); void print col matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 6. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on columnwise block-striped decomposition. Check your program using the vectors and matrix genererated previously. 7. Benchmark your program for various number of processors and plot your data using gnuplot. Exercise 6 Parallel version using Checkerboard Block Decomposition Here a primitive task is associated with each element of the matrix. The task responsible for a i,j multiplies it by b j yielding d i,j. Each element c i of the result vector is then n 1 j=0 d i,j. The tasks are agglomerated into rectangular blocks of approximatively the same size asssigned on a two-dimentional grid of processes (see figure 2). We assume that the vector b is distributed by blocks among the tasks of the first column of processes. The first step is then to redistribute b among all processes, then operate a matrix-vector multiplication on each process (between the owned block and the corresponding subpart of b). Finally, each row of tasks performs a sum-reduction of the result vectors, creating vector c. 4 Hence the dataflow of this function is opposite that of the function read_col_matrix and read_col_vector. 8

b A Redistribute b c Reduction accross rows partial c partial c partial c partial c Matrix vector multiplication Figure 5: Operations performed by a primitive task i in a checkerboard block decomposition. After the reduction, result vector c is distributed among the tasks in the first column of the process grid. The sequence of all operations are summarized in Figure 5. Vector Redistribution. Assume that p processed are divided over a k l grid. If k = l = p. The redistribution of b is easier and operated as illustrated on Figure 6(a). In the general case where k l, the redistribution is more complicated and detailled in Figure 6(b). 1. Parallel computation analysis. For the sake of simplicity, we assume here that m = n and that p is a square number so that the processes are arranged on a square grid. Each process is responsible for a matrix block of size at most n p n p. What is the expected time for the computational portion of parallel program? Again we don t take into account the reading of matrix/vector files. Show that the expected communication time is O(n log p p ) Conclude by giving the complexity of this parallel algorithm. Bonus: Compute the isoefficiency and the scalability function. Is this algorithm highly scalable? 2. The exercise requires the creation of communicators respecting cartesian topology. To create a virtual mesh of processes that is close to square as possible (for maximal scalability). The first step is to determine the number of nodes in each dimension of the grid. This is handled by the function MPI_Dims_create (see Appendix A). With the result of this function, you can use the function MPI_Cart_create to create a communicator with a Cartesian Topology. Note that once created, the function MPI_Cart_get helps to retrieve information about the Cartesian communicator created. Assuming grid_comm to be a communicator with Cartesian topology, write the functions that partition the process grid into columns (resp. 9

Task (i,0) send its portion of b to task (0,i) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (a) Task (0,0) gather b Task (0,0) scatter from tasks (*,0) b on tasks (0,*) Tasks (0,i) broadcasts its portion of b to tasks (*,i) (b) Figure 6: Redistribition of vector b (a) when the process grid is square and (b) when the process grid is not square. rows) relative to the process id. After a call to these functions, col_comm (resp. row_comm) should be a communicator containing the calling process and all other processes in the same column (resp. row) of the process grid, but no others. void create col comm(mpi Comm grid comm, int id, MPI Comm col comm); void create row comm(mpi Comm grid comm, int id, MPI Comm row comm); 3. Suppose again that grid_comm is a communicator with Cartesian topology. Write a code segment illustrating how the function read_block_vector (implemented in Exercise 5.4) can be used to open a file containing the vector b and distribute it among the first column of processes in grid_comm. 4. The distribution pattern of the matrix is similar to the one used for the fonction read_block_vector (implemented in Exercise 5.3) except that instead of scattering each matrix row among all the process, we must scatter them among a subset of the processes those occupying a single row in the virtual process grid. As always, we assume a single process to be responsible for matrix I/O yet the choice does not really matter in this case. Let s select P 0 for this task. Implement the following function in which (a) P 0 opens the file f, read and broadcast (using MPI_Bcast) 10

A the dimension of the matrix (b) each process determines the size of the submatrix it is responsible for (using MPI_Cart_get in particular) and allocates it. (c) P 0 reads in the matrix one row at a time and sends it to the process in the first column of the correct row in the virtual process grid. (d) after the receiving process reads the matrix row, it scatters it among the other processes belonging to the same row in the virtual process grid: void read grid matrix(char f, MPI Datatype dtype, size t m, size t n, void M, MPI Comm grid comm); 5. Write the function that redistribute the vector b among the processes following the operation illustrated in Figure 6: void redistribute vector (void v, MPI Datatype type, size t n, MPI Comm grid comm); Initially, vector b is distributed among k l virtual grids. The process at grid location (i, 0) is responsible for BLOCK_SIZE(i,k,n) elements of b, beginning with the element having index BLOCK_LOW(i,k,n). After the redistribution, every process in column j of the grid is responsible for BLOCK_SIZE(j,l,n) elements of b, beginning with the element having index BLOCK_LOW(j,l,n). You may rely on two sub-functions that deal with the specific cases where k = l = p and k l. 6. Implement the printing function: void print checkerboard matrix(void M, MPI Datatype type, size t m, size t n, MPI Comm comm); 7. Finally, write a program that implement the parallel algorithm that operates the matrix-vector multiplication based on checkerboard block decomposition. Check your program using the vectors and matrix genererated previously. 8. Benchmark your program for various number of processors and plot your data using gnuplot. Merry Christmas and Happy New Year! MPI functions used in this exercise See : MPI_Allgatherv: MPI_Allgather: MPI_Gatherv: MPI_Gather: MPI_Scatterv: MPI_Scatter: MPI_Alltoallv: MPI_Dims_Create: MPI_Cart_create: MPI_Cart_get: MPI_Cart_rank: MPI_Cart_coords: MPI_Cart_split: http://mpi.deino.net/mpi_functions/mpi_allgatherv.html http://mpi.deino.net/mpi_functions/mpi_allgather.html http://mpi.deino.net/mpi_functions/mpi_gatherv.html http://mpi.deino.net/mpi_functions/mpi_gather.html http://mpi.deino.net/mpi_functions/mpi_scatterv.html http://mpi.deino.net/mpi_functions/mpi_scatter.html http://mpi.deino.net/mpi_functions/mpi_alltoallv.html http://mpi.deino.net/mpi_functions/mpi_dims_create.html http://mpi.deino.net/mpi_functions/mpi_cart_create.html http://mpi.deino.net/mpi_functions/mpi_cart_get.html http://mpi.deino.net/mpi_functions/mpi_cart_rank.html http://mpi.deino.net/mpi_functions/mpi_cart_coords.html http://mpi.deino.net/mpi_functions/mpi_cart_split.html 11