Introduction to Parallel Programming Message Passing Interface Practical Session Part I T. Streit, H.-J. Pflug streit@rz.rwth-aachen.de October 28, 2008 1
1. Examples We provide codes of the theoretical part as well as serial codes for the exercises. Download and extract the codes by using the commands: wget http://support.rz.rwth-aachen.de/public/mpi1codes.tar.gz tar xzvf MPI1Codes.tar.gz wget http://support.rz.rwth-aachen.de/public/mpi1exercises.tar.gz tar xzvf MPI1Exercises.tar.gz 2. Hello World Test the hello.c example. Every process prints the Hello World line. Compile and run the program with 4 MPI processes using: $MPICC hello.c -o hello $MPIEXEC -n 4 hello 3. Size & Rank Test the ranks.c example. Every process identifies itself with appropriate rank myrank and communicator size nprocs. Compile and run the program with 4 MPI processes using: $MPICC ranks.c -o ranks $MPIEXEC -n 4 ranks
4. Array of Integers - Part I The program numarray serial.c creates an array of k (e.g. k=50) integer numbers between 0 an 9 and then counts zeros. The output is: k=50 Array: 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 0 6 2 6 1 8 7 9 2 0 2 3 7 5 9 2 2 8 9 7 3 6 1 2 9 3 1 9 4 7 8 4 5 0 Number 0 was found 4 times in the array Parallelize the code, so that only the master creates the array. The integer numbers should be between 0 and nprocs-1. All CPU s have access to the value of k. Then the master prints the array and sends it to all workers. Each worker counts how many times myrank is in the array. All workers send their counts back to the master. The master receives all results and prints the results. Example: k=50, 10 CPU s Array: 6 1 1 4 8 8 2 3 2 3 3 9 1 6 7 6 4 0 5 2 4 8 9 3 0 4 7 5 6 7 2 4 1 3 8 1 2 2 5 4 5 0 5 7 8 4 5 4 5 1 Number 0 was found 8 times in the array Number 1 was found 12 times in the array Number 2 was found 15 times in the array Number 3 was found 15 times in the array 3
5. Array of Integers - Part II Now, modify your parallel code of part I. Send an array of a size that is unknown to the workers - only the root initializes the value of k. Workers have to check the incoming size and allocate the memory according to it. You will have to use the MPI Probe and MPI Get count functions. Type man MPI Probe and man MPI Get count for help. Example: k=50, 4 CPU s Array: 3 2 1 3 1 3 2 0 1 1 2 3 2 3 3 2 0 2 0 0 3 0 3 1 2 2 2 3 3 3 1 2 2 2 1 3 1 0 3 2 1 1 1 3 0 1 2 0 3 2 Number 0 was found 8 times in the array Process 1 received value 50 Process 2 received value 50 Number 1 was found 12 times in the array Process 3 received value 50 Number 2 was found 15 times in the array Number 3 was found 15 times in the array 4
6. Array of Integers - Part III Modify part I again, now to test the collective operations MPI Scatter and MPI Bcast. The master distributes the array using the scatter function, so that each worker only works on a small part of the array. Assume, that the number of array elements is divisible by the number of CPU s. In addition, the number n to look for (now only ONE number for all CPU s) is a random number (between 0 and nprocs 1), the master broadcasts it to all workers, so each worker looks for the same number. The master iteratively receives and sums up all results. For help type man MPI Scatter and man MPI Bcast. Example: k=50, 4 CPU s, n=3 Array: 2 1 3 1 3 2 0 1 1 2 3 2 3 3 2 0 2 0 0 3 0 3 1 2 2 2 3 3 3 1 2 2 2 1 3 1 0 3 2 1 1 1 3 0 1 2 0 3 2 1 Process 0 received value 3 Process 0 has found 3 times the number 3 in part of the array Process 1 received value 3 Process 1 has found 4 times the number 3 in the part of the array Process 2 received value 3 Process 2 has found 4 times the number 3 in the part of the array Process 3 received value 3 Process 3 has found 3 times the number 3 in the part of the array All processes have found 14 times the number 3 in the array 5
7. Array of Integers - Part IV (Homework) Modify III in the following way: use MPI Reduce to receive and automatically sum up all results. For help: man MPI Reduce. Example: k=50, 4 CPU s, n=3 Array: 2 1 3 1 3 2 0 1 1 2 3 2 3 3 2 0 2 0 0 3 0 3 1 2 2 2 3 3 3 1 2 2 2 1 3 1 0 3 2 1 1 1 3 0 1 2 0 3 2 1 Process 0 received value 3 Process 0 has found 3 times the number 3 in the part of the array Process 2 received value 3 Process 1 received value 3 Process 2 has found 4 times the number 3 in the part of the array Process 3 received value 3 Process 1 has found 4 times the number 3 in the part of the array Process 3 has found 3 times the number 3 in the part of the array All processes have found 14 times the number 3 in the array 8. Array of Integers - Part V (Homework) Is it possible to modify Part IV (using MPI Scatter ) so that the program works for array sizes not divisible by the number of CPU s? If yes, explain how. If not, explain why? 6
9. Calculation of π with Numerical Integration Given is a program ( pi tangent serial.c ) calculating π using square approximation of the integral formula π = 1 0 4 (1 + x 2 ) dx To do the integration, the domain [0, 1] is divided into intervals, and the well known trapezium rule (http://en.wikipedia.org/wiki/trapezium rule) is used. One can either use the simple trapezium rule 1 0 ( ) f(0) + f(1) n 1 f(x) dx w + f(0 + iw) 2 or (in this code) the more complex tangent rule i=1 1 0 f(x) dx w n f (0 + w (i 0.5)) i=1 where f(x) = 4 and w is the interval width, i.e. w = 1/n. (1+x 2 ) Results will differ for small numbers of intervals. Now, think about a way to distribute the work among several processes. Parallelize the code. The master should read the number of intervals and broadcast it to the other processes. All processes (including master) should get approx. the same amount of work. Compute the final sum using the collective MPI Reduce function. Example: single CPU: hpclab@sciprog:~/desktop/exercises/day1/pi$ pi_tangent_serial how many intervals: 10 The computed value of the integral is 3.142425985001098 hpclab@sciprog:~/desktop/exercises/day1/pi$ pi_tangent_serial how many intervals: 100 The computed value of the integral is 3.141600986923125 hpclab@sciprog:~/desktop/exercises/day1/pi$ pi_tangent_serial how many intervals: 1000 The computed value of the integral is 3.141592736923123 2 CPUs: hpclab@sciprog:~/desktop/exercises/day1/pi$ $MPIEXEC -n 2 pi 7
how many intervalls: 10 calculated pi value: 3.14242598500109826531 hpclab@sciprog:~/desktop/exercises/day1/pi$ $MPIEXEC -n 2 pi how many intervalls: 100 calculated pi value: 3.14160098692312494961 hpclab@sciprog:~/desktop/exercises/day1/pi$ $MPIEXEC -n 2 pi how many intervalls: 1000 calculated pi value: 3.14159273692313067983 8
10. Matrix-Vector Multiplication Matrix-vector multiplication c = A b is a widely used operation in scientific computing. Given a matrix A R l,m and a vector b R m the result is a vector c R l. In pseudocode: do i=1,l do j=1,m c(i) = c(i) + A(i,j)*b(j) end do end do Given is a serial code for matrix-vector multiplication ( mxv serial 1pointer.c ). Parallelize the code, so that the master distributes the vector b and matrix rows with equal number of rows for each process, i.e. P 0 P 0...... P 0............... P 0 P 0...... P 0 P 1 P 1...... P 1............... P 1 P 1...... P 1............... P N 1 P N 1...... P N 1............... P N 1 P N 1...... P N 1 Assume that the number of rows is divisible by the process number. The master also computes one of the blocks. The master collects the calculated elements in a vector and prints the result. Example: Number of rows: 10 Number of columns: 10 A[0] = 0 1 2 3 4 5 6 7 8 9 A[1] = 1 2 3 4 5 6 7 8 9 0 A[2] = 2 3 4 5 6 7 8 9 0 1 A[3] = 3 4 5 6 7 8 9 0 1 2 A[4] = 4 5 6 7 8 9 0 1 2 3 A[5] = 5 6 7 8 9 0 1 2 3 4 A[6] = 6 7 8 9 0 1 2 3 4 5 A[7] = 7 8 9 0 1 2 3 4 5 6 A[8] = 8 9 0 1 2 3 4 5 6 7 A[9] = 9 0 1 2 3 4 5 6 7 8 b = 0 1 2 3 4 5 6 7 8 9 9
resultvector(0) = 285 resultvector(1) = 240 resultvector(2) = 205 resultvector(3) = 180 resultvector(4) = 165 resultvector(5) = 160 resultvector(6) = 165 resultvector(7) = 180 resultvector(8) = 205 resultvector(9) = 240 The quadratic matrix in our test example can be initialized very simple: A i,j = i+j i, j {0, 1,... rows 1}. Test your program for some large matrices. 10
11. Floyd s Shortest Path Algorithm (Homework) Given is the program floyd serial.c that computes shortest paths using the Floyd algorithm, sometimes also called Floyd-Warshall algorithm. Given a graph, G = (V, E), the Floyd algorithm finds the shortest path between all pairs of nodes i, j. Input: a distance matrix D i,j 0. We assume that v is the number of vertices and D i,i = 0 i {0, 1,... v 1}. (http://en.wikipedia.org/wiki/floyd-warshall algorithm) Serial Floyd algorithm: for k = 0 to v-1 for i = 0 to v-1 for j = 0 to v-1 D i,j = min{d i,j, D i,k + D k,j ) Output: D i,j contains shortest path from i to j. Parallel Floyd with rowwise distribution: A simple parallel Floyd algorithm is based on a one-dimensional, rowwise domain decomposition of the intermediate matrix D. Each of the n processors owns v/n rows. 1 for k = 0 to v-1 Processor that holds row k broadcasts it to all others. for i = i_local_start to i_local_end for j = 0 to v-1 D i,j = min{d i,j, D i,k + D k,j ) In the kth step, each task requires, in addition to its local data, the kth row of D. 1 An alternative parallel version of Floyd s algorithm uses a two-dimensional decomposition of the various matrices. This version allows the use of up to n 2 processors. The parallel Floyd with checkerboard distribution requires new row and column communicator schemes. 11
for k = 0 to v-1 Processor that holds row k broadcasts it (or parts of it) to all others. Processor that holds column k broadcasts it (or parts of it) to all others. for i = i_local_start to i_local_end for j = j_local_start to j_local_end D i,j = min{d i,j, D i,k + D k,j ) In each step, each task requires, in addition to its local data, data from the kth row and the kth column of D. Hence, communication requires two broadcast operations for each step. Implement the parallel Floyd with rowwise distribution. Example (floydmatrix.txt) Input: 0 1 3 20 20 20 2 9 20 20 9 0 9 4 20 4 20 20 20 9 20 20 0 7 3 20 20 20 7 2 2 20 3 0 20 20 5 8 20 4 9 4 3 9 0 20 20 20 8 20 6 4 9 20 3 0 20 20 5 20 20 2 1 20 3 20 0 20 20 20 5 2 20 8 20 20 20 0 20 7 2 20 20 20 20 20 4 20 0 20 3 20 4 20 7 6 20 4 8 0 Result: 0 1 3 5 5 5 2 9 10 5 6 0 7 4 7 4 8 12 9 8 5 6 0 7 3 8 7 6 7 2 2 3 3 0 6 7 4 8 10 4 8 4 3 8 0 8 10 9 8 5 6 4 6 8 3 0 8 12 5 8 6 2 1 6 3 6 0 7 8 3 5 2 8 6 9 6 7 0 11 7 2 3 5 7 7 7 4 11 0 7 3 4 4 8 7 6 5 4 8 0 Test your program for larger matrices. Upload your examples as well. 12
12. Clean Buggy Code (Homework) We have prepared an MPI program with five errors (fixme.c). These errors violate the MPI-standard. /* * This program does the same operation as an MPI_Bcast() but * does it using MPI_Send() and MPI_Recv(). */ #include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int nprocs; /* the number of processes in the task */ int myrank; /* my rank */ int i; int l = 0; int tag = 42; /* tag used for all communication */ int tag2 = 99; /* extra tag used for what ever you want */ int data = 0; /* initialize all the data buffers to 0 */ MPI_Status status; /* status of MPI_Recv() operation */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); /* Initialize the data for rank 0 process only. */ if (myrank == 0) { data = 399; } if (myrank == 0) { for (i = 1; i < nprocs; i++){ MPI_Send(&data, 1, MPI_BYTE, i, tag, MPI_COMM_WORLD); } } else { MPI_Recv(data, l, MPI_INT, 0, tag2, MPI_COMM_WORLD, &status); } MPI_Barrier(MPI_COMM_WORLD); /* Check the data everywhere. */ if (data!= 399) { fprintf(stdout, "Whoa! The data is incorrect\n"); } 13
else fprintf(stdout, "Whoa! Got the message... \n"); } return 0; Fix the errors in the code. Give comments to your changes. 14