2D Heat Distribution Prediction in Parallel

Size: px

Start display at page:

Download "2D Heat Distribution Prediction in Parallel"

Clementine Wright
5 years ago
Views:

1 2D Heat Distribution Prediction in Parallel Brett D. Estrade CS 691 Fall 2004 Final Project Instructor: Dr. Joe Zhang Abstract Prediction of the temperature distribution in a 2 dimensional domain using an iterative finite difference method to solve the governing partial differential equation is a computational problem that can greatly benefit from parallelization. This is especially important over large domains with a high number of nodes because the iterative solutions have a complexity of at least O(n 2 ), which means that the serial time to solution often increases exponentially as the number of nodes increase. Fortunately, many finite difference methods are highly parallelizable, and can be programmed optimally to take advantage of their specific task dependencies. This project investigates three iterative finite difference methods: Jacobi, Gauss-Seidel, and SOS; then implements them in parallel using MPI, and examines their performance for accuracy, speed up, and efficiency. It was not within the scope of this project to optimize the parallel communications, so they all use a simple block-row partitioning scheme with ghost row communication.

2 Table of Contents 1. Introduction a. Problem description b. Jacobi method c. Gauss-Seidel methods d. SOS methods 2. Parallel Algorithm Design a. Details b. Row distribution and task allocation c. Communications 3. Program Specifics a. Overview b. Method implementation c. Boundary condition enforcement d. User interface e. Output files 4. Verification and Performance Analysis a. Details b. Method comparisons c. Speed up and efficiency 5. Conclusions 6. Credits and Resources Appendix I: bde_2dheat.c source code Appendix II: compile.sh script file 2

3 1. Introduction a. Problem description This project required the prediction of the distribution of heat in a 2 dimensional domain using three different finite difference methods. The partial differential equation (PDE) that governs heat conduction in a 2D domain is given by: u xx + u yy = f(x,y) In order to make this PDE work with finite different methods, it must be approximated into a system of algebraic equations. This also requires that the domain must be decomposed into a rectangular mesh with evenly spaced nodes in both directions. Once a domain is decomposed into discrete nodes of uniform distribution, finite difference methods can then be applied. The methods used are iterative, so there must be a test for convergence that will indicate that the heat distribution has achieved equilibrium. Convergence is tested using the following criteria: u (k+1 ) - u (k) 2 < ε Where u (k+1) is the nodal solution for the latest iteration, u (k) is the nodal solution for the immediately past iteration, and ε is the convergence tolerance. This test for convergence is implemented by summing the squared difference of the current and new value at each node. The square root of the total sum is then taken; pseudo code for this test is illustrated below: sum = 0.0; for (i=0;i<num_cols;i++) { for (j=0;j<num_rows;j++) { sum = sum + (U_Next[i][j]-U_Curr[i][j]) 2 ; L2_Norm = sqrt(sum); Where: U_Curr represents u (k) U_Next represents u (k+1) This calculation should converge closer to 0 after each iteration, but practical application of this method requires that the user set some tolerance that signals a low enough value to indicate that the solution has reached some equilibrium. This value is signified by ε, and has been set to for this project. Since there are no internal heat sources for this problem, f(x,y) = 0. The boundaries are fixed at 0 and the sole heat source is located at x=length/2 and y=height. The heat distribution is calculated iteratively until it reaches equilibrium within the tolerance, ε. 3

4 T 0 u xx + u yy = f(x,y) 0 o Additionally, these methods are not time dependent, but if the heat source did vary with time, using these methods would require full convergence for each time step. The following iterative methods were implemented in both serial and parallel. b. Jacobi Method In order to compute the next value of a particular node, this method essentially takes an average value of the surrounding 4 nodes as computed in the previous iteration. If there are any internal heat sources, f(i,j) will return non-zero values for the relevant nodes. The formulation is given by: u(k+1) i,j = (1/4) * (u(k) i-1,j + u(k) i+1,j + u(k) i,j-1 + u(k) i,j+1 - h2f i,j ) Implemented as serial algorithm, the required nested for-loop would have a complexity of O(n 2 ) since each node in the domain must be computed. While computations at each node depend on the previous value of its four neighbors, there is a task dependency of 1. This means that an entire domain could be calculate in a single step if there were enough processors to calculate a single node each. Task dependency is illustrated below. The X s represent boundary nodes set to 0, T 0 is the heat source, and 1 represents the task dependency of the node. The 1 means that it is dependent on the values of the previous iteration only. X X X X X T 0 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X The parallel algorithm is exactly the same, and it requires access to the previous value of the node being solved and the four neighboring nodes. Optimally, each node would be 4

5 calculated on a single processor. Communication calls associated with getting the values of the four neighboring nodes can be reduced into a single communication thus requiring N communications per iteration on an N node domain. c. Gauss-Seidel method This method works by using a red and black node donation scheme. A node is considered red when i+j is odd, and black when i+j is odd. Since computation of the black nodes requires the newest value of the red nodes, there is a much higher task dependency than with Jacobi. This method, however, converges much faster than Jacobi, and is therefore a viable solution despite this dependency. The task dependency for each node is illustrated below. The X s represent boundary nodes set to 0, T 0 is the heat source, and number represents the task dependency of the node if the dependent black nodes are computed concurrently with the red nodes. This method is given in two steps: First, red values are calculated: X X X X X T 0 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X u(k+1) i,j = u(k) i,j + (1/W) * (u(k) i-1,j + u(k) i+1,j + u(k) i,j-1 + u(k) i,j+1-4u(k) i,j - h2f i,j ) Then, black values are calculated: u(k+1) i,j = u(k+1) i,j + (1/W) * (u(k+1) i-1,j + u(k +1) i+1,j + u(k +1) i,j-1 + u(k +1) i,j+1-4u(k) i,j - h2f i,j ) The red values are calculated much like those using Jacobi by using the values obtained in the previous timestep. The black values are calculated using the updated red values. This means that any red value must be computed before any black nodal calculations requires it. While black nodes can be computed concurrently once their required red nodes are computed, this property does not reduce the complexity of the serial algorithm because each node must still be computed individually. The optimized parallel algorithm takes advantage of the task dependency by computing the red nodes required for the black nodes concurrently. While this optimization scheme is not as ideal as Jacobi, it is still highly parallelizable and converges much faster. The W in both formulations is a relaxer that scales the results. For Gauss-Seidel, W =

6 d. SOS method The SOS method is exactly the same as the Gause-Seidel method, but converges faster because the following constraint on W: 2. Parallel Algorithm Design a. Details 1 < W <2 None of these methods were implemented in the optimized ways. Instead, they all use the same block partitioning communication scheme. Additionally, the Gauss-Seidel and SOS methods were implemented in a manner that first computed all of the red nodes, then all of the black. Because of this, all three methods showed similar speed up and efficiency as the number of processors was increased. Since all three methods had to be implemented in both serial and parallel, the domain decomposition and communications were designed so that they would work for both. The steps are: 1) Rows are assigned equally to each process (block row partitioning) 2) Ghost rows are exchanged at the start of each iteration 3) The solution method iterates until the convergence criteria is met (same as serial algorithm) Additionally, each processor references all nodes using a global addressing scheme. This was very helpful because it minimized the need to convert between local and global addresses on each processor. A function called global_to_local translated between global addressing and local addressing when needed. This scheme also required several other helper functions to determine, on the fly, the number of rows on each processor and the global addresses of the first and last row on each processor. This created a slight overhead, but helped minimize the number of MPI calls. While this scheme might seems simplistic, it is very relevant for this type of exercise because it is straightforward to implement and test. b. Row distribution Row distribution is done to ensure that each processor gets an even amount of rows in order to balance the computational load. This is called block row portioning, and no process will have any more than 1 more the total numbers of rows on any other process. Additionally, rows are assigned in adjacent groups to adjacent processes. This helps minimize communications costs and allows for the creation of a virtual global matrix. 6

7 c. Communications Parallel communications are handled using ghost rows. At the beginning of each iteration, a process receives the row immediately above and below its own block of rows. These ghost rows are communicated using a non-blocking MPI_Isend and a blocking MPI_Recv. This ensures that deadlocks do not occur. Additionally, the root process containing the top rows does not need a ghost node from above and the last process containing the last groups of rows does not need a bottom ghost row. Because of this, the total amount of MPI_Isend function calls is calculated using the following formula: Comm sendrecv = I*(2(P-1)) where I is the total number of iterations and P is the total number of PEs Additionally, convergence is calculated once per iteration, so the total number of MPI_Reduce is: Comm reduce = I*(P-1) Therefore, the total amount of active communications (i.e., sent data) is: Comm total = 3I*(P-1) This does not take account the additional over incurred in each MPI_Isend of the ghost rows as the number of nodes in each row increases. Lastly, when an acceptable convergence is achieved, the iterative loop is immediately exited, MPI_Finalize is called to close all communications, and each processor dumps the values it has for its rows into separate files. The files can be globalized easily by concatenating each file together in rank order using a utility such as cat. An example of this process can be seen as part of the program compilation script in Appendix II. 3. Program Specifics a. Details This program was implemented in C, and development time was split between an x86 dual processor running Fedora Core 2 Linux and an older model Dell Inspiron laptop running FreeBSD The MPICH implementation of the MPI standard was used on both platforms. The resulting code can be compiled and executed on both platforms, and has even been tested on an IBM SP distributed memory cluster. The serial and parallel implementations were exactly the same, and at no point is there any globalization of the solution array. The virtual global array was facilitated by communicating the ghost rows discussed in 2c at the beginning of each iteration. This greatly minimized communications and simplified the implementation of a parallel communications scheme. 7

8 Care was taken to produce reusable code, although not all communications were encapsulated inside their own function, such as the ghost row exchange. The finite difference method implementations, boundary condition enforcement, and nodal value requests were all self contained so that different boundary conditions, initial conditions, internal heat sources, and finite difference methods could be easily added. b. Method implementation Implementing the Jacobi method is implemented in pseudo code below: while (! converged (U_Curr,U_Next)) { if (!ROOT) {above_row = get_above(u_curr); if (!LAST) {below_row = get_below(u_curr); /* update U_Curr which holds solution from previous iteration */ U_Curr = U_Next; jacobi(u_curr,u_next,above_row,below_row); Where: above_row is the ghost node from above below_row is the ghost row from below U_Curr represents u (k) U_Next represents u (k+1) The Gauss-Seidel and SOS methods were implementing in the following manner: while (! converged (U_Curr,U_Next)) { /* update U_K which holds solution from previous iteration */ U_Curr = U_Next; if (!ROOT) {above_row = get_above(u_curr); if (!LAST) {below_row = get_below(u_curr); /* solve for red values using values from previous iteration */ solve_red(u_curr,u_next,above_row,below_row); /* solve for black values using red values for this iteration */ solve_black(u_next,u_next,above_row,below_row); Where: above_row is the ghost node from above below_row is the ghost row from below U_Curr represents u (k) U_Next represents u (k+1) Global convergence was calculated using an MPI_Reduce call with the MPI_SUM reduction operation which added up all of the local squared difference sums, and sent them to the root process. The root process then took the square root of this sum to get the convergence. This was then compared to ε, and the simulation was ended when convergence was less than or equal to ε. c. Boundary condition enforcement Boundary conditions were enforced using a two-fold strategy. The first part of this strategy was to provide a function called get_val_par to access array values instead of accessing them directly. This encapsulated the use of ghost rows for the parallel run and 8

9 allowed for much of the same code to be used for both the serial and parallel runs. This function is used by each of the finite difference methods to get the values required to compute each node. For example, here is an illustration of its use in the Jacobi implementation: for (j=my_start;j<=my_end;j++) { next_ptr[j-my_start][i] =.25*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); Where: U_Curr_Above is the ghost row from above U_Curr_Below is the ghost row from below H is the spacing of the nodes in all directions my_rank is the processors rank i and j represent the node coordinate current_ptr represents u (k) next ptr represents u (k+1) When the function get_val_par is called, the corrected value for the requested nodal value is return depending on whether it is the heat source (T o ), a boundary node (T=0), a ghost row, or in the processors owned rows. The second part of boundary condition enforcement is the use of a function called enforce_bc_par which sets next_ptr[i][j] back to its correct value if it is on the boundary. Boundary condition enforcement is done after the computation in order to better abstract the handling of boundary nodes. If there were internal heat sources, these nodes would also be accounted for within this strategy. Approaching the enforcement of boundary conditions this way allows for more flexibility in terms of the types boundary conditions that can be applied. This is a very simple case, and it is tempting to hard code the conditions in, however doing so would not allow for easy code reuse. 9

10 d. User interface The interface allows the user to select one of the three finite difference methods and if they want to see the global convergence for each iteration. Once the user selects the finite difference method he wishes to use, the selection is broadcast to all processes, and a function pointer, void (*method) ();, is pointed to the real function that implements the selected finite difference method: /* broadcast method to use */ (void) MPI_Bcast(&meth,1,MPI_INT,0,MPI_COMM_WORLD); switch (meth) { case 1: method = &jacobi; break; case 2: method = &gauss_seidel; break; case 3: method = &sos; break; Where: meth is the variable that contains the method selected by the user &jacobi is a pointer to the Jacobi implementation &gauss_seidel is a pointer to the Gauss-Seidel implementation &sos is a pointer to the SOS implementation e. Output files Each processor outputs its own output file once the finite difference methods converges. It outputs the value of its rows using a global addressing scheme so that it is easy to globalize all of the output the files into one. Using the cat command, file globalization is accomplished by issuing the following command: % cat output/* > $$.Xp.out Where: output/* refers to all of the files in the output directory containing the output files for each processor > is redirects the output of the cat command to a file called $$.Xn.out $$ is the process id (PID) of the shell or shell script that is executing this command X represents the number of processors used for this calculation 4. Verification and Performance Analysis a. Details All performance tests were conducted on 8, dual processor 2.6 ghz 32 bit Intel Xeons, totaling 16 processors in all. The parallel computing platform containing these computers was a 100mbs switched Ethernet bus network local area network. 10

11 Globalized output files were validated for both serial and parallel implementations of each method, and were found to be identical for all method methods on domains of 10x10, 50x50, and 100x100. b. Method comparison Each method was compared, and the results were all very close. The following table shows a sample of the results for rows 1 and 2 that were computed for 10x10 domain containing 100 nodes: x y Jacobi Gauss-Seidel SOS As the table shows above, all three methods agree to the 1000 th place. Additional method comparisons show the number of iterations needed to converge and how long each method took to converge when run in serial: Iterations To Converge Iterations Jacobi G-S SOS Method 100x100 50x50 11

12 Serial Time To Converge Seconds Jacobi G-S SOS Method 100x100 50x50 SOS was clearly the fastest converging finite difference methods in terms of serial time and the number of iterations. c. Speed up and efficiency Performance evaluations for speed up and efficiency were performed for each of the methods for the 50x50 and 100x100 domain. The results are as follows: Speed Up 50 x 50 Node Domain 5 4 Speed Up Processors Jacobi Gauss-Seidel SOS 12

13 Efficiency 50 x 50 Node Domain Efficiency Processors Jacobi Gauss-Seidel SOS Speed Up 100 x 100 Node Domain Speed Up Processors Jacobi Gauss-Seidel SOS Efficiency 100 x 100 Node Domain Efficiency Processors Jacobi Gauss-Seidel SOS 13

14 5. Conclusions Of the three methods, SOS converged the fasted. Additionally, the parallel implementation of each method shows slight speed up in the beginning, but this speed up quickly tapers off as the efficiency plots show. Speed up and efficiency can be greatly improved by taking advantage of the inherent task dependencies of each method. While Jacobi is the most highly parallelizable, the most optimized Jacobi method may still not converge fast enough to beat an optimized Gauss-Seidel or SOS implementation. The only way to check for sure is to create optimized versions of each method and run the performance comparisons again. It is obvious, however, by the similarities in the speed up and efficiency plots that the same domain decomposition and communication scheme was used for all finite difference methods. One of the other goals of the project was to write the code in a way that encouraged reuse. This goal was not perfectly achieved, but the code allows for the easy changing of boundary conditions and addition of finite difference methods. Ideally, all parallel communications, domain decomposition, finite difference implementations, and boundary condition enforcement would be hidden, but this code only slightly provided some of these features. Improvements in the code would not take that much effort, and would greatly improve its usefulness as an example of a parallel finite difference implementation. 14

15 Credits and Resource Personal communications MPICH Homepage: FreeBSD Homepage: Fedora Homepage: 15

16 Appendix 1: bde_2dheat.c source code /* * Brett D. Estrade * CS 691 Fall 2004 * Final Project * * Parallel implementation of 2d heat conduction * finite difference over a rectangular domain using: * - Jacobi * - Gauss-Seidel * - SOS * * The communication scheme uses shared, or "ghost", rows * that are used by adjacent processes. This scheme shows * linear speed up when the ratio #rows/#procs close to 1, * but performace degrades consistently as the row distribution * gets closer to 1 row per processor. This is due to communication * overhead. The point here is that the communication scheme is not * optimized for any of the methods used here. * * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ X - (W/2,H) *******X******* *...* *...* *...* *...* *...* ~ all bdy by "X" (W/2,H) *...* *...* *...* *...* *...* *************** 2D domain - WIDTH x HEIGHT "X" = T_SRC0 "*" = 0.0 "." = internal node suceptible to heating * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */ #define WIDTH 25 #define HEIGHT 25 #define H 1.0 #define EPSILON #define ITERMAX 1000 #define T_SRC #define ROOT 0 /* Includes */ #include <stdio.h> #include <stdlib.h> #include <math.h> #include "mpi.h" int get_start (int rank); int get_end (int rank); int get_num_rows (int rank); void init_domain (float ** domain_ptr,int rank); void jacobi (float ** current_ptr,float ** next_ptr); void gauss_seidel (float ** current_ptr,float ** next_ptr); void sos (float ** current_ptr,float ** next_ptr); float get_val_par (float * above_ptr,float ** domain_ptr,float * below_ptr,int rank,int i,int j); void enforce_bc_par (float ** domain_ptr,int rank,int i,int j); int global_to_local (int rank, int row); float f (int i,int j); float get_convergence_sqd (float ** current_ptr,float ** next_ptr,int rank); void to_file (float ** current_ptr,int iteration,int my_rank,int meth); /* Function pointer to solver method of choice */ void (*method) (); int main(int argc, char** argv) { int p,my_rank; /* arrays used to contain each PE's rows - specify cols, no need to spec rows */ float **U_Curr; float **U_Next; /* helper variables */ float convergence,convergence_sqd,local_convergence_sqd; int i,j,k,m,n; /* available iterators */ int meth,show_conv,per_proc,remainder,my_start_row,my_end_row,my_num_rows; double time; /* initialize mpi stuff */ MPI_Init(&argc, &argv); /* get number of procs */ MPI_Comm_size(MPI_COMM_WORLD,&p); /* get rank of current process */ MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); 16

17 if (my_rank == ROOT) { printf("what finite difference method do you want to use?\n1)jabobi\n2)gauss-seidel\n3)sos\n"); scanf("%d",&meth); printf("show convergence?\n1)yes\n2)no\n"); scanf("%d",&show_conv); /* wait for user to input runtime params */ MPI_Barrier(MPI_COMM_WORLD); /* broadcast terminal output option to use */ (void) MPI_Bcast(&show_conv,1,MPI_INT,0,MPI_COMM_WORLD); /* broadcast method to use */ (void) MPI_Bcast(&meth,1,MPI_INT,0,MPI_COMM_WORLD); switch (meth) { case 1: method = &jacobi; break; case 2: method = &gauss_seidel; break; case 3: method = &sos; break; /* let each processor decide what rows(s) it owns */ my_start_row = get_start(my_rank); my_end_row = get_end(my_rank); my_num_rows = get_num_rows(my_rank); printf("proc %d contains (%d) rows %d to %d\n",my_rank,my_num_rows,my_start_row,my_end_row); fflush(stdout); /* allocate 2d array */ U_Curr = (float**)malloc(sizeof(float*)*my_num_rows); U_Curr[0] = (float*)malloc(sizeof(float)*my_num_rows*(int)floor(width/h)); for (i=1;i<my_num_rows;i++) { U_Curr[i] = U_Curr[i-1]+(int)floor(WIDTH/H); /* allocate 2d array */ U_Next = (float**)malloc(sizeof(float*)*my_num_rows); U_Next[0] = (float*)malloc(sizeof(float)*my_num_rows*(int)floor(width/h)); for (i=1;i<my_num_rows;i++) { U_Next[i] = U_Next[i-1]+(int)floor(WIDTH/H); /* initialize global grid */ init_domain(u_curr,my_rank); init_domain(u_next,my_rank); /* iterate for solution */ if (my_rank == ROOT) { time = MPI_Wtime(); k = 1; while (1) { method(u_curr,u_next); local_convergence_sqd = get_convergence_sqd(u_curr,u_next,my_rank); MPI_Reduce(&local_convergence_sqd,&convergence_sqd,1,MPI_FLOAT,MPI_SUM,ROOT,MPI_COMM_WORLD); if (my_rank == ROOT) { convergence = sqrt(convergence_sqd); if (show_conv == 1) { printf("l2 = %f\n",convergence); /* broadcast method to use */ (void) MPI_Bcast(&convergence,1,MPI_INT,0,MPI_COMM_WORLD); if (convergence <= EPSILON) { break; /* copy U_Next to U_Curr */ for (j=my_start_row;j<=my_end_row;j++) { U_Curr[j-my_start_row][i] = U_Next[j-my_start_row][i]; k++; MPI_Barrier(MPI_COMM_WORLD); if (my_rank == ROOT) { time = MPI_Wtime() - time; printf("estimated time to convergence in %d iterations using %d processors on a %dx%d grid is %f seconds\n",k,p,(int)floor(width/h),(int)floor(height/h),time); /* Globalize Output */ to_file(u_curr,k,my_rank,meth); MPI_Finalize(); 17

18 exit(1); return 0; void to_file (float ** current_ptr,int iteration,int my_rank,int meth) { int i,j,k,p; FILE *OUTPUT; char filename[20]; MPI_Status status; MPI_Request request; MPI_Comm_size(MPI_COMM_WORLD,&p); sprintf(filename,"output/meth%d.%dp.pe%d.iter%d.out",meth,p,my_rank,iteration); OUTPUT = fopen(filename,"w"); /* output rows */ for (j=get_start(my_rank);j<=get_end(my_rank);j++) { fprintf(output,"%d %d %f\n",i,j,current_ptr[j-get_start(my_rank)][i]); fflush(output); float get_convergence_sqd (float ** current_ptr,float ** next_ptr,int rank) { int i,j,my_start,my_end,my_num_rows; float sum; my_start = get_start(rank); my_end = get_end(rank); my_num_rows = get_num_rows(rank); sum = 0.0; for (j=my_start;j<=my_end;j++) { sum += pow(next_ptr[global_to_local(rank,j)][i]-current_ptr[global_to_local(rank,j)][i],2); return sum; void jacobi (float ** current_ptr,float ** next_ptr) { int i,j,p,my_rank,my_start,my_end,my_num_rows; float U_Curr_Above[(int)floor(WIDTH/H)]; /* 1d array holding values from bottom row of PE above */ float U_Curr_Below[(int)floor(WIDTH/H)]; /* 1d array holding values from top row of PE below */ float U_Send_Buffer[(int)floor(WIDTH/H)]; /* 1d array holding values that are currently being sent */ MPI_Request request; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); my_start = get_start(my_rank); my_end = get_end(my_rank); my_num_rows = get_num_rows(my_rank); /* * Communicating ghost rows - only bother if p > 1 */ if (p > 1) { /* send/receive bottom rows */ if (my_rank < (p-1)) { /* populate send buffer with bottow row */ U_Send_Buffer[i] = current_ptr[my_num_rows-1][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&request); if (my_rank > ROOT) { /* blocking receive */ MPI_Recv(U_Curr_Above,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* send/receive top rows */ if (my_rank > ROOT) { /* populate send buffer with top row */ U_Send_Buffer[i] = current_ptr[0][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&request); if (my_rank < (p-1)) { /* blocking receive */ MPI_Recv(U_Curr_Below,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* Jacobi method using global addressing */ for (j=my_start;j<=my_end;j++) { 18

19 next_ptr[j-my_start][i] =.25*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); void gauss_seidel (float ** current_ptr,float ** next_ptr) { int i,j,p,my_rank,my_start,my_end,my_num_rows; float U_Curr_Above[(int)floor(WIDTH/H)]; /* 1d array holding values from bottom row of PE above */ float U_Curr_Below[(int)floor(WIDTH/H)]; /* 1d array holding values from top row of PE below */ float U_Send_Buffer[(int)floor(WIDTH/H)]; /* 1d array holding values that are currently being sent */ float W = 1.0; MPI_Request request; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); my_start = get_start(my_rank); my_end = get_end(my_rank); my_num_rows = get_num_rows(my_rank); /* * Communicating ghost rows - only bother if p > 1 */ if (p > 1) { /* send/receive bottom rows */ if (my_rank < (p-1)) { /* populate send buffer with bottow row */ U_Send_Buffer[i] = current_ptr[my_num_rows-1][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&request); if (my_rank > ROOT) { /* blocking receive */ MPI_Recv(U_Curr_Above,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* send/receive top rows */ if (my_rank > ROOT) { /* populate send buffer with top row */ U_Send_Buffer[i] = current_ptr[0][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&request); if (my_rank < (p-1)) { /* blocking receive */ MPI_Recv(U_Curr_Below,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* solve next reds (i+j odd) */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2!= 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); /* solve next blacks (i+j) even... using next reds */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2 == 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); 19

20 void sos (float ** current_ptr,float ** next_ptr) { int i,j,p,my_rank,my_start,my_end,my_num_rows; float U_Curr_Above[(int)floor(WIDTH/H)]; /* 1d array holding values from bottom row of PE above */ float U_Curr_Below[(int)floor(WIDTH/H)]; /* 1d array holding values from top row of PE below */ float U_Send_Buffer[(int)floor(WIDTH/H)]; /* 1d array holding values that are currently being sent */ float W = 1.5; MPI_Request request; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); my_start = get_start(my_rank); my_end = get_end(my_rank); my_num_rows = get_num_rows(my_rank); /* * Communicating ghost rows - only bother if p > 1 */ if (p > 1) { /* send/receive bottom rows */ if (my_rank < (p-1)) { /* populate send buffer with bottow row */ U_Send_Buffer[i] = current_ptr[my_num_rows-1][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&request); if (my_rank > ROOT) { /* blocking receive */ MPI_Recv(U_Curr_Above,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* send/receive top rows */ if (my_rank > ROOT) { /* populate send buffer with top row */ U_Send_Buffer[i] = current_ptr[0][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&request); if (my_rank < (p-1)) { /* blocking receive */ MPI_Recv(U_Curr_Below,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* solve next reds (i+j odd) */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2!= 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); /* solve next blacks (i+j) even... using next reds */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2 == 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); void enforce_bc_par (float ** domain_ptr,int rank,int i,int j) { /* enforce bc's first */ if(i == ((int)floor(width/h/2)-1) && j == 0) { /* This is the heat source location */ domain_ptr[j][i] = T_SRC0; else if (i <= 0 j <= 0 i >= ((int)floor(width/h)-1) j >= ((int)floor(height/h)-1)) { /* All edges and beyond are set to 0.0 */ domain_ptr[global_to_local(rank,j)][i] = 0.0; 20

21 float get_val_par (float * above_ptr,float ** domain_ptr,float * below_ptr,int rank,int i,int j) { float ret_val; int p; MPI_Comm_size(MPI_COMM_WORLD,&p); /* enforce bc's first */ if(i == ((int)floor(width/h/2)-1) && j == 0) { /* This is the heat source location */ ret_val = T_SRC0; else if (i <= 0 j <= 0 i >= ((int)floor(width/h)-1) j >= ((int)floor(height/h)-1)) { /* All edges and beyond are set to 0.0 */ ret_val = 0.0; else { /* Else, return value for matrix supplied or ghost rows */ if (j < get_start(rank)) { if (rank == ROOT) { /* not interested in above ghost row */ ret_val = 0.0; else { ret_val = above_ptr[i]; /*printf("%d: Used ghost (%d,%d) row from above = %f\n",rank,i,j,above_ptr[i]); fflush(stdout);*/ else if (j > get_end(rank)) { if (rank == (p-1)) { /* not interested in below ghost row */ ret_val = 0.0; else { ret_val = below_ptr[i]; /*printf("%d: Used ghost (%d,%d) row from below = %f\n",rank,i,j,below_ptr[i]); fflush(stdout);*/ else { /* else, return the value in the domain asked for */ ret_val = domain_ptr[global_to_local(rank,j)][i]; /*printf("%d: Used real (%d,%d) row from self = %f\n",rank,i,global_to_local(rank,j),domain_ptr[global_to_local(rank,j)][i]); fflush(stdout);*/ return ret_val; void init_domain (float ** domain_ptr,int rank) { int i,j,start,end,rows; start = get_start(rank); end = get_end(rank); rows = get_num_rows(rank); for (j=start;j<end;j++) { domain_ptr[j-start][i] = 0.0; int get_start (int rank) { /* computer row divisions to each proc */ int p,per_proc,start_row,remainder; MPI_Comm_size(MPI_COMM_WORLD,&p); /* get initial whole divisor */ per_proc = (int)floor(height/h)/p; /* get number of remaining */ remainder = (int)floor(height/h)%p; /* there is a remainder, then it distribute it to the first "remainder" procs */ if (rank < remainder) { start_row = rank * (per_proc + 1); else { start_row = rank * (per_proc) + remainder; return start_row; int get_end (int rank) { /* computer row divisions to each proc */ int p,per_proc,remainder,end_row; MPI_Comm_size(MPI_COMM_WORLD,&p); per_proc = (int)floor(height/h)/p; remainder = (int)floor(height/h)%p; if (rank < remainder) { end_row = get_start(rank) + per_proc; else { end_row = get_start(rank) + per_proc - 1; return end_row; int get_num_rows (rank) { return 1 + get_end(rank) - get_start(rank); int global_to_local (int rank, int row) { 21

22 return row - get_start(rank); /* * f - function that would be non zero if there was an internal heat source */ float f (int i,int j) { return 0.0; 22

23 Appendix II: compile.sh script file #!/bin/sh #/* # * Brett D. Estrade # * CS 691 Fall 2004 # * Final Project # * # * Parallel implementation of 2d heat conduction # * finite difference over a rectangular domain using: # * - Jacobi # * - Gauss-Seidel # * - SOS # * # * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # X - (W/2,H) # # *******X******* # *...* # *...* # *...* # *...* # *...* ~ all bdy by "X" (W/2,H) # *...* # *...* # *...* # *...* # *...* # *************** # # 2D domain - WIDTH x HEIGHT # "X" = T_SRC0 # "*" = 0.0 # "." = internal node suceptible to heating # * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # */ if [! -d output ]; then mkdir output else rm output/* fi mpicc -lm bde_2dheat.c -o a.out mpirun -np ${1 -machinefile machinefile a.out echo globalizing output to $$.${1p.out cat output/* > $$.${1p.out exit 0 23

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/ Introduction to MPI Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Distributed Memory Programming (MPI) Message Passing Model Initializing and terminating programs Point to point