2D Heat Distribution Prediction in Parallel

Size: px
Start display at page:

Download "2D Heat Distribution Prediction in Parallel"

Transcription

1 2D Heat Distribution Prediction in Parallel Brett D. Estrade CS 691 Fall 2004 Final Project Instructor: Dr. Joe Zhang Abstract Prediction of the temperature distribution in a 2 dimensional domain using an iterative finite difference method to solve the governing partial differential equation is a computational problem that can greatly benefit from parallelization. This is especially important over large domains with a high number of nodes because the iterative solutions have a complexity of at least O(n 2 ), which means that the serial time to solution often increases exponentially as the number of nodes increase. Fortunately, many finite difference methods are highly parallelizable, and can be programmed optimally to take advantage of their specific task dependencies. This project investigates three iterative finite difference methods: Jacobi, Gauss-Seidel, and SOS; then implements them in parallel using MPI, and examines their performance for accuracy, speed up, and efficiency. It was not within the scope of this project to optimize the parallel communications, so they all use a simple block-row partitioning scheme with ghost row communication.

2 Table of Contents 1. Introduction a. Problem description b. Jacobi method c. Gauss-Seidel methods d. SOS methods 2. Parallel Algorithm Design a. Details b. Row distribution and task allocation c. Communications 3. Program Specifics a. Overview b. Method implementation c. Boundary condition enforcement d. User interface e. Output files 4. Verification and Performance Analysis a. Details b. Method comparisons c. Speed up and efficiency 5. Conclusions 6. Credits and Resources Appendix I: bde_2dheat.c source code Appendix II: compile.sh script file 2

3 1. Introduction a. Problem description This project required the prediction of the distribution of heat in a 2 dimensional domain using three different finite difference methods. The partial differential equation (PDE) that governs heat conduction in a 2D domain is given by: u xx + u yy = f(x,y) In order to make this PDE work with finite different methods, it must be approximated into a system of algebraic equations. This also requires that the domain must be decomposed into a rectangular mesh with evenly spaced nodes in both directions. Once a domain is decomposed into discrete nodes of uniform distribution, finite difference methods can then be applied. The methods used are iterative, so there must be a test for convergence that will indicate that the heat distribution has achieved equilibrium. Convergence is tested using the following criteria: u (k+1 ) - u (k) 2 < ε Where u (k+1) is the nodal solution for the latest iteration, u (k) is the nodal solution for the immediately past iteration, and ε is the convergence tolerance. This test for convergence is implemented by summing the squared difference of the current and new value at each node. The square root of the total sum is then taken; pseudo code for this test is illustrated below: sum = 0.0; for (i=0;i<num_cols;i++) { for (j=0;j<num_rows;j++) { sum = sum + (U_Next[i][j]-U_Curr[i][j]) 2 ; L2_Norm = sqrt(sum); Where: U_Curr represents u (k) U_Next represents u (k+1) This calculation should converge closer to 0 after each iteration, but practical application of this method requires that the user set some tolerance that signals a low enough value to indicate that the solution has reached some equilibrium. This value is signified by ε, and has been set to for this project. Since there are no internal heat sources for this problem, f(x,y) = 0. The boundaries are fixed at 0 and the sole heat source is located at x=length/2 and y=height. The heat distribution is calculated iteratively until it reaches equilibrium within the tolerance, ε. 3

4 T 0 u xx + u yy = f(x,y) 0 o Additionally, these methods are not time dependent, but if the heat source did vary with time, using these methods would require full convergence for each time step. The following iterative methods were implemented in both serial and parallel. b. Jacobi Method In order to compute the next value of a particular node, this method essentially takes an average value of the surrounding 4 nodes as computed in the previous iteration. If there are any internal heat sources, f(i,j) will return non-zero values for the relevant nodes. The formulation is given by: u(k+1) i,j = (1/4) * (u(k) i-1,j + u(k) i+1,j + u(k) i,j-1 + u(k) i,j+1 - h2f i,j ) Implemented as serial algorithm, the required nested for-loop would have a complexity of O(n 2 ) since each node in the domain must be computed. While computations at each node depend on the previous value of its four neighbors, there is a task dependency of 1. This means that an entire domain could be calculate in a single step if there were enough processors to calculate a single node each. Task dependency is illustrated below. The X s represent boundary nodes set to 0, T 0 is the heat source, and 1 represents the task dependency of the node. The 1 means that it is dependent on the values of the previous iteration only. X X X X X T 0 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X The parallel algorithm is exactly the same, and it requires access to the previous value of the node being solved and the four neighboring nodes. Optimally, each node would be 4

5 calculated on a single processor. Communication calls associated with getting the values of the four neighboring nodes can be reduced into a single communication thus requiring N communications per iteration on an N node domain. c. Gauss-Seidel method This method works by using a red and black node donation scheme. A node is considered red when i+j is odd, and black when i+j is odd. Since computation of the black nodes requires the newest value of the red nodes, there is a much higher task dependency than with Jacobi. This method, however, converges much faster than Jacobi, and is therefore a viable solution despite this dependency. The task dependency for each node is illustrated below. The X s represent boundary nodes set to 0, T 0 is the heat source, and number represents the task dependency of the node if the dependent black nodes are computed concurrently with the red nodes. This method is given in two steps: First, red values are calculated: X X X X X T 0 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X u(k+1) i,j = u(k) i,j + (1/W) * (u(k) i-1,j + u(k) i+1,j + u(k) i,j-1 + u(k) i,j+1-4u(k) i,j - h2f i,j ) Then, black values are calculated: u(k+1) i,j = u(k+1) i,j + (1/W) * (u(k+1) i-1,j + u(k +1) i+1,j + u(k +1) i,j-1 + u(k +1) i,j+1-4u(k) i,j - h2f i,j ) The red values are calculated much like those using Jacobi by using the values obtained in the previous timestep. The black values are calculated using the updated red values. This means that any red value must be computed before any black nodal calculations requires it. While black nodes can be computed concurrently once their required red nodes are computed, this property does not reduce the complexity of the serial algorithm because each node must still be computed individually. The optimized parallel algorithm takes advantage of the task dependency by computing the red nodes required for the black nodes concurrently. While this optimization scheme is not as ideal as Jacobi, it is still highly parallelizable and converges much faster. The W in both formulations is a relaxer that scales the results. For Gauss-Seidel, W =

6 d. SOS method The SOS method is exactly the same as the Gause-Seidel method, but converges faster because the following constraint on W: 2. Parallel Algorithm Design a. Details 1 < W <2 None of these methods were implemented in the optimized ways. Instead, they all use the same block partitioning communication scheme. Additionally, the Gauss-Seidel and SOS methods were implemented in a manner that first computed all of the red nodes, then all of the black. Because of this, all three methods showed similar speed up and efficiency as the number of processors was increased. Since all three methods had to be implemented in both serial and parallel, the domain decomposition and communications were designed so that they would work for both. The steps are: 1) Rows are assigned equally to each process (block row partitioning) 2) Ghost rows are exchanged at the start of each iteration 3) The solution method iterates until the convergence criteria is met (same as serial algorithm) Additionally, each processor references all nodes using a global addressing scheme. This was very helpful because it minimized the need to convert between local and global addresses on each processor. A function called global_to_local translated between global addressing and local addressing when needed. This scheme also required several other helper functions to determine, on the fly, the number of rows on each processor and the global addresses of the first and last row on each processor. This created a slight overhead, but helped minimize the number of MPI calls. While this scheme might seems simplistic, it is very relevant for this type of exercise because it is straightforward to implement and test. b. Row distribution Row distribution is done to ensure that each processor gets an even amount of rows in order to balance the computational load. This is called block row portioning, and no process will have any more than 1 more the total numbers of rows on any other process. Additionally, rows are assigned in adjacent groups to adjacent processes. This helps minimize communications costs and allows for the creation of a virtual global matrix. 6

7 c. Communications Parallel communications are handled using ghost rows. At the beginning of each iteration, a process receives the row immediately above and below its own block of rows. These ghost rows are communicated using a non-blocking MPI_Isend and a blocking MPI_Recv. This ensures that deadlocks do not occur. Additionally, the root process containing the top rows does not need a ghost node from above and the last process containing the last groups of rows does not need a bottom ghost row. Because of this, the total amount of MPI_Isend function calls is calculated using the following formula: Comm sendrecv = I*(2(P-1)) where I is the total number of iterations and P is the total number of PEs Additionally, convergence is calculated once per iteration, so the total number of MPI_Reduce is: Comm reduce = I*(P-1) Therefore, the total amount of active communications (i.e., sent data) is: Comm total = 3I*(P-1) This does not take account the additional over incurred in each MPI_Isend of the ghost rows as the number of nodes in each row increases. Lastly, when an acceptable convergence is achieved, the iterative loop is immediately exited, MPI_Finalize is called to close all communications, and each processor dumps the values it has for its rows into separate files. The files can be globalized easily by concatenating each file together in rank order using a utility such as cat. An example of this process can be seen as part of the program compilation script in Appendix II. 3. Program Specifics a. Details This program was implemented in C, and development time was split between an x86 dual processor running Fedora Core 2 Linux and an older model Dell Inspiron laptop running FreeBSD The MPICH implementation of the MPI standard was used on both platforms. The resulting code can be compiled and executed on both platforms, and has even been tested on an IBM SP distributed memory cluster. The serial and parallel implementations were exactly the same, and at no point is there any globalization of the solution array. The virtual global array was facilitated by communicating the ghost rows discussed in 2c at the beginning of each iteration. This greatly minimized communications and simplified the implementation of a parallel communications scheme. 7

8 Care was taken to produce reusable code, although not all communications were encapsulated inside their own function, such as the ghost row exchange. The finite difference method implementations, boundary condition enforcement, and nodal value requests were all self contained so that different boundary conditions, initial conditions, internal heat sources, and finite difference methods could be easily added. b. Method implementation Implementing the Jacobi method is implemented in pseudo code below: while (! converged (U_Curr,U_Next)) { if (!ROOT) {above_row = get_above(u_curr); if (!LAST) {below_row = get_below(u_curr); /* update U_Curr which holds solution from previous iteration */ U_Curr = U_Next; jacobi(u_curr,u_next,above_row,below_row); Where: above_row is the ghost node from above below_row is the ghost row from below U_Curr represents u (k) U_Next represents u (k+1) The Gauss-Seidel and SOS methods were implementing in the following manner: while (! converged (U_Curr,U_Next)) { /* update U_K which holds solution from previous iteration */ U_Curr = U_Next; if (!ROOT) {above_row = get_above(u_curr); if (!LAST) {below_row = get_below(u_curr); /* solve for red values using values from previous iteration */ solve_red(u_curr,u_next,above_row,below_row); /* solve for black values using red values for this iteration */ solve_black(u_next,u_next,above_row,below_row); Where: above_row is the ghost node from above below_row is the ghost row from below U_Curr represents u (k) U_Next represents u (k+1) Global convergence was calculated using an MPI_Reduce call with the MPI_SUM reduction operation which added up all of the local squared difference sums, and sent them to the root process. The root process then took the square root of this sum to get the convergence. This was then compared to ε, and the simulation was ended when convergence was less than or equal to ε. c. Boundary condition enforcement Boundary conditions were enforced using a two-fold strategy. The first part of this strategy was to provide a function called get_val_par to access array values instead of accessing them directly. This encapsulated the use of ghost rows for the parallel run and 8

9 allowed for much of the same code to be used for both the serial and parallel runs. This function is used by each of the finite difference methods to get the values required to compute each node. For example, here is an illustration of its use in the Jacobi implementation: for (j=my_start;j<=my_end;j++) { next_ptr[j-my_start][i] =.25*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); Where: U_Curr_Above is the ghost row from above U_Curr_Below is the ghost row from below H is the spacing of the nodes in all directions my_rank is the processors rank i and j represent the node coordinate current_ptr represents u (k) next ptr represents u (k+1) When the function get_val_par is called, the corrected value for the requested nodal value is return depending on whether it is the heat source (T o ), a boundary node (T=0), a ghost row, or in the processors owned rows. The second part of boundary condition enforcement is the use of a function called enforce_bc_par which sets next_ptr[i][j] back to its correct value if it is on the boundary. Boundary condition enforcement is done after the computation in order to better abstract the handling of boundary nodes. If there were internal heat sources, these nodes would also be accounted for within this strategy. Approaching the enforcement of boundary conditions this way allows for more flexibility in terms of the types boundary conditions that can be applied. This is a very simple case, and it is tempting to hard code the conditions in, however doing so would not allow for easy code reuse. 9

10 d. User interface The interface allows the user to select one of the three finite difference methods and if they want to see the global convergence for each iteration. Once the user selects the finite difference method he wishes to use, the selection is broadcast to all processes, and a function pointer, void (*method) ();, is pointed to the real function that implements the selected finite difference method: /* broadcast method to use */ (void) MPI_Bcast(&meth,1,MPI_INT,0,MPI_COMM_WORLD); switch (meth) { case 1: method = &jacobi; break; case 2: method = &gauss_seidel; break; case 3: method = &sos; break; Where: meth is the variable that contains the method selected by the user &jacobi is a pointer to the Jacobi implementation &gauss_seidel is a pointer to the Gauss-Seidel implementation &sos is a pointer to the SOS implementation e. Output files Each processor outputs its own output file once the finite difference methods converges. It outputs the value of its rows using a global addressing scheme so that it is easy to globalize all of the output the files into one. Using the cat command, file globalization is accomplished by issuing the following command: % cat output/* > $$.Xp.out Where: output/* refers to all of the files in the output directory containing the output files for each processor > is redirects the output of the cat command to a file called $$.Xn.out $$ is the process id (PID) of the shell or shell script that is executing this command X represents the number of processors used for this calculation 4. Verification and Performance Analysis a. Details All performance tests were conducted on 8, dual processor 2.6 ghz 32 bit Intel Xeons, totaling 16 processors in all. The parallel computing platform containing these computers was a 100mbs switched Ethernet bus network local area network. 10

11 Globalized output files were validated for both serial and parallel implementations of each method, and were found to be identical for all method methods on domains of 10x10, 50x50, and 100x100. b. Method comparison Each method was compared, and the results were all very close. The following table shows a sample of the results for rows 1 and 2 that were computed for 10x10 domain containing 100 nodes: x y Jacobi Gauss-Seidel SOS As the table shows above, all three methods agree to the 1000 th place. Additional method comparisons show the number of iterations needed to converge and how long each method took to converge when run in serial: Iterations To Converge Iterations Jacobi G-S SOS Method 100x100 50x50 11

12 Serial Time To Converge Seconds Jacobi G-S SOS Method 100x100 50x50 SOS was clearly the fastest converging finite difference methods in terms of serial time and the number of iterations. c. Speed up and efficiency Performance evaluations for speed up and efficiency were performed for each of the methods for the 50x50 and 100x100 domain. The results are as follows: Speed Up 50 x 50 Node Domain 5 4 Speed Up Processors Jacobi Gauss-Seidel SOS 12

13 Efficiency 50 x 50 Node Domain Efficiency Processors Jacobi Gauss-Seidel SOS Speed Up 100 x 100 Node Domain Speed Up Processors Jacobi Gauss-Seidel SOS Efficiency 100 x 100 Node Domain Efficiency Processors Jacobi Gauss-Seidel SOS 13

14 5. Conclusions Of the three methods, SOS converged the fasted. Additionally, the parallel implementation of each method shows slight speed up in the beginning, but this speed up quickly tapers off as the efficiency plots show. Speed up and efficiency can be greatly improved by taking advantage of the inherent task dependencies of each method. While Jacobi is the most highly parallelizable, the most optimized Jacobi method may still not converge fast enough to beat an optimized Gauss-Seidel or SOS implementation. The only way to check for sure is to create optimized versions of each method and run the performance comparisons again. It is obvious, however, by the similarities in the speed up and efficiency plots that the same domain decomposition and communication scheme was used for all finite difference methods. One of the other goals of the project was to write the code in a way that encouraged reuse. This goal was not perfectly achieved, but the code allows for the easy changing of boundary conditions and addition of finite difference methods. Ideally, all parallel communications, domain decomposition, finite difference implementations, and boundary condition enforcement would be hidden, but this code only slightly provided some of these features. Improvements in the code would not take that much effort, and would greatly improve its usefulness as an example of a parallel finite difference implementation. 14

15 Credits and Resource Personal communications MPICH Homepage: FreeBSD Homepage: Fedora Homepage: 15

16 Appendix 1: bde_2dheat.c source code /* * Brett D. Estrade * CS 691 Fall 2004 * Final Project * * Parallel implementation of 2d heat conduction * finite difference over a rectangular domain using: * - Jacobi * - Gauss-Seidel * - SOS * * The communication scheme uses shared, or "ghost", rows * that are used by adjacent processes. This scheme shows * linear speed up when the ratio #rows/#procs close to 1, * but performace degrades consistently as the row distribution * gets closer to 1 row per processor. This is due to communication * overhead. The point here is that the communication scheme is not * optimized for any of the methods used here. * * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ X - (W/2,H) *******X******* *...* *...* *...* *...* *...* ~ all bdy by "X" (W/2,H) *...* *...* *...* *...* *...* *************** 2D domain - WIDTH x HEIGHT "X" = T_SRC0 "*" = 0.0 "." = internal node suceptible to heating * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */ #define WIDTH 25 #define HEIGHT 25 #define H 1.0 #define EPSILON #define ITERMAX 1000 #define T_SRC #define ROOT 0 /* Includes */ #include <stdio.h> #include <stdlib.h> #include <math.h> #include "mpi.h" int get_start (int rank); int get_end (int rank); int get_num_rows (int rank); void init_domain (float ** domain_ptr,int rank); void jacobi (float ** current_ptr,float ** next_ptr); void gauss_seidel (float ** current_ptr,float ** next_ptr); void sos (float ** current_ptr,float ** next_ptr); float get_val_par (float * above_ptr,float ** domain_ptr,float * below_ptr,int rank,int i,int j); void enforce_bc_par (float ** domain_ptr,int rank,int i,int j); int global_to_local (int rank, int row); float f (int i,int j); float get_convergence_sqd (float ** current_ptr,float ** next_ptr,int rank); void to_file (float ** current_ptr,int iteration,int my_rank,int meth); /* Function pointer to solver method of choice */ void (*method) (); int main(int argc, char** argv) { int p,my_rank; /* arrays used to contain each PE's rows - specify cols, no need to spec rows */ float **U_Curr; float **U_Next; /* helper variables */ float convergence,convergence_sqd,local_convergence_sqd; int i,j,k,m,n; /* available iterators */ int meth,show_conv,per_proc,remainder,my_start_row,my_end_row,my_num_rows; double time; /* initialize mpi stuff */ MPI_Init(&argc, &argv); /* get number of procs */ MPI_Comm_size(MPI_COMM_WORLD,&p); /* get rank of current process */ MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); 16

17 if (my_rank == ROOT) { printf("what finite difference method do you want to use?\n1)jabobi\n2)gauss-seidel\n3)sos\n"); scanf("%d",&meth); printf("show convergence?\n1)yes\n2)no\n"); scanf("%d",&show_conv); /* wait for user to input runtime params */ MPI_Barrier(MPI_COMM_WORLD); /* broadcast terminal output option to use */ (void) MPI_Bcast(&show_conv,1,MPI_INT,0,MPI_COMM_WORLD); /* broadcast method to use */ (void) MPI_Bcast(&meth,1,MPI_INT,0,MPI_COMM_WORLD); switch (meth) { case 1: method = &jacobi; break; case 2: method = &gauss_seidel; break; case 3: method = &sos; break; /* let each processor decide what rows(s) it owns */ my_start_row = get_start(my_rank); my_end_row = get_end(my_rank); my_num_rows = get_num_rows(my_rank); printf("proc %d contains (%d) rows %d to %d\n",my_rank,my_num_rows,my_start_row,my_end_row); fflush(stdout); /* allocate 2d array */ U_Curr = (float**)malloc(sizeof(float*)*my_num_rows); U_Curr[0] = (float*)malloc(sizeof(float)*my_num_rows*(int)floor(width/h)); for (i=1;i<my_num_rows;i++) { U_Curr[i] = U_Curr[i-1]+(int)floor(WIDTH/H); /* allocate 2d array */ U_Next = (float**)malloc(sizeof(float*)*my_num_rows); U_Next[0] = (float*)malloc(sizeof(float)*my_num_rows*(int)floor(width/h)); for (i=1;i<my_num_rows;i++) { U_Next[i] = U_Next[i-1]+(int)floor(WIDTH/H); /* initialize global grid */ init_domain(u_curr,my_rank); init_domain(u_next,my_rank); /* iterate for solution */ if (my_rank == ROOT) { time = MPI_Wtime(); k = 1; while (1) { method(u_curr,u_next); local_convergence_sqd = get_convergence_sqd(u_curr,u_next,my_rank); MPI_Reduce(&local_convergence_sqd,&convergence_sqd,1,MPI_FLOAT,MPI_SUM,ROOT,MPI_COMM_WORLD); if (my_rank == ROOT) { convergence = sqrt(convergence_sqd); if (show_conv == 1) { printf("l2 = %f\n",convergence); /* broadcast method to use */ (void) MPI_Bcast(&convergence,1,MPI_INT,0,MPI_COMM_WORLD); if (convergence <= EPSILON) { break; /* copy U_Next to U_Curr */ for (j=my_start_row;j<=my_end_row;j++) { U_Curr[j-my_start_row][i] = U_Next[j-my_start_row][i]; k++; MPI_Barrier(MPI_COMM_WORLD); if (my_rank == ROOT) { time = MPI_Wtime() - time; printf("estimated time to convergence in %d iterations using %d processors on a %dx%d grid is %f seconds\n",k,p,(int)floor(width/h),(int)floor(height/h),time); /* Globalize Output */ to_file(u_curr,k,my_rank,meth); MPI_Finalize(); 17

18 exit(1); return 0; void to_file (float ** current_ptr,int iteration,int my_rank,int meth) { int i,j,k,p; FILE *OUTPUT; char filename[20]; MPI_Status status; MPI_Request request; MPI_Comm_size(MPI_COMM_WORLD,&p); sprintf(filename,"output/meth%d.%dp.pe%d.iter%d.out",meth,p,my_rank,iteration); OUTPUT = fopen(filename,"w"); /* output rows */ for (j=get_start(my_rank);j<=get_end(my_rank);j++) { fprintf(output,"%d %d %f\n",i,j,current_ptr[j-get_start(my_rank)][i]); fflush(output); float get_convergence_sqd (float ** current_ptr,float ** next_ptr,int rank) { int i,j,my_start,my_end,my_num_rows; float sum; my_start = get_start(rank); my_end = get_end(rank); my_num_rows = get_num_rows(rank); sum = 0.0; for (j=my_start;j<=my_end;j++) { sum += pow(next_ptr[global_to_local(rank,j)][i]-current_ptr[global_to_local(rank,j)][i],2); return sum; void jacobi (float ** current_ptr,float ** next_ptr) { int i,j,p,my_rank,my_start,my_end,my_num_rows; float U_Curr_Above[(int)floor(WIDTH/H)]; /* 1d array holding values from bottom row of PE above */ float U_Curr_Below[(int)floor(WIDTH/H)]; /* 1d array holding values from top row of PE below */ float U_Send_Buffer[(int)floor(WIDTH/H)]; /* 1d array holding values that are currently being sent */ MPI_Request request; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); my_start = get_start(my_rank); my_end = get_end(my_rank); my_num_rows = get_num_rows(my_rank); /* * Communicating ghost rows - only bother if p > 1 */ if (p > 1) { /* send/receive bottom rows */ if (my_rank < (p-1)) { /* populate send buffer with bottow row */ U_Send_Buffer[i] = current_ptr[my_num_rows-1][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&request); if (my_rank > ROOT) { /* blocking receive */ MPI_Recv(U_Curr_Above,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* send/receive top rows */ if (my_rank > ROOT) { /* populate send buffer with top row */ U_Send_Buffer[i] = current_ptr[0][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&request); if (my_rank < (p-1)) { /* blocking receive */ MPI_Recv(U_Curr_Below,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* Jacobi method using global addressing */ for (j=my_start;j<=my_end;j++) { 18

19 next_ptr[j-my_start][i] =.25*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); void gauss_seidel (float ** current_ptr,float ** next_ptr) { int i,j,p,my_rank,my_start,my_end,my_num_rows; float U_Curr_Above[(int)floor(WIDTH/H)]; /* 1d array holding values from bottom row of PE above */ float U_Curr_Below[(int)floor(WIDTH/H)]; /* 1d array holding values from top row of PE below */ float U_Send_Buffer[(int)floor(WIDTH/H)]; /* 1d array holding values that are currently being sent */ float W = 1.0; MPI_Request request; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); my_start = get_start(my_rank); my_end = get_end(my_rank); my_num_rows = get_num_rows(my_rank); /* * Communicating ghost rows - only bother if p > 1 */ if (p > 1) { /* send/receive bottom rows */ if (my_rank < (p-1)) { /* populate send buffer with bottow row */ U_Send_Buffer[i] = current_ptr[my_num_rows-1][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&request); if (my_rank > ROOT) { /* blocking receive */ MPI_Recv(U_Curr_Above,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* send/receive top rows */ if (my_rank > ROOT) { /* populate send buffer with top row */ U_Send_Buffer[i] = current_ptr[0][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&request); if (my_rank < (p-1)) { /* blocking receive */ MPI_Recv(U_Curr_Below,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* solve next reds (i+j odd) */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2!= 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); /* solve next blacks (i+j) even... using next reds */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2 == 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); 19

20 void sos (float ** current_ptr,float ** next_ptr) { int i,j,p,my_rank,my_start,my_end,my_num_rows; float U_Curr_Above[(int)floor(WIDTH/H)]; /* 1d array holding values from bottom row of PE above */ float U_Curr_Below[(int)floor(WIDTH/H)]; /* 1d array holding values from top row of PE below */ float U_Send_Buffer[(int)floor(WIDTH/H)]; /* 1d array holding values that are currently being sent */ float W = 1.5; MPI_Request request; MPI_Status status; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); my_start = get_start(my_rank); my_end = get_end(my_rank); my_num_rows = get_num_rows(my_rank); /* * Communicating ghost rows - only bother if p > 1 */ if (p > 1) { /* send/receive bottom rows */ if (my_rank < (p-1)) { /* populate send buffer with bottow row */ U_Send_Buffer[i] = current_ptr[my_num_rows-1][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&request); if (my_rank > ROOT) { /* blocking receive */ MPI_Recv(U_Curr_Above,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* send/receive top rows */ if (my_rank > ROOT) { /* populate send buffer with top row */ U_Send_Buffer[i] = current_ptr[0][i]; /* non blocking send */ MPI_Isend(U_Send_Buffer,(int)floor(WIDTH/H),MPI_FLOAT,my_rank-1,0,MPI_COMM_WORLD,&request); if (my_rank < (p-1)) { /* blocking receive */ MPI_Recv(U_Curr_Below,(int)floor(WIDTH/H),MPI_FLOAT,my_rank+1,0,MPI_COMM_WORLD,&status); MPI_Barrier(MPI_COMM_WORLD); /* solve next reds (i+j odd) */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2!= 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,current_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); /* solve next blacks (i+j) even... using next reds */ for (j=my_start;j<=my_end;j++) { if ((i+j)%2 == 0) { next_ptr[j-my_start][i] = get_val_par(u_curr_above,current_ptr,u_curr_below,my_rank,i,j) + (W/4)*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i-1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i+1,j) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j-1) + get_val_par(u_curr_above,next_ptr,u_curr_below,my_rank,i,j+1) - 4*(get_val_par(U_Curr_Above,next_ptr,U_Curr_Below,my_rank,i,j)) - (pow(h,2)*f(i,j))); enforce_bc_par(next_ptr,my_rank,i,j); void enforce_bc_par (float ** domain_ptr,int rank,int i,int j) { /* enforce bc's first */ if(i == ((int)floor(width/h/2)-1) && j == 0) { /* This is the heat source location */ domain_ptr[j][i] = T_SRC0; else if (i <= 0 j <= 0 i >= ((int)floor(width/h)-1) j >= ((int)floor(height/h)-1)) { /* All edges and beyond are set to 0.0 */ domain_ptr[global_to_local(rank,j)][i] = 0.0; 20

21 float get_val_par (float * above_ptr,float ** domain_ptr,float * below_ptr,int rank,int i,int j) { float ret_val; int p; MPI_Comm_size(MPI_COMM_WORLD,&p); /* enforce bc's first */ if(i == ((int)floor(width/h/2)-1) && j == 0) { /* This is the heat source location */ ret_val = T_SRC0; else if (i <= 0 j <= 0 i >= ((int)floor(width/h)-1) j >= ((int)floor(height/h)-1)) { /* All edges and beyond are set to 0.0 */ ret_val = 0.0; else { /* Else, return value for matrix supplied or ghost rows */ if (j < get_start(rank)) { if (rank == ROOT) { /* not interested in above ghost row */ ret_val = 0.0; else { ret_val = above_ptr[i]; /*printf("%d: Used ghost (%d,%d) row from above = %f\n",rank,i,j,above_ptr[i]); fflush(stdout);*/ else if (j > get_end(rank)) { if (rank == (p-1)) { /* not interested in below ghost row */ ret_val = 0.0; else { ret_val = below_ptr[i]; /*printf("%d: Used ghost (%d,%d) row from below = %f\n",rank,i,j,below_ptr[i]); fflush(stdout);*/ else { /* else, return the value in the domain asked for */ ret_val = domain_ptr[global_to_local(rank,j)][i]; /*printf("%d: Used real (%d,%d) row from self = %f\n",rank,i,global_to_local(rank,j),domain_ptr[global_to_local(rank,j)][i]); fflush(stdout);*/ return ret_val; void init_domain (float ** domain_ptr,int rank) { int i,j,start,end,rows; start = get_start(rank); end = get_end(rank); rows = get_num_rows(rank); for (j=start;j<end;j++) { domain_ptr[j-start][i] = 0.0; int get_start (int rank) { /* computer row divisions to each proc */ int p,per_proc,start_row,remainder; MPI_Comm_size(MPI_COMM_WORLD,&p); /* get initial whole divisor */ per_proc = (int)floor(height/h)/p; /* get number of remaining */ remainder = (int)floor(height/h)%p; /* there is a remainder, then it distribute it to the first "remainder" procs */ if (rank < remainder) { start_row = rank * (per_proc + 1); else { start_row = rank * (per_proc) + remainder; return start_row; int get_end (int rank) { /* computer row divisions to each proc */ int p,per_proc,remainder,end_row; MPI_Comm_size(MPI_COMM_WORLD,&p); per_proc = (int)floor(height/h)/p; remainder = (int)floor(height/h)%p; if (rank < remainder) { end_row = get_start(rank) + per_proc; else { end_row = get_start(rank) + per_proc - 1; return end_row; int get_num_rows (rank) { return 1 + get_end(rank) - get_start(rank); int global_to_local (int rank, int row) { 21

22 return row - get_start(rank); /* * f - function that would be non zero if there was an internal heat source */ float f (int i,int j) { return 0.0; 22

23 Appendix II: compile.sh script file #!/bin/sh #/* # * Brett D. Estrade # * CS 691 Fall 2004 # * Final Project # * # * Parallel implementation of 2d heat conduction # * finite difference over a rectangular domain using: # * - Jacobi # * - Gauss-Seidel # * - SOS # * # * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # X - (W/2,H) # # *******X******* # *...* # *...* # *...* # *...* # *...* ~ all bdy by "X" (W/2,H) # *...* # *...* # *...* # *...* # *...* # *************** # # 2D domain - WIDTH x HEIGHT # "X" = T_SRC0 # "*" = 0.0 # "." = internal node suceptible to heating # * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # */ if [! -d output ]; then mkdir output else rm output/* fi mpicc -lm bde_2dheat.c -o a.out mpirun -np ${1 -machinefile machinefile a.out echo globalizing output to $$.${1p.out cat output/* > $$.${1p.out exit 0 23

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/

Introduction to MPI. Ricardo Fonseca. https://sites.google.com/view/rafonseca2017/ Introduction to MPI Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Distributed Memory Programming (MPI) Message Passing Model Initializing and terminating programs Point to point

More information

MPI introduction - exercises -

MPI introduction - exercises - MPI introduction - exercises - Paolo Ramieri, Maurizio Cremonesi May 2016 Startup notes Access the server and go on scratch partition: ssh a08tra49@login.galileo.cineca.it cd $CINECA_SCRATCH Create a job

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

Hybrid Programming with MPI and OpenMP. B. Estrade

Hybrid Programming with MPI and OpenMP. B. Estrade Hybrid Programming with MPI and OpenMP B. Estrade Objectives understand the difference between message passing and shared memory models; learn of basic models for utilizing both message

More information

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Introduction to Parallel Programming Message Passing Interface Practical Session Part I Introduction to Parallel Programming Message Passing Interface Practical Session Part I T. Streit, H.-J. Pflug streit@rz.rwth-aachen.de October 28, 2008 1 1. Examples We provide codes of the theoretical

More information

15-440: Recitation 8

15-440: Recitation 8 15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs

More information

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1 Lecture 7: More about MPI programming Lecture 7: More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems

More information

HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI

HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI HPC Fall 2007 Project 3 2D Steady-State Heat Distribution Problem with MPI Robert van Engelen Due date: December 14, 2007 1 Introduction 1.1 Account and Login Information For this assignment you need an

More information

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3 6 Iterative Solvers Lab Objective: Many real-world problems of the form Ax = b have tens of thousands of parameters Solving such systems with Gaussian elimination or matrix factorizations could require

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

More about MPI programming. More about MPI programming p. 1

More about MPI programming. More about MPI programming p. 1 More about MPI programming More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems, the CPUs share

More information

Message Passing Interface. most of the slides taken from Hanjun Kim

Message Passing Interface. most of the slides taken from Hanjun Kim Message Passing Interface most of the slides taken from Hanjun Kim Message Passing Pros Scalable, Flexible Cons Someone says it s more difficult than DSM MPI (Message Passing Interface) A standard message

More information

MPI introduction - exercises -

MPI introduction - exercises - MPI introduction - exercises - Introduction to Parallel Computing with MPI and OpenMP P. Ramieri May 2015 Hello world! (Fortran) As an ice breaking activity try to compile and run the Helloprogram, either

More information

Assignment 3 Key CSCI 351 PARALLEL PROGRAMMING FALL, Q1. Calculate log n, log n and log n for the following: Answer: Q2. mpi_trap_tree.

Assignment 3 Key CSCI 351 PARALLEL PROGRAMMING FALL, Q1. Calculate log n, log n and log n for the following: Answer: Q2. mpi_trap_tree. CSCI 351 PARALLEL PROGRAMMING FALL, 2015 Assignment 3 Key Q1. Calculate log n, log n and log n for the following: a. n=3 b. n=13 c. n=32 d. n=123 e. n=321 Answer: Q2. mpi_trap_tree.c The mpi_trap_time.c

More information

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2

Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2 Comparison of different solvers for two-dimensional steady heat conduction equation ME 412 Project 2 Jingwei Zhu March 19, 2014 Instructor: Surya Pratap Vanka 1 Project Description The purpose of this

More information

Message-Passing Computing

Message-Passing Computing Chapter 2 Slide 41þþ Message-Passing Computing Slide 42þþ Basics of Message-Passing Programming using userlevel message passing libraries Two primary mechanisms needed: 1. A method of creating separate

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums MPI Lab Parallelization (Calculating π in parallel) How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums Sharing Data Across Processors

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2017 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group MPI: Parallel Programming for Extreme Machines Si Hammond, High Performance Systems Group Quick Introduction Si Hammond, (sdh@dcs.warwick.ac.uk) WPRF/PhD Research student, High Performance Systems Group,

More information

Practical Course Scientific Computing and Visualization

Practical Course Scientific Computing and Visualization July 5, 2006 Page 1 of 21 1. Parallelization Architecture our target architecture: MIMD distributed address space machines program1 data1 program2 data2 program program3 data data3.. program(data) program1(data1)

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

Introduction to Parallel Programming with MPI

Introduction to Parallel Programming with MPI Introduction to Parallel Programming with MPI PICASso Tutorial October 25-26, 2006 Stéphane Ethier (ethier@pppl.gov) Computational Plasma Physics Group Princeton Plasma Physics Lab Why Parallel Computing?

More information

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI CS 470 Spring 2018 Mike Lam, Professor Distributed Programming & MPI MPI paradigm Single program, multiple data (SPMD) One program, multiple processes (ranks) Processes communicate via messages An MPI

More information

Examples of MPI programming. Examples of MPI programming p. 1/18

Examples of MPI programming. Examples of MPI programming p. 1/18 Examples of MPI programming Examples of MPI programming p. 1/18 Examples of MPI programming p. 2/18 A 1D example The computational problem: A uniform mesh in x-direction with M +2 points: x 0 is left boundary

More information

MPI MESSAGE PASSING INTERFACE

MPI MESSAGE PASSING INTERFACE MPI MESSAGE PASSING INTERFACE David COLIGNON, ULiège CÉCI - Consortium des Équipements de Calcul Intensif http://www.ceci-hpc.be Outline Introduction From serial source code to parallel execution MPI functions

More information

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #

More information

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer HPC-Lab Session 4: MPI, CG M. Bader, A. Breuer Meetings Date Schedule 10/13/14 Kickoff 10/20/14 Q&A 10/27/14 Presentation 1 11/03/14 H. Bast, Intel 11/10/14 Presentation 2 12/01/14 Presentation 3 12/08/14

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics May 24, 2015 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003 Introduction to MPI HY555 Parallel Systems and Grids Fall 2003 Outline MPI layout Sending and receiving messages Collective communication Datatypes An example Compiling and running Typical layout of an

More information

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI Robert van Engelen Due date: December 10, 2010 1 Introduction 1.1 HPC Account Setup and Login Procedure Same as in Project 1. 1.2

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

immediate send and receive query the status of a communication MCS 572 Lecture 8 Introduction to Supercomputing Jan Verschelde, 9 September 2016

immediate send and receive query the status of a communication MCS 572 Lecture 8 Introduction to Supercomputing Jan Verschelde, 9 September 2016 Data Partitioning 1 Data Partitioning functional and domain decomposition 2 Parallel Summation applying divide and conquer fanning out an array of data fanning out with MPI fanning in the results 3 An

More information

MPI. (message passing, MIMD)

MPI. (message passing, MIMD) MPI (message passing, MIMD) What is MPI? a message-passing library specification extension of C/C++ (and Fortran) message passing for distributed memory parallel programming Features of MPI Point-to-point

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

CS 179: GPU Programming. Lecture 14: Inter-process Communication

CS 179: GPU Programming. Lecture 14: Inter-process Communication CS 179: GPU Programming Lecture 14: Inter-process Communication The Problem What if we want to use GPUs across a distributed system? GPU cluster, CSIRO Distributed System A collection of computers Each

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Simple examples how to run MPI program via PBS on Taurus HPC

Simple examples how to run MPI program via PBS on Taurus HPC Simple examples how to run MPI program via PBS on Taurus HPC MPI setup There's a number of MPI implementations install on the cluster. You can list them all issuing the following command: module avail/load/list/unload

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

AMath 483/583 Lecture 21 May 13, 2011

AMath 483/583 Lecture 21 May 13, 2011 AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versions of Jacobi iteration Gauss-Seidel and SOR iterative methods Next week: More MPI Debugging and totalview GPU computing Read: Class notes

More information

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems.

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems. The goals of this lesson are: understanding the MPI programming model managing the MPI environment handling errors point-to-point communication 1. The MPI Environment Lesson 1 MPI (Message Passing Interface)

More information

Scientific Computing

Scientific Computing Lecture on Scientific Computing Dr. Kersten Schmidt Lecture 21 Technische Universität Berlin Institut für Mathematik Wintersemester 2014/2015 Syllabus Linear Regression, Fast Fourier transform Modelling

More information

Parallelization of an Example Program

Parallelization of an Example Program Parallelization of an Example Program [ 2.3] In this lecture, we will consider a parallelization of the kernel of the Ocean application. Goals: Illustrate parallel programming in a low-level parallel language.

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics July 11, 2016 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Parallel Programming Assignment 3 Compiling and running MPI programs

Parallel Programming Assignment 3 Compiling and running MPI programs Parallel Programming Assignment 3 Compiling and running MPI programs Author: Clayton S. Ferner and B. Wilkinson Modification date: October 11a, 2013 This assignment uses the UNC-Wilmington cluster babbage.cis.uncw.edu.

More information

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education MPI Tutorial Shao-Ching Huang High Performance Computing Group UCLA Institute for Digital Research and Education Center for Vision, Cognition, Learning and Art, UCLA July 15 22, 2013 A few words before

More information

The Vibrating String

The Vibrating String CS 789 Multiprocessor Programming The Vibrating String School of Computer Science Howard Hughes College of Engineering University of Nevada, Las Vegas (c) Matt Pedersen, 200 P P2 P3 P4 The One-Dimensional

More information

Task farming on Blue Gene

Task farming on Blue Gene Task farming on Blue Gene Fiona J. L. Reid July 3, 2006 Abstract In this paper we investigate how to implement a trivial task farm on the EPCC eserver Blue Gene/L system, BlueSky. This is achieved by adding

More information

Parallel Computing and the MPI environment

Parallel Computing and the MPI environment Parallel Computing and the MPI environment Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics December 0, 0 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface ) CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

ITCS 4145/5145 Assignment 2

ITCS 4145/5145 Assignment 2 ITCS 4145/5145 Assignment 2 Compiling and running MPI programs Author: B. Wilkinson and Clayton S. Ferner. Modification date: September 10, 2012 In this assignment, the workpool computations done in Assignment

More information

f xx + f yy = F (x, y)

f xx + f yy = F (x, y) Application of the 2D finite element method to Laplace (Poisson) equation; f xx + f yy = F (x, y) M. R. Hadizadeh Computer Club, Department of Physics and Astronomy, Ohio University 4 Nov. 2013 Domain

More information

To connect to the cluster, simply use a SSH or SFTP client to connect to:

To connect to the cluster, simply use a SSH or SFTP client to connect to: RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, phoenix.ce.rit.edu, serves as the master controller or head

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing HW #9 10., 10.3, 10.7 Due April 17 { } Review Completing Graph Algorithms Maximal Independent Set Johnson s shortest path algorithm using adjacency lists Q= V; for all v in Q l[v] = infinity; l[s] = 0;

More information

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge. MPI and CUDA Filippo Spiga, HPCS, University of Cambridge Outline Basic principle of MPI Mixing MPI and CUDA 1 st example : parallel GPU detect 2 nd example: heat2d CUDA- aware MPI, how

More information

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho Programming with MPI on GridRS Dr. Márcio Castro e Dr. Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage -

More information

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d MPI Lab Getting Started Login to ranger.tacc.utexas.edu Untar the lab source code lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar Part 1: Getting Started with simple parallel coding hello mpi-world

More information

AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline:

AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline: AMath 483/583 Lecture 24 Outline: Heat equation and discretization OpenMP and MPI for iterative methods Jacobi, Gauss-Seidel, SOR Notes and Sample codes: Class notes: Linear algebra software $UWHPSC/codes/openmp/jacobi1d_omp1.f90

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms )

CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms ) CSE 590: Special Topics Course ( Supercomputing ) Lecture 6 ( Analyzing Distributed Memory Algorithms ) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2012 2D Heat Diffusion

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

AMath 483/583 Lecture 24

AMath 483/583 Lecture 24 AMath 483/583 Lecture 24 Outline: Heat equation and discretization OpenMP and MPI for iterative methods Jacobi, Gauss-Seidel, SOR Notes and Sample codes: Class notes: Linear algebra software $UWHPSC/codes/openmp/jacobi1d_omp1.f90

More information

Programming with MPI. Pedro Velho

Programming with MPI. Pedro Velho Programming with MPI Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage - Who might be interested in those applications?

More information

Introduction to MPI: Part II

Introduction to MPI: Part II Introduction to MPI: Part II Pawel Pomorski, University of Waterloo, SHARCNET ppomorsk@sharcnetca November 25, 2015 Summary of Part I: To write working MPI (Message Passing Interface) parallel programs

More information

Report S1 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( )

Report S1 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( ) Report S1 C Kengo Nakajima Information Technology Center Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Problem S1-3 Report S1 (2/2) Develop parallel program

More information

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary C MPI Slurm Tutorial - Hello World Introduction The example shown here demonstrates the use of the Slurm Scheduler for the purpose of running a C/MPI program. Knowledge of C is assumed. Having read the

More information

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:... ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure

More information

Report S1 C. Kengo Nakajima

Report S1 C. Kengo Nakajima Report S1 C Kengo Nakajima Technical & Scientific Computing II (4820-1028) Seminar on Computer Science II (4810-1205) Hybrid Distributed Parallel Computing (3747-111) Problem S1-1 Report S1 Read local

More information

Point-to-Point Communication. Reference:

Point-to-Point Communication. Reference: Point-to-Point Communication Reference: http://foxtrot.ncsa.uiuc.edu:8900/public/mpi/ Introduction Point-to-point communication is the fundamental communication facility provided by the MPI library. Point-to-point

More information

Distributed Memory Programming with Message-Passing

Distributed Memory Programming with Message-Passing Distributed Memory Programming with Message-Passing Pacheco s book Chapter 3 T. Yang, CS240A Part of slides from the text book and B. Gropp Outline An overview of MPI programming Six MPI functions and

More information

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Assignment 3 MPI Tutorial Compiling and Executing MPI programs Assignment 3 MPI Tutorial Compiling and Executing MPI programs B. Wilkinson: Modification date: February 11, 2016. This assignment is a tutorial to learn how to execute MPI programs and explore their characteristics.

More information

Introduction to MPI. Table of Contents

Introduction to MPI. Table of Contents 1. Program Structure 2. Communication Model Topology Messages 3. Basic Functions 4. Made-up Example Programs 5. Global Operations 6. LaPlace Equation Solver 7. Asynchronous Communication 8. Communication

More information

MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012

MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012 MPI Lab Steve Lantz Susan Mehringer Parallel Computing on Ranger and Longhorn May 16, 2012 1 MPI Lab Parallelization (Calculating p in parallel) How to split a problem across multiple processors Broadcasting

More information

CSS 534 Program 2: Parallelizing Wave Diffusion with MPI and OpenMP Professor: Munehiro Fukuda Due date: see the syllabus

CSS 534 Program 2: Parallelizing Wave Diffusion with MPI and OpenMP Professor: Munehiro Fukuda Due date: see the syllabus CSS 534 Program 2: Parallelizing Wave Diffusion with MPI and OpenMP Professor: Munehiro Fukuda Due date: see the syllabus 1. Purpose In this programming assignment, we will parallelize a sequential version

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalabissu-tokyoacjp/~reiji/pna16/ [ 5 ] MPI: Message Passing Interface Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1 Architecture

More information

Parallel Programming Using MPI

Parallel Programming Using MPI Parallel Programming Using MPI Short Course on HPC 15th February 2019 Aditya Krishna Swamy adityaks@iisc.ac.in SERC, Indian Institute of Science When Parallel Computing Helps? Want to speed up your calculation

More information

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM THOAI NAM Outline Communication modes MPI Message Passing Interface Standard TERMs (1) Blocking If return from the procedure indicates the user is allowed to reuse resources specified in the call Non-blocking

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC GDDR5 Memory System Memory GDDR5 Memory System Memory GDDR5 Memory System Memory GPU CPU GPU CPU GPU CPU PCI-e PCI-e PCI-e Network

More information

Collective Communication in MPI and Advanced Features

Collective Communication in MPI and Advanced Features Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

CMPE-655 Fall 2013 Assignment 2: Parallel Implementation of a Ray Tracer

CMPE-655 Fall 2013 Assignment 2: Parallel Implementation of a Ray Tracer CMPE-655 Fall 2013 Assignment 2: Parallel Implementation of a Ray Tracer Rochester Institute of Technology, Department of Computer Engineering Instructor: Dr. Shaaban (meseec@rit.edu) TAs: Jason Lowden

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

CSE 160 Lecture 15. Message Passing

CSE 160 Lecture 15. Message Passing CSE 160 Lecture 15 Message Passing Announcements 2013 Scott B. Baden / CSE 160 / Fall 2013 2 Message passing Today s lecture The Message Passing Interface - MPI A first MPI Application The Trapezoidal

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Assignment 2 Using Paraguin to Create Parallel Programs

Assignment 2 Using Paraguin to Create Parallel Programs Overview Assignment 2 Using Paraguin to Create Parallel Programs C. Ferner and B. Wilkinson Minor clarification Oct 11, 2013 The goal of this assignment is to use the Paraguin compiler to create parallel

More information

HIGH PERFORMANCE SCIENTIFIC COMPUTING

HIGH PERFORMANCE SCIENTIFIC COMPUTING ( HPSC 5576 ELIZABETH JESSUP ) HIGH PERFORMANCE SCIENTIFIC COMPUTING :: Homework / 8 :: Student / Florian Rappl 1 problem / 10 points Problem 1 Task: Write a short program demonstrating the use of MPE's

More information

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001) An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (000/001) Summary The objectives of this project were as follows: 1) Investigate iterative

More information

CMSC 714 Lecture 3 Message Passing with PVM and MPI

CMSC 714 Lecture 3 Message Passing with PVM and MPI Notes CMSC 714 Lecture 3 Message Passing with PVM and MPI Alan Sussman To access papers in ACM or IEEE digital library, must come from a UMD IP address Accounts handed out next week for deepthought2 cluster,

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix Algorithms (part 3) Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of

More information

Distributed Memory Programming with MPI

Distributed Memory Programming with MPI Distributed Memory Programming with MPI Part 1 Bryan Mills, PhD Spring 2017 A distributed memory system A shared memory system Identifying MPI processes n Common pracace to idenafy processes by nonnegaave

More information