Hybrid MPI+UPC parallel programming paradigm on an SMP cluster

Size: px

Start display at page:

Download "Hybrid MPI+UPC parallel programming paradigm on an SMP cluster"

Rolf Oliver
5 years ago
Views:

1 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012, c TÜBİTAK doi: /elk Hybrid MPI+UPC parallel programming paradigm on an SMP cluster Zeki BOZKUŞ Department of Computer Engineering, Kadir Has University, İstanbul-TURKEY zeki.bozkus@khas.edu.tr Received: Abstract The symmetric multiprocessing (SMP) cluster system, which consists of shared memory nodes with several multicore central processing units connected to a high-speed network to form a distributed memory system, is the most widely available hardware architecture for the high-performance computing community. Today, the Message Passing Interface (MPI) is the most widely used parallel programming paradigm for SMP clusters, in which the MPI provides programming both for an SMP node and among nodes simultaneously. However, Unified Parallel C (UPC) is an emerging alternative that supports the partitioned global address space model that can be again employed within and across the nodes of a cluster. In this paper, we describe a hybrid parallel programming paradigm that was designed to combine MPI and UPC programming models. This paradigm s objective is to mix the MPI s data locality control and scalability strengths with UPC s fine-grain parallelism and ease of programming to achieve multiple-level parallelism at the SMP cluster, which itself has multilevel parallel architecture. Utilizing a proposed hybrid model and comparing MPI-only to UPC-only implementations, this paper presents a detailed description of Cannon s algorithm benchmark application with performance results of a random-access benchmark and the Barnes Hut N-Body simulation. Experiments indicate that the hybrid MPI+UPC model can significantly provide performance increases of up to double in comparison with UPC-only implementation, and up to 20% increases in comparison to MPI-only implementation. Furthermore, an optimization was achieved that improved the hybrid performance by an additional 20%. Key Words: Hybrid parallel programming, UPC, MPI 1. Introduction The Message Passing Interface (MPI) is one of the most commonly used parallel programming models for parallel computing [1]. The MPI provides portability, good scalability, and significant flexibility in parallel programming. However, the MPI requires explicit communications with large granularity, which renders programming and debugging problematic. Recently, partitioned global address space (PGAS) languages such as Unified Parallel C (UPC) have made possible an alternative parallel programming model that allows for shared memory, such as programming on distributed memory systems, through an ability to read and write to remote memory with simple statements without explicit communication [2]. 1389

2 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 This study aimed to exploit the complementary strengths of both models by providing a hybrid parallel programming model that combines the MPI and UPC. This hybrid model reduces the communications overhead by lowering data movements within nodes. In addition, the goal of this hybrid model is to offer the finegranularity parallelism of UPC and its benefit of simplified programming. The hybrid adds the strengths of the MPI s good scalability and coarse-grain parallelism with a larger message size. The recent trend in highperformance computer architecture is to increase cores on nodes and hence decrease the memory per core at nodes, consequently encouraging us to explore different programming paradigms such as an MPI+UPC hybrid on a large-scale distributed platform. In this study, we selected a funneled approach for hybrid programming, meaning that all interactions between UPC and the MPI are controlled by a master thread and only the master thread calls MPI routines. In this manner, all communication is funneled to the master thread. This paper also provides a detailed description of our hybrid MPI+UPC model and demonstrates its effectiveness via a comparison of the performance of MPIonly and UPC-only models. We developed 3 codes to probe the efficiency and scalability of the models on the distributed multicore systems with the Cannon matrix multiplication algorithm, which was chosen to exploit some of the advanced features of the MPI. MPI virtual topology was employed to benefit from regional locality, as UPC has only local or shared (global) objects, and in this way the hybrid model will enhance UPC program performance with regional locality. In addition, we utilized an optimization to overlap MPI communications with UPC computations in the hybrid model, and this optimized benchmark performance by up to 30% for some data sets. Dinan et al. [3] were the first researchers to define the hybrid MPI+UPC programming model in terms that varied the level of nesting and the number of instances of the model. They classified the hybrid model into 3 categories: the flat model, the nested-funneled model, and the multiple model. The nested-funneled model is based on the fact that one member of each UPC group is able to make MPI calls. This kind of funneled approach has been used for MPI+OpenMP and MPI+Pthread hybrids. Our proposed model is also a funneled method in which there is only one master thread in each UPC group that can participate in MPI communication. However, the main goal of the nested-funneled model [3] is to provide MPI programs with access to a large distributed shared global address space. They set up their hybrid system with the inclusion of a UPC compiler to handle the case. On the other hand, our main purpose was to localize computations so that our hybrid system could outperform not only the plain UPC version but also the plain MPI version. In our funneled model, every UPC group has its own shared variables, which are only distributed at that UPC group; they are not distributed across all groups to enlarge the MPI memory. Section 2 of this paper presents an overview of MPI and UPC parallel programming models on parallel platforms, briefly describing their main strengths and weaknesses, and Section 3 examines hybrid models and describes the funneled version in further detail. Section 4 presents the Cannon algorithm, which was used as a benchmark application, while Section 5 explains 4 different implementations of the Cannon algorithm that we used for our experiments in Section 6. Related work and the conclusion are presented in Sections 7 and 8, respectively. 2. Overview of MPI and UPC This section provides an overview of MPI and UPC parallel programming models and examines the advantages and disadvantages of each for the construction of a hybrid programming model. 1390

3 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, 2.1. MPI model The MPI is the most widely used parallel programming paradigm, particularly for distributed memory parallel computers. The MPI provides the user with a programming model in which processes communicate with other processes by explicitly calling library routines to send and receive messages. The MPI-1 standard provides library routines for multiple separate processes to collaborate and to communicate with each other. These include 2-sided send/receive operations for exchanging data between process pairs, a variety of powerful collective operations, virtual topologies, and explicit grouping operations for regional locality such as among rows and columns of Cartesian topology. The MPI-2 standard contributes support for one-sided communication, dynamic process management, and parallel I/O. The MPI-2 defines 3 one-sided communications operations, Put, Get, and Accumulate, as a write to remote memory, a read from remote memory, and a reduction operation on the same memory across a number of processes. The MPI-2 s one-sided model supports remote process access to the data without explicit help from the data s owner, which is a model similar to UPC s global address space programming model. However, it is more restrictive than a global address space in terms of its cache coherence and synchronization characteristic [4,5]. We believe that a hybrid programming model should exploit the complementary strengths and not the similarities between the MPI and UPC; for this reason, we did not consider MPI-2 one-sided communication for hybrid development. We first analyzed the advantages and disadvantages of both MPI and UPC models and then added the desirable features of each model. The next step was to reduce the impact of unfavorable features of each to form a new hybrid model. The advantages of the MPI programming model include the programmer s complete control over data distribution, process synchronization, explicit communication, and a permitting of the optimization of data locality [6]. This gives MPI programs high performance and scalability; however, it also has the drawback of making the MPI difficult to program and debug. Another disadvantage is that existing sequential applications require a fair amount of restructuring for MPI parallelization, as the programmer may not start with a sequential program and then incrementally change the program to achieve incremental performance improvement UPC model UPC is an extension of the C programming language that provides a uniform programming model for both shared and distributed memory systems. UPC, as an explicit parallel language, provides the facilities for direct user specification of program parallelism and control of data distribution. The programmer is presented with a single shared and logically partitioned global address space that is physically distributed by default in roundrobin fashion across available memories [2]. Each variable is physically associated with a single processor, but variables may be read and written by any processor without explicit help from the data s owner [7]. With UPC, the number of threads, or the degree of parallelism, is fixed at either the compiler or program startup time and does not change during execution. Multiple threads operate independently and each thread has affinity with a portion of the globally shared address space. Each thread also has a private space. The total number of threads is THREADS and each thread identifies itself using MYTHREAD. Work can be distributed conveniently using upc forall constructs. All iterations must be independent in order to use upc forall. The advantage of the UPC programming model over the MPI is that it allows the programmer to perform incremental parallelization of applications. The programmer can incrementally parallelize a portion of the existing sequential code and then incrementally obtain performance enhancement. UPC presents local and global (shared) objects as visible to the programmer because these objects have different performance behavior. 1391

4 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 However, one weakness in the model is that data are either local or global; the language has no regional notion. UPC does not permit thread groups to allocate distributed shared arrays on a subset of processors, while the MPI can group processes with MPI Comm split to create regional locality, such as rows or columns of virtual topologies. In UPC terminology, a single unit of execution is named a UPC thread to emphasize UPC s goal of providing shared memory-like programming. However, UPC threads are like operating system-level processes in a Berkeley UPC distribution, just like MPI processes, because these processes of the hybrid MPI+UPC may run on distributed memory machines located in different cluster nodes Hybrid MPI+UPC model The MPI is an application programming interface-based library that can be linked with C, C++, or Fortran programming languages. On the other hand, UPC is an extension of the C programming language. Both the MPI and UPC use a single program, multiple data (SPMD) model of computation. Thus, the UPC program calls MPI libraries to form a hybrid program with an SPMD model. The hybrid program is compiled with the UPC compiler and linked with MPI libraries. The objective of the hybrid MPI+UPC programming model is to combine the strengths of MPI locality control and scalability with UPC fine-grain parallelism and ease of programming. The MPI demands large granularity, and small messages are expensive because every communication has a fixed startup overhead latency. Thus, the hybrid will use MPI for the outer parallelism and UPC for inner parallelism. Figure 1 shows a hybrid model in which multiple UPC groups are combined with one outer MPI group. UPC threads can communicate with each other within their group while the MPI is used for intergroup communication. There is only one master thread in each UPC group, such as the last thread (MYTHREAD == THREADS-1), which can participate in MPI communication; i.e. all communication is funneled to the master thread [3]. This model is very similar to MPI THREAD FUNNELED support, in which an MPI application may be multithreaded but only one thread at a time may make MPI calls. We refer to this hybrid model as a funneled model, as well [8]. The hybrid MPI+UPC model is based on running independent UPC groups that are connected to each other solely by MPI communication. The programmer of the hybrid model must first decide how many independent UPC groups are to be run, and then decide how many threads should be present in each UPC group. For example, if a user wants to have 4 UPC groups and let each UPC group have 8 threads, then this is a 4MPI 8 UPC configuration. The hybrid model compiles the hybrid code with the UPC compiler with fupcthreads-8 specifying that the program should have 8 THREADS and then run the generated executable with mpiexec np 4, where mpiexec/mpirun starts 4 different SPMD programs and each program is an 8-threaded UPC program. One of these threads is chosen as a master thread to serve both the MPI process and UPC threads. This master thread can perform MPI calls, but other threads cannot involve any MPI routines. The UPC group can only communicate with other groups through the master thread. The user chooses the group parameters based on his or her application and the system hardware configuration, and the hybrid model itself does not place any constraints on the number of threads Differences between funneled hybrid and nested-funneled hybrid In this section, we would like to clarify the differences between our proposed hybrid solution and Dinan et al. s hybrid solution [3]. 1392

5 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, UPC Group 0 UPC Group MPI0 MPI1 MPI Group UPC Group 2 UPC Group 3 MPI2 MPI Figure 1. The funneled hybrid MPI+UPC model; gray circles represent the hybrid MPI+UPC master thread and white represents UPC threads. 01: 02: #include <upc.h> #include <mpi.h> 03: #define N 400 // THREADS = 4 04: shared double v1[n], v2[n]; 05: shared double our sum = 0.0; 06: shared double my sum[threads]; 07: shared int me, np; 08: int main(int argc, char **argv) { 09: 10: int i, B; double dotp; 11: if (MYTHREAD == 0) { 12: MPI Init(&argc, &argv); 13: MPI Comm rank(mpi COMM WORLD,(int*)&me); 14: MPI Comm size(mpi COMM WORLD,(int*)&np); 15: } 16: upc barrier; 17: B = N/np; 18: my sum[mythread] = 0.0; 19: upc forall(i=me*b;i<(me+1)*b;i++; &v1[i]) 20: my sum[mythread] += v1[i]*v2[i]; 21: upc all reduced(&our sum, 22: &my sum[mythread],...); 23: if (MYTHREAD == 0) { 24: MPI Reduce(&our sum, &dotp, 1,...); 25: if (me == 0) printf("dot = %f\n",dotp); 26: MPI Finalize(); 27: } 28: return 0; 29:} 30: Nested-funneled dot product from [3]. 01: 02: #include <upc.h> #include <mpi.h> 03: #define N : shared double v1[n], v2[n]; 05: shared double our sum = 0.0; 06: shared double my sum[threads]; 07: shared int me, np; 08: int main(int argc, char **argv) { 09: 10: int i; double dotp; 11: if (MYTHREAD == THREADS-1) { 12: MPI Init(&argc, &argv); 13: MPI Comm rank(mpi COMM WORLD,(int*)&me); 14: MPI Comm size(mpi COMM WORLD,(int*)&np); 15: } 16: upc barrier; 17: 18: my sum[mythread] = 0.0; 19: upc forall(i=0;i<n;i++; &v1[i]) 20: my sum[mythread] += v1[i]*v2[i]; 21: upc all reduced(&our sum, 22: &my sum[mythread],...); 23: if (MYTHREAD == 0) { 24: MPI Reduce(&our sum, &dotp, 1,...); 25: if (me == 0) printf("dot = %f\n",dotp); 26: MPI Finalize(); 27: } 28: return 0; 29:} 30: Our funneled dot product. The main purpose for the hybrid model in [3] is to help memory-constrained MPI codes scale to larger problem sizes. This is mentioned in the abstract, introduction and conclusion. For example, in the 1393

6 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 conclusion of [3] it is stated: For memory-constrained MPI codes, the hybrid model enables the processing of larger problems by aggregating the memory of several nodes into a single, shared global address space. On the other hand, our main goal is to improve performance such that the hybrid MPI+UPC outperforms both MPI-only and UPC-only models. Imagine that we are running this code on 4 4 groups that comprise 4 MPI processes and 4 UPC THREADS. In the nested-funneled model proposed by Dinan et al., the partitioning across groups is done in blocks with block size B, and line 17 calculates block size B. Later, upc forall uses B and MPI rank me to calculate the lower and upper bounds at line 19. Their main goal is to provide MPI programs with access to a large distributed shared global address space. They set up their hybrid system, including the UPC compiler, to handle this case. Therefore, there are many caveats and difficulties involved in setting up and using their model [9]. On the other hand, our main purpose is to localize the computation so that the hybrid system can outperform not only the plain UPC version but also the plain MPI version. Please compare line 19 in both programs, in which upc forall is the main source of parallel computation for the UPC language. Dinan et al. compare their hybrid model with a UPC-only model, but we compare both UPC-only and MPI-only implementations. Moreover, we provide an optimization to overlap the computation with the communication in the Cannon application. This optimization allows the UPC to perform computations while the MPI performs communication. 3. Cannon algorithm Matrix multiplication is a fundamental kernel that is used for the numerical solution of many problems [10]. Cannon s algorithm [11], also sometimes called Fox s algorithm [12], is the most efficient matrix multiplication for parallel platforms. For simplicity, we are interested in performing the multiplication C = A B, where C, A, and B are N N square matrices. We begin by assuming that matrices A, B, and C are identically decomposed into subblocks of 2-dimensional tile grid. Algorithm 1. The Cannon matrix multiplication algorithm. for i = 0 to ( P-1) do // P is the total number of tiles A sub broadcast_a( T appropriate along rows ) C sub T x A sub cshift_b ( upward along colums ) C sub end for Algorithm 1, indicated above, has 3 fundamental steps. The first step broadcasts the diagonal A sub subblocks along each row of tiles. The broadcast source will be shifted to the right of the rows for the next 1394

7 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, iteration. The second step performs submatrix multiplication. The final step performs an upward circular shift along each column of 2-dimensional tile grid. The Cannon algorithm runs a 2-dimensional square grid with a wraparound connection. Supposing that the number of processors is P, the array of processors will be a 2-dimensional grid with size P P. We multiply the global array A[N][N] with the global array B[N][N] to calculate global array C[N][N]. These global arrays must be distributed on the P P grid, and each node on the grid will have the local A sub [dn][dn], B sub [dn][dn], and C sub [dn][dn] arrays where dn =. N P The Cannon algorithm calculates a partial result using the submatrices that are currently accessed by it. It successively performs the same calculation on new submatrices, adding the new results to the previous. It has 2 different types of communications. Every rows broadcast A sub from the selected source along rows. Every node in the grid performs a circular shift (send/receive) for B sub along each column concurrently. It successively performs the same calculation on new submatrices, adding the new results to the previous. All of these operations are performed P times. There are several parallel matrix multiplication algorithms using matrix decomposition based on the number of processors available, such as the Cannon algorithm used in this section; these include the systolic algorithm, parallel universal matrix multiplication (PUMMA), scalable universal matrix multiplication (SUMMA), and distribution-independent matrix multiplication (DIMMA) [13]. Basically, each one of these algorithms decomposes the matrices into submatrices and calculates a partial result using the submatrices that are currently accessed by the processor. However, different matrix multiplication methods may have different computation and communication requirements. For example, the systolic algorithm performs transpose operations and sends and receives communications. PUMMA uses scatter communication, and SUMMA and DIMMA use broadcast operations. In contrast, the Cannon algorithm uses both the broadcast and shift operations (send/receive). PUMMA and the systolic algorithm are known to provide strong performances [13] on a distributed system. However, we selected the Cannon algorithm, which is easy to implement, and we described our funneled hybrid model with this example without difficulty. In addition, Cannon algorithm performance is very compatible with both MPI implementation and UPC implementation. 4. Code overview In this section, we develop 4 different implementations of the same parallel Cannon algorithm with the MPI, UPC, hybrid MPI+UPC, and optimized hybrid MPI+UPC versions, respectively MPI implementation Our MPI implementation of Cannon s algorithm is based upon a 2-dimensional block decomposition, in which there are 2 collective communication operations involving a subset of the processes, such as rows of processes and columns of processes. In order to involve only a subset of the original process group in a collective operation, we need to create a Cartesian topology, a 2-dimensional virtual grid of processes as shown in Figure 2. The 2-dimensional grid with wraparound connections is often simply referred to as a torus network. Figure 3 shows the Cannon algorithm implementation written for the MPI model. The packing, unpacking, and calculating of source and destination for communications have been removed for simplification. The code has 2 major parts, the first of which is to construct the Cartesian topology. The second part is to 1395

8 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 implement the algorithm in 3 steps. In the first step, the processes in the rows of the virtual process grid participate in the broadcast communication. The second step performs the submatrix multiplication. In the last step, each column of processes in the grid performs the send receive operations for executing a circular shift operation across a chain of processes in the grid column. The MPI program has explicit control of data locality. Regional data locality is provided among rows and columns by using the advance feature of the MPI s Cartesian virtual topology. The MPI can enhance UPC by providing explicit control over data locality in the hybrid programming model, and Cannon s algorithm is an ideal selection to demonstrate the importance of regional data locality Row Row Row Row3 Column 0 Column 1 Column 2 Column 3 Figure 2. An MPI Cartesian topology: a 2-dimensional virtual grid of MPI processes with wraparound connections. UPC does not provide process groups and does not allow for the allocation of distributed shared arrays on a subset of processors. However, the MPI program has explicit control of data locality. The programmer can create process groups, such as a group of rows or columns, as in MPI s Cartesian virtual topology. For example, each row group of the topology may perform broadcasting of different values at the same time. This is referred to as regional locality. The hybrid Cannon application uses this feature of MPI to enhance the UPC creation of these groups. This is one of the benefits that the MPI contributes to the hybrid for UPC. However, the MPI adds other benefits to the hybrid, as well, such as well-tuned collective communications. Furthermore, UPC benefits the hybrid, as well, in that it handles the fine-grain parallelism and prepares the coarse-grain parallelism for the MPI UPC implementation Figure 4 shows the UPC implementation of matrix multiplication with a block distribution. The UPC code for the matrix multiplication is almost the same size as the sequential code. This makes UPC easy to program and it allows for incremental parallelization of sequential codes. The global (shared) array declaration has the keyword shared [block-size] to distribute shared arrays in a block per thread in a round-robin fashion. UPC does 1396

9 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, not provide a 2-dimensional virtual topology to make a group of threads for regional data locality, such as the row-wise or column-wise grouping presented in Section 4.1. UPC only differentiates between 2 different kinds of data, shared (global) and private (local), for threads. UPC partitions parallel works by using the upc forall construct, which distributes iterations of the loop according to the affinity expression, &Aupc[i][0]. The UPC will assign iterations to the thread that has affinity to the corresponding element of Aupc. The Berkeley UPC distribution version provides a matrix multiplication for the Cannon algorithm, and that is the code that we utilized for UPC matrix multiplication benchmarking. However, our MPI+UPC hybrid uses the block matrix multiplication (Figure 4) for the submatrix multiplication. // PART 1: Construct a Cartesian topology MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p); MPI_Dims_create (p, 2, grid_size); MPI_Cart_create (MPI_COMM_W ORLD, 2, grid_size, periodic, 1, &grid_comm); MPI_Comm_rank (grid_comm, &grid_id); MPI_Cart_coords (grid_comm, grid_id, 2, grid_coords); MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1], &row_comm); MPI_Comm_split (grid_comm, grid_coords[1], grid_coords[0], &col_comm); // PART 2: Cannon Algorithm int S = (int) sqrt( P ); for ( k = 0; k < S; k++) { MPI_Bcast(Atmp, dn*dn, mpitype, src, row_comm); // STEP 1: Broadcast Asub matmul(atmp, Bsub,C sub); // STEP 2: C = Asub x Bsub MPI_Sendrecv(Bsub, dn*dn, mpitype, left, tag, // STEP 3: CSHIFT Bsub btmp, dn*dn, mpitype, right, tag, col_comm, &status); } Figure 3. MPI implementation of Cannon s algorithm. The goal of the UPC language is to provide a shared memory, like a programming model, on the distributed memory systems. The distributed global arrays are marked with a shared keyword. The UPC compiler automatically decomposes the shared arrays to the available processors. However, an MPI programmer must explicitly distribute N N matrices to the local submatrices with size dn dn, where dn = N P and P is the number of processors. 1397

10 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 shared [N] double Aupc[N][N], Bupc[N][N], Cupc[N][N]; void matmul_upc( ) { int i,j,k; double sum; upc_forall (i=0; i<n; i++; &Aupc[i][0]) for (j=0; j<n; j++) { sum = 0; for (k=0; k< N; k++) sum +=Aupc[i][k]*Bupc[k] [j]; Cupc[i][j] = sum; } } Figure 4. UPC matrix multiplication with block distribution Hybrid MPI+UPC implementation In Figure 5, the funneled hybrid MPI+UPC is formed with the careful combination of the MPI program of Figure 3 and the UPC program of Figure 4. Here again, we present the simplified hybrid code due to space limitations; however, the main algorithm of the code should be clear. The hybrid program consists of 1 MPI group, and each MPI process has 1 UPC group, as shown in Figure 1. There is only one master thread in each UPC group, such as the last thread (MYTHREAD == THREADS-1), which can participate in MPI operations as an MPI process. The master thread initializes the MPI and is able to construct an MPI Cartesian topology at part 1 of the code. Part 2 performs Cannon s algorithm with 2 explicit levels of parallelism. The MPI manages the outer parallelism by bringing the appropriate subblocks to the master threads of each UPC group. Master threads copy their private subblocks to the shared subblocks by a copy from master to upc() routine, which performs simple copy operations and synchronizes each UPC thread group with upc barrier(). The shared subblocks are in the global address space of each UPC group. Each UPC participates in the innerlevel parallelization of subblock matrix multiplication with a block distribution, which is given by matmul upc() in Figure 4. In order to obtain optimal execution time in a parallel program, the program has to be partitioned into concurrent tasks that handle the different parts and different grain sizes of the global data for the data-parallel programming model. Here, the MPI, UPC, and hybrid MPI+UPC programs use data-parallel programming, and the decision about selecting grain size is very important for data-parallel programming. However, selecting the optimum grain size suitable for parallel execution is an NP-complete problem. A large grain size will limit potential parallelism. A small grain size, however, will result in greater communication overheads and may cause execution time degradation. The MPI performs well on coarse-grain parallelism with a larger message size, while UPC can handle fine grain better than the MPI. Hybrid implementation has 2 kinds of data decompositions. Supposing that we have a global matrix A[N][N], we first perform the data decomposition as A sub [dn][dn] for the MPI processes. Later, the UPC compiler performs the second decomposition with the shared Aupc[dN][dN] declaration. The master thread, which is both an MPI process and a UPC thread, copies A sub to Aupc at the same time. The first distribution results in coarse-grain distribution, which the MPI can handle very efficiently. The second distribution performs 1398

11 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, shared [N] double Aupc[N][N], Bupc[N][N], Cupc[N][N]; Boolean MASTER = (MYTHREAD == THREADS 1); int main (int argc, char *argv[]) { if(master) { // PART 1: Construct a Cartesian topology MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p);... MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1], &row_comm); MPI_Comm_split (grid_comm, grid_coords[1], grid_coords[0], & col_comm); } for( k = 0; k < S; k++) { // PART 2: Cannon Algorithm // Master copied Atmp = Asub if(master) MPI_Bcast(Atmp, dn* dn,type,src,row_comm); // STEP 1:Broadcast Asub copy_from_master_to_upc(atmp,b sub ); // Copy to shared Aupc=Atmp, Bupc=Bsub matmul_upc (); // STEP 2: Cupc += A upc x B upc if(master) MPI_Sendrecv(Bsub, dn* dn,type,left,tag, // STEP 3: CSHIFT Bsub } btmp, dn* dn, mpitype, right, tag, col_comm, &status); } Figure 5. Hybrid MPI+UPC Cannon s algorithm. The subblock multiplication has UPC block distribution. fine-grain distribution, which UPC can handle quite effectively. The hybrid program is compiled with the UPC compiler and linked with the MPI libraries. The -fupcthreads-num option generates code for a fixed number NUM of UPC threads. The MPI launcher is used to start the hybrid program. Below are an example UPC compilation and an example MPI launcher, in which 4 MPI processes are created and each process creates a UPC group with 4 threads. The total parallel thread number is 4 MPI 4 UPC = 16 threads, as shown in Figure 1. $ upcc --o matmul hybrid matmul hybrid.upc -fupc-threads-4 -O $ mpiexec --np 4 --hostfile hosts matmul hybrid 4.4. Optimized hybrid MPI+UPC implementation Although overlapping communication with computation provides the opportunity to improve the execution time of a parallel program, this parallel programming style is not widely used due to its complexity. However, the hybrid Cannon algorithm presents a good opportunity for overlapping communications with computation. In this algorithm, we only need the full synchronization of each UPC group before performing the subblock matrix multiplication. The computation of subblock UPC multiplication can be overlapped with the MPI s communication. However, the upc forall construct distributes iterations of the loop according to the affinity 1399

12 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 expression at the fourth parameter of the construct. The UPC will assign each iteration to the thread that has affinity to the corresponding element of the shared array. This implies that assignment statements of upc forall involving shared arrays are executed exclusively by those threads that are owners of the shared array elements. We structured our hybrid code such that the master thread will not spend excessive time on upc forall computations, but rather dedicate time to MPI communication, since the master thread is both the UPC thread and the MPI process. #define BLOCK (N*N)/(THREADS - 1) #if BLOCK < shared [BLOCK] double Aupc[N][N]; shared [BLOCK] double Bupc[N][N]; shared [BLOCK] double Cupc[N][N]; #else if shared [ ] double Aupc[N][N]; shared [ ] double Bupc[N][N]; shared [ ] double Cupc[N][N]; #endif Figure 6. Distribution scheme to optimize the hybrid MPI+UPC Cannon algorithm. If the hybrid code partitions the shared arrays such that the master thread does not have shared array elements or has much less than the other threads, the master will finish upc forall earlier than the others and reach the MPI communication operations while the other threads are still executing their portion of upc forall iterations. Figure 5 shows a distribution scheme in which fewer shared array elements reside on the master thread s memory space. The shared array declaration has the keyword shared [block-size] to distribute shared arrays in a block per thread in a round-robin fashion. The shared array size is N N in the subblock of Cannon s algorithm, and in this way the master thread should have no data. The block size must be (N N)/(THREADS 1) so that the last thread will have no data in a round-robin fashion. However, the Berkeley UPC implementation has a limit for the block size of 64k. Even if the problem size reaches the limit, the solution in Figure 5 still provides fewer shared array elements to the last thread (master) because of the round-robin fashion distribution. The addition of the code in Figure 6 optimizes the hybrid MPI+UPC implementation of Figure 5 to overlap MPI communications with UPC computations on hybrid MPI+UPC implementations. The Cannon algorithm has 3 steps in a loop. Step1: The master thread performs MPI Broadcast A sub Step2: Every thread performs matmul upc() parallelism from upc forall Step3: The master thread performs shift communication B sub When the master thread is performing broadcast communication in step 1, all of the other UPC threads are waiting at matmul upc() in step 2. Once the master thread completes bcast, it will join the rest of the threads at matmul upc(). Every thread starts matmul upc() in step 2. However, the master thread will quickly exit from matmul upc to start step 3 communications while the rest of the threads are still working on the matmul upc() at Step2. Here, the UPC computation in step 2 is overlapped with the MPI communication at Step 3. Even further, the master thread completes the step 3 communication and starts the step 1 broadcasting 1400

13 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, from the next iteration while the nonmaster threads are still trying to finish matmul upc() in step 2. Basically, the master threads (MPI processes) spend most of their time on communications in step 1 and step 2, while the other UPC threads are mostly busy successively performing the matrix multiplication in step Benefits of hybrid MPI+UPC model over MPI and UPC models The hybrid MPI+UPC can take advantage of multicores at the symmetric multiprocessing (SMP) cluster to efficiently access the shared memory in nodes of the SMP cluster with no cost. However, the MPI uses communications to access the shared memory at multicore nodes of the SMP cluster. This brings overhead for the MPI to access the local data with penalty. UPC does not provide process groups and does not allow for the allocation of distributed shared arrays on a subset of processors. The hybrid model uses MPI features to enhance the UPC creation of these groups. This enhances some hybrid programs scalability over UPC programs. The hybrid model provides 2 explicit levels of parallelism. The hybrid uses the MPI to manage the outer parallelism by bringing the appropriate data blocks to UPC groups for processing. UPC groups participate in the inner level of parallelization. Neither the MPI nor UPC alone can provide this flexibility, which may be needed for some applications such as Cannon s algorithm. Hybrid implementation has 2 kinds of data decompositions. The first distribution results in coarse-grain distribution, which the MPI can handle very efficiently. The second distribution performs fine-grain distribution, which UPC can handle quite effectively. 5. Performance evaluation This section will illustrate the impact of our proposed hybrid MPI+UPC approach through several experiments by running 4 different implementations of the Cannon algorithm on a 16-node HP BL460c cluster located at Kadir Has University. This SMP cluster consists of GHz Intel Xeon Quad Core CPUs and 24 GB RAM with a total of 8 processing cores per node running Linux connected with a 20-Gbps Infiniband. We show the performance for a baseline UPC version, a baseline MPI version, and hybrid MPI+UPC versions for each benchmark. We performed the hybrid testing with various group sizes. We selected the best performing group for our benchmark. The experiment results were obtained with an average of 5 runs Cannon algorithm performance The first experiment objective is to find an optimum UPC group for the hybrid MPI+UPC model. The results in Figure 7 show the time required to perform matrix multiplications with groups of 1, 4, 8, 16, and 64 UPC THREADS. The ideal performance is in the group of 8 UPC threads. Each node in our system holds 8 cores per slot, indicating that each UPC thread goes to different cores in the nodes. In fact, we configured UPC s GASNET NODEFILE such that each consecutive thread goes to consecutive cores in the node for the hybrid runs. Similarly, we ensured that each MPI process goes to different nodes by putting the slots=1 option in the MPI s hostfile for the best performance of the hybrid model with a round-robin fashion. However, during plain MPI runs, the hostfile is configured with slots=8 to advise consecutive processes to be in consecutive cores for the process affinity. Cannon s algorithm runs on a 2-dimensional square mesh constructed by the MPI. For example, in the 9 MPI 8 UPC case, there will be 9 MPI processes to be used to create a 2-dimensional 3 3squaremesh. If 1401

14 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 the number of MPI processes is P, the algorithm creates a P P mesh. In addition, all of the matrix sizes are the global matrix, not the distributed submatrices. The second experiment was designed to reveal how well the proposed hybrids perform compared to plain MPI and plain UPC versions. Figure 8 shows the total execution time on the vertical axis, and the horizontal axis denotes the problem size. The CPU times of hybrids and others for the small problem sizes of and are almost the same. For the other problem sizes, hybrids consistently obtain better performance than UPC. Execution time (s) Cannon algorithm on 64 CPUs Hybrid Optimized hybrid MPI processes UPC threads Execution time (s) MPI UPC MPI, UPC (36 CPUS) Hybrids (4 9 = 36 CPUS) Hybrid Optimized hybrid ,000 Problem size Figure 7. The effect of varying numbers of MPI processes and UPC threads on execution time for double global matrices for Cannon s algorithm on 64 CPUs. Figure 8. Execution time of Cannon s algorithm on 36 nodes with varying problem sizes; hybrids run 4 MPI 9 UPC configurations. The Table shows the percentage gains of the optimized hybrid compared to the MPI, UPC, and the plain hybrid. For the 10,000 2 problem size, the optimized hybrid shows an efficiency of 60.13% over UPC. The optimized hybrid outperforms in efficiency by 46.83% over the MPI at As for the plain hybrid, the optimized hybrid achieved approximately 16% comparative efficiency. Table. Comparing percentage gains of the optimized hybrid MPI+UPC version to the MPI, UPC, and plain hybrid MPI+UPC of Figure 5. Data size ,000 2 Gain over MPI Gain over UPC Gain over hybrid Figure 9 shows the behavior of the MPI, UPC, and optimized hybrid MPI+UPC codes on a large data size with a fixed number of CPUs. The optimized hybrid achieved around 20% and 15% efficiency compared to the MPI and UPC, respectively. However, UPC code outperforms the MPI in this scenario. As the MPI requires copying data on shared memory, when the data size increases and the CPU size stays the same, the number of the direct accesses of shared memory of UPC increases, and therefore the MPI could be considered less efficient unless the number of CPUs increases along with data sizes. 1402

15 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, 5.2. Random-access benchmark performance The random-access benchmark is defined to test the speed at which a machine can update the elements of a table spread across a global system memory. The benchmark requires that the array should not be replicated for the memory limitation. Parallel implementation performs poorly on distributed-memory machines, due to updates requiring numerous small point-to-point messages between processors. The random-access benchmark is trivial to parallelize by using UPC with 20 lines of codes. UPC uses one-sided access to distributed arrays. However, parallelizing this application is not trivial when using the MPI with 150 lines of code. The code handles irregular communication patterns with point-to-point communication. The funneled hybrid MPI+UPC is 180 lines of code. In terms of productivity in this application, the UPC model is superior to the MPI and the hybrid models because of PGAS support. However, Figure 10 shows that the hybrid MPI+UPC outperforms both the MPI and UPC in terms of execution time. Execution time (s) 12,000 10, MPI UPC Optimized hybrid 10,000 20,000 30,000 40,000 Problem size Figure 9. Execution time of Cannon s algorithm on 100 CPUs with large problem sizes. Figure 10. Random-access benchmark execution time (s) versus number of CPUs for UPC, the MPI, and the hybrid with group size of 8 implementations. Each process performs 1,000,000 random accesses to the distributed global array with 1,000,000 lengths, and for each access they perform 1000 floating-point operations of computation. Each process has a constant amount of work, and so the performance of this experiment does not improve when the number of processors increases. In this experiment, the funneled hybrid MPI+UPC model leads to about a 200% improvement in performance over the UPC model and a 100% improvement over the MPI model. Both UPC and MPI execution time increase when the number of processors increases because this also increases the number of communications. However, the funneled hybrid MPI+UPC remains flat, indicating that processor count does not affect the execution time. The main reason for success of the hybrid in this case is that it better localizes the random accesses. The MPI model outperforms UPC because hand code communication is better than UPC compiler-generated, one-sided communication. We wanted to compare our result with the nested-funneled model of [3]; however, we were not able to set up their nested-funneled model in our system because at present there are numerous caveats and difficulties involved in setting up and using their model [9]. However, if we make a rough comparison between the plain execution time of our model with the reported number from [3], our model leads to about 30% improvement in performance over the Hybrid-4 run of Dinan et al. However, our system has dual-core 2.6-GHz Intel Xeon CPUs. Dinan et al. s test machine has dual-core 2.6-GHz AMD Opteron processors. These systems have different floating point units and cache characteristics in terms of cache size and latencies, which may have a great effect on the execution time. In addition, our system with a 20-Gbps Infiniband is faster than the 10-Gbps Infiniband used for Dinan et al. s test machine. The random- 1403

16 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 access benchmark has irregular access pattern. Thus, exchanging data via a regular communication pattern is not feasible. For that reason, the communication timing of this benchmark should be made up from the latency of the Infiniband network, not the bandwidth of the network. The end-to-end latency is 2.84 μs and 3.85 μs for up to 128 bytes for the 20-Gbps and 10-Gbps Infiniband, respectively [14]. In this application, each process performs 1,000,000 random accesses. If we assume that all of the accesses are remote accesses for the sake of simplicity, our communication timing gains 1 s (from to )overthecase of Dinan et al. However, we conducted an experiment to measure the communication timing for this benchmark to include congestions and delays on our cluster. We found that the communication took 4 times longer than the above numbers. That means that the better network benefited us a 4-s gain over the case of Dinan et al. Thus, when we take network speed differences into account, our improvement is reduced to the 18% range Barnes-Hut N-body simulation performance The classical N-body problem simulates the evolution of a system of N bodies, where the force exerting a gravitational pull on each body arises due to interaction with all of the other bodies in the system. This algorithm is frequently used in demonstrations of computational performance and is an interesting algorithm for several reasons. First, the simulation of the motion of particles subject to particle particle interactions represents a general class of algorithms with applications ranging from chemistry to astrophysics. Straightforward sequential algorithms to solve these problems typically have time complexity O(N 2 ) per iteration, where N is the number of objects. However, the Barnes Hut algorithm [15] was developed to reduce the complexity. This is ano(n log N) algorithm based on a hierarchical octree representation of space in 3 dimensions. It computes interactions between distant particles by means of first-order approximation. We undertook an experiment to show the strong scaling of the Barnes Hut force and velocity calculation kernel. The results are shown in Figure 11 for a 150,000-body system. We show the speedup for the baseline UPC implementation, MPI implementation, and funneled hybrid implementation with a UPC group size of 8. UPC implementation shows poor scaling performance due to the fact that the number of nonlocal references implemented as one-sided communication increases proportionately to the UPC thread count. The MPI shows better speedup performance than UPC in this application. The funneled hybrid MPI+UPC model offers a better performance comparison because the hybrid model demonstrates better management in the percentage of local data references over baseline UPC implementation by replicating the octree of the algorithm. The funneled hybrid with group size 8 achieves almost linear scaling because the hybrid brings all data to the SMP multicore nodes. In comparison, the funneled hybrid with an 8 UPC group offers better performance than the nested-funneled group with Hybrid-8 [3]. However, our best-performing hybrid with an 8 UPC group is comparable to Hybrid-4 of [3], because we conducted our experiment on SMP clusters that had 8 processing cores per node, while Dinan et al. conducted their experiment on SMP clusters with 4 cores. 6. Related work An SMP cluster system with multiple SMP nodes and multicore chips is the most commonly available parallel computing hardware. MPI programming can be used both within an SMP node and among the SMP cluster s nodes. However, the other programming model for this platform is a hybrid programming model, in which a parallel program is written using a thread programming library, such as Portable Operating System Interface (POSIX) threads, within an SMP node and MPI programming among nodes simultaneously. The MPI

17 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, standard has clearly defined the interaction between the MPI and user-created threads in an MPI program. However, Gropp and Thakur [8] pointed out the issues involved in developing an efficient thread-safe MPI implementation without sacrificing too much performance. Speedup 140 Barnes-Hut algorithm Number of CPUs MPI UPC Hybrid-8 Figure 11. Barnes Hut N-body simulation speedup graphs at force calculation part for UPC, the MPI, and the funneled hybrid with group size of 8. Other hybrid programming with the MPI as outer-level parallelism and OpenMP as inner-level parallelism has been extensively studied for an SMP cluster system [16 18]. The shared address space within each SMP node is suitable for OpenMP parallelization and the MPI can be employed within and across the nodes of a cluster. The MPI/OpenMP hybrid programming model is easy to apply via automatic parallelization of the compilers with some directives for loop-level parallelism. Rabenseifner et al. showed the relation between the MPI/OpenMP hybrid programming model and hardware architecture [17]. Recent years have seen another shift in parallel computing hardware with the introduction of general purpose programming tools to perform high-performance computing on multicore graphic processing units (GPUs), which are attached to each node of an SMP cluster. The latest GPU programming languages, such as NVIDIA s Compute Unified Device Architecture (CUDA), provide the programmer with a high-level and flexible model [19] that develops a hybrid MPI-CUDA model and adopts a CUDA programming model for fine-grain data-parallel operations within each GPU, and the MPI for coarse-grain parallelization across the cluster. A new hybrid parallel model that combines the MPI and UPC was explored first in [3] by Dinan et al. and defined the hybrid MPI+UPC parallel programming model in terms of submodels that varied the level of nesting and the number of instances of the model. They classified the hybrid model into 3 categories. The flat model permits all processes to participate in both UPC and MPI communication. In the nested-funneled model, one process per UPC group can participate in MPI communication. The nested-multiple model is the most powerful, allowing the MPI to span all UPC processes in all UPC groups. However, this added flexibility comes with greater complexity. The authors demonstrated its effectiveness and performance gains with Barnes Hut N-body simulation. Our funneled model is similar to the nested-funneled model in that only one member of each UPC group is able to make MPI calls. However, in our funneled model, every UPC group has its own shared variables, which are only distributed at that UPC group; they are not distributed across all groups to enlarge the MPI memory. In [3], the lower and upper bounds of upc forall are calculated by using MPI id and UPC group id. In contrast, our approach simply uses the shared array index upc forall for lower and upper bounds in each UPC group. In addition, we implemented the Cannon algorithm with advanced MPI features 1405

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18