Hybrid MPI+UPC parallel programming paradigm on an SMP cluster

Size: px
Start display at page:

Download "Hybrid MPI+UPC parallel programming paradigm on an SMP cluster"

Transcription

1 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012, c TÜBİTAK doi: /elk Hybrid MPI+UPC parallel programming paradigm on an SMP cluster Zeki BOZKUŞ Department of Computer Engineering, Kadir Has University, İstanbul-TURKEY zeki.bozkus@khas.edu.tr Received: Abstract The symmetric multiprocessing (SMP) cluster system, which consists of shared memory nodes with several multicore central processing units connected to a high-speed network to form a distributed memory system, is the most widely available hardware architecture for the high-performance computing community. Today, the Message Passing Interface (MPI) is the most widely used parallel programming paradigm for SMP clusters, in which the MPI provides programming both for an SMP node and among nodes simultaneously. However, Unified Parallel C (UPC) is an emerging alternative that supports the partitioned global address space model that can be again employed within and across the nodes of a cluster. In this paper, we describe a hybrid parallel programming paradigm that was designed to combine MPI and UPC programming models. This paradigm s objective is to mix the MPI s data locality control and scalability strengths with UPC s fine-grain parallelism and ease of programming to achieve multiple-level parallelism at the SMP cluster, which itself has multilevel parallel architecture. Utilizing a proposed hybrid model and comparing MPI-only to UPC-only implementations, this paper presents a detailed description of Cannon s algorithm benchmark application with performance results of a random-access benchmark and the Barnes Hut N-Body simulation. Experiments indicate that the hybrid MPI+UPC model can significantly provide performance increases of up to double in comparison with UPC-only implementation, and up to 20% increases in comparison to MPI-only implementation. Furthermore, an optimization was achieved that improved the hybrid performance by an additional 20%. Key Words: Hybrid parallel programming, UPC, MPI 1. Introduction The Message Passing Interface (MPI) is one of the most commonly used parallel programming models for parallel computing [1]. The MPI provides portability, good scalability, and significant flexibility in parallel programming. However, the MPI requires explicit communications with large granularity, which renders programming and debugging problematic. Recently, partitioned global address space (PGAS) languages such as Unified Parallel C (UPC) have made possible an alternative parallel programming model that allows for shared memory, such as programming on distributed memory systems, through an ability to read and write to remote memory with simple statements without explicit communication [2]. 1389

2 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 This study aimed to exploit the complementary strengths of both models by providing a hybrid parallel programming model that combines the MPI and UPC. This hybrid model reduces the communications overhead by lowering data movements within nodes. In addition, the goal of this hybrid model is to offer the finegranularity parallelism of UPC and its benefit of simplified programming. The hybrid adds the strengths of the MPI s good scalability and coarse-grain parallelism with a larger message size. The recent trend in highperformance computer architecture is to increase cores on nodes and hence decrease the memory per core at nodes, consequently encouraging us to explore different programming paradigms such as an MPI+UPC hybrid on a large-scale distributed platform. In this study, we selected a funneled approach for hybrid programming, meaning that all interactions between UPC and the MPI are controlled by a master thread and only the master thread calls MPI routines. In this manner, all communication is funneled to the master thread. This paper also provides a detailed description of our hybrid MPI+UPC model and demonstrates its effectiveness via a comparison of the performance of MPIonly and UPC-only models. We developed 3 codes to probe the efficiency and scalability of the models on the distributed multicore systems with the Cannon matrix multiplication algorithm, which was chosen to exploit some of the advanced features of the MPI. MPI virtual topology was employed to benefit from regional locality, as UPC has only local or shared (global) objects, and in this way the hybrid model will enhance UPC program performance with regional locality. In addition, we utilized an optimization to overlap MPI communications with UPC computations in the hybrid model, and this optimized benchmark performance by up to 30% for some data sets. Dinan et al. [3] were the first researchers to define the hybrid MPI+UPC programming model in terms that varied the level of nesting and the number of instances of the model. They classified the hybrid model into 3 categories: the flat model, the nested-funneled model, and the multiple model. The nested-funneled model is based on the fact that one member of each UPC group is able to make MPI calls. This kind of funneled approach has been used for MPI+OpenMP and MPI+Pthread hybrids. Our proposed model is also a funneled method in which there is only one master thread in each UPC group that can participate in MPI communication. However, the main goal of the nested-funneled model [3] is to provide MPI programs with access to a large distributed shared global address space. They set up their hybrid system with the inclusion of a UPC compiler to handle the case. On the other hand, our main purpose was to localize computations so that our hybrid system could outperform not only the plain UPC version but also the plain MPI version. In our funneled model, every UPC group has its own shared variables, which are only distributed at that UPC group; they are not distributed across all groups to enlarge the MPI memory. Section 2 of this paper presents an overview of MPI and UPC parallel programming models on parallel platforms, briefly describing their main strengths and weaknesses, and Section 3 examines hybrid models and describes the funneled version in further detail. Section 4 presents the Cannon algorithm, which was used as a benchmark application, while Section 5 explains 4 different implementations of the Cannon algorithm that we used for our experiments in Section 6. Related work and the conclusion are presented in Sections 7 and 8, respectively. 2. Overview of MPI and UPC This section provides an overview of MPI and UPC parallel programming models and examines the advantages and disadvantages of each for the construction of a hybrid programming model. 1390

3 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, 2.1. MPI model The MPI is the most widely used parallel programming paradigm, particularly for distributed memory parallel computers. The MPI provides the user with a programming model in which processes communicate with other processes by explicitly calling library routines to send and receive messages. The MPI-1 standard provides library routines for multiple separate processes to collaborate and to communicate with each other. These include 2-sided send/receive operations for exchanging data between process pairs, a variety of powerful collective operations, virtual topologies, and explicit grouping operations for regional locality such as among rows and columns of Cartesian topology. The MPI-2 standard contributes support for one-sided communication, dynamic process management, and parallel I/O. The MPI-2 defines 3 one-sided communications operations, Put, Get, and Accumulate, as a write to remote memory, a read from remote memory, and a reduction operation on the same memory across a number of processes. The MPI-2 s one-sided model supports remote process access to the data without explicit help from the data s owner, which is a model similar to UPC s global address space programming model. However, it is more restrictive than a global address space in terms of its cache coherence and synchronization characteristic [4,5]. We believe that a hybrid programming model should exploit the complementary strengths and not the similarities between the MPI and UPC; for this reason, we did not consider MPI-2 one-sided communication for hybrid development. We first analyzed the advantages and disadvantages of both MPI and UPC models and then added the desirable features of each model. The next step was to reduce the impact of unfavorable features of each to form a new hybrid model. The advantages of the MPI programming model include the programmer s complete control over data distribution, process synchronization, explicit communication, and a permitting of the optimization of data locality [6]. This gives MPI programs high performance and scalability; however, it also has the drawback of making the MPI difficult to program and debug. Another disadvantage is that existing sequential applications require a fair amount of restructuring for MPI parallelization, as the programmer may not start with a sequential program and then incrementally change the program to achieve incremental performance improvement UPC model UPC is an extension of the C programming language that provides a uniform programming model for both shared and distributed memory systems. UPC, as an explicit parallel language, provides the facilities for direct user specification of program parallelism and control of data distribution. The programmer is presented with a single shared and logically partitioned global address space that is physically distributed by default in roundrobin fashion across available memories [2]. Each variable is physically associated with a single processor, but variables may be read and written by any processor without explicit help from the data s owner [7]. With UPC, the number of threads, or the degree of parallelism, is fixed at either the compiler or program startup time and does not change during execution. Multiple threads operate independently and each thread has affinity with a portion of the globally shared address space. Each thread also has a private space. The total number of threads is THREADS and each thread identifies itself using MYTHREAD. Work can be distributed conveniently using upc forall constructs. All iterations must be independent in order to use upc forall. The advantage of the UPC programming model over the MPI is that it allows the programmer to perform incremental parallelization of applications. The programmer can incrementally parallelize a portion of the existing sequential code and then incrementally obtain performance enhancement. UPC presents local and global (shared) objects as visible to the programmer because these objects have different performance behavior. 1391

4 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 However, one weakness in the model is that data are either local or global; the language has no regional notion. UPC does not permit thread groups to allocate distributed shared arrays on a subset of processors, while the MPI can group processes with MPI Comm split to create regional locality, such as rows or columns of virtual topologies. In UPC terminology, a single unit of execution is named a UPC thread to emphasize UPC s goal of providing shared memory-like programming. However, UPC threads are like operating system-level processes in a Berkeley UPC distribution, just like MPI processes, because these processes of the hybrid MPI+UPC may run on distributed memory machines located in different cluster nodes Hybrid MPI+UPC model The MPI is an application programming interface-based library that can be linked with C, C++, or Fortran programming languages. On the other hand, UPC is an extension of the C programming language. Both the MPI and UPC use a single program, multiple data (SPMD) model of computation. Thus, the UPC program calls MPI libraries to form a hybrid program with an SPMD model. The hybrid program is compiled with the UPC compiler and linked with MPI libraries. The objective of the hybrid MPI+UPC programming model is to combine the strengths of MPI locality control and scalability with UPC fine-grain parallelism and ease of programming. The MPI demands large granularity, and small messages are expensive because every communication has a fixed startup overhead latency. Thus, the hybrid will use MPI for the outer parallelism and UPC for inner parallelism. Figure 1 shows a hybrid model in which multiple UPC groups are combined with one outer MPI group. UPC threads can communicate with each other within their group while the MPI is used for intergroup communication. There is only one master thread in each UPC group, such as the last thread (MYTHREAD == THREADS-1), which can participate in MPI communication; i.e. all communication is funneled to the master thread [3]. This model is very similar to MPI THREAD FUNNELED support, in which an MPI application may be multithreaded but only one thread at a time may make MPI calls. We refer to this hybrid model as a funneled model, as well [8]. The hybrid MPI+UPC model is based on running independent UPC groups that are connected to each other solely by MPI communication. The programmer of the hybrid model must first decide how many independent UPC groups are to be run, and then decide how many threads should be present in each UPC group. For example, if a user wants to have 4 UPC groups and let each UPC group have 8 threads, then this is a 4MPI 8 UPC configuration. The hybrid model compiles the hybrid code with the UPC compiler with fupcthreads-8 specifying that the program should have 8 THREADS and then run the generated executable with mpiexec np 4, where mpiexec/mpirun starts 4 different SPMD programs and each program is an 8-threaded UPC program. One of these threads is chosen as a master thread to serve both the MPI process and UPC threads. This master thread can perform MPI calls, but other threads cannot involve any MPI routines. The UPC group can only communicate with other groups through the master thread. The user chooses the group parameters based on his or her application and the system hardware configuration, and the hybrid model itself does not place any constraints on the number of threads Differences between funneled hybrid and nested-funneled hybrid In this section, we would like to clarify the differences between our proposed hybrid solution and Dinan et al. s hybrid solution [3]. 1392

5 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, UPC Group 0 UPC Group MPI0 MPI1 MPI Group UPC Group 2 UPC Group 3 MPI2 MPI Figure 1. The funneled hybrid MPI+UPC model; gray circles represent the hybrid MPI+UPC master thread and white represents UPC threads. 01: 02: #include <upc.h> #include <mpi.h> 03: #define N 400 // THREADS = 4 04: shared double v1[n], v2[n]; 05: shared double our sum = 0.0; 06: shared double my sum[threads]; 07: shared int me, np; 08: int main(int argc, char **argv) { 09: 10: int i, B; double dotp; 11: if (MYTHREAD == 0) { 12: MPI Init(&argc, &argv); 13: MPI Comm rank(mpi COMM WORLD,(int*)&me); 14: MPI Comm size(mpi COMM WORLD,(int*)&np); 15: } 16: upc barrier; 17: B = N/np; 18: my sum[mythread] = 0.0; 19: upc forall(i=me*b;i<(me+1)*b;i++; &v1[i]) 20: my sum[mythread] += v1[i]*v2[i]; 21: upc all reduced(&our sum, 22: &my sum[mythread],...); 23: if (MYTHREAD == 0) { 24: MPI Reduce(&our sum, &dotp, 1,...); 25: if (me == 0) printf("dot = %f\n",dotp); 26: MPI Finalize(); 27: } 28: return 0; 29:} 30: Nested-funneled dot product from [3]. 01: 02: #include <upc.h> #include <mpi.h> 03: #define N : shared double v1[n], v2[n]; 05: shared double our sum = 0.0; 06: shared double my sum[threads]; 07: shared int me, np; 08: int main(int argc, char **argv) { 09: 10: int i; double dotp; 11: if (MYTHREAD == THREADS-1) { 12: MPI Init(&argc, &argv); 13: MPI Comm rank(mpi COMM WORLD,(int*)&me); 14: MPI Comm size(mpi COMM WORLD,(int*)&np); 15: } 16: upc barrier; 17: 18: my sum[mythread] = 0.0; 19: upc forall(i=0;i<n;i++; &v1[i]) 20: my sum[mythread] += v1[i]*v2[i]; 21: upc all reduced(&our sum, 22: &my sum[mythread],...); 23: if (MYTHREAD == 0) { 24: MPI Reduce(&our sum, &dotp, 1,...); 25: if (me == 0) printf("dot = %f\n",dotp); 26: MPI Finalize(); 27: } 28: return 0; 29:} 30: Our funneled dot product. The main purpose for the hybrid model in [3] is to help memory-constrained MPI codes scale to larger problem sizes. This is mentioned in the abstract, introduction and conclusion. For example, in the 1393

6 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 conclusion of [3] it is stated: For memory-constrained MPI codes, the hybrid model enables the processing of larger problems by aggregating the memory of several nodes into a single, shared global address space. On the other hand, our main goal is to improve performance such that the hybrid MPI+UPC outperforms both MPI-only and UPC-only models. Imagine that we are running this code on 4 4 groups that comprise 4 MPI processes and 4 UPC THREADS. In the nested-funneled model proposed by Dinan et al., the partitioning across groups is done in blocks with block size B, and line 17 calculates block size B. Later, upc forall uses B and MPI rank me to calculate the lower and upper bounds at line 19. Their main goal is to provide MPI programs with access to a large distributed shared global address space. They set up their hybrid system, including the UPC compiler, to handle this case. Therefore, there are many caveats and difficulties involved in setting up and using their model [9]. On the other hand, our main purpose is to localize the computation so that the hybrid system can outperform not only the plain UPC version but also the plain MPI version. Please compare line 19 in both programs, in which upc forall is the main source of parallel computation for the UPC language. Dinan et al. compare their hybrid model with a UPC-only model, but we compare both UPC-only and MPI-only implementations. Moreover, we provide an optimization to overlap the computation with the communication in the Cannon application. This optimization allows the UPC to perform computations while the MPI performs communication. 3. Cannon algorithm Matrix multiplication is a fundamental kernel that is used for the numerical solution of many problems [10]. Cannon s algorithm [11], also sometimes called Fox s algorithm [12], is the most efficient matrix multiplication for parallel platforms. For simplicity, we are interested in performing the multiplication C = A B, where C, A, and B are N N square matrices. We begin by assuming that matrices A, B, and C are identically decomposed into subblocks of 2-dimensional tile grid. Algorithm 1. The Cannon matrix multiplication algorithm. for i = 0 to ( P-1) do // P is the total number of tiles A sub broadcast_a( T appropriate along rows ) C sub T x A sub cshift_b ( upward along colums ) C sub end for Algorithm 1, indicated above, has 3 fundamental steps. The first step broadcasts the diagonal A sub subblocks along each row of tiles. The broadcast source will be shifted to the right of the rows for the next 1394

7 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, iteration. The second step performs submatrix multiplication. The final step performs an upward circular shift along each column of 2-dimensional tile grid. The Cannon algorithm runs a 2-dimensional square grid with a wraparound connection. Supposing that the number of processors is P, the array of processors will be a 2-dimensional grid with size P P. We multiply the global array A[N][N] with the global array B[N][N] to calculate global array C[N][N]. These global arrays must be distributed on the P P grid, and each node on the grid will have the local A sub [dn][dn], B sub [dn][dn], and C sub [dn][dn] arrays where dn =. N P The Cannon algorithm calculates a partial result using the submatrices that are currently accessed by it. It successively performs the same calculation on new submatrices, adding the new results to the previous. It has 2 different types of communications. Every rows broadcast A sub from the selected source along rows. Every node in the grid performs a circular shift (send/receive) for B sub along each column concurrently. It successively performs the same calculation on new submatrices, adding the new results to the previous. All of these operations are performed P times. There are several parallel matrix multiplication algorithms using matrix decomposition based on the number of processors available, such as the Cannon algorithm used in this section; these include the systolic algorithm, parallel universal matrix multiplication (PUMMA), scalable universal matrix multiplication (SUMMA), and distribution-independent matrix multiplication (DIMMA) [13]. Basically, each one of these algorithms decomposes the matrices into submatrices and calculates a partial result using the submatrices that are currently accessed by the processor. However, different matrix multiplication methods may have different computation and communication requirements. For example, the systolic algorithm performs transpose operations and sends and receives communications. PUMMA uses scatter communication, and SUMMA and DIMMA use broadcast operations. In contrast, the Cannon algorithm uses both the broadcast and shift operations (send/receive). PUMMA and the systolic algorithm are known to provide strong performances [13] on a distributed system. However, we selected the Cannon algorithm, which is easy to implement, and we described our funneled hybrid model with this example without difficulty. In addition, Cannon algorithm performance is very compatible with both MPI implementation and UPC implementation. 4. Code overview In this section, we develop 4 different implementations of the same parallel Cannon algorithm with the MPI, UPC, hybrid MPI+UPC, and optimized hybrid MPI+UPC versions, respectively MPI implementation Our MPI implementation of Cannon s algorithm is based upon a 2-dimensional block decomposition, in which there are 2 collective communication operations involving a subset of the processes, such as rows of processes and columns of processes. In order to involve only a subset of the original process group in a collective operation, we need to create a Cartesian topology, a 2-dimensional virtual grid of processes as shown in Figure 2. The 2-dimensional grid with wraparound connections is often simply referred to as a torus network. Figure 3 shows the Cannon algorithm implementation written for the MPI model. The packing, unpacking, and calculating of source and destination for communications have been removed for simplification. The code has 2 major parts, the first of which is to construct the Cartesian topology. The second part is to 1395

8 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 implement the algorithm in 3 steps. In the first step, the processes in the rows of the virtual process grid participate in the broadcast communication. The second step performs the submatrix multiplication. In the last step, each column of processes in the grid performs the send receive operations for executing a circular shift operation across a chain of processes in the grid column. The MPI program has explicit control of data locality. Regional data locality is provided among rows and columns by using the advance feature of the MPI s Cartesian virtual topology. The MPI can enhance UPC by providing explicit control over data locality in the hybrid programming model, and Cannon s algorithm is an ideal selection to demonstrate the importance of regional data locality Row Row Row Row3 Column 0 Column 1 Column 2 Column 3 Figure 2. An MPI Cartesian topology: a 2-dimensional virtual grid of MPI processes with wraparound connections. UPC does not provide process groups and does not allow for the allocation of distributed shared arrays on a subset of processors. However, the MPI program has explicit control of data locality. The programmer can create process groups, such as a group of rows or columns, as in MPI s Cartesian virtual topology. For example, each row group of the topology may perform broadcasting of different values at the same time. This is referred to as regional locality. The hybrid Cannon application uses this feature of MPI to enhance the UPC creation of these groups. This is one of the benefits that the MPI contributes to the hybrid for UPC. However, the MPI adds other benefits to the hybrid, as well, such as well-tuned collective communications. Furthermore, UPC benefits the hybrid, as well, in that it handles the fine-grain parallelism and prepares the coarse-grain parallelism for the MPI UPC implementation Figure 4 shows the UPC implementation of matrix multiplication with a block distribution. The UPC code for the matrix multiplication is almost the same size as the sequential code. This makes UPC easy to program and it allows for incremental parallelization of sequential codes. The global (shared) array declaration has the keyword shared [block-size] to distribute shared arrays in a block per thread in a round-robin fashion. UPC does 1396

9 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, not provide a 2-dimensional virtual topology to make a group of threads for regional data locality, such as the row-wise or column-wise grouping presented in Section 4.1. UPC only differentiates between 2 different kinds of data, shared (global) and private (local), for threads. UPC partitions parallel works by using the upc forall construct, which distributes iterations of the loop according to the affinity expression, &Aupc[i][0]. The UPC will assign iterations to the thread that has affinity to the corresponding element of Aupc. The Berkeley UPC distribution version provides a matrix multiplication for the Cannon algorithm, and that is the code that we utilized for UPC matrix multiplication benchmarking. However, our MPI+UPC hybrid uses the block matrix multiplication (Figure 4) for the submatrix multiplication. // PART 1: Construct a Cartesian topology MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p); MPI_Dims_create (p, 2, grid_size); MPI_Cart_create (MPI_COMM_W ORLD, 2, grid_size, periodic, 1, &grid_comm); MPI_Comm_rank (grid_comm, &grid_id); MPI_Cart_coords (grid_comm, grid_id, 2, grid_coords); MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1], &row_comm); MPI_Comm_split (grid_comm, grid_coords[1], grid_coords[0], &col_comm); // PART 2: Cannon Algorithm int S = (int) sqrt( P ); for ( k = 0; k < S; k++) { MPI_Bcast(Atmp, dn*dn, mpitype, src, row_comm); // STEP 1: Broadcast Asub matmul(atmp, Bsub,C sub); // STEP 2: C = Asub x Bsub MPI_Sendrecv(Bsub, dn*dn, mpitype, left, tag, // STEP 3: CSHIFT Bsub btmp, dn*dn, mpitype, right, tag, col_comm, &status); } Figure 3. MPI implementation of Cannon s algorithm. The goal of the UPC language is to provide a shared memory, like a programming model, on the distributed memory systems. The distributed global arrays are marked with a shared keyword. The UPC compiler automatically decomposes the shared arrays to the available processors. However, an MPI programmer must explicitly distribute N N matrices to the local submatrices with size dn dn, where dn = N P and P is the number of processors. 1397

10 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 shared [N] double Aupc[N][N], Bupc[N][N], Cupc[N][N]; void matmul_upc( ) { int i,j,k; double sum; upc_forall (i=0; i<n; i++; &Aupc[i][0]) for (j=0; j<n; j++) { sum = 0; for (k=0; k< N; k++) sum +=Aupc[i][k]*Bupc[k] [j]; Cupc[i][j] = sum; } } Figure 4. UPC matrix multiplication with block distribution Hybrid MPI+UPC implementation In Figure 5, the funneled hybrid MPI+UPC is formed with the careful combination of the MPI program of Figure 3 and the UPC program of Figure 4. Here again, we present the simplified hybrid code due to space limitations; however, the main algorithm of the code should be clear. The hybrid program consists of 1 MPI group, and each MPI process has 1 UPC group, as shown in Figure 1. There is only one master thread in each UPC group, such as the last thread (MYTHREAD == THREADS-1), which can participate in MPI operations as an MPI process. The master thread initializes the MPI and is able to construct an MPI Cartesian topology at part 1 of the code. Part 2 performs Cannon s algorithm with 2 explicit levels of parallelism. The MPI manages the outer parallelism by bringing the appropriate subblocks to the master threads of each UPC group. Master threads copy their private subblocks to the shared subblocks by a copy from master to upc() routine, which performs simple copy operations and synchronizes each UPC thread group with upc barrier(). The shared subblocks are in the global address space of each UPC group. Each UPC participates in the innerlevel parallelization of subblock matrix multiplication with a block distribution, which is given by matmul upc() in Figure 4. In order to obtain optimal execution time in a parallel program, the program has to be partitioned into concurrent tasks that handle the different parts and different grain sizes of the global data for the data-parallel programming model. Here, the MPI, UPC, and hybrid MPI+UPC programs use data-parallel programming, and the decision about selecting grain size is very important for data-parallel programming. However, selecting the optimum grain size suitable for parallel execution is an NP-complete problem. A large grain size will limit potential parallelism. A small grain size, however, will result in greater communication overheads and may cause execution time degradation. The MPI performs well on coarse-grain parallelism with a larger message size, while UPC can handle fine grain better than the MPI. Hybrid implementation has 2 kinds of data decompositions. Supposing that we have a global matrix A[N][N], we first perform the data decomposition as A sub [dn][dn] for the MPI processes. Later, the UPC compiler performs the second decomposition with the shared Aupc[dN][dN] declaration. The master thread, which is both an MPI process and a UPC thread, copies A sub to Aupc at the same time. The first distribution results in coarse-grain distribution, which the MPI can handle very efficiently. The second distribution performs 1398

11 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, shared [N] double Aupc[N][N], Bupc[N][N], Cupc[N][N]; Boolean MASTER = (MYTHREAD == THREADS 1); int main (int argc, char *argv[]) { if(master) { // PART 1: Construct a Cartesian topology MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &p);... MPI_Comm_split (grid_comm, grid_coords[0], grid_coords[1], &row_comm); MPI_Comm_split (grid_comm, grid_coords[1], grid_coords[0], & col_comm); } for( k = 0; k < S; k++) { // PART 2: Cannon Algorithm // Master copied Atmp = Asub if(master) MPI_Bcast(Atmp, dn* dn,type,src,row_comm); // STEP 1:Broadcast Asub copy_from_master_to_upc(atmp,b sub ); // Copy to shared Aupc=Atmp, Bupc=Bsub matmul_upc (); // STEP 2: Cupc += A upc x B upc if(master) MPI_Sendrecv(Bsub, dn* dn,type,left,tag, // STEP 3: CSHIFT Bsub } btmp, dn* dn, mpitype, right, tag, col_comm, &status); } Figure 5. Hybrid MPI+UPC Cannon s algorithm. The subblock multiplication has UPC block distribution. fine-grain distribution, which UPC can handle quite effectively. The hybrid program is compiled with the UPC compiler and linked with the MPI libraries. The -fupcthreads-num option generates code for a fixed number NUM of UPC threads. The MPI launcher is used to start the hybrid program. Below are an example UPC compilation and an example MPI launcher, in which 4 MPI processes are created and each process creates a UPC group with 4 threads. The total parallel thread number is 4 MPI 4 UPC = 16 threads, as shown in Figure 1. $ upcc --o matmul hybrid matmul hybrid.upc -fupc-threads-4 -O $ mpiexec --np 4 --hostfile hosts matmul hybrid 4.4. Optimized hybrid MPI+UPC implementation Although overlapping communication with computation provides the opportunity to improve the execution time of a parallel program, this parallel programming style is not widely used due to its complexity. However, the hybrid Cannon algorithm presents a good opportunity for overlapping communications with computation. In this algorithm, we only need the full synchronization of each UPC group before performing the subblock matrix multiplication. The computation of subblock UPC multiplication can be overlapped with the MPI s communication. However, the upc forall construct distributes iterations of the loop according to the affinity 1399

12 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 expression at the fourth parameter of the construct. The UPC will assign each iteration to the thread that has affinity to the corresponding element of the shared array. This implies that assignment statements of upc forall involving shared arrays are executed exclusively by those threads that are owners of the shared array elements. We structured our hybrid code such that the master thread will not spend excessive time on upc forall computations, but rather dedicate time to MPI communication, since the master thread is both the UPC thread and the MPI process. #define BLOCK (N*N)/(THREADS - 1) #if BLOCK < shared [BLOCK] double Aupc[N][N]; shared [BLOCK] double Bupc[N][N]; shared [BLOCK] double Cupc[N][N]; #else if shared [ ] double Aupc[N][N]; shared [ ] double Bupc[N][N]; shared [ ] double Cupc[N][N]; #endif Figure 6. Distribution scheme to optimize the hybrid MPI+UPC Cannon algorithm. If the hybrid code partitions the shared arrays such that the master thread does not have shared array elements or has much less than the other threads, the master will finish upc forall earlier than the others and reach the MPI communication operations while the other threads are still executing their portion of upc forall iterations. Figure 5 shows a distribution scheme in which fewer shared array elements reside on the master thread s memory space. The shared array declaration has the keyword shared [block-size] to distribute shared arrays in a block per thread in a round-robin fashion. The shared array size is N N in the subblock of Cannon s algorithm, and in this way the master thread should have no data. The block size must be (N N)/(THREADS 1) so that the last thread will have no data in a round-robin fashion. However, the Berkeley UPC implementation has a limit for the block size of 64k. Even if the problem size reaches the limit, the solution in Figure 5 still provides fewer shared array elements to the last thread (master) because of the round-robin fashion distribution. The addition of the code in Figure 6 optimizes the hybrid MPI+UPC implementation of Figure 5 to overlap MPI communications with UPC computations on hybrid MPI+UPC implementations. The Cannon algorithm has 3 steps in a loop. Step1: The master thread performs MPI Broadcast A sub Step2: Every thread performs matmul upc() parallelism from upc forall Step3: The master thread performs shift communication B sub When the master thread is performing broadcast communication in step 1, all of the other UPC threads are waiting at matmul upc() in step 2. Once the master thread completes bcast, it will join the rest of the threads at matmul upc(). Every thread starts matmul upc() in step 2. However, the master thread will quickly exit from matmul upc to start step 3 communications while the rest of the threads are still working on the matmul upc() at Step2. Here, the UPC computation in step 2 is overlapped with the MPI communication at Step 3. Even further, the master thread completes the step 3 communication and starts the step 1 broadcasting 1400

13 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, from the next iteration while the nonmaster threads are still trying to finish matmul upc() in step 2. Basically, the master threads (MPI processes) spend most of their time on communications in step 1 and step 2, while the other UPC threads are mostly busy successively performing the matrix multiplication in step Benefits of hybrid MPI+UPC model over MPI and UPC models The hybrid MPI+UPC can take advantage of multicores at the symmetric multiprocessing (SMP) cluster to efficiently access the shared memory in nodes of the SMP cluster with no cost. However, the MPI uses communications to access the shared memory at multicore nodes of the SMP cluster. This brings overhead for the MPI to access the local data with penalty. UPC does not provide process groups and does not allow for the allocation of distributed shared arrays on a subset of processors. The hybrid model uses MPI features to enhance the UPC creation of these groups. This enhances some hybrid programs scalability over UPC programs. The hybrid model provides 2 explicit levels of parallelism. The hybrid uses the MPI to manage the outer parallelism by bringing the appropriate data blocks to UPC groups for processing. UPC groups participate in the inner level of parallelization. Neither the MPI nor UPC alone can provide this flexibility, which may be needed for some applications such as Cannon s algorithm. Hybrid implementation has 2 kinds of data decompositions. The first distribution results in coarse-grain distribution, which the MPI can handle very efficiently. The second distribution performs fine-grain distribution, which UPC can handle quite effectively. 5. Performance evaluation This section will illustrate the impact of our proposed hybrid MPI+UPC approach through several experiments by running 4 different implementations of the Cannon algorithm on a 16-node HP BL460c cluster located at Kadir Has University. This SMP cluster consists of GHz Intel Xeon Quad Core CPUs and 24 GB RAM with a total of 8 processing cores per node running Linux connected with a 20-Gbps Infiniband. We show the performance for a baseline UPC version, a baseline MPI version, and hybrid MPI+UPC versions for each benchmark. We performed the hybrid testing with various group sizes. We selected the best performing group for our benchmark. The experiment results were obtained with an average of 5 runs Cannon algorithm performance The first experiment objective is to find an optimum UPC group for the hybrid MPI+UPC model. The results in Figure 7 show the time required to perform matrix multiplications with groups of 1, 4, 8, 16, and 64 UPC THREADS. The ideal performance is in the group of 8 UPC threads. Each node in our system holds 8 cores per slot, indicating that each UPC thread goes to different cores in the nodes. In fact, we configured UPC s GASNET NODEFILE such that each consecutive thread goes to consecutive cores in the node for the hybrid runs. Similarly, we ensured that each MPI process goes to different nodes by putting the slots=1 option in the MPI s hostfile for the best performance of the hybrid model with a round-robin fashion. However, during plain MPI runs, the hostfile is configured with slots=8 to advise consecutive processes to be in consecutive cores for the process affinity. Cannon s algorithm runs on a 2-dimensional square mesh constructed by the MPI. For example, in the 9 MPI 8 UPC case, there will be 9 MPI processes to be used to create a 2-dimensional 3 3squaremesh. If 1401

14 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 the number of MPI processes is P, the algorithm creates a P P mesh. In addition, all of the matrix sizes are the global matrix, not the distributed submatrices. The second experiment was designed to reveal how well the proposed hybrids perform compared to plain MPI and plain UPC versions. Figure 8 shows the total execution time on the vertical axis, and the horizontal axis denotes the problem size. The CPU times of hybrids and others for the small problem sizes of and are almost the same. For the other problem sizes, hybrids consistently obtain better performance than UPC. Execution time (s) Cannon algorithm on 64 CPUs Hybrid Optimized hybrid MPI processes UPC threads Execution time (s) MPI UPC MPI, UPC (36 CPUS) Hybrids (4 9 = 36 CPUS) Hybrid Optimized hybrid ,000 Problem size Figure 7. The effect of varying numbers of MPI processes and UPC threads on execution time for double global matrices for Cannon s algorithm on 64 CPUs. Figure 8. Execution time of Cannon s algorithm on 36 nodes with varying problem sizes; hybrids run 4 MPI 9 UPC configurations. The Table shows the percentage gains of the optimized hybrid compared to the MPI, UPC, and the plain hybrid. For the 10,000 2 problem size, the optimized hybrid shows an efficiency of 60.13% over UPC. The optimized hybrid outperforms in efficiency by 46.83% over the MPI at As for the plain hybrid, the optimized hybrid achieved approximately 16% comparative efficiency. Table. Comparing percentage gains of the optimized hybrid MPI+UPC version to the MPI, UPC, and plain hybrid MPI+UPC of Figure 5. Data size ,000 2 Gain over MPI Gain over UPC Gain over hybrid Figure 9 shows the behavior of the MPI, UPC, and optimized hybrid MPI+UPC codes on a large data size with a fixed number of CPUs. The optimized hybrid achieved around 20% and 15% efficiency compared to the MPI and UPC, respectively. However, UPC code outperforms the MPI in this scenario. As the MPI requires copying data on shared memory, when the data size increases and the CPU size stays the same, the number of the direct accesses of shared memory of UPC increases, and therefore the MPI could be considered less efficient unless the number of CPUs increases along with data sizes. 1402

15 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, 5.2. Random-access benchmark performance The random-access benchmark is defined to test the speed at which a machine can update the elements of a table spread across a global system memory. The benchmark requires that the array should not be replicated for the memory limitation. Parallel implementation performs poorly on distributed-memory machines, due to updates requiring numerous small point-to-point messages between processors. The random-access benchmark is trivial to parallelize by using UPC with 20 lines of codes. UPC uses one-sided access to distributed arrays. However, parallelizing this application is not trivial when using the MPI with 150 lines of code. The code handles irregular communication patterns with point-to-point communication. The funneled hybrid MPI+UPC is 180 lines of code. In terms of productivity in this application, the UPC model is superior to the MPI and the hybrid models because of PGAS support. However, Figure 10 shows that the hybrid MPI+UPC outperforms both the MPI and UPC in terms of execution time. Execution time (s) 12,000 10, MPI UPC Optimized hybrid 10,000 20,000 30,000 40,000 Problem size Figure 9. Execution time of Cannon s algorithm on 100 CPUs with large problem sizes. Figure 10. Random-access benchmark execution time (s) versus number of CPUs for UPC, the MPI, and the hybrid with group size of 8 implementations. Each process performs 1,000,000 random accesses to the distributed global array with 1,000,000 lengths, and for each access they perform 1000 floating-point operations of computation. Each process has a constant amount of work, and so the performance of this experiment does not improve when the number of processors increases. In this experiment, the funneled hybrid MPI+UPC model leads to about a 200% improvement in performance over the UPC model and a 100% improvement over the MPI model. Both UPC and MPI execution time increase when the number of processors increases because this also increases the number of communications. However, the funneled hybrid MPI+UPC remains flat, indicating that processor count does not affect the execution time. The main reason for success of the hybrid in this case is that it better localizes the random accesses. The MPI model outperforms UPC because hand code communication is better than UPC compiler-generated, one-sided communication. We wanted to compare our result with the nested-funneled model of [3]; however, we were not able to set up their nested-funneled model in our system because at present there are numerous caveats and difficulties involved in setting up and using their model [9]. However, if we make a rough comparison between the plain execution time of our model with the reported number from [3], our model leads to about 30% improvement in performance over the Hybrid-4 run of Dinan et al. However, our system has dual-core 2.6-GHz Intel Xeon CPUs. Dinan et al. s test machine has dual-core 2.6-GHz AMD Opteron processors. These systems have different floating point units and cache characteristics in terms of cache size and latencies, which may have a great effect on the execution time. In addition, our system with a 20-Gbps Infiniband is faster than the 10-Gbps Infiniband used for Dinan et al. s test machine. The random- 1403

16 Turk J Elec Eng & Comp Sci, Vol.20, No.Sup.2, 2012 access benchmark has irregular access pattern. Thus, exchanging data via a regular communication pattern is not feasible. For that reason, the communication timing of this benchmark should be made up from the latency of the Infiniband network, not the bandwidth of the network. The end-to-end latency is 2.84 μs and 3.85 μs for up to 128 bytes for the 20-Gbps and 10-Gbps Infiniband, respectively [14]. In this application, each process performs 1,000,000 random accesses. If we assume that all of the accesses are remote accesses for the sake of simplicity, our communication timing gains 1 s (from to )overthecase of Dinan et al. However, we conducted an experiment to measure the communication timing for this benchmark to include congestions and delays on our cluster. We found that the communication took 4 times longer than the above numbers. That means that the better network benefited us a 4-s gain over the case of Dinan et al. Thus, when we take network speed differences into account, our improvement is reduced to the 18% range Barnes-Hut N-body simulation performance The classical N-body problem simulates the evolution of a system of N bodies, where the force exerting a gravitational pull on each body arises due to interaction with all of the other bodies in the system. This algorithm is frequently used in demonstrations of computational performance and is an interesting algorithm for several reasons. First, the simulation of the motion of particles subject to particle particle interactions represents a general class of algorithms with applications ranging from chemistry to astrophysics. Straightforward sequential algorithms to solve these problems typically have time complexity O(N 2 ) per iteration, where N is the number of objects. However, the Barnes Hut algorithm [15] was developed to reduce the complexity. This is ano(n log N) algorithm based on a hierarchical octree representation of space in 3 dimensions. It computes interactions between distant particles by means of first-order approximation. We undertook an experiment to show the strong scaling of the Barnes Hut force and velocity calculation kernel. The results are shown in Figure 11 for a 150,000-body system. We show the speedup for the baseline UPC implementation, MPI implementation, and funneled hybrid implementation with a UPC group size of 8. UPC implementation shows poor scaling performance due to the fact that the number of nonlocal references implemented as one-sided communication increases proportionately to the UPC thread count. The MPI shows better speedup performance than UPC in this application. The funneled hybrid MPI+UPC model offers a better performance comparison because the hybrid model demonstrates better management in the percentage of local data references over baseline UPC implementation by replicating the octree of the algorithm. The funneled hybrid with group size 8 achieves almost linear scaling because the hybrid brings all data to the SMP multicore nodes. In comparison, the funneled hybrid with an 8 UPC group offers better performance than the nested-funneled group with Hybrid-8 [3]. However, our best-performing hybrid with an 8 UPC group is comparable to Hybrid-4 of [3], because we conducted our experiment on SMP clusters that had 8 processing cores per node, while Dinan et al. conducted their experiment on SMP clusters with 4 cores. 6. Related work An SMP cluster system with multiple SMP nodes and multicore chips is the most commonly available parallel computing hardware. MPI programming can be used both within an SMP node and among the SMP cluster s nodes. However, the other programming model for this platform is a hybrid programming model, in which a parallel program is written using a thread programming library, such as Portable Operating System Interface (POSIX) threads, within an SMP node and MPI programming among nodes simultaneously. The MPI

17 BOZKUŞ: Hybrid MPI+UPC parallel programming paradigm on an SMP cluster, standard has clearly defined the interaction between the MPI and user-created threads in an MPI program. However, Gropp and Thakur [8] pointed out the issues involved in developing an efficient thread-safe MPI implementation without sacrificing too much performance. Speedup 140 Barnes-Hut algorithm Number of CPUs MPI UPC Hybrid-8 Figure 11. Barnes Hut N-body simulation speedup graphs at force calculation part for UPC, the MPI, and the funneled hybrid with group size of 8. Other hybrid programming with the MPI as outer-level parallelism and OpenMP as inner-level parallelism has been extensively studied for an SMP cluster system [16 18]. The shared address space within each SMP node is suitable for OpenMP parallelization and the MPI can be employed within and across the nodes of a cluster. The MPI/OpenMP hybrid programming model is easy to apply via automatic parallelization of the compilers with some directives for loop-level parallelism. Rabenseifner et al. showed the relation between the MPI/OpenMP hybrid programming model and hardware architecture [17]. Recent years have seen another shift in parallel computing hardware with the introduction of general purpose programming tools to perform high-performance computing on multicore graphic processing units (GPUs), which are attached to each node of an SMP cluster. The latest GPU programming languages, such as NVIDIA s Compute Unified Device Architecture (CUDA), provide the programmer with a high-level and flexible model [19] that develops a hybrid MPI-CUDA model and adopts a CUDA programming model for fine-grain data-parallel operations within each GPU, and the MPI for coarse-grain parallelization across the cluster. A new hybrid parallel model that combines the MPI and UPC was explored first in [3] by Dinan et al. and defined the hybrid MPI+UPC parallel programming model in terms of submodels that varied the level of nesting and the number of instances of the model. They classified the hybrid model into 3 categories. The flat model permits all processes to participate in both UPC and MPI communication. In the nested-funneled model, one process per UPC group can participate in MPI communication. The nested-multiple model is the most powerful, allowing the MPI to span all UPC processes in all UPC groups. However, this added flexibility comes with greater complexity. The authors demonstrated its effectiveness and performance gains with Barnes Hut N-body simulation. Our funneled model is similar to the nested-funneled model in that only one member of each UPC group is able to make MPI calls. However, in our funneled model, every UPC group has its own shared variables, which are only distributed at that UPC group; they are not distributed across all groups to enlarge the MPI memory. In [3], the lower and upper bounds of upc forall are calculated by using MPI id and UPC group id. In contrast, our approach simply uses the shared array index upc forall for lower and upper bounds in each UPC group. In addition, we implemented the Cannon algorithm with advanced MPI features 1405

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

More Communication (cont d)

More Communication (cont d) Data types and the use of communicators can simplify parallel program development and improve code readability Sometimes, however, simply treating the processors as an unstructured collection is less than

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplicaiton Propose replication of vectors Develop three parallel programs, each based on a different data decomposition

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Matrix-vector Multiplication

Matrix-vector Multiplication Matrix-vector Multiplication Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition Outline Sequential algorithm

More information

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product)

Parallel Programming. Matrix Decomposition Options (Matrix-Vector Product) Parallel Programming Matrix Decomposition Options (Matrix-Vector Product) Matrix Decomposition Sequential algorithm and its complexity Design, analysis, and implementation of three parallel programs using

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Unified Parallel C, UPC

Unified Parallel C, UPC Unified Parallel C, UPC Jarmo Rantakokko Parallel Programming Models MPI Pthreads OpenMP UPC Different w.r.t. Performance/Portability/Productivity 1 Partitioned Global Address Space, PGAS Thread 0 Thread

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

DEVELOPMENT OF HYBRID MPI+UPC PARALLEL PROGRAMMING MODEL. Elif ÖZTÜRK

DEVELOPMENT OF HYBRID MPI+UPC PARALLEL PROGRAMMING MODEL. Elif ÖZTÜRK DEVELOPMENT OF HYBRID MPI+UPC PARALLEL PROGRAMMING MODEL Elif ÖZTÜRK KADIR HAS UNIVERSITY 2011 DEVELOPMENT OF HYBRID MPI+UPC PARALLEL PROGRAMMING MODEL ELİF ÖZTÜRK B.S., Computer Engineering, Kadir Has

More information

Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI and OpenMP Parallel Programming Hybrid MPI and OpenMP Parallel Programming Jemmy Hu SHARCNET HPTC Consultant July 8, 2015 Objectives difference between message passing and shared memory models (MPI, OpenMP) why or why not hybrid? a common

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

MPI & OpenMP Mixed Hybrid Programming

MPI & OpenMP Mixed Hybrid Programming MPI & OpenMP Mixed Hybrid Programming Berk ONAT İTÜ Bilişim Enstitüsü 22 Haziran 2012 Outline Introduc/on Share & Distributed Memory Programming MPI & OpenMP Advantages/Disadvantages MPI vs. OpenMP Why

More information

Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix Algorithms (part 3) Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Parallel Computers. c R. Leduc

Parallel Computers. c R. Leduc Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?

More information

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 8 Chapter 8 Matrix-vector Multiplication Chapter Objectives Review matrix-vector multiplication Propose replication of vectors Develop three parallel programs, each based on a different data decomposition

More information

Project C/MPI: Matrix-Vector Multiplication

Project C/MPI: Matrix-Vector Multiplication Master MICS: Parallel Computing Lecture Project C/MPI: Matrix-Vector Multiplication Sebastien Varrette Matrix-vector multiplication is embedded in many algorithms for solving

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Dense Matrix Algorithms

Dense Matrix Algorithms Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication

More information

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Hybrid MPI/OpenMP parallelization Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Thread parallelism (such as OpenMP or Pthreads) can provide additional parallelism

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Parallelization, OpenMP

Parallelization, OpenMP ~ Parallelization, OpenMP Scientific Computing Winter 2016/2017 Lecture 26 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 18 Why parallelization? Computers became faster and faster

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:... ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Parallel Computing. Lecture 17: OpenMP Last Touch

Parallel Computing. Lecture 17: OpenMP Last Touch CSCI-UA.0480-003 Parallel Computing Lecture 17: OpenMP Last Touch Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Some slides from here are adopted from: Yun (Helen) He and Chris Ding

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo .. Programming Distributed Memory Machines in MPI and UPC Kenjiro Taura University of Tokyo 1 / 57 Distributed memory machines chip (socket, node, CPU) (physical) core hardware thread (virtual core, CPU)

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture

Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture Matrix Multiplication on an Experimental Parallel System With Hybrid Architecture SOTIRIOS G. ZIAVRAS and CONSTANTINE N. MANIKOPOULOS Department of Electrical and Computer Engineering New Jersey Institute

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 15 October 2015 Announcements Homework #3 and #4 Grades out soon Homework #5 will be posted

More information

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems

Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems Ashay Rane and Dan Stanzione Ph.D. {ashay.rane, dstanzi}@asu.edu Fulton High Performance Computing Initiative, Arizona

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

A Comprehensive Study on the Performance of Implicit LS-DYNA

A Comprehensive Study on the Performance of Implicit LS-DYNA 12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four

More information

Hybrid Programming with MPI and OpenMP

Hybrid Programming with MPI and OpenMP Hybrid Programming with and OpenMP Fernando Silva and Ricardo Rocha Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2017/2018 F. Silva and R. Rocha (DCC-FCUP) Programming

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs 1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) s http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information