Translating OpenMP Programs for Distributed-Memory Systems

Size: px

Start display at page:

Download "Translating OpenMP Programs for Distributed-Memory Systems"

Emery Terry
5 years ago
Views:

1 Translating Programs for Distributed-Memory Systems Johannes Thull Sept Abstract Parallel systems that support shared memory abstraction are becoming more and more important in the HPC market. In addition, clusters of these SMP nodes are built in order to further increase the speed of the entire system. These systems present a hierarchical hardware design where intra-node communications can take place implicitly through a shared memory while inter-node communications must be carried out explicitly through a message passing interface. Currently, and MPI are the de-facto programming tools for shared memory and distributed memory systems, respectively. However, is not suited for distributed memory architectures, while MPI does, but suffers from large communication overhead. It is therefore reasonable to combine both parallelization strategies by introducing a hybrid /MPI model. This model closely maps to the architecture of an SMP cluster, where MPI provides communication mechanisms between nodes while exploits parallelism within shared-memory nodes. Furthermore, MPI is deemed to be the assembler language of parallel programming, and parallelizing a sequential application in MPI requires a considerable effort. Therefore, it is necessary to provide the programmer with a tool that abstracts from the detailed and extensive programming issues of MPI. This term paper presents two approaches that abstract from directly programming MPI code by automatically generating MPI- or even hybrid /MPI code.

2 Contents 1 Introduction 1 2 MPI and : A short Introduction Execution Model Data Handling MPI Example Code Example Code The Language llc The OTOSP Model Example Translating llc Programs Experimental Results Automatic generation of MPI code from programs in GCC to MPI Translation GCC Compilation GOMP: Support for GCC Using GOMP to transform Programs into MPI Programs Current Implementation Conclusions 22 I

3 1 Introduction Most Systems in High Performance Computing are clusters of shared memory nodes, ranging from small clusters of Multicore-CPU PCs up to largest systems like the Earth Simulator which consists of over 640 SMP nodes [5]. When developing applications for such systems, programmers currently have the choice between different APIs. These include Intel TBB [21], Cilk [22] Pthreads [23], or MPI, just to mention a few. Here, we only consider the last two APIs: and MPI. MPI (Message Passing Interface) is a standard specification for a message passing interface between Processes on a distributed-memory system, which provides the programmer with a Single-Program-Multiple-Data (SPMD) view of the computation. Here, the programmer has to manage all the communications using special library functions, which makes MPI, one the one hand, a powerful tool. But on the other hand decomposition, development and debugging of applications can be time consuming and significant code changes are often required. Since communications have to be carried out through a messaging system, they can produce a large overhead and also large code granularity is required to reduce latency [5]. is an industry standard for shared memory programming which provides the programmer with a set of compiler directives, library routines and environment variables. In contrast to MPI, communication is implicit and the actual parallelization is delegated to the compiler. Parallelism is expressed through compiler pragmas which makes applications relatively easy to implement. Both, MPI and have their advantages and disadvantages. The main drawback of is the restriction to shared memory architectures, compared to MPI Messages which also work across system boundaries (e.g via TCP/IP). On the other hand, MPI scales poorly on fine grain problems, where MPI applications become communication dominated. To combine the best of both worlds, several approaches [5, 9, 10] to combine sharedand distributed memory programming have been proposed. The idea behind this hybrid model is to place MPI parallelism on top of parallelism. For example, consider Figure 1 which shows a 2D data array that has to be processed by an SMP Cluster of four nodes. In a first step, the array is partitioned into four parts and distributed to four MPI processes, each running on a separate node. On a second level, each node runs an code which accomplishes a further partitioning and consequently also a further level of parallelization. This model closely maps to the architecture of an SMP cluster: MPI provides communication mechanisms between nodes while exploits parallelism within shared-memory nodes. In this term paper, I will present two strategies to bring both parallel paradigms i.e. shared memory and message passing together. Both approaches have in common that they abstract from directly programming MPI code and hence make parallel programming much easier for less experienced programmers like scientists and engineers. The term paper is organized as follows: Section 2 gives a short introduction about the execution and data handling models of and MPI. In addition I give two example codes to illustrate the usage of and MPI respectively. Section 3 is based on the work of Ruymán Reyes et al. [1]. In this Section I will present llc, a language that expresses parallelism through directives and automatically generates hybrid code. In this approach, MPI handles inter-node communications, while is used inside each SMP node. Section 4 which is based on the work of Abdellah-Medjadji et al. [13] describes how to use the implementation in GCC to transform code into MPI programs. 1

4 MPI MPI Process 0 MPI Process 1 MPI Process 2 MPI Process 3 Thread 0 Thread 0 Thread 0 Thread 0 Thread 1 Thread 2 Shared Memory Thread 1 Thread 2 Shared Memory Thread 1 Thread 2 Shared Memory Thread 1 Thread 2 Shared Memory Thread 3 Thread 3 Thread 3 Thread 3 SMP Node 1 SMP Node 2 SMP Node 3 SMP Node 4 Figure 1: Mixed mode MPI+ programming 2 MPI and : A short Introduction MPI is based on a Single-Process-Multiple-Data (SPMD) execution model. In this model, all processes execute the same program executable. Instructions that will be executed by each process are determined using control structures that depend on a special process identifier (called: rank). In contrast to MPI, uses a fork/join execution model. In this model, each program begins execution as a single thread called the master thread. The master thread is executed sequentially until a parallel construct is reached. 2.1 Execution Model In, when a parallel construct is encountered, the master thread creates a team of slave threads. Each thread in the team executes the statements in the dynamic extent of a parallel region (except for the work-sharing constructs). When completing the parallel construct, all threads in the team synchronize at an implicit barrier, and only the master thread continues execution [14]. In terms of MPI programs, sequential parts (outside parallel regions) will thus be redundantly executed by all MPI processes. 2

5 2.2 Data Handling While data communication in MPI is carried out explicitly (e.g via MPI_Send() and MPI_Recv()), takes advantage of the ability to directly access shared memory. To this end, it is necessary to coordinate the access to shared variables by multiple threads in order to ensure correct execution. allows two different kinds of variable sharing attributes inside a parallel construct: private and shared. Private variables allow each thread to have its own local copy and use it as a temporary variable. A private variable is uninitialized and the value will not be kept outside the parallel region. Shared variables are visible and accessible to all threads simultaneously. By default, variables inside a parallel construct are shared. 2.3 MPI Example Code In this section, I will not give a complete reference for all commands and arguments of MPI. I would rather like to explain them with the aid of an example code. For this purpose, consider Listing 1 which shows the basic structure of an MPI program. The routine MPI Init() (line 12) initializes the MPI environment and must be called before any other MPI routine. Usually, it gets the command line arguments of the main routine. MPI processes can be divided into groups. A process is represented by a (group, rank) pair. Each process in a group is associated with a unique integer rank. The routine MPI Comm size() (line 13) determines the size of the group (i.e. the number of processes) associated with a so-called communicator. Roughly speaking a communicator is a data structure that contains information about the participants of a communication. MPI COMM WORLD is a predefined communicator which contains all created processes. It is initialized when the program is started. The command MPI Comm rank() (line 14) determines the rank of the calling process in the communicator. The if-statement in line 17 establishes a Master-Slave relationship between the processes. The process with rank=0 assumes the role of the master and sends different messages to the other processes (i.e. the slaves). Sending is done using the MPI Send() routine which takes a pointer to the sending buffer, the size of the buffer, the datatype of each send buffer element, the rank of the destination process, a message tag and a communicator. A tag is an arbitrary non-negative integer to uniquely identify a message. A receiving process will only receive a message if the message tag matches the tag specified in MPI Recv(). The else-part of the if statement (line 25 et seq.) contains the code for all slave processes. The routine MPI Recv() takes a pointer to a message buffer, its length, the datatype of each element, the rank of the source process (here we only consider messages from the master process), a message tag, a communicator and a pointer to a status object (which we do not need here). MPI Finalize() terminates the MPI execution environment and must be called by all processes before exiting. Compilation of an MPI program is done using mpicc [11] which is an Open MPI C wrapper compiler that transparently adds relevant compiler and linker flags to an underlying C compiler. In order to run the MPI program, you have to use a command line tool, for instance mpiexec [12]. For illustration purposes, I will run four MPI processes on a local machine using the command mpiexec -n 4 mpi test where the command line parameter -n specifies the number of MPI processes. Figure 2 shows the output of the MPI program. 3

6 1 # include <stdio.h> 2 # include < stdlib.h> 3 # include < string.h> 4 # include <mpi.h> 5 6 int main ( int argc, char ** argv ) 7 { 8 int numprocs, rank, tag, i; 9 char message [32]; 10 tag = 1234; MPI_Init (& argc, & argv ); 13 MPI_Comm_size ( MPI_COMM_WORLD, & numprocs ); 14 MPI_Comm_rank ( MPI_COMM_WORLD, & rank ); printf (" Process %d out of %d started.\n", rank, numprocs ); 17 if ( rank == 0) 18 { 19 for ( i = 1; i < numprocs ; i ++) 20 { 21 sprintf ( message, " Hello, there node % d", i); 22 MPI_Send ( message, sizeof ( message ), MPI_CHAR, i, tag, MPI_COMM_WORLD ); 23 } 24 } 25 else 26 { 27 MPI_Recv ( message, sizeof ( message ), MPI_CHAR, 0, tag, MPI_COMM_WORLD, NULL ); 28 printf (" Node % d received \"% s\" from node 0\ n", rank, message ); 29 } MPI_Finalize (); 32 return 0; 33 } Listing 1: MPI Example Code Process 1 out of 4 started. Process 2 out of 4 started. Process 0 out of 4 started. Process 3 out of 4 started. Node 3 received "Hello, there node 3" from node 0 Node 1 received "Hello, there node 1" from node 0 Node 2 received "Hello, there node 2" from node 0 Figure 2: Output of the MPI Example Code. 2.4 Example Code This section gives a basic example of an program that computes a sum in parallel. First, consider the for-loop in Listing 2 which iterates from 0 to 3. In order to parallelize this loop with, the programmer has to add an pragma omp parallel (line 10) that tells the compiler to distribute the loop to different threads. During execution, the loop is forked to a specific number of slave threads. The construct private (id, num) indicates that each thread gets its own private copy of the variable id and num, respectively. shared (sum) expresses that the variable sum is shared by all threads. To avoid race conditions, the pragma omp critical (line 13) indicates a critical section where only one thread can have access to the 4

7 variable sum. Later in Section 4.3, Listing 6 shows a more elegant way that uses reduction to form the sum. The function omp get thread num() returns the thread number within its team. omp get num threads() determines the total number of threads currently in the team. When compiling programs with gcc you have to use the option -fopenmp. The number of threads that will be created for parallelization is determined automatically depending of the number of threads the processor is capable to run in parallel. This number can be changed either by setting the environment variable OMP NUM THREADS=n or in the source code using the function omp set num threads(). Figure 3 shows the output of the program on a dual-core machine. 1 # include <omp.h> 2 # include <stdio.h> 3 # include < stdlib.h> 4 5 int main ( void ) 6 { 7 int num, id, sum, i; 8 9 sum = 0; 10 # pragma omp parallel for private ( id, num ) shared ( sum ) 11 for ( i = 0; i < 4; i ++) 12 { 13 # pragma omp critical 14 { 15 sum ++; 16 } id = omp_get_thread_num (); 19 num = omp_get_num_threads (); printf (" Thread % d out of % d computes iteration % d \ n", id, num, i); 22 } 23 printf (" sum : %d\n", sum ); return 0; 26 } Listing 2: Example Code Thread 0 out of 2 computes iteration 0 Thread 0 out of 2 computes iteration 1 Thread 1 out of 2 computes iteration 2 Thread 1 out of 2 computes iteration 3 sum: 4 Figure 3: Output of the Example Code. 5

8 3 The Language llc llc (read it La Laguna C 1 ) is a high-level parallel language that provides both and MPI programming through a set of compiler directives. This means that it follows the simplicity of and abstracts from the low-level aspects of MPI. Like in, all parallelism is expressed through compiler pragmas in the code. Wherever it is possible llc pragmas are compatible with those existing in [4]. The programmer starts with a sequential code and incrementally parallelizes it by adding compiler directives. Compilation is done in a sourceto-source fashion using a special compiler called llcomp which translates the annotated code into hybrid /MPI code. The generated code includes MPI communication handling and synchronization of data between the nodes. The following Sections are organized as follows: Section 3.1 describes the computational model behind llc and describes how computational work distribution is done in principle. Section 3.2, gives a short example of how to annotate C code with llc pragmas and describes the generated C code. In Section 3.3, I will explain the actual translation process. In Section 3.4, I will present ans discuss the experimental results the authors got when running their parallel code on two parallel systems. 3.1 The OTOSP Model The OTOSP (One Thread is one Set of Processors) model [1, 2, 3, 4] is the underlying computational model of the llc language. It describes how computational work can be distributed to a given set of processors. An OTOSP machine consists of an infinite number of processors, where each processor has its own private memory. Communication between processors takes place through a network interface. In this model all processors are organized in sets and in each set, all processors have the same memory state, execute the same program and have the same input data. Consider the example algorithm in Listing 3 which has two nested loops. At the beginning, the initial set consists of all (infinitely many) processors, which execute the same sequential thread (see Figure 4). When program execution reaches a parallel construct (in this example a parallel for) the set of processors is partitioned into subsets such that each subset executes a single loop iteration. One set executes the loop body for i = 1, the second set for i = 2, and the third set for i = 3. Each processor in a set replicates the computation of the same task, which allows us to (potentially) add a further parallelization level. In order to keep memory coherent, each processor in a subset communicates its results to the processors in the complementary subsets. This requires that the programmer tells the compiler which data will be changed inside each loop iteration. This is done by using the result directive (see line 2) that takes a list of pairs (pointer expression, size), determining the memory range that is modified during iteration i. The meaning of the source code was not described in detail but one can see that data is changed in line 9 by the outer loop. It seems that the array si holds the number of (consecutive) bytes that were changed by the I_function(). The result will be stored in the array ri at position i. For instance, when processor 0 finishes the execution of iteration i=1, it sends si[1] bytes starting at the address pointed to by ri+1 to processors 1 and 2 (see red communication line in the 2nd row of Figure 4). In addition, processor 0 receives the results of his partners, i.e. it receives si[2] bytes from processor 1 and stores them in its local memory starting at the address pointed to by ri+2 and receives 1 named after the La Laguna University (Tenerife) where it was developed 6

9 si[3] bytes from processor 2 and stores them at the address pointed to by ri+3. In general, let P = {p 0, p 1,...} be an infinite set of processors. When partitioning this set into n subsets P 0 = {p 0, p n, p 2n,...}, P 1 = {p 1, p n+1, p 2n+1,...},..., P n 1 = {p n 1, p 2n 1, p 3n 1,...}, processor p i n exchanges data with processors p i n+j, for all i N 0, 1 < j < n. Of course, this could be done recursively as we will see next. Line 8 contains another loop which is executed inside each loop iteration i. To this end, the processor sets are partitioned again, according to the number of iterations of the inner loop (see 3rd row in Figure 4). For instance, the set that executes the outer loop iteration i = 1 yields a partition of the corresponding processor set into two new subsets, one that executes the (inner) loop iteration i = 1, j = 0 and one that executes the loop iteration i = 1, j = 1. Here, too, the results are communicated according to the partnership relation between the processors. This theoretical model abstracts from the fact that under real conditions, there is only a finite number of processors available. Nevertheless, it describes in an easy way how the distribution of a certain task to different processors takes place. 1 # pragma omp parallel for 2 # pragma llc result (ri +1,s[i]); 3 for (i =1;i <=3; i ++) 4 { 5 # pragma omp parallel for 6 # pragma llc result ( rj+j, sj[ j]); 7 for (j =0;j<i;j ++) 8 rj[ j] = compute_rj (i, j, & sj[ j]); 9 ri[i] = I_function (i,& si[i]); 10 } Listing 3: Two nested parallel loops i=1 i=2 i= i=1; j=0 i=1; j=1 i=2; j=0 i=2; j=1 i=2; j=2 i=3; j=0 i=3; j=1 i=3; j=2 i=3; j= Figure 4: Distribution of nested loop iterations among an infinite number of processors. 7

10 3.2 Example In this Section, I will give a basic example of how the llc language looks in practice and subsequently explain the generated hybrid code Computing π In order to illustrate the main features of llc, Listing 4 shows a simple algorithm augmented with and llc directives. The algorithm calculates an approximation of π and is based 1 4 on the fact that π = dx which can be approximated with N 4 1+x 2 N(1+((i+0.5)/N) 2 ). 0 1 w = 1.0/ N; 2 pi = 0.0; 3 4 # pragma omp parallel for private ( t) reduction (+: pi) 5 # pragma llc reduction_type ( double ) 6 for (i =0;i<N;i ++) 7 { 8 t = (i +0.5) *w; 9 pi = pi / (1.0 + t * t); 10 } 11 pi *=w; i=0 Listing 4: parallel approximation of π Let us first consider the pragma, which consists of the following clauses: the parallel for construct indicates that the loop will be distributed among different threads, private(t) specifies that each thread has its own variable t. This clause is kept only for compatibility reasons, since all storages are private in the OTOSP model. The reduction(+ : pi) clause tells the compiler that all values of pi should be added at the end of the loop. The llc compiler does not support type analysis and therefore, the type of the reduction variable has to be specified with a special llc pragma reduction_type (see line 5) The generated C code Listing 5 shows some parts of the hybrid code that was generated by the compiler. Since no explanation for the code was given by the authors, I can only give a rough explanation of the generated hybrid code. The original for-loop in Listing 4 ranges from 0 to N. In order to distribute the loop among several processors, each processor gets an index which is stored in llc_grp_save. The array llc_f holds the starting points for the different loops and the function LLC_PROCSGRP returns the corresponding number of iterations. For example, a loop ranging from 0 to 99 has to be distributed to 4 processors: The array llc_f should hold the values [0,25,50,75] and LLC_PROCSGRP should return the value 25. Now, Processor P 0 (llc_grp_save = 0) computes the loop for(i=0; i<25; i++), Processor P 1 (llc_grp_save = 1) computes the loop for(i=25; i<50; i++), etc. In addition, each loop is parallelized with (see line 1). At the end, the processors have to communicate their results. Each processor has a name (comparable to the rank in MPI) which is stored in LLC_NAME. The Master-Node (LLC_NAME=0) collects all the results by iterating over each node (line 12). To this end, the MPI-Function MPI_Recv is called to receive the data (i.e. the value pi of each Slave-Node). The reduction is performed in line 16 where all values of pi are added. The pointer llc_buf_ptr is used to 8

11 access the received data. In line 21, the pointer is moved sizeof(double) bytes ahead in order to access the next received value. This statement is needless in this example, since there is only one value (pi) to receive. In lines 19-22, all reduced values are broadcast to all members of the group. The Slave-Nodes (line 26 et seq.) copy their calculated values into a buffer and send it to the Master-Node. 1 # pragma omp parallel for private ( t) reduction (+ : pi) 2 for ( i = (0) + llc_f [ llc_grp_save ]; i < (0) + llc_f [ llc_grp_save ] + LLC_PROCSGRP ( llc_grp_save ); i ++) 3 { 4 { 5 t= (i + 0.5) * w; 6 pi = pi / (1.0 + t * t); 7 }; 8 } if ( LLC_NAME == 0) 11 { 12 for ( llc_i = 1; llc_i < LLC_NUMPROCESSORS ; llc_i ++) 13 { 14 MPI_Recv ( llc_buf, llc_buf_size, MPI_BYTE, llc_i, LLC_TAG_REDUCE_DATA, * llc_currentgroup, & llc_status ); 15 llc_buf_ptr = llc_buf ; 16 pi += (* ( double *) llc_buf_ptr ); 17 llc_buf_ptr += sizeof ( double ); 18 } 19 llc_buf_ptr = llc_buf ; 20 memcpy ( llc_buf_ptr, & pi, sizeof ( double )); 21 llc_buf_ptr += sizeof ( double ); 22 MPI_Bcast ( llc_buf, llc_buf_size, MPI_BYTE, 0, * llc_currentgroup ); 23 } 24 else 25 { 26 llc_buf_ptr = llc_buf ; 27 memcpy ( llc_buf_ptr, & pi, sizeof ( double )); 28 llc_buf_ptr += sizeof ( double ); 29 MPI_Send ( llc_buf, llc_buf_size, MPI_BYTE, 0, LLC_TAG_REDUCE_DATA, * llc_currentgroup ); 30 MPI_Bcast ( llc_buf, llc_buf_size, MPI_BYTE, 0, * llc_currentgroup ); 31 llc_buf_ptr = llc_buf ; 32 memcpy (& pi, llc_buf_ptr, sizeof ( double )); 33 llc_buf_ptr += sizeof ( double ); 34 } Listing 5: The llc translation of the 3.3 Translating llc Programs When observing different parallel programs, you will find several similarities between different implementations. This is due to the fact that every parallel program has to carry out certain operations like initialization, execution, communication and finalization. These code patterns can be stored in separate files serving as skeletons for further parallel implementations. These files are independent of the compiler and can be modified without changing the compiler source code. These files contain special tags that have to be replaced appropriately by the compiler. llcomp takes advantage of this feature and uses two kinds of patterns when translating llc parallel constructs: 9

12 Static Patterns Static patterns contain code that is needed for creating processor groups and other initial operations, resource distribution, data communication, load balancing, etc. These codes are written in the target language (i.e. C with MPI statements) and fixed at compile time, since they only depend on the parallelization paradigm that was defined using llc or directives. With the information contained in these directives, the compiler decides which code skeleton it uses and completes the code according to the information given in the directives. As mentioned before, each pattern is stored in a file and contains special tags that have to be completed during translation. Additionally, each pattern is divided into different stages like initialization, execution, communication and finalization. Here again, each stage is stored in a separate file. This rich library of different patterns also contains optimized code for common situations. If one of these situations is detected by the compiler, it uses the optimized code in order to increase the performance. Dynamic Patterns Since static patterns only contain code that handles the sending and receiving of buffers (i.e. streams of bytes without datatypes), additional code is needed when operating on real data. Therefore, specific code for allocation and management of these buffers has to be inserted into special marked places within the static code. In order to produce this code, llcomp uses dynamic patterns, which are generated during compile time and stored in temporary files. In the next step, the content of these files will be inserted into specially marked places within the static pattern code. For further optimization, the transfered data is compressed in order to reduce communication overhead between MPI nodes. Because of the separation between two different types of patterns, static patterns can be changed without consideration of the data. This has the advantage that new data management strategies can be introduced without changing static code patterns. For the distribution between MPI and, llc uses one MPI process per communication node. The loops inside of the MPI process are parallelized with. According to [5], this programming scheme is named the hybrid masteronly, where MPI is called only outside parallel regions, i.e. by the master thread. This idea is also reflected in the generated code in Listing 5. Lines 1-8 contain the code that will run on an SMP node parallelized with. Communications (i.e. MPI calls) are done outside of this parallel region (lines 10 et seq.). This approach takes advantage of the shared memory facility inside the nodes and reduces the communication overhead between the nodes. 3.4 Experimental Results In order to evaluate the performance of the llc translation, the authors have used 4 algorithms in two different multicore systems. The algorithms they used are the π-approximation algorithm which was introduced in Section 3.2.1, the Velvet algorithm for Molecular Dynamics simulation, a Mandelbrot Set computation and the Conjugate Gradiant Algorithm which is also part of the NAS Parallel Benchmarks 2. The four algorithms were evaluated using two different systems: 2 The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications [6] 10

13 Tajinaste, an IBM cluster with 14 dual-core Opteron processors, interconnected with a Gigabit Ethernet network, and Verode, a two quad-core Opteron system. Both systems were run on Linux CentOS 5.3 with Kernel 2.6. The codes were compiled using gcc 4.1 and I. In the first experiment, the authors implemented four versions of the Mandelbrot-Set algorithm: 1. LLC-MPI: pure MPI code generated with llcomp. 2. LLC-HYB: hybrid /MPI code generated with llcomp. 3. MPI: a pure MPI implementation. 4. HYB: an ad-hoc hybrid implementation. Figure 5 shows the speed-ups 3 achieved with the four different implementations on the Tajinaste system. For the Mandelbrot Set Computation, the code generated by llcomp (LLC-MPI and LLC-HYB) yields almost the same speed-up as the hand-written code but with much less coding effort. Figure 5: Mandelbrot Set Computation on Tajinaste In subsequent experiments, the authors merely compare both llc implementations (LLC- MPI and LLC-HYB) when observing the remaining algorithms. Figure 6 shows the results for the Molecular Dynamics Simulation and the π-approximation on the Tajinaste system. Here, no significant differences could be observed. The same holds for the experiments on the Verode system (see Figure 5). One reason for this could be the fact that there is only little data that has to be transferred between the processes. For example, consider the π-approximation algorithm (Listing 4). There is only one reduction variable (pi) that has to be communicated at the end of each (partial-)loop and therefore it takes less advantage of shared memory since there is only few communication. This example shows that pure MPI implementations do not always suffer from too much inter-node traffic. In contrast, the Cojugate Gradient algorithm benefits from the hybrid approach (see Figure 7c). The reason for this improvement is the granularity of the algorithm. That is, it requires lots of matrix and vector manipulations which can be carried out by several parallelized loops. Because of the fine-grain parallelism (i.e. many parallel loops with few iterations) this algorithm takes advantage of the shared memory system due to more 3 Speedup is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with multiple processors: Speedup = T serial T parallel [8]. 11

14 communications that can be carried out within an SMP node. Unfortunately, the authors did not run the Cojugate Gradient algorithm on the Tajinaste system and thus there is no comparison between both systems. The experiments have shown that hybrid programming (especially automated generation of hybrid code) is no silver bullet to achieve an optimal speed-up. It rather depends on different factors like the granularity of the parallelization and the amount of data that has to be communicated. Nevertheless, llc seems to be a promising language that gives also less experienced programmers the ability to parallelize sequential algorithms easily and with less programming effort. 12 LLC-MPI LLC-HYB 12 LLC-MPI LLC-HYB Speedup 6 Speedup Processors Processors (a) Molecular Dynamics algorithm (b) π-approximation Figure 6: Speedup in Tajinaste 7 6 LLC-MPI LLC-HYB 7 6 LLC-MPI LLC-HYB 5 5 Speedup 4 3 Speedup Processors Processors (a) Molecular Dynamics algorithm (b) π-approximation LLC-MPI LLC-HYB Speedup Processors (c) Conjugate Gradient algorithm Figure 7: Speedup in Verode 12

15 4 Automatic generation of MPI code from programs in GCC In this Section I present an approach to automatically transform code into MPI programs by using the GCC compiler. GCC is a tool from the Free Software Foundation and hence, the authors decided to add new passes into GCC. The following Sections are organized as follows: Section 4.1 describes the to MPI translation process. In Section 4.2 gives a rough description of the compilation process of GCC and aferwards, in Section 4.3, I will introduce GOMP, an implementation of the standard for GCC. In Section 4.4 I will describe how GOMP is used to transform code into MPI programs. 4.1 to MPI Translation When translating programs into MPI programs, both the execution- and the memory model (see Section 2) must be adapted. distinguishes between shared and private variables. In the SPMD model of MPI, all variables are private because they run in the address spaces of the different MPI processes. Since the programmer guarantees that statements inside an parallel region can be executed independently, this region can also be executed in parallel by several MPI processes. Hence, declaring a variable as shared could have different meanings in a distributed memory context: 1. The variable is read-only by each task. Therefore, it may be privatized and initialized (with the same value). 2. An array variable is modified by several tasks but only distinct elements are changed by each task. In this case, single elements may be privatized. When finishing the parallel construct, all tasks have to accomplish a global update. 3. The variable is read and modified by each task. Because of the relaxed-consistency memory model of, the user has to provide synchronization directives ( e.g. flush, atomic or critical) in the code. In this case, the variable may be privatized and each synchronization causes update communications within the parallel construct. 4.2 GCC Compilation Figure 8: GCC intermediate representations 13

16 Roughly speaking, the compilation process of GCC (see Figure 8) can be split into four phases: The first phase, which is called parsing, is specific to each language and yields an intermediate representation of the source code called GENERIC. The GENERIC representation is a tree-structure where each language construct (e.g. loops, conditionals, declarations) has a specific tree representation. Each node of the tree has a certain tree code that defines the type of the tree. Depending on the tree-type, there exists a different number of operands. For example, an assignment expression has two operands which correspond to the left and right hand sides of the expression [15]. Figure 9 shows a simplified graphical representation of an assignment expression. At this level, some language specific constructs may still exist. The next phase is called gimplification which builds a further representation of the code. GIMPLE uses the same tree data structure as GENERIC but in a language independent fashion. Therefore, all expressions are converted into a three-address representation by breaking down GENERIC expressions into tuples of no more than 3 operands (with some exceptions like function calls). Additionally, all the control structures used in GENERIC are lowered into conditional jumps [16]. A dump of the GIMPLE form in a C-like representation can be requested with the gcc-flag -fdump-tree-gimple. At a later stage, GIMPLE trees are converted into Static Single Asmodify_expr op 0 op 1 type var_decl name type integer_cst type low:123 identifier_node strg: a integer_type name sign:signed type_decl name type identifier_node strg: int Figure 9: simplified tree-representation of expression a = 123; signment (SSA) trees, a lower representation used for low-level optimization. When assigning a variable multiple times, new versions of that variable are created in the SSA tree. Different versions of the same variable are distinguished by a subscript determining its version number. Variables used in the right hand side of expressions are renamed so that their version number matches that of the most recent assignment [16]. A dump of the SSA related information can be requested with the gcc-flag -fdump-tree-ssa. Figure 10 shows a variable (a) that is assigned multiple times. The SSA representation now contains different versions of that variable. At the end, variable b is assigned the most recent version of a. All in all, GCC performs more than 20 different optimizations on SSA trees [15]. Finally, after some optimizations, the SSA representation is translated to the Register Transfer Language (RTL), which represents an abstract machine with an infinite number of registers. RTL can be viewed as an abstract repre- 14

17 a = 1; a = 2; a = 3; b = a; SSA a_1 = 1; a_2 = 2; a_3 = 3; b_4 = a_3; Figure 10: source code and its SSA representation. sentation of the target assembly language i.e. the concrete assembly syntax has been discarded and only the semantics captured [17]. A dump of the RTL related information can be requested with the gcc-flag -fdump-rtl-expand. Figure 11 shows a simplified RTL representation of the expression b=a+2. In a final step, the back-end generates the assembly code for the specified target platform. b = a+2; RTL (set (reg:si 62) (plus:si (reg/v:si 60 [ a ]) (const_int 2 [0x2]))) Figure 11: Simplified RTL representation of the expression b=a GOMP: Support for GCC GOMP [18] is the implementation of the GCC, and currently it supports the 3.0 specification. Nevertheless, this paper focuses on the 2.5 specification. The code generation strategy of GOMP moves the body of parallel regions into separate functions, which are passed as arguments to the libgomp thread creation routines. Data sharing is implemented by passing the address of a local structure (which will be described later). The entire transformation (see Figure 12) is done on the GIMPLE representation of the code. The transformation mainly consists of three phases: 1. High GIMPLE form In this phase, the parser generates the GENERIC representation of the code. To this end, each directive and clause has a tree code defined in a special file (tree.def). This step also includes the determination of the data sharing attributes. For instance, consider the simple code in Listing 6 which computes a sum in parallel. The parallel for construct in line 9 indicates that the loop will be parallelized and the reduction(+: sum) clause tells the compiler that all values of sum should be added at the end of the loop. For this purpose, a private copy is created and initialized on each thread. After the end of the region, the original variable (sum) is updated with the values of the private copies using the specified operator (+) [19]. Listing 7 shows the high GIMPLE form of the code. Since the parallel loop construct is just a shortcut for specifying a parallel construct containing a single loop construct [19], it is expanded to a 15

18 loop nested in a parallel region. The data sharing attribute private, which indicates that this variable is private to a thread, is added automatically (see line 11). 2. Low GIMPLE form In the next phase, the code is linearized and special markers are inserted to identify the end of each parallel worksharing region. At the end of this phase, a data structure containing shared data is created. 2. Final GIMPLE form This phase consists in outlining the body of an omp parallel pragma into a separate function. Additionally, calls to libgomp thread creation routines are inserted in order to run the outlined functions. This step also contains task partitioning. That is, local loop bounds based on thread identifiers are computed. Listing 8 shows the final GIMPLE form. Now, the loop body has moved into a separate function (main.omp_fn.0, in line 19) and thread also creation (line 11) and termination (line 13) routines are inserted into the code. This function takes an argument of type struct.omp_data_s.0 * which contains the addresses of all shared variables contained in the parallel construct (in this example: sum). 1 # include " stdio.h" 2 # define N # define INC int main () 6 { 7 int i; 8 int sum =0; 9 # pragma omp parallel for reduction (+: sum ) 10 for ( i =1; i <=N; i ++) 11 { 12 sum += INC ; 13 } 14 return sum ; 15 } Listing 6: computing int main () () 2 { 3 int D.2562; 4 { 5 int i ; 6 int sum ; 7 sum = 0 ; 8 # pragma omp parallel reduction (+: sum ) 9 { 10 { 11 # pragma omp for nowait private ( i) 12 for (i=1 ;i <=100; i=i +1) 13 { 14 sum = sum + 2 ; 15 } 16 } i=1 2 in parallel 16

19 17 } 18 D.2562= sum ; 19 return D.2562; 20 } 21 D.2562=0 ; 22 return D.2562; 23 } Listing 7: High GIMPLE form 1 int main () () 2 { 3 int sum ; 4 int i ; 5 int D.2562; 6 struct. omp_data_s.0. omp_data_o.2; 7 8 <bb 2 >: 9 sum = 0; 10. ompdata_o.2. sum = sum ; 11 builtin_gomp_parallel_start ( main. omp_fn.0, &. omp_data_o.2, 0); 12 main. omp_fn.0(&. omp_data_o.2) ; 13 builtin_gomp_parallel_end (); 14 sum =. ompdata_o.2. sum ; 15 D.2562 = sum ; 16 return D.2562; 17 } void main. omp_fn.0( void *) (. omp_data_i ) 20 { <bb 2 >: 23 sum = 0; 24 D.2578 = builtin_omp_get_num_threads (); 25 D.2579 = builtin_omp_get_thread_num (); 26 D.2580 = 100 / D.2578; D.2586 = MIN_EXPR <D.2585, 100 >; 29 if ( D.2584 >= D.2586) goto <L3 >; else goto <L1 >; 30 <L3 >:; 31 sum.1 = ( unsigned int ) sum ; 32 D.2572 = &. omp_data_i ->sum ; 33 sync_fetch_and_add_4 ( D.2572, sum.1) ; 34 return ; 35 <L1 >:; 36 D.2587 = D.2584*1 ; 37 i = D ; 38 D.2588 = D.2586*1 ; 39 D.2589 = D ; 40 <L2 >:; 41 sum = sum + 2 ; 42 i = i +1; 43 D.2590 = i < D.2589; 44 if( D.2590) goto <L2 >; else goto <L3 >; } Listing 8: Final GIMPLE form 17

Figure 12: GOMP transformations 4.4 Using GOMP to transform Programs into MPI Programs This Section describes how GOMP can be used to generate MPI programs.

20 Figure 12: GOMP transformations 4.4 Using GOMP to transform Programs into MPI Programs This Section describes how GOMP can be used to generate MPI programs. An important aspect is that automatic MPI code generation requires data dependency analyses in order to generate communications. Since MPI transformation does not require any new syntactic constructs, the compiler front-end can be used without any modification. Additionally, the automatic determination of the data sharing attributes, the code linearization, the marker insertion and the outlining of the body of a parallel construct are compatible and can be reused. However, the semantics of the parallel directive is very different in a message-passing context of MPI, i.e. there will be no task creation at runtime, since all tasks are executed at the beginning of the MPI applicatoin. The parallel directive guarantees that there is no data dependency between two parallel constructs. During task partitioning, GOMP computes local loop bounds depending on thread identifiers, while MPI provides process identifiers (i.e. ranks). Hence, GOMP thread creation codes must be replaced by MPI routines for initialization, communication and finalization. Initialization and finalization routines must be added at the beginning and at the end of the main program, and communication routines must be inserted after the for worksharing construct. Figure 13 illustrates the modified transformation process in the GCC pipeline. When generating communication, it is necessary to know which shared data is modified within a parallel work sharing construct. To this end, a list of shared variables which are modifiable inside a parallel work sharing construct and later used in the program, must be built. This list must contain the type and the size of each variable. The size can be determined easily for primitive types and statically allocated arrays. In contrast, the size of dynamically allocated arrays must be determined by analyzing the control-flow and data-flow graph when possible. At the end of a parallel work sharing construct, all modified data will be updated on each MPI process by using MPI communications. If information about accessed data elements is available, then only these regions will be updated. I will go into that in more detail in Section

Figure 13: GOMP and MPI transformations 4.5 Current Implementation The authors have developed a limited prototype that only supports transformations of for worksharing constructs.

21 Figure 13: GOMP and MPI transformations 4.5 Current Implementation The authors have developed a limited prototype that only supports transformations of for worksharing constructs. The communication only works for scalar and statically allocated arrays and non-function calls inside a work sharing construct The Code Transformation Process A parallel for loop is represented by the OMP_FOR GIMPLE tree-code that has five operands, which are listed in Table 1. The associated function for gimplification is c_gimplify_omp_for (file: gcc-4.2.0/gcc/gimplify.c). This function is called with a pointer to a tree-structure which represents the body of the loop. When building the new MPI-suited loop structure, the initialization, stop condition and increment expression can be reused. The lower and upper bounds of the parallel loops have to be computed depending on the current MPI process identifier and the total number of MPI processes. To this end, the authors created two additional GIMPLE variables lo.x and hi.x and insert function calls to new functions step_get_loop_lo() and step_get_loop_hi() in front of the body of the parallel loop. The loop initializer (OMP_FOR_INIT) and the condition (OMP_FOR_COND) will then be modified according to the new values. Finally, calls to MPI routines for communicating array variables that have been modified within the parallel loop body are inserted. The list of modified variables is built by collecting variables that occur on the left hand sides of assignments. Note that this analysis only works for statically allocated arrays and does not follow function calls. When synchronizing data, a centralized communication scheme is implemented, where each MPI process sends a set of modified variables to the MPI master process (rank=0). Subsequently, the master synchronizes each variable and broadcasts the updated version to all other MPI processes. 19

22 Operand Description OMP_FOR_BODY contains the loop body. OMP_FOR_CLAUSES list of clauses associated with the directive. OMP_FOR_INIT loop initialization code of the form VAR = N1. OMP_FOR_COND loop conditional expression of the form VAR {<,>,<=,>=} N2. OMP_FOR_INCR is the loop index increment of the form VAR {+=,-=} INCR. Table 1: OMP FOR operands [20] Communication Optimization The authors also propose a first optimization strategy for communications. As mentioned before, the MPI transformation process needs precise information about the modified array-regions. The optimization strategy consists of determining lower and higher indices of accessed array-elements for each array dimension. This requires that array regions are accessed linearly within parallel loops, i.e. all array references are continuous. In order to compute the lower and higher indices, each variable which appears in the index expression has to be replaced by its minimal and maximal number respectively. For instance, consider Listing 9 that performs a parallel RGB to grayscale conversion. The array A is modified in line 16 and the corresponding index expression is j+i*w. Hence, the global iteration domain for i (ranging from 0 to H-1) is given by the outermost loop header (line 12). The iteration domain for j ranges from 0 to W-1 (line 14). Since the outer most loop will be parallelized, the actual local iteration domain is only known at runtime. In the current implementation, two variables (lo.x and hi.x) hold the minimum and maximum values of the parallel loop. Now, array A is accessed in the interval [MIN(j)+lo.x * W, MAX(j)+hi.x * W] which becomes [lo.x * W, H-1+hi.x * W] after replacing MIN(j) with 0 and MAX(j) with H-1. At the end of the execution of the parallel loop, each thread knows, that the region of A which has to be send, ranges from A[lo.x*W] to A[H-1+hi.x*W]. Listing 10 shows the low GIMPLE representation of the code. Here, at the end of the parallel for loop (else-part at lines 27 et seq.), the region that has to be sent is calculated by using the above formula (upper boundary: lines 27-28, lower boundary: lines 29-30). Afterwards, the data is sent to the MPI master process by calling the function step_sendregion(). At the end, the master synchronizes all variables and broadcasts the updated versions to all other MPI processes. 20

23 1 # define W 10 2 # define H int main () 5 { 6 int i, j; 7 unsigned char A [ W* H]; 8 unsigned char R [ W* H],G[ W* H], B [ W* H] ; 9 i =0; 10 j =0; 11 # pragma omp parallel for 12 for ( int i =0; i<h; i ++) 13 { 14 for ( j= 0; j<w; j ++) 15 { 16 A [ j+i* W] = R [ j+i* W] * \ 17 G [ j+i* W] * \ 18 B [ j+i* W] * 0.114; 19 } 20 } 21 return 0 ; 22 } Listing 9: Parallel conversion of a RGB image to a grayscale image 1 int main () () 2 {... 3 { 4 {... 5 step_get_loop_lo (& lo.0) ; 6 step_get_loop_hi (& hi.1) ; 7 i=lo.0 ; 8 goto <D.2237 >; 9 <D.2235 >:; 10 j = 0; 11 goto <D.2241 >; 12 <D.2240 >:; 13 D.2242 = 0*10 ; 14 D.2243 = D j ; 15 D.2244 = i *10 ; /* LoopBody */ D.2267 = step_get_rank (); 20 if ( D.2267 == 0) 21 { 22 step_recvmerge (A, 0, 99, MPI_TYPE_INT ); 23 step_bcastregion (A, 0, 99, MPI_TYPE_INT ); 24 } 25 else 26 { 27 D.2268= hi.1*10; 28 D.2269= D ; 29 D.2270= lo.0*10; 30 D.2271= D ; 31 step_sendregion (A,0,D.2271, D.2269, MPI_TYPE_INT ); 32 step_recvregion (A,0,99, MPI_TYPE_INT ); 33 } 34 } 35 } } Listing 10: Low Gimple form 21

24 5 Conclusions Today, most systems in High Performance Computing are clusters of shared memory nodes and therefore, it makes sense to combine both worlds by placing MPI parallelism on top of parallelism. However, parallelizing a sequential application in MPI requires a considerable effort. In order to avoid this disadvantage, we saw two approaches that try to abstract from directly programming MPI code. First, we considered llc, a language where parallelism is expressed using combined with several llc-specific compiler directives. Compilation is done using a special compiler which translates llc code into hybrid /MPI code. This code includes MPI communication handling and synchronization of data between the nodes. The OTOSP model describes how computational work can be distributed to a given set of processors. Using llc, programming parallel code for SMP clusters will become much easier, especially for scientists and engineers. However, llc requires that all variables within a parallel for construct are private. The reason for this could be that manipulating shared variables produces expensive communications between all SMP nodes. Hence, this approach is not applicable to programs that make use of shared variables. The second approach in this paper considered an extension of the GOMP framework in the GCC compiler, which performs a translation of code to pure MPI programs. This approach also abstracts from programming MPI code but, in contrast to llc, it does not require additional compiler pragmas. Nevertheless it is not designed to produce hybrid code, and therefore it does not exploit parallelism within SMP nodes and data must be explicitly updated via communications. The experiments in Section 3.4 have shown that hybrid programming is no silver bullet to achieve an optimal speed-up. Since GCC is free software, it has the advantage that modifications in GCC, e.g. code transformations can be directly used by everyone. 22

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18