Translating OpenMP Programs for Distributed-Memory Systems

Size: px
Start display at page:

Download "Translating OpenMP Programs for Distributed-Memory Systems"

Transcription

1 Translating Programs for Distributed-Memory Systems Johannes Thull Sept Abstract Parallel systems that support shared memory abstraction are becoming more and more important in the HPC market. In addition, clusters of these SMP nodes are built in order to further increase the speed of the entire system. These systems present a hierarchical hardware design where intra-node communications can take place implicitly through a shared memory while inter-node communications must be carried out explicitly through a message passing interface. Currently, and MPI are the de-facto programming tools for shared memory and distributed memory systems, respectively. However, is not suited for distributed memory architectures, while MPI does, but suffers from large communication overhead. It is therefore reasonable to combine both parallelization strategies by introducing a hybrid /MPI model. This model closely maps to the architecture of an SMP cluster, where MPI provides communication mechanisms between nodes while exploits parallelism within shared-memory nodes. Furthermore, MPI is deemed to be the assembler language of parallel programming, and parallelizing a sequential application in MPI requires a considerable effort. Therefore, it is necessary to provide the programmer with a tool that abstracts from the detailed and extensive programming issues of MPI. This term paper presents two approaches that abstract from directly programming MPI code by automatically generating MPI- or even hybrid /MPI code.

2 Contents 1 Introduction 1 2 MPI and : A short Introduction Execution Model Data Handling MPI Example Code Example Code The Language llc The OTOSP Model Example Translating llc Programs Experimental Results Automatic generation of MPI code from programs in GCC to MPI Translation GCC Compilation GOMP: Support for GCC Using GOMP to transform Programs into MPI Programs Current Implementation Conclusions 22 I

3 1 Introduction Most Systems in High Performance Computing are clusters of shared memory nodes, ranging from small clusters of Multicore-CPU PCs up to largest systems like the Earth Simulator which consists of over 640 SMP nodes [5]. When developing applications for such systems, programmers currently have the choice between different APIs. These include Intel TBB [21], Cilk [22] Pthreads [23], or MPI, just to mention a few. Here, we only consider the last two APIs: and MPI. MPI (Message Passing Interface) is a standard specification for a message passing interface between Processes on a distributed-memory system, which provides the programmer with a Single-Program-Multiple-Data (SPMD) view of the computation. Here, the programmer has to manage all the communications using special library functions, which makes MPI, one the one hand, a powerful tool. But on the other hand decomposition, development and debugging of applications can be time consuming and significant code changes are often required. Since communications have to be carried out through a messaging system, they can produce a large overhead and also large code granularity is required to reduce latency [5]. is an industry standard for shared memory programming which provides the programmer with a set of compiler directives, library routines and environment variables. In contrast to MPI, communication is implicit and the actual parallelization is delegated to the compiler. Parallelism is expressed through compiler pragmas which makes applications relatively easy to implement. Both, MPI and have their advantages and disadvantages. The main drawback of is the restriction to shared memory architectures, compared to MPI Messages which also work across system boundaries (e.g via TCP/IP). On the other hand, MPI scales poorly on fine grain problems, where MPI applications become communication dominated. To combine the best of both worlds, several approaches [5, 9, 10] to combine sharedand distributed memory programming have been proposed. The idea behind this hybrid model is to place MPI parallelism on top of parallelism. For example, consider Figure 1 which shows a 2D data array that has to be processed by an SMP Cluster of four nodes. In a first step, the array is partitioned into four parts and distributed to four MPI processes, each running on a separate node. On a second level, each node runs an code which accomplishes a further partitioning and consequently also a further level of parallelization. This model closely maps to the architecture of an SMP cluster: MPI provides communication mechanisms between nodes while exploits parallelism within shared-memory nodes. In this term paper, I will present two strategies to bring both parallel paradigms i.e. shared memory and message passing together. Both approaches have in common that they abstract from directly programming MPI code and hence make parallel programming much easier for less experienced programmers like scientists and engineers. The term paper is organized as follows: Section 2 gives a short introduction about the execution and data handling models of and MPI. In addition I give two example codes to illustrate the usage of and MPI respectively. Section 3 is based on the work of Ruymán Reyes et al. [1]. In this Section I will present llc, a language that expresses parallelism through directives and automatically generates hybrid code. In this approach, MPI handles inter-node communications, while is used inside each SMP node. Section 4 which is based on the work of Abdellah-Medjadji et al. [13] describes how to use the implementation in GCC to transform code into MPI programs. 1

4 MPI MPI Process 0 MPI Process 1 MPI Process 2 MPI Process 3 Thread 0 Thread 0 Thread 0 Thread 0 Thread 1 Thread 2 Shared Memory Thread 1 Thread 2 Shared Memory Thread 1 Thread 2 Shared Memory Thread 1 Thread 2 Shared Memory Thread 3 Thread 3 Thread 3 Thread 3 SMP Node 1 SMP Node 2 SMP Node 3 SMP Node 4 Figure 1: Mixed mode MPI+ programming 2 MPI and : A short Introduction MPI is based on a Single-Process-Multiple-Data (SPMD) execution model. In this model, all processes execute the same program executable. Instructions that will be executed by each process are determined using control structures that depend on a special process identifier (called: rank). In contrast to MPI, uses a fork/join execution model. In this model, each program begins execution as a single thread called the master thread. The master thread is executed sequentially until a parallel construct is reached. 2.1 Execution Model In, when a parallel construct is encountered, the master thread creates a team of slave threads. Each thread in the team executes the statements in the dynamic extent of a parallel region (except for the work-sharing constructs). When completing the parallel construct, all threads in the team synchronize at an implicit barrier, and only the master thread continues execution [14]. In terms of MPI programs, sequential parts (outside parallel regions) will thus be redundantly executed by all MPI processes. 2

5 2.2 Data Handling While data communication in MPI is carried out explicitly (e.g via MPI_Send() and MPI_Recv()), takes advantage of the ability to directly access shared memory. To this end, it is necessary to coordinate the access to shared variables by multiple threads in order to ensure correct execution. allows two different kinds of variable sharing attributes inside a parallel construct: private and shared. Private variables allow each thread to have its own local copy and use it as a temporary variable. A private variable is uninitialized and the value will not be kept outside the parallel region. Shared variables are visible and accessible to all threads simultaneously. By default, variables inside a parallel construct are shared. 2.3 MPI Example Code In this section, I will not give a complete reference for all commands and arguments of MPI. I would rather like to explain them with the aid of an example code. For this purpose, consider Listing 1 which shows the basic structure of an MPI program. The routine MPI Init() (line 12) initializes the MPI environment and must be called before any other MPI routine. Usually, it gets the command line arguments of the main routine. MPI processes can be divided into groups. A process is represented by a (group, rank) pair. Each process in a group is associated with a unique integer rank. The routine MPI Comm size() (line 13) determines the size of the group (i.e. the number of processes) associated with a so-called communicator. Roughly speaking a communicator is a data structure that contains information about the participants of a communication. MPI COMM WORLD is a predefined communicator which contains all created processes. It is initialized when the program is started. The command MPI Comm rank() (line 14) determines the rank of the calling process in the communicator. The if-statement in line 17 establishes a Master-Slave relationship between the processes. The process with rank=0 assumes the role of the master and sends different messages to the other processes (i.e. the slaves). Sending is done using the MPI Send() routine which takes a pointer to the sending buffer, the size of the buffer, the datatype of each send buffer element, the rank of the destination process, a message tag and a communicator. A tag is an arbitrary non-negative integer to uniquely identify a message. A receiving process will only receive a message if the message tag matches the tag specified in MPI Recv(). The else-part of the if statement (line 25 et seq.) contains the code for all slave processes. The routine MPI Recv() takes a pointer to a message buffer, its length, the datatype of each element, the rank of the source process (here we only consider messages from the master process), a message tag, a communicator and a pointer to a status object (which we do not need here). MPI Finalize() terminates the MPI execution environment and must be called by all processes before exiting. Compilation of an MPI program is done using mpicc [11] which is an Open MPI C wrapper compiler that transparently adds relevant compiler and linker flags to an underlying C compiler. In order to run the MPI program, you have to use a command line tool, for instance mpiexec [12]. For illustration purposes, I will run four MPI processes on a local machine using the command mpiexec -n 4 mpi test where the command line parameter -n specifies the number of MPI processes. Figure 2 shows the output of the MPI program. 3

6 1 # include <stdio.h> 2 # include < stdlib.h> 3 # include < string.h> 4 # include <mpi.h> 5 6 int main ( int argc, char ** argv ) 7 { 8 int numprocs, rank, tag, i; 9 char message [32]; 10 tag = 1234; MPI_Init (& argc, & argv ); 13 MPI_Comm_size ( MPI_COMM_WORLD, & numprocs ); 14 MPI_Comm_rank ( MPI_COMM_WORLD, & rank ); printf (" Process %d out of %d started.\n", rank, numprocs ); 17 if ( rank == 0) 18 { 19 for ( i = 1; i < numprocs ; i ++) 20 { 21 sprintf ( message, " Hello, there node % d", i); 22 MPI_Send ( message, sizeof ( message ), MPI_CHAR, i, tag, MPI_COMM_WORLD ); 23 } 24 } 25 else 26 { 27 MPI_Recv ( message, sizeof ( message ), MPI_CHAR, 0, tag, MPI_COMM_WORLD, NULL ); 28 printf (" Node % d received \"% s\" from node 0\ n", rank, message ); 29 } MPI_Finalize (); 32 return 0; 33 } Listing 1: MPI Example Code Process 1 out of 4 started. Process 2 out of 4 started. Process 0 out of 4 started. Process 3 out of 4 started. Node 3 received "Hello, there node 3" from node 0 Node 1 received "Hello, there node 1" from node 0 Node 2 received "Hello, there node 2" from node 0 Figure 2: Output of the MPI Example Code. 2.4 Example Code This section gives a basic example of an program that computes a sum in parallel. First, consider the for-loop in Listing 2 which iterates from 0 to 3. In order to parallelize this loop with, the programmer has to add an pragma omp parallel (line 10) that tells the compiler to distribute the loop to different threads. During execution, the loop is forked to a specific number of slave threads. The construct private (id, num) indicates that each thread gets its own private copy of the variable id and num, respectively. shared (sum) expresses that the variable sum is shared by all threads. To avoid race conditions, the pragma omp critical (line 13) indicates a critical section where only one thread can have access to the 4

7 variable sum. Later in Section 4.3, Listing 6 shows a more elegant way that uses reduction to form the sum. The function omp get thread num() returns the thread number within its team. omp get num threads() determines the total number of threads currently in the team. When compiling programs with gcc you have to use the option -fopenmp. The number of threads that will be created for parallelization is determined automatically depending of the number of threads the processor is capable to run in parallel. This number can be changed either by setting the environment variable OMP NUM THREADS=n or in the source code using the function omp set num threads(). Figure 3 shows the output of the program on a dual-core machine. 1 # include <omp.h> 2 # include <stdio.h> 3 # include < stdlib.h> 4 5 int main ( void ) 6 { 7 int num, id, sum, i; 8 9 sum = 0; 10 # pragma omp parallel for private ( id, num ) shared ( sum ) 11 for ( i = 0; i < 4; i ++) 12 { 13 # pragma omp critical 14 { 15 sum ++; 16 } id = omp_get_thread_num (); 19 num = omp_get_num_threads (); printf (" Thread % d out of % d computes iteration % d \ n", id, num, i); 22 } 23 printf (" sum : %d\n", sum ); return 0; 26 } Listing 2: Example Code Thread 0 out of 2 computes iteration 0 Thread 0 out of 2 computes iteration 1 Thread 1 out of 2 computes iteration 2 Thread 1 out of 2 computes iteration 3 sum: 4 Figure 3: Output of the Example Code. 5

8 3 The Language llc llc (read it La Laguna C 1 ) is a high-level parallel language that provides both and MPI programming through a set of compiler directives. This means that it follows the simplicity of and abstracts from the low-level aspects of MPI. Like in, all parallelism is expressed through compiler pragmas in the code. Wherever it is possible llc pragmas are compatible with those existing in [4]. The programmer starts with a sequential code and incrementally parallelizes it by adding compiler directives. Compilation is done in a sourceto-source fashion using a special compiler called llcomp which translates the annotated code into hybrid /MPI code. The generated code includes MPI communication handling and synchronization of data between the nodes. The following Sections are organized as follows: Section 3.1 describes the computational model behind llc and describes how computational work distribution is done in principle. Section 3.2, gives a short example of how to annotate C code with llc pragmas and describes the generated C code. In Section 3.3, I will explain the actual translation process. In Section 3.4, I will present ans discuss the experimental results the authors got when running their parallel code on two parallel systems. 3.1 The OTOSP Model The OTOSP (One Thread is one Set of Processors) model [1, 2, 3, 4] is the underlying computational model of the llc language. It describes how computational work can be distributed to a given set of processors. An OTOSP machine consists of an infinite number of processors, where each processor has its own private memory. Communication between processors takes place through a network interface. In this model all processors are organized in sets and in each set, all processors have the same memory state, execute the same program and have the same input data. Consider the example algorithm in Listing 3 which has two nested loops. At the beginning, the initial set consists of all (infinitely many) processors, which execute the same sequential thread (see Figure 4). When program execution reaches a parallel construct (in this example a parallel for) the set of processors is partitioned into subsets such that each subset executes a single loop iteration. One set executes the loop body for i = 1, the second set for i = 2, and the third set for i = 3. Each processor in a set replicates the computation of the same task, which allows us to (potentially) add a further parallelization level. In order to keep memory coherent, each processor in a subset communicates its results to the processors in the complementary subsets. This requires that the programmer tells the compiler which data will be changed inside each loop iteration. This is done by using the result directive (see line 2) that takes a list of pairs (pointer expression, size), determining the memory range that is modified during iteration i. The meaning of the source code was not described in detail but one can see that data is changed in line 9 by the outer loop. It seems that the array si holds the number of (consecutive) bytes that were changed by the I_function(). The result will be stored in the array ri at position i. For instance, when processor 0 finishes the execution of iteration i=1, it sends si[1] bytes starting at the address pointed to by ri+1 to processors 1 and 2 (see red communication line in the 2nd row of Figure 4). In addition, processor 0 receives the results of his partners, i.e. it receives si[2] bytes from processor 1 and stores them in its local memory starting at the address pointed to by ri+2 and receives 1 named after the La Laguna University (Tenerife) where it was developed 6

9 si[3] bytes from processor 2 and stores them at the address pointed to by ri+3. In general, let P = {p 0, p 1,...} be an infinite set of processors. When partitioning this set into n subsets P 0 = {p 0, p n, p 2n,...}, P 1 = {p 1, p n+1, p 2n+1,...},..., P n 1 = {p n 1, p 2n 1, p 3n 1,...}, processor p i n exchanges data with processors p i n+j, for all i N 0, 1 < j < n. Of course, this could be done recursively as we will see next. Line 8 contains another loop which is executed inside each loop iteration i. To this end, the processor sets are partitioned again, according to the number of iterations of the inner loop (see 3rd row in Figure 4). For instance, the set that executes the outer loop iteration i = 1 yields a partition of the corresponding processor set into two new subsets, one that executes the (inner) loop iteration i = 1, j = 0 and one that executes the loop iteration i = 1, j = 1. Here, too, the results are communicated according to the partnership relation between the processors. This theoretical model abstracts from the fact that under real conditions, there is only a finite number of processors available. Nevertheless, it describes in an easy way how the distribution of a certain task to different processors takes place. 1 # pragma omp parallel for 2 # pragma llc result (ri +1,s[i]); 3 for (i =1;i <=3; i ++) 4 { 5 # pragma omp parallel for 6 # pragma llc result ( rj+j, sj[ j]); 7 for (j =0;j<i;j ++) 8 rj[ j] = compute_rj (i, j, & sj[ j]); 9 ri[i] = I_function (i,& si[i]); 10 } Listing 3: Two nested parallel loops i=1 i=2 i= i=1; j=0 i=1; j=1 i=2; j=0 i=2; j=1 i=2; j=2 i=3; j=0 i=3; j=1 i=3; j=2 i=3; j= Figure 4: Distribution of nested loop iterations among an infinite number of processors. 7

10 3.2 Example In this Section, I will give a basic example of how the llc language looks in practice and subsequently explain the generated hybrid code Computing π In order to illustrate the main features of llc, Listing 4 shows a simple algorithm augmented with and llc directives. The algorithm calculates an approximation of π and is based 1 4 on the fact that π = dx which can be approximated with N 4 1+x 2 N(1+((i+0.5)/N) 2 ). 0 1 w = 1.0/ N; 2 pi = 0.0; 3 4 # pragma omp parallel for private ( t) reduction (+: pi) 5 # pragma llc reduction_type ( double ) 6 for (i =0;i<N;i ++) 7 { 8 t = (i +0.5) *w; 9 pi = pi / (1.0 + t * t); 10 } 11 pi *=w; i=0 Listing 4: parallel approximation of π Let us first consider the pragma, which consists of the following clauses: the parallel for construct indicates that the loop will be distributed among different threads, private(t) specifies that each thread has its own variable t. This clause is kept only for compatibility reasons, since all storages are private in the OTOSP model. The reduction(+ : pi) clause tells the compiler that all values of pi should be added at the end of the loop. The llc compiler does not support type analysis and therefore, the type of the reduction variable has to be specified with a special llc pragma reduction_type (see line 5) The generated C code Listing 5 shows some parts of the hybrid code that was generated by the compiler. Since no explanation for the code was given by the authors, I can only give a rough explanation of the generated hybrid code. The original for-loop in Listing 4 ranges from 0 to N. In order to distribute the loop among several processors, each processor gets an index which is stored in llc_grp_save. The array llc_f holds the starting points for the different loops and the function LLC_PROCSGRP returns the corresponding number of iterations. For example, a loop ranging from 0 to 99 has to be distributed to 4 processors: The array llc_f should hold the values [0,25,50,75] and LLC_PROCSGRP should return the value 25. Now, Processor P 0 (llc_grp_save = 0) computes the loop for(i=0; i<25; i++), Processor P 1 (llc_grp_save = 1) computes the loop for(i=25; i<50; i++), etc. In addition, each loop is parallelized with (see line 1). At the end, the processors have to communicate their results. Each processor has a name (comparable to the rank in MPI) which is stored in LLC_NAME. The Master-Node (LLC_NAME=0) collects all the results by iterating over each node (line 12). To this end, the MPI-Function MPI_Recv is called to receive the data (i.e. the value pi of each Slave-Node). The reduction is performed in line 16 where all values of pi are added. The pointer llc_buf_ptr is used to 8

11 access the received data. In line 21, the pointer is moved sizeof(double) bytes ahead in order to access the next received value. This statement is needless in this example, since there is only one value (pi) to receive. In lines 19-22, all reduced values are broadcast to all members of the group. The Slave-Nodes (line 26 et seq.) copy their calculated values into a buffer and send it to the Master-Node. 1 # pragma omp parallel for private ( t) reduction (+ : pi) 2 for ( i = (0) + llc_f [ llc_grp_save ]; i < (0) + llc_f [ llc_grp_save ] + LLC_PROCSGRP ( llc_grp_save ); i ++) 3 { 4 { 5 t= (i + 0.5) * w; 6 pi = pi / (1.0 + t * t); 7 }; 8 } if ( LLC_NAME == 0) 11 { 12 for ( llc_i = 1; llc_i < LLC_NUMPROCESSORS ; llc_i ++) 13 { 14 MPI_Recv ( llc_buf, llc_buf_size, MPI_BYTE, llc_i, LLC_TAG_REDUCE_DATA, * llc_currentgroup, & llc_status ); 15 llc_buf_ptr = llc_buf ; 16 pi += (* ( double *) llc_buf_ptr ); 17 llc_buf_ptr += sizeof ( double ); 18 } 19 llc_buf_ptr = llc_buf ; 20 memcpy ( llc_buf_ptr, & pi, sizeof ( double )); 21 llc_buf_ptr += sizeof ( double ); 22 MPI_Bcast ( llc_buf, llc_buf_size, MPI_BYTE, 0, * llc_currentgroup ); 23 } 24 else 25 { 26 llc_buf_ptr = llc_buf ; 27 memcpy ( llc_buf_ptr, & pi, sizeof ( double )); 28 llc_buf_ptr += sizeof ( double ); 29 MPI_Send ( llc_buf, llc_buf_size, MPI_BYTE, 0, LLC_TAG_REDUCE_DATA, * llc_currentgroup ); 30 MPI_Bcast ( llc_buf, llc_buf_size, MPI_BYTE, 0, * llc_currentgroup ); 31 llc_buf_ptr = llc_buf ; 32 memcpy (& pi, llc_buf_ptr, sizeof ( double )); 33 llc_buf_ptr += sizeof ( double ); 34 } Listing 5: The llc translation of the 3.3 Translating llc Programs When observing different parallel programs, you will find several similarities between different implementations. This is due to the fact that every parallel program has to carry out certain operations like initialization, execution, communication and finalization. These code patterns can be stored in separate files serving as skeletons for further parallel implementations. These files are independent of the compiler and can be modified without changing the compiler source code. These files contain special tags that have to be replaced appropriately by the compiler. llcomp takes advantage of this feature and uses two kinds of patterns when translating llc parallel constructs: 9

12 Static Patterns Static patterns contain code that is needed for creating processor groups and other initial operations, resource distribution, data communication, load balancing, etc. These codes are written in the target language (i.e. C with MPI statements) and fixed at compile time, since they only depend on the parallelization paradigm that was defined using llc or directives. With the information contained in these directives, the compiler decides which code skeleton it uses and completes the code according to the information given in the directives. As mentioned before, each pattern is stored in a file and contains special tags that have to be completed during translation. Additionally, each pattern is divided into different stages like initialization, execution, communication and finalization. Here again, each stage is stored in a separate file. This rich library of different patterns also contains optimized code for common situations. If one of these situations is detected by the compiler, it uses the optimized code in order to increase the performance. Dynamic Patterns Since static patterns only contain code that handles the sending and receiving of buffers (i.e. streams of bytes without datatypes), additional code is needed when operating on real data. Therefore, specific code for allocation and management of these buffers has to be inserted into special marked places within the static code. In order to produce this code, llcomp uses dynamic patterns, which are generated during compile time and stored in temporary files. In the next step, the content of these files will be inserted into specially marked places within the static pattern code. For further optimization, the transfered data is compressed in order to reduce communication overhead between MPI nodes. Because of the separation between two different types of patterns, static patterns can be changed without consideration of the data. This has the advantage that new data management strategies can be introduced without changing static code patterns. For the distribution between MPI and, llc uses one MPI process per communication node. The loops inside of the MPI process are parallelized with. According to [5], this programming scheme is named the hybrid masteronly, where MPI is called only outside parallel regions, i.e. by the master thread. This idea is also reflected in the generated code in Listing 5. Lines 1-8 contain the code that will run on an SMP node parallelized with. Communications (i.e. MPI calls) are done outside of this parallel region (lines 10 et seq.). This approach takes advantage of the shared memory facility inside the nodes and reduces the communication overhead between the nodes. 3.4 Experimental Results In order to evaluate the performance of the llc translation, the authors have used 4 algorithms in two different multicore systems. The algorithms they used are the π-approximation algorithm which was introduced in Section 3.2.1, the Velvet algorithm for Molecular Dynamics simulation, a Mandelbrot Set computation and the Conjugate Gradiant Algorithm which is also part of the NAS Parallel Benchmarks 2. The four algorithms were evaluated using two different systems: 2 The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications [6] 10

13 Tajinaste, an IBM cluster with 14 dual-core Opteron processors, interconnected with a Gigabit Ethernet network, and Verode, a two quad-core Opteron system. Both systems were run on Linux CentOS 5.3 with Kernel 2.6. The codes were compiled using gcc 4.1 and I. In the first experiment, the authors implemented four versions of the Mandelbrot-Set algorithm: 1. LLC-MPI: pure MPI code generated with llcomp. 2. LLC-HYB: hybrid /MPI code generated with llcomp. 3. MPI: a pure MPI implementation. 4. HYB: an ad-hoc hybrid implementation. Figure 5 shows the speed-ups 3 achieved with the four different implementations on the Tajinaste system. For the Mandelbrot Set Computation, the code generated by llcomp (LLC-MPI and LLC-HYB) yields almost the same speed-up as the hand-written code but with much less coding effort. Figure 5: Mandelbrot Set Computation on Tajinaste In subsequent experiments, the authors merely compare both llc implementations (LLC- MPI and LLC-HYB) when observing the remaining algorithms. Figure 6 shows the results for the Molecular Dynamics Simulation and the π-approximation on the Tajinaste system. Here, no significant differences could be observed. The same holds for the experiments on the Verode system (see Figure 5). One reason for this could be the fact that there is only little data that has to be transferred between the processes. For example, consider the π-approximation algorithm (Listing 4). There is only one reduction variable (pi) that has to be communicated at the end of each (partial-)loop and therefore it takes less advantage of shared memory since there is only few communication. This example shows that pure MPI implementations do not always suffer from too much inter-node traffic. In contrast, the Cojugate Gradient algorithm benefits from the hybrid approach (see Figure 7c). The reason for this improvement is the granularity of the algorithm. That is, it requires lots of matrix and vector manipulations which can be carried out by several parallelized loops. Because of the fine-grain parallelism (i.e. many parallel loops with few iterations) this algorithm takes advantage of the shared memory system due to more 3 Speedup is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with multiple processors: Speedup = T serial T parallel [8]. 11

14 communications that can be carried out within an SMP node. Unfortunately, the authors did not run the Cojugate Gradient algorithm on the Tajinaste system and thus there is no comparison between both systems. The experiments have shown that hybrid programming (especially automated generation of hybrid code) is no silver bullet to achieve an optimal speed-up. It rather depends on different factors like the granularity of the parallelization and the amount of data that has to be communicated. Nevertheless, llc seems to be a promising language that gives also less experienced programmers the ability to parallelize sequential algorithms easily and with less programming effort. 12 LLC-MPI LLC-HYB 12 LLC-MPI LLC-HYB Speedup 6 Speedup Processors Processors (a) Molecular Dynamics algorithm (b) π-approximation Figure 6: Speedup in Tajinaste 7 6 LLC-MPI LLC-HYB 7 6 LLC-MPI LLC-HYB 5 5 Speedup 4 3 Speedup Processors Processors (a) Molecular Dynamics algorithm (b) π-approximation LLC-MPI LLC-HYB Speedup Processors (c) Conjugate Gradient algorithm Figure 7: Speedup in Verode 12

15 4 Automatic generation of MPI code from programs in GCC In this Section I present an approach to automatically transform code into MPI programs by using the GCC compiler. GCC is a tool from the Free Software Foundation and hence, the authors decided to add new passes into GCC. The following Sections are organized as follows: Section 4.1 describes the to MPI translation process. In Section 4.2 gives a rough description of the compilation process of GCC and aferwards, in Section 4.3, I will introduce GOMP, an implementation of the standard for GCC. In Section 4.4 I will describe how GOMP is used to transform code into MPI programs. 4.1 to MPI Translation When translating programs into MPI programs, both the execution- and the memory model (see Section 2) must be adapted. distinguishes between shared and private variables. In the SPMD model of MPI, all variables are private because they run in the address spaces of the different MPI processes. Since the programmer guarantees that statements inside an parallel region can be executed independently, this region can also be executed in parallel by several MPI processes. Hence, declaring a variable as shared could have different meanings in a distributed memory context: 1. The variable is read-only by each task. Therefore, it may be privatized and initialized (with the same value). 2. An array variable is modified by several tasks but only distinct elements are changed by each task. In this case, single elements may be privatized. When finishing the parallel construct, all tasks have to accomplish a global update. 3. The variable is read and modified by each task. Because of the relaxed-consistency memory model of, the user has to provide synchronization directives ( e.g. flush, atomic or critical) in the code. In this case, the variable may be privatized and each synchronization causes update communications within the parallel construct. 4.2 GCC Compilation Figure 8: GCC intermediate representations 13

16 Roughly speaking, the compilation process of GCC (see Figure 8) can be split into four phases: The first phase, which is called parsing, is specific to each language and yields an intermediate representation of the source code called GENERIC. The GENERIC representation is a tree-structure where each language construct (e.g. loops, conditionals, declarations) has a specific tree representation. Each node of the tree has a certain tree code that defines the type of the tree. Depending on the tree-type, there exists a different number of operands. For example, an assignment expression has two operands which correspond to the left and right hand sides of the expression [15]. Figure 9 shows a simplified graphical representation of an assignment expression. At this level, some language specific constructs may still exist. The next phase is called gimplification which builds a further representation of the code. GIMPLE uses the same tree data structure as GENERIC but in a language independent fashion. Therefore, all expressions are converted into a three-address representation by breaking down GENERIC expressions into tuples of no more than 3 operands (with some exceptions like function calls). Additionally, all the control structures used in GENERIC are lowered into conditional jumps [16]. A dump of the GIMPLE form in a C-like representation can be requested with the gcc-flag -fdump-tree-gimple. At a later stage, GIMPLE trees are converted into Static Single Asmodify_expr op 0 op 1 type var_decl name type integer_cst type low:123 identifier_node strg: a integer_type name sign:signed type_decl name type identifier_node strg: int Figure 9: simplified tree-representation of expression a = 123; signment (SSA) trees, a lower representation used for low-level optimization. When assigning a variable multiple times, new versions of that variable are created in the SSA tree. Different versions of the same variable are distinguished by a subscript determining its version number. Variables used in the right hand side of expressions are renamed so that their version number matches that of the most recent assignment [16]. A dump of the SSA related information can be requested with the gcc-flag -fdump-tree-ssa. Figure 10 shows a variable (a) that is assigned multiple times. The SSA representation now contains different versions of that variable. At the end, variable b is assigned the most recent version of a. All in all, GCC performs more than 20 different optimizations on SSA trees [15]. Finally, after some optimizations, the SSA representation is translated to the Register Transfer Language (RTL), which represents an abstract machine with an infinite number of registers. RTL can be viewed as an abstract repre- 14

17 a = 1; a = 2; a = 3; b = a; SSA a_1 = 1; a_2 = 2; a_3 = 3; b_4 = a_3; Figure 10: source code and its SSA representation. sentation of the target assembly language i.e. the concrete assembly syntax has been discarded and only the semantics captured [17]. A dump of the RTL related information can be requested with the gcc-flag -fdump-rtl-expand. Figure 11 shows a simplified RTL representation of the expression b=a+2. In a final step, the back-end generates the assembly code for the specified target platform. b = a+2; RTL (set (reg:si 62) (plus:si (reg/v:si 60 [ a ]) (const_int 2 [0x2]))) Figure 11: Simplified RTL representation of the expression b=a GOMP: Support for GCC GOMP [18] is the implementation of the GCC, and currently it supports the 3.0 specification. Nevertheless, this paper focuses on the 2.5 specification. The code generation strategy of GOMP moves the body of parallel regions into separate functions, which are passed as arguments to the libgomp thread creation routines. Data sharing is implemented by passing the address of a local structure (which will be described later). The entire transformation (see Figure 12) is done on the GIMPLE representation of the code. The transformation mainly consists of three phases: 1. High GIMPLE form In this phase, the parser generates the GENERIC representation of the code. To this end, each directive and clause has a tree code defined in a special file (tree.def). This step also includes the determination of the data sharing attributes. For instance, consider the simple code in Listing 6 which computes a sum in parallel. The parallel for construct in line 9 indicates that the loop will be parallelized and the reduction(+: sum) clause tells the compiler that all values of sum should be added at the end of the loop. For this purpose, a private copy is created and initialized on each thread. After the end of the region, the original variable (sum) is updated with the values of the private copies using the specified operator (+) [19]. Listing 7 shows the high GIMPLE form of the code. Since the parallel loop construct is just a shortcut for specifying a parallel construct containing a single loop construct [19], it is expanded to a 15

18 loop nested in a parallel region. The data sharing attribute private, which indicates that this variable is private to a thread, is added automatically (see line 11). 2. Low GIMPLE form In the next phase, the code is linearized and special markers are inserted to identify the end of each parallel worksharing region. At the end of this phase, a data structure containing shared data is created. 2. Final GIMPLE form This phase consists in outlining the body of an omp parallel pragma into a separate function. Additionally, calls to libgomp thread creation routines are inserted in order to run the outlined functions. This step also contains task partitioning. That is, local loop bounds based on thread identifiers are computed. Listing 8 shows the final GIMPLE form. Now, the loop body has moved into a separate function (main.omp_fn.0, in line 19) and thread also creation (line 11) and termination (line 13) routines are inserted into the code. This function takes an argument of type struct.omp_data_s.0 * which contains the addresses of all shared variables contained in the parallel construct (in this example: sum). 1 # include " stdio.h" 2 # define N # define INC int main () 6 { 7 int i; 8 int sum =0; 9 # pragma omp parallel for reduction (+: sum ) 10 for ( i =1; i <=N; i ++) 11 { 12 sum += INC ; 13 } 14 return sum ; 15 } Listing 6: computing int main () () 2 { 3 int D.2562; 4 { 5 int i ; 6 int sum ; 7 sum = 0 ; 8 # pragma omp parallel reduction (+: sum ) 9 { 10 { 11 # pragma omp for nowait private ( i) 12 for (i=1 ;i <=100; i=i +1) 13 { 14 sum = sum + 2 ; 15 } 16 } i=1 2 in parallel 16

19 17 } 18 D.2562= sum ; 19 return D.2562; 20 } 21 D.2562=0 ; 22 return D.2562; 23 } Listing 7: High GIMPLE form 1 int main () () 2 { 3 int sum ; 4 int i ; 5 int D.2562; 6 struct. omp_data_s.0. omp_data_o.2; 7 8 <bb 2 >: 9 sum = 0; 10. ompdata_o.2. sum = sum ; 11 builtin_gomp_parallel_start ( main. omp_fn.0, &. omp_data_o.2, 0); 12 main. omp_fn.0(&. omp_data_o.2) ; 13 builtin_gomp_parallel_end (); 14 sum =. ompdata_o.2. sum ; 15 D.2562 = sum ; 16 return D.2562; 17 } void main. omp_fn.0( void *) (. omp_data_i ) 20 { <bb 2 >: 23 sum = 0; 24 D.2578 = builtin_omp_get_num_threads (); 25 D.2579 = builtin_omp_get_thread_num (); 26 D.2580 = 100 / D.2578; D.2586 = MIN_EXPR <D.2585, 100 >; 29 if ( D.2584 >= D.2586) goto <L3 >; else goto <L1 >; 30 <L3 >:; 31 sum.1 = ( unsigned int ) sum ; 32 D.2572 = &. omp_data_i ->sum ; 33 sync_fetch_and_add_4 ( D.2572, sum.1) ; 34 return ; 35 <L1 >:; 36 D.2587 = D.2584*1 ; 37 i = D ; 38 D.2588 = D.2586*1 ; 39 D.2589 = D ; 40 <L2 >:; 41 sum = sum + 2 ; 42 i = i +1; 43 D.2590 = i < D.2589; 44 if( D.2590) goto <L2 >; else goto <L3 >; } Listing 8: Final GIMPLE form 17

20 Figure 12: GOMP transformations 4.4 Using GOMP to transform Programs into MPI Programs This Section describes how GOMP can be used to generate MPI programs. An important aspect is that automatic MPI code generation requires data dependency analyses in order to generate communications. Since MPI transformation does not require any new syntactic constructs, the compiler front-end can be used without any modification. Additionally, the automatic determination of the data sharing attributes, the code linearization, the marker insertion and the outlining of the body of a parallel construct are compatible and can be reused. However, the semantics of the parallel directive is very different in a message-passing context of MPI, i.e. there will be no task creation at runtime, since all tasks are executed at the beginning of the MPI applicatoin. The parallel directive guarantees that there is no data dependency between two parallel constructs. During task partitioning, GOMP computes local loop bounds depending on thread identifiers, while MPI provides process identifiers (i.e. ranks). Hence, GOMP thread creation codes must be replaced by MPI routines for initialization, communication and finalization. Initialization and finalization routines must be added at the beginning and at the end of the main program, and communication routines must be inserted after the for worksharing construct. Figure 13 illustrates the modified transformation process in the GCC pipeline. When generating communication, it is necessary to know which shared data is modified within a parallel work sharing construct. To this end, a list of shared variables which are modifiable inside a parallel work sharing construct and later used in the program, must be built. This list must contain the type and the size of each variable. The size can be determined easily for primitive types and statically allocated arrays. In contrast, the size of dynamically allocated arrays must be determined by analyzing the control-flow and data-flow graph when possible. At the end of a parallel work sharing construct, all modified data will be updated on each MPI process by using MPI communications. If information about accessed data elements is available, then only these regions will be updated. I will go into that in more detail in Section

21 Figure 13: GOMP and MPI transformations 4.5 Current Implementation The authors have developed a limited prototype that only supports transformations of for worksharing constructs. The communication only works for scalar and statically allocated arrays and non-function calls inside a work sharing construct The Code Transformation Process A parallel for loop is represented by the OMP_FOR GIMPLE tree-code that has five operands, which are listed in Table 1. The associated function for gimplification is c_gimplify_omp_for (file: gcc-4.2.0/gcc/gimplify.c). This function is called with a pointer to a tree-structure which represents the body of the loop. When building the new MPI-suited loop structure, the initialization, stop condition and increment expression can be reused. The lower and upper bounds of the parallel loops have to be computed depending on the current MPI process identifier and the total number of MPI processes. To this end, the authors created two additional GIMPLE variables lo.x and hi.x and insert function calls to new functions step_get_loop_lo() and step_get_loop_hi() in front of the body of the parallel loop. The loop initializer (OMP_FOR_INIT) and the condition (OMP_FOR_COND) will then be modified according to the new values. Finally, calls to MPI routines for communicating array variables that have been modified within the parallel loop body are inserted. The list of modified variables is built by collecting variables that occur on the left hand sides of assignments. Note that this analysis only works for statically allocated arrays and does not follow function calls. When synchronizing data, a centralized communication scheme is implemented, where each MPI process sends a set of modified variables to the MPI master process (rank=0). Subsequently, the master synchronizes each variable and broadcasts the updated version to all other MPI processes. 19

22 Operand Description OMP_FOR_BODY contains the loop body. OMP_FOR_CLAUSES list of clauses associated with the directive. OMP_FOR_INIT loop initialization code of the form VAR = N1. OMP_FOR_COND loop conditional expression of the form VAR {<,>,<=,>=} N2. OMP_FOR_INCR is the loop index increment of the form VAR {+=,-=} INCR. Table 1: OMP FOR operands [20] Communication Optimization The authors also propose a first optimization strategy for communications. As mentioned before, the MPI transformation process needs precise information about the modified array-regions. The optimization strategy consists of determining lower and higher indices of accessed array-elements for each array dimension. This requires that array regions are accessed linearly within parallel loops, i.e. all array references are continuous. In order to compute the lower and higher indices, each variable which appears in the index expression has to be replaced by its minimal and maximal number respectively. For instance, consider Listing 9 that performs a parallel RGB to grayscale conversion. The array A is modified in line 16 and the corresponding index expression is j+i*w. Hence, the global iteration domain for i (ranging from 0 to H-1) is given by the outermost loop header (line 12). The iteration domain for j ranges from 0 to W-1 (line 14). Since the outer most loop will be parallelized, the actual local iteration domain is only known at runtime. In the current implementation, two variables (lo.x and hi.x) hold the minimum and maximum values of the parallel loop. Now, array A is accessed in the interval [MIN(j)+lo.x * W, MAX(j)+hi.x * W] which becomes [lo.x * W, H-1+hi.x * W] after replacing MIN(j) with 0 and MAX(j) with H-1. At the end of the execution of the parallel loop, each thread knows, that the region of A which has to be send, ranges from A[lo.x*W] to A[H-1+hi.x*W]. Listing 10 shows the low GIMPLE representation of the code. Here, at the end of the parallel for loop (else-part at lines 27 et seq.), the region that has to be sent is calculated by using the above formula (upper boundary: lines 27-28, lower boundary: lines 29-30). Afterwards, the data is sent to the MPI master process by calling the function step_sendregion(). At the end, the master synchronizes all variables and broadcasts the updated versions to all other MPI processes. 20

23 1 # define W 10 2 # define H int main () 5 { 6 int i, j; 7 unsigned char A [ W* H]; 8 unsigned char R [ W* H],G[ W* H], B [ W* H] ; 9 i =0; 10 j =0; 11 # pragma omp parallel for 12 for ( int i =0; i<h; i ++) 13 { 14 for ( j= 0; j<w; j ++) 15 { 16 A [ j+i* W] = R [ j+i* W] * \ 17 G [ j+i* W] * \ 18 B [ j+i* W] * 0.114; 19 } 20 } 21 return 0 ; 22 } Listing 9: Parallel conversion of a RGB image to a grayscale image 1 int main () () 2 {... 3 { 4 {... 5 step_get_loop_lo (& lo.0) ; 6 step_get_loop_hi (& hi.1) ; 7 i=lo.0 ; 8 goto <D.2237 >; 9 <D.2235 >:; 10 j = 0; 11 goto <D.2241 >; 12 <D.2240 >:; 13 D.2242 = 0*10 ; 14 D.2243 = D j ; 15 D.2244 = i *10 ; /* LoopBody */ D.2267 = step_get_rank (); 20 if ( D.2267 == 0) 21 { 22 step_recvmerge (A, 0, 99, MPI_TYPE_INT ); 23 step_bcastregion (A, 0, 99, MPI_TYPE_INT ); 24 } 25 else 26 { 27 D.2268= hi.1*10; 28 D.2269= D ; 29 D.2270= lo.0*10; 30 D.2271= D ; 31 step_sendregion (A,0,D.2271, D.2269, MPI_TYPE_INT ); 32 step_recvregion (A,0,99, MPI_TYPE_INT ); 33 } 34 } 35 } } Listing 10: Low Gimple form 21

24 5 Conclusions Today, most systems in High Performance Computing are clusters of shared memory nodes and therefore, it makes sense to combine both worlds by placing MPI parallelism on top of parallelism. However, parallelizing a sequential application in MPI requires a considerable effort. In order to avoid this disadvantage, we saw two approaches that try to abstract from directly programming MPI code. First, we considered llc, a language where parallelism is expressed using combined with several llc-specific compiler directives. Compilation is done using a special compiler which translates llc code into hybrid /MPI code. This code includes MPI communication handling and synchronization of data between the nodes. The OTOSP model describes how computational work can be distributed to a given set of processors. Using llc, programming parallel code for SMP clusters will become much easier, especially for scientists and engineers. However, llc requires that all variables within a parallel for construct are private. The reason for this could be that manipulating shared variables produces expensive communications between all SMP nodes. Hence, this approach is not applicable to programs that make use of shared variables. The second approach in this paper considered an extension of the GOMP framework in the GCC compiler, which performs a translation of code to pure MPI programs. This approach also abstracts from programming MPI code but, in contrast to llc, it does not require additional compiler pragmas. Nevertheless it is not designed to produce hybrid code, and therefore it does not exploit parallelism within SMP nodes and data must be explicitly updated via communications. The experiments in Section 3.4 have shown that hybrid programming is no silver bullet to achieve an optimal speed-up. Since GCC is free software, it has the advantage that modifications in GCC, e.g. code transformations can be directly used by everyone. 22

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming

More information

GCC Developers Summit Ottawa, Canada, June 2006

GCC Developers Summit Ottawa, Canada, June 2006 OpenMP Implementation in GCC Diego Novillo dnovillo@redhat.com Red Hat Canada GCC Developers Summit Ottawa, Canada, June 2006 OpenMP Language extensions for shared memory concurrency (C, C++ and Fortran)

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples Multicore Jukka Julku 19.2.2009 1 2 3 4 5 6 Disclaimer There are several low-level, languages and directive based approaches But no silver bullets This presentation only covers some examples of them is

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:... ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 016 Solutions Name:... Answer questions in space provided below questions. Use additional paper if necessary but make sure

More information

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP CS 470 Spring 2018 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS.

JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 OPEN SOURCE. OPEN STANDARDS. 0104 Cover (Curtis) 11/19/03 9:52 AM Page 1 JANUARY 2004 LINUX MAGAZINE Linux in Europe User Mode Linux PHP 5 Reflection Volume 6 / Issue 1 LINUX M A G A Z I N E OPEN SOURCE. OPEN STANDARDS. THE STATE

More information

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP CS 470 Spring 2017 Mike Lam, Professor OpenMP OpenMP Programming language extension Compiler support required "Open Multi-Processing" (open standard; latest version is 4.5) Automatic thread-level parallelism

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/ Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and

More information

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007 HPC Workshop University of Kentucky May 9, 2007 May 10, 2007 Part 3 Parallel Programming Parallel Programming Concepts Amdahl s Law Parallel Programming Models Tools Compiler (Intel) Math Libraries (Intel)

More information

15-440: Recitation 8

15-440: Recitation 8 15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018 OpenMP 4 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Hybrid MPI/OpenMP parallelization Recall: MPI uses processes for parallelism. Each process has its own, separate address space. Thread parallelism (such as OpenMP or Pthreads) can provide additional parallelism

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

MPI Collective communication

MPI Collective communication MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization

More information

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018 OpenMP 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen OpenMPand the PGAS Model CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen LastTime: Message Passing Natural model for distributed-memory systems Remote ( far ) memory must be retrieved before use Programmer

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

Session 4: Parallel Programming with OpenMP

Session 4: Parallel Programming with OpenMP Session 4: Parallel Programming with OpenMP Xavier Martorell Barcelona Supercomputing Center Agenda Agenda 10:00-11:00 OpenMP fundamentals, parallel regions 11:00-11:30 Worksharing constructs 11:30-12:00

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 15 October 2015 Announcements Homework #3 and #4 Grades out soon Homework #5 will be posted

More information

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions. 1 of 10 3/3/2005 10:51 AM Linux Magazine March 2004 C++ Parallel Increase application performance without changing your source code. Parallel Processing Top manufacturer of multiprocessing video & imaging

More information

Parallel Programming

Parallel Programming Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1 1 Overview What is parallel software development Why do we need parallel computation? Problems

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set of compiler directives

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Introduction to OpenMP.

Introduction to OpenMP. Introduction to OpenMP www.openmp.org Motivation Parallelize the following code using threads: for (i=0; i

More information

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013 OpenMP António Abreu Instituto Politécnico de Setúbal 1 de Março de 2013 António Abreu (Instituto Politécnico de Setúbal) OpenMP 1 de Março de 2013 1 / 37 openmp what? It s an Application Program Interface

More information

CS 5220: Shared memory programming. David Bindel

CS 5220: Shared memory programming. David Bindel CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 4 Message-Passing Programming Learning Objectives n Understanding how MPI programs execute n Familiarity with fundamental MPI functions

More information

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010. Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

CS691/SC791: Parallel & Distributed Computing

CS691/SC791: Parallel & Distributed Computing CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Shared Memory Programming with OpenMP

Shared Memory Programming with OpenMP Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

Chapter 3. Distributed Memory Programming with MPI

Chapter 3. Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap n Writing your first MPI program. n Using the common MPI functions. n The Trapezoidal Rule

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

Distributed Memory Programming with Message-Passing

Distributed Memory Programming with Message-Passing Distributed Memory Programming with Message-Passing Pacheco s book Chapter 3 T. Yang, CS240A Part of slides from the text book and B. Gropp Outline An overview of MPI programming Six MPI functions and

More information

Multithreading in C with OpenMP

Multithreading in C with OpenMP Multithreading in C with OpenMP ICS432 - Spring 2017 Concurrent and High-Performance Programming Henri Casanova (henric@hawaii.edu) Pthreads are good and bad! Multi-threaded programming in C with Pthreads

More information

Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI and OpenMP Parallel Programming Hybrid MPI and OpenMP Parallel Programming Jemmy Hu SHARCNET HPTC Consultant July 8, 2015 Objectives difference between message passing and shared memory models (MPI, OpenMP) why or why not hybrid? a common

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve PTHREADS pthread_create, pthread_exit, pthread_join Mutex: locked/unlocked; used to protect access to shared variables (read/write) Condition variables: used to allow threads

More information

Shared Memory Programming Model

Shared Memory Programming Model Shared Memory Programming Model Ahmed El-Mahdy and Waleed Lotfy What is a shared memory system? Activity! Consider the board as a shared memory Consider a sheet of paper in front of you as a local cache

More information

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2 Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

Molecular Dynamics. Dim=3, parts=8192, steps=10. crayc (Cray T3E) Processors

Molecular Dynamics. Dim=3, parts=8192, steps=10. crayc (Cray T3E) Processors The llc language and its implementation Antonio J. Dorta, Jose Rodr guez, Casiano Rodr guez and Francisco de Sande Dpto. Estad stica, I.O. y Computación Universidad de La Laguna La Laguna, 38271, Spain

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico March 7, 2016 CPD (DEI / IST) Parallel and Distributed

More information

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer OpenMP examples Sergeev Efim Senior software engineer Singularis Lab, Ltd. OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism.

More information

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory

More information

An Introduction to OpenMP

An Introduction to OpenMP Dipartimento di Ingegneria Industriale e dell'informazione University of Pavia December 4, 2017 Recap Parallel machines are everywhere Many architectures, many programming model. Among them: multithreading.

More information

Shared memory parallel computing

Shared memory parallel computing Shared memory parallel computing OpenMP Sean Stijven Przemyslaw Klosiewicz Shared-mem. programming API for SMP machines Introduced in 1997 by the OpenMP Architecture Review Board! More high-level than

More information

CSE 160 Lecture 18. Message Passing

CSE 160 Lecture 18. Message Passing CSE 160 Lecture 18 Message Passing Question 4c % Serial Loop: for i = 1:n/3-1 x(2*i) = x(3*i); % Restructured for Parallelism (CORRECT) for i = 1:3:n/3-1 y(2*i) = y(3*i); for i = 2:3:n/3-1 y(2*i) = y(3*i);

More information

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB) COMP4300/8300: The OpenMP Programming Model Alistair Rendell See: www.openmp.org Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Chapter 6 & 7 High Performance

More information

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell COMP4300/8300: The OpenMP Programming Model Alistair Rendell See: www.openmp.org Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Chapter 6 & 7 High Performance

More information

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs 1 The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) s http://mpi-forum.org https://www.open-mpi.org/ Mike Bailey mjb@cs.oregonstate.edu Oregon State University mpi.pptx

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set

More information