Point-to-Point Synchronisation on Shared Memory Architectures

Similar documents
A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

OPENMP TIPS, TRICKS AND GOTCHAS

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Towards OpenMP for Java

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

OpenMP 4.0/4.5. Mark Bull, EPCC

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

A Uniform Programming Model for Petascale Computing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OPENMP TIPS, TRICKS AND GOTCHAS

Introduction to OpenMP

Introduction to OpenMP

Overview: The OpenMP Programming Model

OpenMP 4.0. Mark Bull, EPCC

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Mixed Mode MPI / OpenMP Programming

!OMP #pragma opm _OPENMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

An Introduction to OpenMP

Data Distribution, Migration and Replication on a cc-numa Architecture

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Introduction to OpenMP

Parallel Programming. Libraries and Implementations

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

5.12 EXERCISES Exercises 263

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Shared Memory Programming with OpenMP

Parallelising Pipelined Wavefront Computations on the GPU

Parallel Programming in C with MPI and OpenMP

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

First Experiences with Intel Cluster OpenMP

Lecture 4: OpenMP Open Multi-Processing

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Parallel Poisson Solver in Fortran

6.1 Multiprocessor Computing Environment

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Parallel Programming

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Numerical Algorithms

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Advanced OpenMP. OpenMP Basics

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

[Potentially] Your first parallel application

Parallel Computer Architecture and Programming Written Assignment 3

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Loop Modifications to Enhance Data-Parallel Performance

Shared Memory programming paradigm: openmp

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Concurrent Programming with OpenMP

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Two OpenMP Programming Patterns

Parallel Computing. Hwansoo Han (SKKU)

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

COMP528: Multi-core and Multi-Processor Computing

Allows program to be incrementally parallelized

A Programming and Performance Comparison of OpenMP and MPI for Concordance Benchmark

Parallel Programming: OpenMP

LECTURE 19. Subroutines and Parameter Passing

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

A brief introduction to OpenMP

Scientific Programming in C XIV. Parallel programming

Distributed Systems + Middleware Concurrent Programming with OpenMP

Multi-core Architecture and Programming

MPI Casestudy: Parallel Image Processing

Introduction to Parallel Programming

Introducing OpenMP Tasks into the HYDRO Benchmark

Parallel Numerical Algorithms

Introduction to OpenMP.

Introduction to OpenMP

Introduction to OpenMP

Session 4: Parallel Programming with OpenMP

Introduction to OpenMP

COSC 462. Parallel Algorithms. The Design Basics. Piotr Luszczek

Shared Memory Programming Model

Transcription:

Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email: m.bull@epcc.ed.ac.uk Abstract The extensive use of barriers in OpenMP often causes synchronisation of threads which do not share data, and therefore have no data dependences to preserve. In order to make synchronisation more closely resemble the dependence structure of the data used by a thread, point-to-point synchronisation can be used. We report the results of implementing such a point-to-point synchronisation routine on a Sun Fire 15K with 32 processors and an IBM p69 with 8 processors. We show that for some common dependency patterns, it is significantly more efficient than the standard OpenMP barrier. 1 Introduction Use of barriers is common in shared memory parallel programming. The OpenMP API [2], [3], which facilitates the transformation of sequential codes for shared memory multiprocessors, not only contains explicitly called barriers, but many of the directives also contain hidden or implicit barriers. To avoid race conditions, the order of memory operations to individual shared data must be guaranteed. To guarantee the order of operations on a single shared datum, the threads which operate on the datum are the only ones which require consideration. There is no need to synchronise threads which do not operate on the same data. Often, much of the synchronisation performed by a barrier is unnecessary. A point-to-point synchronisation routine can be used to take advantage of limited thread interdependence. A point-to-point routine forces each thread to synchronise with an individual list of threads. For example, in the case where the value of a datum depends on the values of neighbouring data in an array, only the threads which update neighbouring data need to be considered. The thread need not synchronise with other threads which have no influence on its data. It is hoped that point-to-point routine will have two performance benefits over barriers. First, the time taken to communicate with a small number of threads is much less than the time taken for all threads to communicate with each other, no matter how intelligent the communication algorithm. Second, looser synchronisation can reduce the impact of non-systematic load imbalance between the threads. Excessive barrier synchronisation is often the reason why regular domain decomposition OpenMP codes do not scale as well as MPI versions of the same code on very large shared memory systems. Another situation where barrier overheads can dominate is in legacy vector codes which contain large numbers of fine-grained parallel loops. Such codes frequently perform poorly on superscalar SMP systems. A set of experiments has been performed to investigate the potential benefits of point-to-point synchronisation. We also consider how point-to-point synchronisation might usefully be incorporated into the OpenMP API. 2 Point-to-Point Routine The point-to-point routine, running on N threads uses a shared array, all members of which are initialised to a constant value (zero). Each thread uses a different element of the array. Each thread passes to the routine a list of threads on which it has a dependency. This process is not automatic, but the programmer my chose from a number of pre-designed patterns (see the next section) or create his own. It should be noted, however, that where dynamic loop scheduling is used, it is not possible to determine which thread accesses which data. The list of thread

numbers known as the dependency list is stored in a private array. Each thread waits until all its dependent threads have entered the routine the same number of times as itself. void sync(int me, int *neighbour, int numneighbours) { int i; /* Increase my counter: */ counter[me*padding]++; #pragma omp flush (counter) /* Wait until each neighbour has arrived: */ for(i=;i<numneighbours;i++) { while(counter[me*padding] > counter[neighbour[i]*padding]){ #pragma omp flush (counter) } } } Figure 1: Point-to-Point synchronisation code. When a thread arrives at the routine, it increments its member of the shared array and busy-waits until the members of the array belonging to each of the threads on its dependency list has been incremented also. In doing so it is ensuring that each of the relevant threads has at least arrived at the same episode. The relevant code fragment is shown in Figure 1. In this routine, counter is an array of 64-bit-integers (long long), to reduce the possibility of overflow. It is unlikely that a number of synchronisations larger than the maximum capacity of a 64-bit long long (2 63 1 1 19 ) will be required in any application. If this is necessary, however, the initialisation routine can be run again to reset the array to zero. PADDING is the array padding factor, which prevents false sharing by forcing each counter to be stored on a different cache-line. On a Sun Fire 15K, for example, cache-lines are 64 bytes (8 64-bit words) long, so the optimal value for PADDING is 8. Note the use of the OpenMP flush directive to ensure that the values in the counter array are written to and read from the memory system, and not stored in registers. We have also implemented system specific versions for both the IBM and Sun systems which do not use the flush directive. On the Sun system, the above code functions correctly without the flush directives provided it is compiled without optimisation and counter is declared volatile. On the IBM, in addition to the volatile declaration, synchronisation instructions (see [1]) have to be added, as the Power4 processor has a weaker consistency model and allows out-of-order execution. A lwsync instruction is needed in place of the first flush,andanisync instruction after the while loop. 3 Common Dependence Patterns The dependency list of a thread, contains the IDs of threads which it synchronises with. With the synchronisation routine shown in Figure 1, it is left to the programmer to build this list. There are several dependence patterns which occur frequently in data-parallel style codes. A selection is illustrated in Figure 2. The threads are arranged into one, two and three dimensional grids, in order to operate on correspondingly ordered data arrays. Cases where dependence flow is restricted to a number of dimensions lower than the dimensionality of the grid (for example, where each new value depends on some function of the neighbouring values in one dimension but not the other: Hi,j new = Hi±1,j old ) are referred to as dimensionally degenerate cases and are not considered. One dimensional arrays have two common patterns: the data can depend on one or both neighbouring data points: H new i = Hi±1 old or Hi new = Hi+1 old,hold i 1

(a) 1 dimension (b) 2 dimensions (c) 3 dimensions Figure 2: Common data flow patterns. The numbers indicate the number of threads in the dependency list. Four common non-degenerate two-dimensional patterns are shown in Figure 2(b). The leftmost illustrates the case where a data point depends on two adjacent neighbours: H new i,j = H old i 1,j,H old i,j 1 Next, the so called wave-fronted loop is the case where a data point depends on the value of one of the diagonally adjacent points, for example: H i,j = H i 1,j 1 It is called wave-fronted because after the top-right corner point has executed, the threads can execute in a diagonal wave through the grid. Next is the 5-pointed stencil, which is where a data point is dependent on its four nearest adjacent neighbours in some way. This is shown in the central diagram in Figure 2(b). The rightmost diagram shows a variation called the 9-pointed stencil, which can be used to obtain more accuracy. For example, where the 5-pointed stencil would correspond to a diffusion term such as: H new i,j = 1 old (Hi 1,j 4 + Hold i+1,j + Hold i,j 1 + H i,j+1 old )

a more accurate result can be gained by instead using the term: Hi,j new 1 = 5 (H old i 1,j + Hold i+1,j + Hold i,j 1 + H old i,j+1 ) + 1 2 (Hold i 1,j 1 + Hold i+1,j 1 + Hold i 1,j+1 + Hold i+1,j+1 ) which requires a 9-pointed stencil pattern. The three-dimensional non-degenerate patterns shown in Figure 2(c) are natural extensions into three dimensions of the two dimensional patterns. The patterns listed above are not the only non-degenerate patterns. Involving all nearest neighbours in m dimensions, there are (3 m 1)! 3 (3 m 1 1)! potential non-degenerate patterns including all laterally and rotationally symmetric repetitions. However most of these represent dependence patterns which occur rarely in practice. 1 8 Time (microseconds) 6 4 Cyclic, 1 dim, 1 dep Cyclic, 1 dim, 2 dep Cyclic, 2 dim, 4 dep Non-cyclic, 1 dim, 1 dep Non-cyclic, 1 dim, 2 dep Non-cyclic, 2 dim, 4 dep All dep OpenMP barrier 2 2 4 6 8 12 16 2 24 32 No. of processors Figure 3: Overhead for synchronisation routines on Sun Fire 15K 4 Results In this section we report the results of experiments which assess the performance of the point-to-point routine. We measured the overhead of the routine by measuring the time taken to execute 1,, calls, and computing the mean overhead per call. We ran each experiment five times and took the best result in each case. We used two systems to test the routine: a Sun Fire 15K system where we had access to 32 out of a total of 52 processors, and a 32 processor IBM p69 where we had access to an eight processor logical partition. The Sun Fire 15K system contains 9 MHz UltraSparc-III processors, each with a 64 Kbyte level 1 cache and an 8 Mbyte level 2 cache. The system has 1 Gbyte of memory per processor. We used the Forte Developer 7 C compiler. The IBM p69 system contains 1.3 GHz Power4 processors, each with a 32 Kbyte level 1 cache. Each pair of processors shares a 1.44 Mbyte level 2 cache and all eight processors in the logical partition share 128 Mbytes of level 3 cache. We used the XLC 6. C compiler. Figure 3 shows the overhead of the point-to-point routine on the Sun Fire 15K for a number of synchronisation patterns: 1 and 2 neighbours in one dimension and 4 neighbours in 2 dimensions, with cyclic and non-cyclic boundary conditions in each case. We also show the overhead of the OpenMP barrier directive, and the result of implementing a barrier (i.e. synchronising all threads) using the point-to-point routine. We see immediately

5 Time (microseconds) 4 3 2 Cyclic, 1 dim, 1 dep Cyclic, 1 dim, 2 dep Cyclic, 2 dim, 4 dep Non-cyclic, 1 dim, 1 dep Non-cyclic, 1 dim, 2 dep Non-cyclic, 2 dim, 4 dep All dep OpenMP barrier 1 2 4 6 8 No. of processors Figure 4: Overhead for synchronisation routines on IBM p69 1.6 1.4 1.2 Time (microseconds) 1.8.6.4 FLUSH no FLUSH.2 2 4 6 8 12 16 2 24 32 No. of processors Figure 5: Comparison of point-to-point implementations on Sun Fire 15K

.6 FLUSH no FLUSH.5 Time (microseconds).4.3.2.1 2 4 6 8 No. of processors Figure 6: Comparison of point-to-point implementations on IBM p69 that the point-to-point routine for small numbers of dependencies has significantly less overhead than an OpenMP barrier. On 32 processors, there is roughly an order of magnitude difference. The overhead increases with the number of dependencies, and is lower for non-cyclic boundary conditions than for cyclic ones. Synchronising all threads with the point-to-point routine is, predictably, very expensive for more than a small number of threads. Figure 4 shows the results of the same experiments run on the IBM p69. On three or more processors there is about an order of magnitude difference between the point-to-point routines and the OpenMP barrier. The number of neighbours affects the overhead here less than on the Sun system: this might be an effect of the shared cache structure. Indeed, synchronising all threads with the point-to-point routine is significantly cheaper then the OpenMP barrier, suggesting some room for improvement in the implementation of the latter. Figures 5 and 6 show the overhead of the point-to-point routine for one of the above cases (1 dimension, 2 dependencies, cyclic boundaries) on the Sun and IBM system respectively. In each case we compare the performance of the OpenMP version shown in Figure 1 with the system-specific versions (without the flush directive) described in Section 2. On the Sun Fire 15K the OpenMP version is significantly faster: this can be attributed to the need to compile the system-specific version with a lower optimisation level. On the IBM there is not much to choose between the versions. We can conclude that on both systems the flush directive is efficiently implemented and there is no reason to prefer a system-specific solution. In order to assess the impact of using the point-to-point routine, we devised a small benchmark to simulate its use in a code with fine-grained loop parallelism. The kernel of this code is show in Figure 7. A standard OpenMP version of this code would require two barriers per outer iteration. For the point-topoint routine, each thread only has a dependency on its two neighbouring threads in a 1-dimensional grid, with non-cyclic boundaries. Figure 8 shows the speedup obtained on 8 processors of the IBM system with values of N varying from 1 to 1 6. We observe that for large values of N, both versions give good speedups (note that there are some superlinear effects due to caching). For smaller values of N, however the difference in speedup becomes quite marked. For N = 1, no speedup can be obtained by using barriers. By using the point-to-point routine a modest but useful speedup of just under 4 is obtained.

5 Making Point-To-Point Synchronisation Useful The experiments revealed that for the dependence patterns studied, point-to-point synchronisation is worth using instead of barriers. For a point-to-point routine to be useful, it must be straightforward to use. In this section we discuss how a point-to-point routine might fit into the OpenMP API. In these circumstances, we ask: how much detail could be hidden by the API, while not restricting functionality? The aim of this section is to find a good compromise between simplicity and flexibility, i.e. keeping it simple, but still useful. As discussed before, the OpenMP API provides directives and clauses which allow the programmer to inform the compiler about how variables should be treated. For example, a piece of code which should be executed on only one thread should be labelled with a single directive. This normally contains an implicit barrier, but this can be suppressed with the use of the nowait clause. There are two places where a point-to-point routine could be useful: as a directive and as a clause. The directive form, pointtopoint(list,length) could replace a barrier for a some situations where barriers are currently used. It could be particularly useful in situations where the amount of work required for each datum at each time-step is irregular. A barrier would force all threads to wait for the last one at each time-step. A point to point routine would allow more flexibility, and reduce the overhead of the load imbalance. To make the routine easier to use, it would be advantageous to provide a number of functions which generate dependency lists for all the commonly used patterns. A programmer should be able to generate his own pattern for cases where his pattern is not provided for. A waitforthreads(list,length) clause could appear wherever a nowait clause is permitted. It would be used to replace the implicit barrier which comes at the end of a construct. Each thread waits for all the relevant neighbours to be in place before proceeding. This requires a list which depends on the situation. We must allow for the possibility that two different sections of code may require two different lists. This would be most easily done by requiring that the programmer pass in a list. Another useful place for a point-to-point algorithm is after a single section. All the non-executing threads only need to wait for the thread which executes the code. As the executing thread is the first thread to arrive at that section, the other threads lists would need to contain its ID. The communication required of the first thread to pass its own ID would not be significantly more than that required to inform all the others that it is the executing thread. The clause waitforsingle is not necessary: if is more efficient to use this than an implicit barrier, then it should be used, but the programmer doesn t need to know about it. We have noted that system-specific implementations of point-to-point synchronisation appear unnecessary. As a result, the portable synchronisation routine of Figure 1 may be used, and there may not be much merit in extending the OpenMP API for this purpose. Finally we note that a compiler with sufficiently good dependence analysis could automatically replace implicit or explicit barriers with the required point-to-point synchronisation. for (iter=; iter<niters; iter++) { } for (i=1;i<=n;i++) b[i] =.5*(a[i-1]+a[i+1]); for (i=1;i<=n;i++) a[i] =.5*(b[i-1]+b[i+1]); Figure 7: Fine grained loop kernel

9 8 7 6 Speedup 5 4 Barrier Point2Point 3 2 1 1 1 1 1e+6 Iteration count Figure 8: Speedup of fine grained loop kernel on eight processors of the IBM p69 6 Conclusion Barriers can be a significant overhead in OpenMP codes, but they may enforce more synchronisation between threads than is required by the data dependencies. A point-to-point synchronisation routine allows a programmer to mould the synchronisation to fit the dependency structure. To measure the benefit, a point-to-point routine was made using C and OpenMP and timed for 1,, iterations for some common data dependence patterns. The point-to-point routine was found to be significantly faster than the OpenMP barrier for the dependence patterns examined, but there was no advantage from system-specific implementations which avoid the use of the flush directive. We have considered how such a routine might be included in the OpenMP API, but the benefits of doing so are unclear. It would be interesting to test our point-to-point routine in a selection of real-world applications to discover whether there is any advantage, as has been suggested by these results. References [1] Hay, R.W. and G.R. Hook, POWER4 and shared memory synchronisation, April 22, available at http://www-1.ibm.com/servers/esdd/articles/power4 mem.html [2] OpenMP ARB, OpenMP C and C++ Application Program Interface, Version 2., March 22. [3] OpenMP ARB, OpenMP Fortran Application Program Interface, Version 2., November 2.