Hybrid programming with MPI and OpenMP On the way to exascale

Size: px

Start display at page:

Download "Hybrid programming with MPI and OpenMP On the way to exascale"

Jennifer Hardy
5 years ago
Views:

1 Institut du Développement et des Ressources en Informatique Scientifique Hybrid programming with MPI and OpenMP On the way to exascale 1

2 Trends of hardware evolution Main problematic : how to deal with power consumption? Simplification and multiplication of the number of cores Many-cores (like Intel Xeon Phi, IBM BG/Q) Accelerators (like NVIDIA Tesla or AMD Fusion) ARM based microprocessors (see for complementary information) Common characteristics that impact users and applications Huge number of threads of executions, remember that exascale = 1 billion threads of execution! Intensive use of SMT or Hyper-Threading to get good performance (at least 2 to 4 threads per core!) Vectorization (SIMD) is required to use the hardware efficiently, compiler tries to do its best, but not enough (yet) Memory per execution thread shrinks 2

3 Introduction to hybrid MPI+OpenMP parallelization For homogeneous architectures without accelerators, two well recognized and mature standards to parallelize applications : OpenMP : For shared memory architectures Directive based API supporting C/C++ and Fortran, to create threads (via parallel region), to choose the data sharing attribute of variables (PRIVATE or SHARED), to share work among the threads (DO, SECTION and TASK) and to synchronize threads (BARRIER, ATOMIC, CRITICAL, FLUSH) Latest official OpenMP specifications : version 3.1 (july 2011) Waiting for version 4.0 (Error Model, NUMA Support, Accelerators and Tasking Extensions) MPI : For all kinds of architecture Message passing library supporting C/C++ and Fortran to manage onesided, point to point or collective communications between processes, to define topologies and derived data types, to deal with parallel IO and to synchronize processes Latest official MPI specifications : MPI 3.0 Released September 21,

4 Introduction to hybrid MPI+OpenMP parallelization Majority of codes parallelized with MPI or OpenMP Nevertheless, with some applications, this approach begins to show some limitations on the latest generation of massively parallel architectures for various reasons like : Granularity of code (it s decreasing) Memory consumption of your application (doesn t fit anymore what is available ) Algorithm and hardware limitations (visible only beyond a certain threshold ) Huge load imbalance (very hard to deal with) Overheads (increase with the number of cores) All this leads to disappointing performance and very limited scalability!!! Solving all these issues is far from being simple 4

5 Introduction to hybrid MPI+OpenMP parallelization The main problem is simple : too many MPI processes to manage, with too little work to execute How to reduce the number of MPI processes? Replace MPI processes with OpenMP threads! That s what is called hybrid programing with MPI and OpenMP OpenMP can be replaced by any threading library Take the best of both approaches : MPI to exchange data between nodes OpenMP to benefit from the shared memory inside a node Mixing MPI and OpenMP with a two levels parallelization seems natural : Fits perfectly the hardware characteristics of various machines (either fat or thin nodes ) Has a lot of advantages but also some drawbacks, be careful 5

6 Introduction to hybrid MPI+OpenMP parallelization 6

7 Thread Support in MPI For a multithreaded MPI application, replace MPI_INIT( ) with MPI_INIT_THREAD(REQUIRED, PROVIDED, IERROR) MPI_THREAD_SINGLE : only one thread per MPI process, OpenMP cannot be used MPI_THREAD_FUNNELED : multiple threads per MPI process, but only the main thread can make MPI calls. MPI calls outside OpenMP parallel region or by the main thread (the one which made the MPI_INIT_THREAD call) MPI_THREAD_SERIALIZED : all threads can make MPI calls, but only one at a time. In an OpenMP parallel region, MPI calls have to be made in critical sections MPI_THREAD_MULTIPLE : completely multithreaded without restrictions (except for MPI collective calls using the same communicator) 7

8 Introduction to hybrid MPI+OpenMP parallelization Drawbacks of hybrid MPI+OpenMP approach : Complexity of the application (especially with MPI_THREAD_MULTIPLE) and high level of expertise required for developers Performance improvements are not guaranteed, having good MPI and OpenMP performances and efficiency is mandatory (Amdahl s law applies to both approaches) Memory affinity, mapping, binding, etc has to be carefully managed (same problems with flat MPI or OpenMP codes) Data race, deadlock, race condition or wrong data sharing attribute, all the pitfalls of MPI and OpenMP combined and lead to very complex debugging No real mature and robust tools to debug hybrid MPI+OpenMP applications at scale (or even on small number of cores ) So, is there still an interest to hybrid MPI+OpenMP parallelization? Yes of course. Fortunately there are even greater advantages 8

9 Memory saving Memory per thread of execution is scarce and hybrid approach optimizes its usage. But why memory saving? The hybrid programming allows optimizing the code to the target architecture. The latter is generally composed of shared-memory nodes (SMP) linked by an interconnect network. The interest of the shared memory inside a node is that it is not necessary to duplicate data in order to exchange them. Every thread can access (read /write) SHARED data. The ghost or halo cells, introduced to simplify the programming of MPI codes using domain decomposition, are no longer required within the SMP node. Only the ghost cells associated with the inter-node communications are mandatory. This savings is far from being negligible. It depends heavily on the order of the method, on the domain type (2D or 3D), on the domain decomposition (on one or multiple dimensions) and on the number of cores of the SMP node. The footprint memory of system buffers associated with MPI is not negligible and increases with the number of processes. For example, for an Infiniband network with MPI processes, the footprint memory of system buffers reaches 300MB per process, almost 20TB in total! 9

10 Memory saving OK for the theory, but in real life, do we observe any gain in memory consumption of applications? 10

11 Memory saving Source: << Mixed Mode Programming on HECToR >>, Anastasios Stathopoulos, August 22, 2010, MSc in High Performance Computing, EPCC Target machine: HECToR CRAY XT6 Results (the memory per node is expressed in MiB) Code Pure MPI version Hybrid version Memory saving MPI processes Mem./node MPI x threads Mem./node CPMD x BQCD x SP-MZ x IRS x Jacobi x

12 Memory saving Source : << Performance evaluations of gyrokinetic Eulerian code GT5D on massively parallel multi-core platforms >>, Yasuhiro Idomura and Sébastien Jolliet, SC11 Executions on 4096 cores on : Fujitsu BX900 with Nehalem-EP processors at 2.93 GHz (8 cores and 24 GiB per node) Fujitsu FX1 with SPARC64 VII processors at 2.5 GHz (4 cores and 32 GiB per node) All sizes given in TiB System Pure MPI 4 threads/process 8 threads/process Total (code+sys) Total (code+sys) Gain Total (code+sys) Gain BX ( ) 2.69 ( ) FX1 5.4 ( ) 2.83 ( ) ( )

13 Conclusions on memory saving Too often, this aspect is forgotten when talking about hybrid programming. However, the potential gains are very significant and could be exploited to increase the size of the problems to be simulated! The gap, in term of memory usage, between the MPI and hybrid approaches will continue to grow rapidly for the next generation of machines : Increase in the total number of cores Rapid increase in the number of cores within a SMP node General use of Hyper-Threading or SMT (the possibility to run simultaneously multiple threads on one core), General use of high-order numerical methods (nearly free computational cost thanks to hardware accelerators) This will make the transition to hybrid programming almost mandatory... 13

14 Exceeded algorithmic limitations Some applications are sometimes limited in term of scalability by a physical parameter (dimension in one direction for example). In the NAS Parallel Benchmark, the problem size define the notion of zone. The maximum number of MPI process cannot exceed the number of zones (limited to 1024 for class D and 2048 for class E problem size) The hybrid version of the code is still limited in term of MPI process, but each MPI process can manage multiple OpenMP threads The total number of threads of execution is the number of MPI process times the number of OpenMP thread per MPI process. On BG/P, you can gain up to a factor of 4 and a factor of 16 on a BG/Q with an excellent scalability! 14

15 Exceeded algorithmic limitations 15

16 Performance and scalability Many factors will contribute to increase performance and scalability of applications using an hybrid MPI+OpenMP parallelization : Better MPI granularity : hybridation allows to use the same number of execution cores, but with a reduce number of MPI processes. Each MPI process has much more work to manage, improving by the way the granularity of the application Better load balancing : for a pure MPI application, a dynamic load balancing is very complex to implement and is time consuming (requiring heavy use of message passing). For an hybrid application, inside the MPI process, you can easily manage a dynamic load balancing (with the schedule DYNAMIC or GUIDED for parallel loops, or directly by hand using the shared memory). Load balancing is a critical factor for massive parallelism, impacting the scalability of the code. Optimization of communications : the reduction of the number of MPI processes minimizes the number of communications and increases the size of messages. Hence, the impact on latency is reduced and the throughput of communications is improved (even more important for applications using collective communications heavily) 16

17 Performance and scalability Improvement of the convergence of certain iterative algorithms : if your iterative algorithm uses information relative to the local domain associated with MPI process, then reducing the number of MPI process will result in having bigger local domains, with much more information. Hence the rate of convergence of your iterative algorithm will improve leading to better time to solution Optimization of I/O : the reduction of the number of MPI processes leads to less simultaneous disk access and increases the size of records. As a consequence, meta-data servers are less loaded and the size of record is much more adapted to the disk system Approach that fits perfectly new architectures (many-cores, ) : with hybrid parallelization, you can naturally create and manage lots of threads, which can be used to overload cores (SMT or Hyper-Threading) and efficiently use the hardware 17

18 Performance and scalability The potential gains in term of performance are even more important as the number of execution cores is large If the hybrid parallelization is well done, the limit of scalability of the hybrid version of the code compared to the flat MPI version can be improved by a factor up to the number of cores of the SMP node! Let s have a look to a real life application named HYDRO 18

19 Application HYDRO HYDRO is a 2D Computational Fluid Dynamics code (~1500 lines of Fortran90), that solves Euler s equations with a Finite Volume Method using Godunov s scheme and a Riemann solver at each interface on a regular mesh. Selected as the PRACE application benchmark for assessment of WP9 prototypes Thanks to many contributors, various versions of HYDRO have been developed : Sequential versions : F90, C99 Accelerated versions : HMPP, Cuda, OpenCL Parallel versions : OpenMP (fine and coarse grain), MPI, hybrid MPI+OpenMP Others versions : Cilk cache oblivious version, X10, 19

20 HYDRO results Characteristic of the hybrid version of HYDRO : MPI_THREAD_FUNNELED level of thread support (MPI calls done inside the parallel region but only by the master thread) MPI parallelization relies on a 2D domain decomposition, with MPI derived datatype and synchronous communications with neighborhoods OpenMP parallelization relies on another 2D domain decomposition (coarse grain approach), with fine synchronization among threads managed by FLUSH directive, to cope with dependencies We will compare the pure MPI version and the hybrid MPI+OpenMP version of HYDRO All timings are in second (s.) and correspond to the elapsed time of the full application 20

64. The number of OpenMP threads per MPI process varies from 1 (pure MPI version) to 32.

21 HYDRO results Goal : Is hybrid approach interesting on a moderate number of execution cores? Target architecture : 2 IBM SP6 nodes (64 cores) The total number of threads of execution is fixed to 64. The number of OpenMP threads per MPI process varies from 1 (pure MPI version) to 32. MPI x OpenMP per node Time in s. on 64 execution cores 32 x x x x x x

22 HYDRO results Goal : determine if hybrid approach is more scalable than pure MPI? Target architecture : IBM BG/P (10 racks) Strong scaling on high number of execution cores (from 4096 to cores) All timings are in second (s.) Pure MPI Hybrid with 4 threads per MPI prc 4096 cores cores cores cores cores

23 HYDRO results Scalability limit of the pure MPI version : 8192 Scalability limit of the hybrid version : optimal 16384, sub optimal On cores, the hybrid version is more than 6 times faster than the pure MPI version The best hybrid version (on cores) is 2.6 time faster than the best pure MPI version (on 8192 cores) 23

24 Conclusions No need to hybrid parallelization if you don t face any problem of scalability and/or memory consumption with your MPI application A sustainable approach, based on recognized, mature and widely available standards (MPI and OpenMP); it is a long-term investment. The advantages of the hybrid approach compared to the pure MPI approach are many: Significant memory saving Gains in performance (on a fixed number of execution cores), through a better adaptation of the code to the target architecture Gains in terms of scalability, allowing pushing the limit of a code s scalability of an equal factor to the number of cores of the sharedmemory node These different gains are proportional to the number of cores of the sharedmemory node, a number that will increase significantly in the short term (general use of multi/many-core processors) A durable solution that allows an efficient usage of the next massively parallel architectures (multi-peta, exascale,...) but still has to evolve to take accelerators into account (OpenCL, OpenACC, OpenMP 4.0, ) 24

Hybrid MPI-OpenMP Programming

Hybrid MPI-OpenMP Programming Pierre-Francois.Lavallee@idris.fr Philippe.Wautelet@aero.obs-mip.fr CNRS IDRIS / LA Version 3.0.1 1 December 2017 P.-Fr. Lavallée P. Wautelet (IDRIS / LA) Hybrid Programming