Mixed MPI-OpenMP EUROBEN kernels

Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Outline Short kernel description MPI and OpenMP paradigms Objectives and Porting activities Performances and Results Conclusions, Remarks and Future Works Probably nothing is new but this could be a good starting point to important and relevant considerations on the actual HPC ecosystem! 2

OBJECTIVES AND PORTING ACTIVITIES 3

Objectives 1. Starting from a simple (serial) C kernel, realize a parallel mixed version based on MPI and OpenMP (2 de facto standards) performance 2. Starting from a simple (serial) C kernel, evaluate the effort of the porting activity to mixed version productivity The kernels were chosen because are representative of complex computational kernels inside real applications. 4

Porting activity: the covered way From simple to multi-threading version using OpenMP (explicit approach) using multi-threaded library* (implicit approach) From simple to distributed parallel version using Message Passing Interface (MPI) and then Mixing multi-threading and distributed parallel versions 5

Porting activity: development platform PRACE Prototype INTI (provided by CEA) Bull cluster composed by 128 nodes (1024 cores) dual-socket quad-core Intel Nehalem EP @ 2.53 GHz 24 GBytes of memory on each node IB interconnection INTEL compiler suite (v11.1.038) Math Kernel Library (v10.0.010) Open MPI 1.3.2 6

Porting activity: mod2am Explicit multi-threading using OpenMP inner/middle/outer parallel loop & loop exchange with unrolling Refinements to allow automatic compiler SSE vectorization Implicit Multi-threading using numerical libraries CBLAS (open-source and MKL) MPI parallelization 1D and 2D (Cannon) block decomposition MPI communications based on MPI_send/MPI_recv, MPI_bcast, MPI_Isend/MPI_Irecv, MPI_sendrecv/MPI_cart 7

Porting activity: mod2am (details) mod2am --> v0 [ORIGINAL KERNEL] mod2am_omp-unrolled4 --> v0.1 [NOT COMMITTED] mod2am_omp-1_loop --> v0.2 mod2am_omp-2_loop --> v0.3 mod2am_omp-3_loop --> v0.4 mod2am_omp-nested --> v0.5 mod2am_omp-cblas --> v0.6 mod2am_mpi-1d --> v1.0 mod2am_mpi-1d-bcast --> v1.1 mod2am_mpi-1d-sendrecv --> v1.2 mod2am_mpi-1d-sendrecv-nonblock --> v1.3 mod2am_mpi-2d-cannon --> v2.0 mod2am_mpi-2d-cannon --> v2.0.1 (-D CUBLAS) mod2am_mpi-2d-cannon-nonblock --> v2.1 mod2am_mpi-2d-cannon-nonblock --> v2.1.1 + (-D CUBLAS) mod2am_mpi-2d-cannon-nonblock --> v2.1.2 + (-D CUBLAS -D PREPOSTED_NONBLOCKING) 8

Porting activity: mod2as Explicit multi-threading using OpenMP both for 0-index and 1-index CSR Implicit Multi-threading using numerical libraries Sparse BLAS (open-source* and MKL) MPI parallelization Trivial block-striped partitioning among all processors * NIST (http://math.nist.gov/spblas), not multi-threaded 9

Porting activity: mod2as (details) mod2as --> v0 [ORIGINAL KERNEL] mod2as_omp --> v0.1 mod2as_omp-opt --> v0.2 mod2as_omp-opt-csr_0_index --> v0.3.0 [0-index CSR] mod2as_omp-opt-csr_1_index --> v0.3.1 [1-index CSR] mod2as_omp-sblas --> v0.4 [Sparse BLAS library (NIST interface) ]* mod2as_omp_sblas-mkl --> v0.5 [Sparse BLAS provided by Intel MKL ] mod2as_mpi-simple mod2as_mpi-sblas-mkl --> v1.0 [trivial block-striped partitioning among all processors] --> v1.1 [local calculation using MKL and final MPI_Reduce] 10

Porting activity: mod2f Explicit multi-threading using OpenMP Not done Implicit Multi-threading using numerical libraries FFTW2 & FFTW3 (open-source) MKL DFTI MKL wrapper for FFTW2 and FFTW3 MPI parallelization MPI FFTW (no multi-threaded) MKL 1D Cluster FFT (natively multi-threaded) 11

Porting activity: mod2f (details) mod2f --> v0 [ORIGINAL KERNEL] mod2f_fftw mod2f_fftw_mkl mod2f_fftw3 mod2f_fftw3_mkl mod2f_mkl mod2f_mpi mod2f_mpi_pfftw mod2f_mpi_mk --> v0.1 [multi-thread FFT provided by FFTW library] --> v0.1.1 [multi-thread FFT provided by MKL FFTW wrapper] --> v0.2 [multi-thread FFT provided by FFTW3 library] --> v0.2.1 [multi-thread FFT provided by MKL FFTW3 wrapper] --> v0.3 [the same as mkl/lrz/mod2f_mkl, little modifications were done] --> v1.0 [the same as base/c-mpi... 2D transformation!] --> v1.1 [1D distributed FFT using FFTW. No multi-threaded] --> v1.2 [1D distributed FFT using MKL Cluster FFT] 12

Porting activity: what is missing? mod2am Parallel BLAS (PBLAS) SUMMA: Scalable Universal Matrix Multiplication Algorithm DIMMA: Distribution-Independent Matrix Multiplication Algorithm mod2as mod2f Extension to multi-dimensional FFT Explicit OpenMP parallelization (but could it be really useful?) 13

PERFORMANCES AND RESULTS 14

Productivity evaluation: mod2am Time [hh:mm] Effort* SLOC** % OpenMP 0:55 150 +1,4% MPI ~5:00 339 +129% OpenMP + MPI ~6:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces 15

Productivity evaluation: mod2as Time [hh:mm] Effort* SLOC** % OpenMP ~2:00 223 +4,7% MPI ~1:10 238 +11,7% OpenMP + MPI ~3:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces 16

Productivity evaluation: mod2f Time [hh:mm] Effort* SLOC** % OpenMP 2:00 289-51% MPI ~2d:00:00*** 279-52% OpenMP + MPI ~2d:00:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces *** I spent two days to solve a problem thanks to the help of INTEL forum support 17

Performance: mod2am (1) 1,00E+05 Performance [MFlops] 1,00E+04 1,00E+03 1,00E+02 SERIAL 8OMP 4MPI*2OMP 1,00E+01 Input dimension 18

Performance: mod2am (2) 7 6 5 Scalability 4 3 2 Explicit Implicit 1 0 1 2 3 4 5 6 7 8 n. of threads 19

Performance: mod2as 2,60E+03 Performance [MFlops] 2,10E+03 1,60E+03 1,10E+03 6,00E+02 SERIAL 8OMP 4MPI*2OMP 1,00E+02 Input dimension 20

Performance: mod2f Ops, we went out of time However Intel has recently published on his developer blog a presentation* about performance comparisons between MKL and FFTW. It covers the same strategies we followed during our porting activities 1D Cluster FFT implements distributed calculation using BLACS Performance comparisons for parallel/distributed version were made using input set larger than our (up to 2 23 ) * URL: http://software.intel.com/en-us/articles/intel-mkl-fft-training-material/ 21

CONCLUSIONS, REMARKS AND FUTURE WORKS 22

General conclusions The porting activities concerning MPI-OpenMP were easy and fast OpenMP is easy but sometimes it is useless to trash time to try to use this paradigm (see mod2f) For well-know kernels, vendor multi-threading libraries are usually the winner choice If we want to look only at performances, we need to increase the input data set (especially when we use distributed versions of the kernels) 23

Remarks: integrate multi-threading libraries Distributed functions could have different prototypes and different conventions this requires knowledge about the library Native distributed functions are efficient and fast but do not ensure easy portability Different version of the same library could have different requirements in term of linking and name conventions Use safely the library (and the library must be safe by itself ) the usage of multi-threading libraries and OpenMP regions together requires to be careful 24

Remarks: how to realize the mixing There are 2 ways : 1. Serial Multi-threading (OpenMP) Parallel/Distributed Multi-threading (OpenMP+MPI) 2. Serial Parallel/Distributed (MPI) Multi-threading Parallel/ Distributed (MPI+OpenMP) Q: But are there differences? A: Of course! Because different goals have to be achieved at different level 25

(Possible) Future Works Replicate the porting activities by using Fortran instead of C Performance measurements with/without Simultaneous Multi- Threading (SMT) (Try to) Evaluate quantitatively the impact of thread affinity (OpenMP) and processes placement (MPI) Move to other architectures Evaluate the effort (time) to support other multi-threading libraries (from MKL to ACML, ESSL/PESSL, NAG, ) Evaluate if other (open-source) multi-threading libraries have more or less efficiency in term of performance than MKL (Try to) use OpenMP to manage transparently and efficiently the workload between multiple accelerators 26

Last but not least Let s start to play with real applications! MPI-OpenMP paradigm is mature enough to be used by production codes and today there are both good compilers and good libraries. MPI-OpenMP is increasing in importance as a programming model because many pure MPI programs do not exhibit good scalability using very large numbers (up to 1024) of MPI tasks. See Programming models: Hybrid programming with MPI & OpenMP (Carlo Cavazzoni, CINECA, Italy) during PRACE workshop on application porting and performance tuning at CSC, Finland (11-12 June, 2009). 27

THANK YOU FOR YOUR ATTENTION 28