Mixed MPI-OpenMP EUROBEN kernels

Similar documents
John Hengeveld Director of Marketing, HPC Evangelist

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Carlo Cavazzoni, HPC department, CINECA

Intel Performance Libraries

OCTOPUS Performance Benchmark and Profiling. June 2015

Quantum ESPRESSO on GPU accelerated systems

First Experiences with Intel Cluster OpenMP

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Performance Tools for Technical Computing

Introduction to parallel Computing

Mathematical Libraries and Application Software on JUQUEEN and JURECA

Intel Math Kernel Library 10.3

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

arxiv: v2 [cs.pf] 19 Feb 2010

Shared Memory programming paradigm: openmp

Parallel Programming. Libraries and implementations

Overview of research activities Toward portability of performance

Mathematical Libraries and Application Software on JUROPA, JUGENE, and JUQUEEN. JSC Training Course

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Intel Math Kernel Library

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Intel : Accelerating the Path to Exascale. Kirk Skaugen Vice President Intel Architecture Group General Manager Data Center Group

Intel MPI Library Conditional Reproducibility

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

LS-DYNA Performance Benchmark and Profiling. October 2017

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Maximizing performance and scalability using Intel performance libraries

Our new HPC-Cluster An overview

CUDA Toolkit 4.0 Performance Report. June, 2011

Improving the Performance and Extending the Scalability in the Cluster of SMP based Petaflops Computing

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Scaling Out Python* To HPC and Big Data

Parallel Programming. Libraries and Implementations

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Introduction to Xeon Phi. Bill Barth January 11, 2013

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

Hybrid Model Parallel Programs

HPC Architectures. Types of resource currently in use

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Performance Analysis of Large-Scale OpenMP and Hybrid MPI/OpenMP Applications with Vampir NG

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

CUDA Toolkit 5.0 Performance Report. January 2013

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Scientific Programming in C XIV. Parallel programming

Brief notes on setting up semi-high performance computing environments. July 25, 2014

Addressing Heterogeneity in Manycore Applications

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Parallel Programming Libraries and implementations

Code Auto-Tuning with the Periscope Tuning Framework

AutoTune Workshop. Michael Gerndt Technische Universität München

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

GOING ARM A CODE PERSPECTIVE

Experiences in Tuning Performance of Hybrid MPI/OpenMP Applications on Quad-core Systems

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Investigation of Intel MIC for implementation of Fast Fourier Transform

High Performance Computing with Accelerators

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

ORAP Forum October 10, 2013

MPI & OpenMP Mixed Hybrid Programming

OpenMP 4.0. Mark Bull, EPCC

Hybrid (MPP+OpenMP) version of LS-DYNA

Advanced Threading and Optimization

Practical High Performance Computing

Future Technologies (WP8) Prototype Evaluation & Research Activities. Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Dealing with Heterogeneous Multicores

New Features in LS-DYNA HYBRID Version

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Java Performance Analysis for Scientific Computing

ecse08-10: Optimal parallelisation in CASTEP

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Adaptive Transpose Algorithms for Distributed Multicore Processors

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Lecture 3: Intro to parallel machines and models

COMP528: Multi-core and Multi-Processor Computing

CP2K Performance Benchmark and Profiling. April 2011

Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation

RapidMind & PGI Accelerator Compiler. Dr. Volker Weinberg Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften

Benchmark runs of pcmalib on Nehalem and Shanghai nodes

[Potentially] Your first parallel application

OpenMP 4.0/4.5. Mark Bull, EPCC

Experiences with ENZO on the Intel Many Integrated Core Architecture

Simulation using MIC co-processor on Helios

SPIRAL, FFTX, and the Path to SpectralPACK

Compute Node Linux: Overview, Progress to Date & Roadmap

LS-DYNA Performance on Intel Scalable Solutions

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Transcription:

Mixed MPI-OpenMP EUROBEN kernels Filippo Spiga ( on behalf of CINECA ) PRACE Workshop New Languages & Future Technology Prototypes, March 1-2, LRZ, Germany

Outline Short kernel description MPI and OpenMP paradigms Objectives and Porting activities Performances and Results Conclusions, Remarks and Future Works Probably nothing is new but this could be a good starting point to important and relevant considerations on the actual HPC ecosystem! 2

OBJECTIVES AND PORTING ACTIVITIES 3

Objectives 1. Starting from a simple (serial) C kernel, realize a parallel mixed version based on MPI and OpenMP (2 de facto standards) performance 2. Starting from a simple (serial) C kernel, evaluate the effort of the porting activity to mixed version productivity The kernels were chosen because are representative of complex computational kernels inside real applications. 4

Porting activity: the covered way From simple to multi-threading version using OpenMP (explicit approach) using multi-threaded library* (implicit approach) From simple to distributed parallel version using Message Passing Interface (MPI) and then Mixing multi-threading and distributed parallel versions 5

Porting activity: development platform PRACE Prototype INTI (provided by CEA) Bull cluster composed by 128 nodes (1024 cores) dual-socket quad-core Intel Nehalem EP @ 2.53 GHz 24 GBytes of memory on each node IB interconnection INTEL compiler suite (v11.1.038) Math Kernel Library (v10.0.010) Open MPI 1.3.2 6

Porting activity: mod2am Explicit multi-threading using OpenMP inner/middle/outer parallel loop & loop exchange with unrolling Refinements to allow automatic compiler SSE vectorization Implicit Multi-threading using numerical libraries CBLAS (open-source and MKL) MPI parallelization 1D and 2D (Cannon) block decomposition MPI communications based on MPI_send/MPI_recv, MPI_bcast, MPI_Isend/MPI_Irecv, MPI_sendrecv/MPI_cart 7

Porting activity: mod2am (details) mod2am --> v0 [ORIGINAL KERNEL] mod2am_omp-unrolled4 --> v0.1 [NOT COMMITTED] mod2am_omp-1_loop --> v0.2 mod2am_omp-2_loop --> v0.3 mod2am_omp-3_loop --> v0.4 mod2am_omp-nested --> v0.5 mod2am_omp-cblas --> v0.6 mod2am_mpi-1d --> v1.0 mod2am_mpi-1d-bcast --> v1.1 mod2am_mpi-1d-sendrecv --> v1.2 mod2am_mpi-1d-sendrecv-nonblock --> v1.3 mod2am_mpi-2d-cannon --> v2.0 mod2am_mpi-2d-cannon --> v2.0.1 (-D CUBLAS) mod2am_mpi-2d-cannon-nonblock --> v2.1 mod2am_mpi-2d-cannon-nonblock --> v2.1.1 + (-D CUBLAS) mod2am_mpi-2d-cannon-nonblock --> v2.1.2 + (-D CUBLAS -D PREPOSTED_NONBLOCKING) 8

Porting activity: mod2as Explicit multi-threading using OpenMP both for 0-index and 1-index CSR Implicit Multi-threading using numerical libraries Sparse BLAS (open-source* and MKL) MPI parallelization Trivial block-striped partitioning among all processors * NIST (http://math.nist.gov/spblas), not multi-threaded 9

Porting activity: mod2as (details) mod2as --> v0 [ORIGINAL KERNEL] mod2as_omp --> v0.1 mod2as_omp-opt --> v0.2 mod2as_omp-opt-csr_0_index --> v0.3.0 [0-index CSR] mod2as_omp-opt-csr_1_index --> v0.3.1 [1-index CSR] mod2as_omp-sblas --> v0.4 [Sparse BLAS library (NIST interface) ]* mod2as_omp_sblas-mkl --> v0.5 [Sparse BLAS provided by Intel MKL ] mod2as_mpi-simple mod2as_mpi-sblas-mkl --> v1.0 [trivial block-striped partitioning among all processors] --> v1.1 [local calculation using MKL and final MPI_Reduce] 10

Porting activity: mod2f Explicit multi-threading using OpenMP Not done Implicit Multi-threading using numerical libraries FFTW2 & FFTW3 (open-source) MKL DFTI MKL wrapper for FFTW2 and FFTW3 MPI parallelization MPI FFTW (no multi-threaded) MKL 1D Cluster FFT (natively multi-threaded) 11

Porting activity: mod2f (details) mod2f --> v0 [ORIGINAL KERNEL] mod2f_fftw mod2f_fftw_mkl mod2f_fftw3 mod2f_fftw3_mkl mod2f_mkl mod2f_mpi mod2f_mpi_pfftw mod2f_mpi_mk --> v0.1 [multi-thread FFT provided by FFTW library] --> v0.1.1 [multi-thread FFT provided by MKL FFTW wrapper] --> v0.2 [multi-thread FFT provided by FFTW3 library] --> v0.2.1 [multi-thread FFT provided by MKL FFTW3 wrapper] --> v0.3 [the same as mkl/lrz/mod2f_mkl, little modifications were done] --> v1.0 [the same as base/c-mpi... 2D transformation!] --> v1.1 [1D distributed FFT using FFTW. No multi-threaded] --> v1.2 [1D distributed FFT using MKL Cluster FFT] 12

Porting activity: what is missing? mod2am Parallel BLAS (PBLAS) SUMMA: Scalable Universal Matrix Multiplication Algorithm DIMMA: Distribution-Independent Matrix Multiplication Algorithm mod2as mod2f Extension to multi-dimensional FFT Explicit OpenMP parallelization (but could it be really useful?) 13

PERFORMANCES AND RESULTS 14

Productivity evaluation: mod2am Time [hh:mm] Effort* SLOC** % OpenMP 0:55 150 +1,4% MPI ~5:00 339 +129% OpenMP + MPI ~6:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces 15

Productivity evaluation: mod2as Time [hh:mm] Effort* SLOC** % OpenMP ~2:00 223 +4,7% MPI ~1:10 238 +11,7% OpenMP + MPI ~3:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces 16

Productivity evaluation: mod2f Time [hh:mm] Effort* SLOC** % OpenMP 2:00 289-51% MPI ~2d:00:00*** 279-52% OpenMP + MPI ~2d:00:00 * 1 Star = easy 5 Stars = really hard (at qualitative level) ** Number of source lines of code without comments and spaces *** I spent two days to solve a problem thanks to the help of INTEL forum support 17

Performance: mod2am (1) 1,00E+05 Performance [MFlops] 1,00E+04 1,00E+03 1,00E+02 SERIAL 8OMP 4MPI*2OMP 1,00E+01 Input dimension 18

Performance: mod2am (2) 7 6 5 Scalability 4 3 2 Explicit Implicit 1 0 1 2 3 4 5 6 7 8 n. of threads 19

Performance: mod2as 2,60E+03 Performance [MFlops] 2,10E+03 1,60E+03 1,10E+03 6,00E+02 SERIAL 8OMP 4MPI*2OMP 1,00E+02 Input dimension 20

Performance: mod2f Ops, we went out of time However Intel has recently published on his developer blog a presentation* about performance comparisons between MKL and FFTW. It covers the same strategies we followed during our porting activities 1D Cluster FFT implements distributed calculation using BLACS Performance comparisons for parallel/distributed version were made using input set larger than our (up to 2 23 ) * URL: http://software.intel.com/en-us/articles/intel-mkl-fft-training-material/ 21

CONCLUSIONS, REMARKS AND FUTURE WORKS 22

General conclusions The porting activities concerning MPI-OpenMP were easy and fast OpenMP is easy but sometimes it is useless to trash time to try to use this paradigm (see mod2f) For well-know kernels, vendor multi-threading libraries are usually the winner choice If we want to look only at performances, we need to increase the input data set (especially when we use distributed versions of the kernels) 23

Remarks: integrate multi-threading libraries Distributed functions could have different prototypes and different conventions this requires knowledge about the library Native distributed functions are efficient and fast but do not ensure easy portability Different version of the same library could have different requirements in term of linking and name conventions Use safely the library (and the library must be safe by itself ) the usage of multi-threading libraries and OpenMP regions together requires to be careful 24

Remarks: how to realize the mixing There are 2 ways : 1. Serial Multi-threading (OpenMP) Parallel/Distributed Multi-threading (OpenMP+MPI) 2. Serial Parallel/Distributed (MPI) Multi-threading Parallel/ Distributed (MPI+OpenMP) Q: But are there differences? A: Of course! Because different goals have to be achieved at different level 25

(Possible) Future Works Replicate the porting activities by using Fortran instead of C Performance measurements with/without Simultaneous Multi- Threading (SMT) (Try to) Evaluate quantitatively the impact of thread affinity (OpenMP) and processes placement (MPI) Move to other architectures Evaluate the effort (time) to support other multi-threading libraries (from MKL to ACML, ESSL/PESSL, NAG, ) Evaluate if other (open-source) multi-threading libraries have more or less efficiency in term of performance than MKL (Try to) use OpenMP to manage transparently and efficiently the workload between multiple accelerators 26

Last but not least Let s start to play with real applications! MPI-OpenMP paradigm is mature enough to be used by production codes and today there are both good compilers and good libraries. MPI-OpenMP is increasing in importance as a programming model because many pure MPI programs do not exhibit good scalability using very large numbers (up to 1024) of MPI tasks. See Programming models: Hybrid programming with MPI & OpenMP (Carlo Cavazzoni, CINECA, Italy) during PRACE workshop on application porting and performance tuning at CSC, Finland (11-12 June, 2009). 27

THANK YOU FOR YOUR ATTENTION 28