Intel MPI Library Conditional Reproducibility

Similar documents
Getting Reproducible Results with Intel MKL

Using the Intel Math Kernel Library (Intel MKL) and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results

Intel Performance Libraries

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

What s New August 2015

Overview of Intel Xeon Phi Coprocessor

Reusing this material

Improving Numerical Reproducibility in C/C++/Fortran

Bring your application to a new era:

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

A Simple Path to Parallelism with Intel Cilk Plus

Code modernization and optimization for improved performance using the OpenMP* programming model for threading and SIMD parallelism.

IXPUG 16. Dmitry Durnov, Intel MPI team

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Intel Knights Landing Hardware

Using Intel VTune Amplifier XE for High Performance Computing

Consistency of Floating-Point Results using the Intel Compiler or Why doesn t my application always give the same answer?

Introduction to Xeon Phi. Bill Barth January 11, 2013

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Dan Stafford, Justine Bonnot

OpenACC Course. Office Hour #2 Q&A

Parallel Programming Libraries and implementations

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

High Performance Parallel Programming. Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

Kevin O Leary, Intel Technical Consulting Engineer

Message Passing Interface (MPI) on Intel Xeon Phi coprocessor

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Native Computing and Optimization. Hang Liu December 4 th, 2013

Optimized Scientific Computing:

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Parallel Programming. Libraries and Implementations

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Cilk Plus GETTING STARTED

MPI Casestudy: Parallel Image Processing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Maximizing performance and scalability using Intel performance libraries

Mixed MPI-OpenMP EUROBEN kernels

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

OpenCL Vectorising Features. Andreas Beckmann

COMP528: Multi-core and Multi-Processor Computing

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Lab: Scientific Computing Tsunami-Simulation

Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

Investigation of Intel MIC for implementation of Fast Fourier Transform

Eliminate Threading Errors to Improve Program Stability

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Bei Wang, Dmitry Prohorov and Carlos Rosales

Eliminate Memory Errors to Improve Program Stability

Installation Guide and Release Notes

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Dr. Ilia Bermous, the Australian Bureau of Meteorology. Acknowledgements to Dr. Martyn Corden (Intel), Dr. Zhang Zhang (Intel), Dr. Martin Dix (CSIRO)

Intel Math Kernel Library (Intel MKL) Latest Features

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Intel Xeon Phi Coprocessor

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Growth in Cores - A well rehearsed story

Basic Communication Operations (Chapter 4)

[Potentially] Your first parallel application

Intel Xeon Phi Coprocessors

Improving graphics processing performance using Intel Cilk Plus

OpenACC 2.6 Proposed Features

Intel Parallel Studio XE 2017 Composer Edition BETA C++ - Debug Solutions Release Notes

Accelerator Programming Lecture 1

Parallel Programming. The Ultimate Road to Performance April 16, Werner Krotz-Vogel

The MPI Message-passing Standard Practical use and implementation (VI) SPD Course 08/03/2017 Massimo Coppola

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Eliminate Threading Errors to Improve Program Stability

Intel Parallel Studio XE 2015

Overview: Programming Environment for Intel Xeon Phi Coprocessor

Cornell Theory Center. Discussion: MPI Collective Communication I. Table of Contents. 1. Introduction

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

SIMD Exploitation in (JIT) Compilers

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Parallel Computing. November 20, W.Homberg

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Dynamic SIMD Scheduling

Architecture, Programming and Performance of MIC Phi Coprocessor

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Consistency of Floating-Point Results using the Intel Compiler or Why doesn t my application always give the same answer?

Shared-Memory Programming Models

HPC code modernization with Intel development tools

Eliminate Memory Errors to Improve Program Stability

Transcription:

1 Intel MPI Library Conditional Reproducibility By Michael Steyer, Technical Consulting Engineer, Software and Services Group, Developer Products Division, Intel Corporation Introduction High performance computing (HPC) users running numerical codes may experience cases where floating-point operations create slightly different results. Usually this would not be considered a problem, but due to the nature of such applications, differences can quickly propagate forward through the iterations and combine into larger differences. In order to address these variations, the Intel Compiler has several switches that manipulate floating-point precision, while the Intel Math Kernel Library (Intel MKL) Conditional Numerical Reproducibility (CNR) feature 1 provides functions for obtaining reproducible floating-point results. Also, deterministic reduction algorithms are available for Intel OpenMP and Intel Threading Building Blocks (Intel TBB) runtimes. Some of the collective operations

2 of the Intel MPI Library, however, might also lead to slight differences in their results. This article will address methods that can be used to gather conditionally reproducible results from collective operations of the Intel MPI Library. Motivation Let s have a look at a simple example with 64 MPI ranks calling an MPI_Reduce operation where double precision values are accumulated. Figure 1 shows the Fortran code that calls an MPI_Reduce operation. Each MPI rank writes a very small number (2^-60) to its local_value variable except where Rank #16 (Index 15) writes 1.0 and Rank #17 (Index 16) writes -1.0. All local_value fields from the different ranks will then be accumulated to a global sum using MPI_Reduce. After the reduction operation, Rank 0 will write out global_sum with up to 20 digits after the decimal point. program rep use mpi implicit none integer :: n_ranks,rank,errc real*8 :: global_sum,local_value call MPI_Init(errc) call MPI_Comm_size(MPI_COMM_WORLD, n_ranks, errc) call MPI_Comm_rank(MPI_COMM_WORLD, rank, errc) local_value = 2.0 ** -60 if(rank.eq.15) local_value= +1.0 if(rank.eq.16) local_value= -1.0 call MPI_Reduce(local_value,global_sum,1,MPI_DOUBLE_PRECISION, & MPI_SUM,0,MPI_COMM_WORLD, errc) if(rank.eq.0) write(*, (f22.20) ) global_sum call MPI_Finalize(errc) end program rep 1 Fortran 90 accumulation example

3 Assume we have four nodes available, in which each system has 32 processor cores. Since we can run our application with only two systems, let s consider two different distributions schemas of MPI ranks: A) 64 ranks across all four nodes => 16 ranks per node B) 64 ranks on only two nodes => 32 ranks per node Due to its highly optimized nature, Intel MPI Library will try to leverage distributed and shared memory resources as efficiently as possible. Depending on the size of the run (#MPI ranks) and the message sizes that have to be exchanged, the library can choose among different algorithms available for each collective operation. Choosing the topologically aware algorithm for the reduce operation may result in a different order of operation for cases A and B. To reduce load on the cluster interconnect, the algorithm would accumulate local (per node) operations first and then send these results only once through the cluster network in order to accumulate the final result. A) Reduce( Reduce(#1 #16) + Reduce(#17 #32) + Reduce(#33 #48) + Reduce(#49 #64) ) B) Reduce( Reduce(#1 #32) + Reduce(#33 #64) ) The associative law (a + b) + c = a + (b + c) assumes exact computations and effectively unlimited precision; therefore, it does not apply when using limited precision representations. Since floating-point numbers are approximated by a limited number of bits representing the value, operations on these values will frequently introduce rounding errors. For a sequence of floating-point operations, the total rounding error can depend on the order in which these operations are executed. 2 The Intel MPI Library offers algorithms to gather conditionally reproducible results, even when the MPI rank distribution environment differs from run to run. As a result of the different order of operations in cases A and B, the final Reduce could generate slightly different values. While the results could be slightly different, they are still valid according to the IEEE 754 floating-point standard. 3 Let s break down the distribution of ranks for cases A and B from a pure floating-point perspective. This will provide a clearer picture of the actual problem: A) ( ( + 2^-60 + (+1) ) + ( (-1) + 2^-60 + ) + B) ( ( + 2^-60 + (+1) + (-1) + 2^-60 + ) +

4 In case A, +1 and -1 have to be accumulated with the very small 2^-60 values. In case B, +1 and -1 will be eliminated since they re calculated in the same step. Depending on the Intel MPI Library runtime configuration (shown in Table 1), this can result in the output in Figure 2. $ cat ${machinefile_a} ehk248:16 ehs146:16 ehs231:16 ehs145:16 $ cat ${machinefile_b} ehk248:32 ehs146:32 ehs231:0 ehs145:0 $ mpiifort -fp-model strict o./rep.x./rep.f90 $ export I_MPI_ADJUST_REDUCE=3 $ mpirun -n 64 -machinefile ${machinefile_a}./rep.x 0.00000000000000000000 $ mpirun -n 64 -machinefile ${machinefile_b}./rep.x 0.00000000000000004163 2 Getting diverse floating-point results Preparation Before addressing Intel MPI Library reproducibility, we should make sure that all other parts of the application produce numerically stable results. For example, the OpenMP standard, as a frequently used hybrid threading extension to MPI, does not specify the order in which partial sums should be combined. Therefore, the outcome of a reduction operation in OpenMP can vary from run to run depending on the runtime parameters. The Intel OpenMP runtime provides the environment variable KMP_DETERMINISTIC_REDUCTION, which can be used to control the runtime behavior. 4 Also, the Intel TBB Library does support deterministic reductions using the parallel_deterministic_reduce function. 5 Read more about using both the Intel Compiler and Intel MKL in the article Using the Intel Math Kernel Library and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results. 6

5 Reproducibility To explicitly set the expectations, we need to differentiate the terms reproducible and repeatable. Furthermore, when we use the term reproducible, we always mean conditional reproducibility. Repeatable Reproducible (conditional) Provides consistent results if the application is launched under exactly the same conditions repeating the run on the same machine and configuration. Provides consistent results even if the distribution of ranks differs, while the number of ranks (and #threads for hybrid applications) involved has to be stable. Also, the runtime including the microarchitecture has to be consistent. 7 All Intel MPI Library operations guarantee repeatable results. The reproducibility of Intel MPI Library operations is guaranteed under the following conditions: 1. Do not use topologically aware algorithms inside the collective reduction operations. 2. Avoid the recursive doubling algorithm for the MPI_Allreduce operation. 3. Avoid MPI_Reduce_scatter_block as well as the MPI-3 nonblocking-collective operations. The first condition for reproducibility can be met by explicitly setting the corresponding collective reduction operation algorithm using the I_MPI_ADJUST_ environment variables. A detailed documentation can be found in the Intel MPI Library Reference Manual 8 in the Collective Operation Control chapter. The information provided in the document clearly states which algorithms are topologically aware and should be avoided. Table 1 shows the five collective operations, which use reductions, and the corresponding Intel MPI Library environment variables. Set these accordingly in order to leverage the nontopologically aware algorithms (fulfilling the first condition above): Table 1 Collective MPI Operation Using Reductions Intel MPI Collective Operation Control Environment Nontopologically Aware Algorithms MPI_Allreduce I_MPI_ADJUST_ALLREDUCE (1) a, 2, 3, 5, 7, 8, 9 b MPI_Exscan I_MPI_ADJUST_EXSCAN 1 MPI_Reduce_scatter I_MPI_ADJUST_REDUCE_SCATTER 1, 2, 3, 4 MPI_Reduce I_MPI_ADJUST_REDUCE 1, 2, 5, 7 a MPI_Scan I_MPI_ADJUST_SCAN 1 a b Keep in mind that while the first algorithm of MPI_Allreduce is not topologically aware, it does not guarantee conditionally reproducible results see the second condition for details. The Knomial algorithm (IMPI 5.0.2) provides reproducible results, only if the I_MPI_ADJUST_<COLLECTIVE-OP-NAME>_KN_RADIX environment is kept stable or unmodified.

6 To see which algorithms are currently selected, set the environment variable I_MPI_DEBUG=6 and review the output. The default algorithms for collective operations can differ, depending on the size of the run (#ranks) as well as the transfer message sizes. Figure 3 shows the debug output for the collective operations used in the simple MPI reduce application introduced earlier. [0] MPI startup(): Reduce_scatter: 4: 0-2147483647 & 257-512 [0] MPI startup(): Reduce_scatter: 4: 0-5 & 513-2147483647 [0] MPI startup(): Reduce_scatter: 5: 5-307 & 513-2147483647 [0] MPI startup(): Reduce_scatter: 1: 307-1963 & 513-2147483647 [0] MPI startup(): Reduce_scatter: 3: 1963-2380781 & 513-2147483647 [0] MPI startup(): Reduce_scatter: 4: 0-2147483647 & 513-2147483647 [0] MPI startup(): Reduce: 1: 0-2147483647 & 0-2147483647 [0] MPI startup(): Scan: 0: 0-2147483647 & 0-2147483647 [0] MPI startup(): Scatter: 1: 1-494 & 0-32 [0] MPI startup(): Scatter: 2: 495-546 & 0-32 [0] MPI startup(): Scatter: 1: 547-1117 & 0-32 [0] MPI startup(): Scatter: 3: 0-2147483647 & 0-32 [0] MPI startup(): Scatter: 1: 1-155 & 33-2147483647 3 Example of selected collective operations One can see that for the MPI_Reduce collective operation, the first algorithm is being selected across all message sizes (0 2147483647) and ranges of MPI ranks (0 2147483647) by default. This is why it was necessary to select a different topology-aware algorithm (3) for the example above in order to get differing results for the MPI reduction (I_MPI_ADJUST_REDUCE=3). The second condition can be met by avoiding the recursive doubling algorithm for the MPI_Allreduce operation (I_MPI_ADJUST_ALLREDUCE=1). While the order of MPI ranks is guaranteed to be stable, the order of operands inside each MPI rank can differ due to the applied optimizations.

7 If, however, the operation is covered by the commutative law a + b = b + a, even the recursive doubling algorithm can be used to achieve reproducible results. The third condition is necessary since the MPI_Reduce_scatter_block as well as the new MPI-3 9 nonblocking-collective operations is implemented by using topology-aware algorithms. These collective operations cannot be adjusted by the Intel MPI Library user (as of Version 5.0.2), as they are only determined at runtime based on certain operation parameters. In Figure 4, we show how to achieve reproducible results for the simple reduction example used in the Motivation section of this article. Therefore, we will apply a nontopology-aware collective operation algorithm in the Intel MPI Library environment. As we have seen in Figure 3, the first algorithm was already the default case. Another option here was not specifying any I_MPI_ADJUST_REDUCE environment at all and leaving the default settings intact. $ cat ${machinefile_a} ehk248:16 ehs146:16 ehs231:16 ehs145:16 $ cat ${machinefile_b} ehk248:32 ehs146:32 ehs231:0 ehs145:0 $ mpiifort -fp-model strict o./rep.x./rep.f90 $ export I_MPI_ADJUST_REDUCE=1 $ mpirun -n 64 -machinefile ${machinefile_a}./rep.x 0.00000000000000004163 $ mpirun -n 64 -machinefile ${machinefile_b}./rep.x 0.00000000000000004163 4 Getting reproducible floating-point results

8 Keep in mind that while the distribution of MPI ranks along the nodes changed, all other parameters, such as the number of ranks and the architecture used, have been kept stable. This is necessary, as according to the definition of conditional reproducibility, the runtime environment has to be the same. Intel Xeon Phi Coprocessor When discussing conditional reproducibility for the Intel MPI Library, there is no difference between treatment for an Intel Xeon processor and an Intel Xeon Phi coprocessor. The same considerations we discussed apply to both. This allows the user to transparently integrate the Intel Xeon Phi coprocessor into HPC solutions. Remember, however, that different microarchitectures/instruction sets also come with different hardware-rounding support, which can lead to different results between the two microarchitectures. Also, as defined in the Reproducibility section of this article, the conditions have to be the same and, therefore, the number of threads and MPI ranks have to be stable. Summary In this article, we have shown several methods to enable the Intel MPI Library to use algorithms that guarantee deterministic reductions for the different collective MPI operations. We also demonstrated the impact of such algorithms, using a simple example of an MPI reduce operation moving from a repeatable to a conditionally reproducible outcome. This has been achieved without any need to modify the application s source code. The Intel MPI Library offers algorithms to gather conditionally reproducible results, even when the MPI rank distribution environment differs from run to run. It is important to understand that all other parameters, like the number of ranks or the microarchitecture, have to be equal from run to run. This is necessary in order to fulfill the requirements for conditionally reproducible results.

9 End Notes 1. T. Rosenquist, Introduction to Conditional Numerical Reproducibility (CNR), Intel Corporation, 2012. 2. D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, Association for Computing Machinery, Inc., 1991. 3. IEEE Standard for Binary Floating-Point Arithmetic, The Institute of Electrical and Electronics Engineers, Inc., 1985. 4. M.J. Corden and D. Kreitzer, Consistency of Floating-Point Results using the Intel Compiler, Intel Corporation, 2012. BLOG HIGHLIGHTS Tuning Tips for Compute Offload to Intel Processor Graphics BY ANOOP MADHUSOODHANAN PRABHA» Below are some tuning tips, which will help the programmer tune his kernel to get better performance from processor graphics: Offloaded loop nests must have enough iterations for all hardware threads available on Processor Graphics. Using perfectly nested parallel _Cilk_for loops allows parallelization in the dimensions of the parallel loop nest. Pragmas and code restructuring can be employed to get offloaded code vectorized. Using restrict and assume_aligned keywords may help vectorization too. Using the pin clause of the offload pragma will eliminate data copying to/from the GPU. Scalar memory accesses are much less efficient than vector accesses. Using Intel Cilk Plus array notation for memory accesses may help vectorize computation. A single memory access can handle up to 128 bytes. Gather/scatter operations of 4-byte elements are quite efficient, but with 2-byte elements are slower. Gather/scatter operations may result from array sections with non-unit strides. Read more

10 5. A. Katranov, Deterministic Reduction: A New Community Preview Feature in Intel Threading Building Blocks, 2012. 6. T. Rosenquist and S. Story, Using the Intel Math Kernel Library and Intel Compilers to Obtain Run-to-Run Numerical Reproducible Results, Intel Parallel Universe Magazine, 2012. 7. Even if the target application is compiled for one single vector instruction set such as AVX, running it on different microarchitectures such as Sandy Bridge or Haswell might trigger libraries to utilize different vector instruction sets based on the available microarchitecture. See Consistency of Floating-Point Results using the Intel Compiler 3 for more information. 8. Intel MPI Library Documentation, Intel Corporation, 2015. 9. MPI: A Message-Passing Interface Standard Version 3.0, Message-Passing Interface Forum, 2012. Try Intel Threading Building Blocks (Intel TBB) > Available in these software tool suites: Intel Parallel Studio XE > Intel System Studio > Intel Integrated Native Developer Experience (Intel INDE) >