Myths and reality of communication/computation overlap in MPI applications

Similar documents
CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

IsoStack Highly Efficient Network Processing on Dedicated Cores

Advanced Computer Networks. End Host Optimization

Enhanced Memory debugging of MPI-parallel Applications in Open MPI

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Introduction to MPI. Branislav Jansík

IO virtualization. Michael Kagan Mellanox Technologies

MM5 Modeling System Performance Research and Profiling. March 2009

Exploiting Offload Enabled Network Interfaces

Application Acceleration Beyond Flash Storage

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Understanding MPI on Cray XC30

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

MPI Runtime Error Detection with MUST

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

CP2K Performance Benchmark and Profiling. April 2011

ICON Performance Benchmark and Profiling. March 2012

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Memory Management Strategies for Data Serving with RDMA

2008 International ANSYS Conference

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

MPI Runtime Error Detection with MUST

Parallel Short Course. Distributed memory machines

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Further MPI Programming. Paul Burton April 2015

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Lecture 7: Distributed memory

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MILC Performance Benchmark and Profiling. April 2013

A Case for Standard Non-Blocking Collective Operations

Lessons learned from MPI

HIGH-PERFORMANCE NETWORKING :: USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS

LAMMPS and WRF on iwarp vs. InfiniBand FDR

NAMD Performance Benchmark and Profiling. January 2015

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

Introduction to parallel computing concepts and technics

MPI and MPI on ARCHER

MPI Programming Techniques

iwarp Learnings and Best Practices

Practical Introduction to Message-Passing Interface (MPI)

MPI History. MPI versions MPI-2 MPICH2

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Designing High Performance DSM Systems using InfiniBand Features

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

Reduces latency and buffer overhead. Messaging occurs at a speed close to the processors being directly connected. Less error detection

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Generic RDMA Enablement in Linux

An introduction to MPI

Slides prepared by : Farzana Rahman 1

Performance Implications Libiscsi RDMA support

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Screencast: Basic Architecture and Tuning

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9

Introduction to System Programming

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Message Passing Interface. George Bosilca

Lecture 15: I/O Devices & Drivers

MPI Mechanic. December Provided by ClusterWorld for Jeff Squyres cw.squyres.com.

AMBER 11 Performance Benchmark and Profiling. July 2011

Illinois Proposal Considerations Greg Bauer

The Common Communication Interface (CCI)

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Ongoing work on NSF OCI at UNH InterOperability Laboratory. UNH IOL Participants

Parallel Programming. Functional Decomposition (Document Classification)

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Ravindra Babu Ganapathi

MPI+X on The Way to Exascale. William Gropp

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

COS 318: Operating Systems. Message Passing. Kai Li and Andy Bavier Computer Science Department Princeton University

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Open MPI for Cray XE/XK Systems

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Maximum Performance. How to get it and how to avoid pitfalls. Christoph Lameter, PhD

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Parallel Programming in C with MPI and OpenMP

IXPUG 16. Dmitry Durnov, Intel MPI team

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Introduction. CS3026 Operating Systems Lecture 01

Parallel Programming in C with MPI and OpenMP

Towards scalable RDMA locking on a NIC

Mellanox GPUDirect RDMA User Manual

Point-to-Point Communication. Reference:

Myri-10G Myrinet Converges with Ethernet

Multifunction Networking Adapters

MATH 676. Finite element methods in scientific computing

Session 12: Introduction to MPI (4PY) October 9 th 2018, Alexander Peyser (Lena Oden)

The Exascale Architecture

Common Computer-System and OS Structures

MPI Correctness Checking with MUST

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Transcription:

Myths and reality of communication/computation overlap in MPI applications Alessandro Fanfarillo National Center for Atmospheric Research Boulder, Colorado, USA elfanfa@ucar.edu Oct 12th, 2017 (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 1 / 21

Disclaimer This presentation is not meant to provide official performance results. This presentation is not meant to judge/test/blame any compiler, MPI implementation, network interconnect, MPI Forum decision, or NCAR staff. This presentation is only meant to clarify some of the misconceptions behind the communication/computation overlap mechanism in MPI applications. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of NCAR, UCAR and NSF. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 2 / 21

Short Historical Introduction to MPI Early 80s: Message Passing model was dominant. Every machine/network producer had its own message passing protocol. 1992: A group of researchers from academia and industry, decided to design a standardized and portable message-passing system called MPI, Message Passing Interface. The MPI specification provides over 200 API functions with a specific and defined behavior to which an implementation must conform. There are many cases where standard is (intentionally?) ambiguous and the MPI implementer has to make a decision. The asynchronous message progress is one of them. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 3 / 21

Common Myths on Communication/Computation Overlap My application overlaps communication with computation because......it uses non-blocking MPI routines...it uses one-sided (Remote Memory Access) MPI routines...it runs on a network with native support for Remote Direct Memory Access (e.g. Mellanox Infiniband) MPI courses usually explain how to get correct results and hoping to get communication/computation overlap. We need a better definition of what overlap actually means... (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 4 / 21

Overlap and MPI Progress Overlap is a characteristic of the network layer; it consists of the NIC capability to take care of data transfer(s) without direct involvement of the host processor, thus allowing the CPU to be dedicated to computation. Progress is a characteristic related to MPI, which is the software stack that resides above the network layer. Remote Direct Memory Access can provide overlap. The MPI standard defines a Progress Rule for asynchronous communication operations. Two different interpretations leading to different behaviors (both compliant with the standard). (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 5 / 21

Remote Direct Memory Access (1) RDMA is a way (protocol) of moving buffers between two applications across a network. RDMA is composed by three main components: 1 Zero-copy (send/recv data without copies) 2 Kernel/OS bypass (direct access hw network without OS intervention) 3 Protocol offload (RDMA and transport layer implemented in hw)!!! (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 6 / 21

Remote Direct Memory Access (2) In theory the CPU is not involved in the data transfer and the cache in the remote CPU won t be polluted. In practice, vendors decide how much protocol processing is implemented in the NIC. QLogic (Intel True Scale) encourages protocol onloading. Mellanox Infiniband encourages protocol offloading. Both Yellowstone and Cheyenne have Mellanox InfiniBand. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 7 / 21

MPI Progress Rule - Strictest interpretation: it mandates non-local progress semantics for all non-blocking communication operations once they have been enabled. - Weak interpretation: it allows a compliant implementation to require the application to make further library calls in order to achieve progress on other pending communication operations. It is possible either to support overlap without supporting independent MPI progress or have independent MPI progress without overlap. Some networks (e.g. Portals, Quadrics) provide MPI-like interfaces in order to support MPI functions in hardware, others do not. MPI libraries should implement in software what is needed and not provided by the hardware. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 8 / 21

Strategies for Asynchronous Progress Three possible strategies for asynchronous progress: 1 Manual progress 2 Thread-based progress (polling-based vs. interrupt-based) 3 Communication offload Manual progress is the most portable but requires the user to call MPI functions (e.g. MPI Test). Using a dedicated thread to poke MPI requires a thread-safe implementation. Offloading may become a bottleneck because of the low performance of embedded processor on NIC. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 9 / 21

Thread-based progress: the silver bullet? Often stated as the silver bullet but not widely used. The MPI implementation must be thread-safe (MPI THREAD MULTIPLE). Thread-safety in MPI implementations penalizes performance (mostly latency). Two alternatives: polling-based and interrupt-based. Polling-based is beneficial if separate computation cores are available for the progression threads. Interrupt-based might also be helpful in case of oversubscribed node. For more info: T. Hoefler and A. Lumsdaine - Message Progression in Parallel Computing - To Thread or not to Thread? (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 10 / 21

MPI Non-blocking Example call MPI_COMM_RANK ( MPI_COMM_WORLD, me, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, np, ierr ) if( me == np -1) then call MPI_Irecv (y, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, req, ierr ) endif call MPI_Barrier ( MPI_COMM_WORLD, ierr ) if( me == 0) then call MPI_Isend (x, n, MPI_DOUBLE, np -1, 0, MPI_COMM_WORLD, req, ierr )! Some long computation here call MPI_Wait (req, status, ierr ) else if( me == np -1) then call MPI_Wait (req, status, ierr )! Data on y correctly received endif (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 11 / 21

Modules and Flags for IntelMPI In the demo, two MPI implementations have been used: OpenMPI/3.0.0 and IntelMPI/2017.1.134. For compiling and running the benchmarks with IntelMPI: module swap gnu intel module load intel/17.0.1 module load impi mpiicc your benchmark.c -o your benchmark intel To turn on the helper thread for asynchronous progress in IntelMPI: export I MPI ASYNC PROGRESS=1 (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 12 / 21

Modules and Flags for Open-MPI For compiling and running the benchmarks with Open-MPI on Cheyenne: module load gnu/7.1.0 module load openmpi/3.0.0 mpif90 your benchmark.c -o your benchmark openmpi mpirun -np n./your benchmark openmpi To turn off the RDMA support use btl openib flags 1: mpirun -np n mca btl openib flags 1./your benchmark openmpi (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 13 / 21

Demo 1 Non-Blocking Two-Sided Progress (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 14 / 21

Two-sided vs. One-sided Communication Two-sided One-sided In theory, one-sided communication allows to overlap communication with computation and save idle times. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 15 / 21

Demo 2 One-Sided Get Progress (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 16 / 21

MPI One-sided and RDMA MPI one-sided routines show great potential for applications that benefit from communication/computation overlap. MPI one-sided maps naturally on RDMA. In case of not asynchronous progress, MPI calls needed on the target process (poking MPI engine). Some MPI implementations have implemented the one-sided routines on top of the two-sided... (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 17 / 21

Demo 3 One-Sided Put Progress and Memory Allocation (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 18 / 21

Conclusions Message progression is an intricate topic. Do not assume, always verify. Asynchronous progress may penalize the performance. Writing programs exploiting overlap is critical. Consider to switch from non-blocking two-sided to one-sided. (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 19 / 21

Acknowledgments Davide Del Vento Patrick Nichols Brian Vanderwende Rory Kelly (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 20 / 21

Thanks Thanks (elfanfa@ucar.edu) NCAR SEA Oct 12th, 2017 21 / 21