Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Similar documents
Runtime Optimization of Application Level Communication Patterns

Open MPI from a user perspective. George Bosilca ICL University of Tennessee

MPI History. MPI versions MPI-2 MPICH2

Implementation and Usage of the PERUSE-Interface in Open MPI

MPI versions. MPI History

Open MPI for Cray XE/XK Systems

Enhanced Memory debugging of MPI-parallel Applications in Open MPI

Open MPI: A Flexible High Performance MPI

Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

Non-Blocking Collectives for MPI

Lessons learned from MPI

Myths and reality of communication/computation overlap in MPI applications

Optimizing communications on clusters of multicores

Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI

Programming for Fujitsu Supercomputers

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

OPEN MPI AND RECENT TRENDS IN NETWORK APIS

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication

Non-blocking Collective Operations for MPI

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Runtime Algorithm Selection of Collective Communication with RMA-based Monitoring Mechanism

High Performance MPI on IBM 12x InfiniBand Architecture

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Self Adapting Numerical Software (SANS-Effort)

Writing MPI Programs for Odyssey. Teresa Kaltz, PhD Research Computing

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Towards Exascale: Leveraging InfiniBand to accelerate the performance and scalability of Slurm jobstart.

The Future of Interconnect Technology

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

MOVING FORWARD WITH FABRIC INTERFACES

Runtime Optimization of Application Level Communication Patterns

Screencast: OMPI OpenFabrics Protocols (v1.2 series)

AcuSolve Performance Benchmark and Profiling. October 2011

COSC 6385 Computer Architecture - Multi Processor Systems

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Implementing MPI on Windows: Comparison with Common Approaches on Unix

PRIMEHPC FX10: Advanced Software

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

What communication library can do with a little hint from programmers? Takeshi Nanri (Kyushu Univ. and JST CREST, Japan)

I N F M A S T E R S T H E S I S I N C O M P U T E R S C I E N C E

Using Lamport s Logical Clocks

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Comparative Performance Analysis of RDMA-Enhanced Ethernet

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Initial Performance Evaluation of the Cray SeaStar Interconnect

Optimization of MPI Applications Rolf Rabenseifner

UCX: An Open Source Framework for HPC Network APIs and Beyond

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Implementing a Hardware-Based Barrier in Open MPI

Unified Communication X (UCX)

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Infiniband Scalability in Open MPI

ADINA DMP System 9.3 Installation Notes

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Brian W. Barre7 Scalable System So?ware Sandia NaConal Laboratories December 3, 2012 SAND Number: P

Unified Runtime for PGAS and MPI over OFED

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

Introduction to High Performance Computing at ZIH

arxiv: v1 [cs.dc] 27 Sep 2018

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

Local Area Multicomputer (LAM -MPI)

Screencast: Basic Architecture and Tuning

Progress Report on Transparent Checkpointing for Supercomputing

Our new HPC-Cluster An overview

Memory Management Strategies for Data Serving with RDMA

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Application Performance on Dual Processor Cluster Nodes

How to Use MPI on the Cray XT. Jason Beech-Brandt Kevin Roy Cray UK

A Case for High Performance Computing with Virtual Machines

TEG: A High-Performance, Scalable, Multi-Network Point-to-Point Communications Methodology

COSC 6374 Parallel Computation. Performance Modeling and 2 nd Homework. Edgar Gabriel. Spring Motivation

RDMA in Embedded Fabrics

University of Tokyo Akihiro Nomura, Yutaka Ishikawa

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

Operational Robustness of Accelerator Aware MPI

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Advanced Computer Networks. End Host Optimization

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Application Development with MARMOT

Moab Workload Manager on Cray XT3

HPC Customer Requirements for OpenFabrics Software

Technical Computing Suite supporting the hybrid system

A Case for Standard Non-Blocking Collective Operations

The Common Communication Interface (CCI)

InfoBrief. Platform ROCKS Enterprise Edition Dell Cluster Software Offering. Key Points

IsoStack Highly Efficient Network Processing on Dedicated Cores

IOS: A Middleware for Decentralized Distributed Computing

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Interconnect Your Future

Real Parallel Computers

RDMA Requirements for High Availability in the NVM Programming Model

Cluster Network Products

Transcription:

Open MPI und ADCL Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen Department of Computer Science University of Houston gabriel@cs.uh.edu

Is MPI dead? New MPI libraries released in the last three years: LAM/MPI 7.x re-implementation of LAM/MPI focusing on a component architecture MPICH2 all new version by ANL Open MPI all new public domain implementation MVAPICH/2 (public domain) MPI libraries for InfiniBand interconnects Intel MPI commercial cluster MPI library based on MPICH2 HP-MPI MPI library for clusters Voltaire MPI etc. vendor specific derivatives of MPICH- 1.2.x for InfiniBand

Open MPI Team PACX-MPI LAM/MPI LA-MPI FT-MPI

Goals All of MPI-2 Thread safety ( MPI_THREAD_MULTIPLE ) Based on a component architecture Flexible run-time tuning Plug-ins for different capabilities (e.g., different networks) Optimized performance Low latency and High bandwidth Polling vs. asynchronous progres Production quality Open source Vendor-friendly license (BSD) Bring together MPI-smart developers Prevent forking problem Community / 3rd party involvement

MPI Component Architecture (MCA) User application MPI API MPI Component Architecture (MCA) Framework Framework Framework Framework Framework Framework Framework

MCA Component Frameworks Components divided into three categories Back-end to MPI API functions Run-time environment Infrastructure / management Rule of thumb: If we ll ever want more than one implementation, make it a component

MCA Component Types MPI types P2P management P2P transport Collectives Topologies MPI-2 one-sided MPI-2 IO Reduction Operations Run-time env. Types Out of band communication Process control Global data registry Management types Memory pooling Memory caching Common

Component / Module Lifecycle Module Component Checkpoint restart Open Selection Initialization Finalization Close Normal usage Component Open: per-process initialization Selection: per-scope determine if want to use Close: per-process finalization Module Initialization: if component selected Normal usage / checkpoint Finalization: per-scope cleanup

Point-to-Point User application MPI API MPI Component Architecture (MCA) PML BML BTL MPool RCache OB1 DR R2 TCP/IP Shared Mem IB gm ib sm rb

Pt-2-Pt Components PML P2P Management Layer Provides MPI Point-to-point semantics Message Progression Request Completion and Notification Internal MPI messaging protocols Eager send Rendezvous Support for various types of interconnect Send/Recv RDMA Hybrids BTL Byte Transfer Layer Data mover Message Matching Responsible for own progress (polling or async) BML BTL Management Layer Thin multiplexing layer over BTL s Manages peer resource discovery

Performance Results 3-D Finite Difference with four different implementations of the occurring communication pattern evaluated: fcfs: first-come first-serve using non-blocking communication and derived datatypes fcfs-pack: first-come first-serve using non-blocking operations and pack/unpack overlap: first-come first-serve using non-blocking operations, derived datatypes and overlapping communication and computation ordered: using blocking Send/Recv operation with derived datatypes Tests executed on cacau (HLRS): EM64T cluster using an InfiniBand and GEthernet phobos (ZIH): Opteron cluster using InfiniBand Three different MPI libraries tested: Open MPI v1.0.1 Intel MPI 1.0 MVAPICH 1.2.

Performance Results (I) Execution time for 200 iterations on 64 processes/ 32 nodes on phobos Comparison of Open MPI and MVAPICH on Phobos 14 12 execution time [sec] 10 8 6 4 128x128x128 MVAPICH2 128x128x128 Open MPI 256x128x128 MVAPICH2 256x128x128 Open MPI 2 0 fcfs fcfs-pack ordered overlap

Performance Results (II) Execution time for 200 iterations on 16 processors / 16 nodes over IB Comparison of Open MPI and Intel MPI on cacau 14 12 execution time [sec] 10 8 6 4 Open MPI 128x64x64 Intel MPI 128x64x64 Open MPI 128x128x64 Intel MPI 128x128x64 2 0 fcfs fcfs-pack ordered overlap

Performance Results (III) Execution time for 200 iterations on 16 processors / 16 nodes over GE Comparison of Open MPI and Intel MPI 16 on cacau 14 12 execution time [sec] 10 8 6 4 Open MPI 128x64x64 Intel MPI 128x64x64 Open MPI 128x128x64 Intel MPI 128x128x64 2 0 fcfs fcfs-pack ordered overlap

Current status (I) Current stable release: v1.0.2 Last week branched for v1.1 Expected to be released May/June/July Tuned collective communication component One-sided communication component Data reliability Supported operating systems Linux OS X (BSD) Solaris * AIX * * Less frequently tested

Current status (II) Supported network interconnects TCP Shared memory Myrinet GM, MX Infiniband mvapi, OpenIB Portals Supported batch schedulers rsh / ssh BProc (current) PBS / Torque SLURM BJS (LANL BProc Clustermatic) Yod (Red Storm)

Currently ongoing work Data reliability for point-to-point operations Definition of new collective framework collv2 Selection on a per-function bases (instead of per component basis in v1) Coordinated checkpoint-restart capabilities ORTE v2 Relevant for dynamic process management

ADCL - Motivation (I) Finite difference code using regular domain decomposition Data exchange at process boundaries required in every iteration of the solver Typically implemented by a sequence of point-to-point operations

Motivation (III) Execution time for 200 iterations on 32 processes/processors 30 25 execution time [sec ] 20 15 10 fcfs fcfs-pack ordered overlap 5 0 128x128x64 IB 128x128x128 IB 128x128x64 TCP 128x128x128 TCP

How to implement the required communication pattern? Dependence on platform Some functionality only supported (efficiently) on certain/platforms or with certain network interconnects Dependence on MPI library Does the MPI library support all available methods Efficiency in overlapping communication and computation Quality of the support for user defined datatypes Dependence on application Problem size Ratio of communication to computation

Problem: How can an (average) user understand the myriad of implementation options and their impact on the performance of the application? (Honest) Answer: no way Abstract interfaces for application level communication operations required ADCL Statistical tools required to detect correlations between parameters and application performance

ADCL - Adaptive Data and Communication Library Goals: Provide abstract interfaces for often occurring application level communication patterns Collective operations Not-covered by MPI specification Provide a wide variety of implementation possibilities and decision routines which choose the fastest available implementation (at runtime) Not replacing MPI, but add-on functionality Uses many features of MPI

ADCL components (I) 1. Static (parallel) configure step Exclude methods not supported by the MPI library Determine characteristics of the MPI library, e.g. MPI_Send vs. MPI_Isend vs. MPI_Put vs. MPI_Get Effect of MPI_Alloc_mem Derived Datatypes vs. MPI_Pack/MPI_Unpack Efficiency of overlapping communication and computation Characteristics stored as attributes of the library

ADCL components (II) 2. ADCL Methods and Runtime library Collection of all available implementations for a certain communication operation Runtime decision routines Matching of requirements of an implementation to the attributes set by the parallel configure step Testing at runtime Monitoring of the performance Used for initiating re-evaluation of a decision 3. Historic learning Input file Usage of performance skeletons (cooperation with Jaspal Subhlok)

Classification of implementations 1. Data transfer primitives Blocking point-to-point operations Non-blocking point-to-point operations Persistent request operations One-sided operations Collective operations 2. Mapping of the communication pattern to data transfer operations Direct-transfer vs. Variable-transfer Single-block vs. dual-block implementations 3. Handling of non-contiguous messages Sending each element separately Pack/unpack Derived datatypes

ADCL code sample /* describe neighborhood relations using Topology functions of MPI */ MPI_Cart_create ( comm, n, dims[n], period, reorder, &cart_comm); /* Register a data structure for communication operations*/ ADCL_Register_dense_matrix ( matrix, n, matrix_dims[n], k, submatrix_dims[k], num_ghostcells, distance, cart_comm, &adcl_request); /* Start a blocking communication for the registered matrix on the provided communicator */ ADCL_Start ( &adcl_request );

Current status of ADCL Application driven CMAQ: air-quality code (Daewon Byun) Multi-scale blood-flow simulation (Marc Garbey)

Available implementations for 3-D neighborhood communication ordered fcfs fcfs-pack overlap alltoallw get-fence put-fence get-start put-start topo topo-overlap Data transfer primitive Blocking Non-blocking Non-blocking Non-blocking Collective One-sided One-sided One-sided One-sided Non-blocking Non-blocking Communication structure Direct-transfer, single-block Direct-transfer, single-block Direct-transfer, single-block Direct-transfer, dual-block Direct-transfer, single-block Direct-transfer, dual-block Direct-transfer, dual-block Direct-transfer, dual-block Direct-transfer, dual-block Variable-transfer, single-block Variable-transfer, dual-block Handling of noncont. messages Der. datatypes Der. datatypes Pack/Unpack Der. datatypes Der. datatypes Der. datatypes Der. datatypes Der. datatypes Der. datatypes Der. datatypes Der. datatypes

Summary Open MPI a component based, flexible implementation of the MPI-1 and MPI-2 specifications Resolves some of the issues seen on today s cluster with other MPI libraries ADCL: An adaptive communication library for abstracting often occurring application level communication operations Simplifies the development of portable and performant code for scientific computing