Implementing a Hardware-Based Barrier in Open MPI

Similar documents
Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters

Analysis of the Component Architecture Overhead in Open MPI

Open MPI: A High Performance, Flexible Implementation of MPI Point-to-Point Communications

TEG: A High-Performance, Scalable, Multi-Network Point-to-Point Communications Methodology

SELF-HEALING NETWORK FOR SCALABLE FAULT TOLERANT RUNTIME ENVIRONMENTS

Evaluating Sparse Data Storage Techniques for MPI Groups and Communicators

Open MPI: A Flexible High Performance MPI

ANALYZING THE EFFICIENCY OF PROGRAM THROUGH VARIOUS OOAD METRICS

Flexible collective communication tuning architecture applied to Open MPI

Improving the Dynamic Creation of Processes in MPI-2

Designing High-Performance and Resilient Message Passing on InfiniBand

A Component Architecture for LAM/MPI

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast

Scalable Fault Tolerant Protocol for Parallel Runtime Environments

Implementation and Usage of the PERUSE-Interface in Open MPI

Scalable Fault Tolerant Protocol for Parallel Runtime Environments

MPI Collective Algorithm Selection and Quadtree Encoding

A Case for Standard Non-Blocking Collective Operations

Towards Efficient Shared Memory Communications in MPJ Express

How to Run Scientific Applications over Web Services

Towards Efficient MapReduce Using MPI

Fast Barrier Synchronization for InfiniBand TM

Scalable Fault Tolerant MPI: Extending the recovery algorithm

Open MPI from a user perspective. George Bosilca ICL University of Tennessee

Transforming the high-performance 3d-FFT in ABINIT to enable the use of non-blocking collective operations

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols

Hierarchy Aware Blocking and Nonblocking Collective Communications-The Effects of Shared Memory in the Cray XT environment

BUILDING A HIGHLY SCALABLE MPI RUNTIME LIBRARY ON GRID USING HIERARCHICAL VIRTUAL CLUSTER APPROACH

Infiniband Scalability in Open MPI

Open MPI for Cray XE/XK Systems

A Case for Standard Non-Blocking Collective Operations

Optimization of a parallel 3d-FFT with non-blocking collective operations

Coupling DDT and Marmot for Debugging of MPI Applications

Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem

Influence of the Progress Engine on the Performance of Asynchronous Communication Libraries

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Measuring the Quality of Software through Analytical Design by OOAD Metrics

Scalable High Performance Message Passing over InfiniBand for Open MPI

Proposal of MPI operation level checkpoint/rollback and one implementation

Redesigning the Message Logging Model for High Performance

Open MPI und ADCL. Kommunikationsbibliotheken für parallele, wissenschaftliche Anwendungen. Edgar Gabriel

Avida Checkpoint/Restart Implementation

Fault Tolerance in Distributed Paradigms

Auto-tuning Non-blocking Collective Communication Operations

Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI

On Scalability for MPI Runtime Systems

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

Using SCTP to hide latency in MPI programs

Runtime Optimization of Application Level Communication Patterns

Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI

UTILIZING THE MULTI-THREADING TECHNIQUES TO IMPROVE THE TWO-LEVEL CHECKPOINT/ROLLBACK SYSTEM FOR MPI APPLICATIONS

Evaluation of Java Message Passing in High Performance Data Analytics

MPI Collective Algorithm Selection and Quadtree Encoding

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

MPI Forum: Preview of the MPI 3 Standard

Enhanced Memory debugging of MPI-parallel Applications in Open MPI

High Performance Computing: Concepts, Methods & Means Enabling Technologies 2 : Cluster Networks

On the Reproducibility of MPI Reduction Operations

Handling Single Node Failures Using Agents in Computer Clusters

Developing a Thin and High Performance Implementation of Message Passing Interface 1

LibPhotonNBC: An RDMA Aware Collective Library on Photon

Fast and Efficient Total Exchange on Two Clusters

Non-Blocking Collectives for MPI

MPI Mechanic. December Provided by ClusterWorld for Jeff Squyres cw.squyres.com.

Locality-Aware Parallel Process Mapping for Multi-Core HPC Systems

Parallel Sorting with Minimal Data

Kent Academic Repository

MPI History. MPI versions MPI-2 MPICH2

MPI versions. MPI History

Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI API Version / SSI Version 1.0.0

Comparative Performance Analysis of RDMA-Enhanced Ethernet

A Comprehensive User-level Checkpointing Strategy for MPI Applications

POSH: Paris OpenSHMEM A High-Performance OpenSHMEM Implementation for Shared Memory Systems

Scheduling Dynamically Spawned Processes in MPI-2

Exploiting Offload Enabled Network Interfaces

Gb Ethernet Protocols for Clusters: An OpenMPI, TIPC, GAMMA Case Study

Decision Trees and MPI Collective Algorithm Selection Problem

Failure Detection within MPI Jobs: Periodic Outperforms Sporadic

{jfzhan, snh} gncic.ac.cn

Proactive Process-Level Live Migration in HPC Environments

Programming for Fujitsu Supercomputers

Using CMT in SCTP-Based MPI to Exploit Multiple Interfaces in Cluster Nodes

A Capabilities Based Communication Model for High-Performance Distributed Applications: The Open HPC++ Approach

Non-blocking Collective Operations for MPI

NIC-assisted Cache-Ecient Receive Stack for Message Passing over Ethernet

Optimizing the Synchronization Operations in MPI One-Sided Communication

The Common Communication Interface (CCI)

A Communication Model for Small Messages with InfiniBand

Online Dynamic Monitoring of MPI Communications

A Fault-Tolerant Communication Library for Grid Environments

IMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR

The Impact of Network Noise at Large-Scale Communication Performance

A Resource Look up Strategy for Distributed Computing

Screencast: Basic Architecture and Tuning

Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance

HPC Environment Management: New Challenges in the Petaflop Era

OSv-MPI: A prototype MPI implementation for the OSv cloud operating system

MPISH: A Parallel Shell for MPI Programs

SOAr-DSGrid: Service-Oriented Architecture for Distributed Simulation on the Grid

Transcription:

Implementing a Hardware-Based Barrier in Open MPI - A Case Study - Torsten Hoefler 1, Jeffrey M. Squyres 2, Torsten Mehlan 1 Frank Mietke 1 and Wolfgang Rehm 1 1 Technical University of Chemnitz 2 Open Systems Laboratory Dept. of Computer Science Indiana University Chair of Computer Architecture 501 N. Morton Street Chemnitz, 09107 GERMANY Bloomington, IN 47404 USA {htor,tome,mief,rehm}@cs.tu-chemnitz.de jsquyres@open-mpi.org April 15, 2006 Abstract Open MPI is a recent open source development project which combines features of different MPI implementations. These features include fault tolerance, multi network support, grid support and a architecture which ensures extensibility. The TUC Hardware Barrier is a special purpose low latency barrier network based on commodity hardware. We show that the Open MPI collective framework can easily be extended to support the special purpose collective hardware without any changes in Open MPI itself. 1 Introduction Many different MPI libraries are available to the user and each implementation has its strengths and weaknesses. Many of them are based on the reference implementa- 1

1 INTRODUCTION 1.1 Short Introduction to Open MPI tion MPICH [8] or its successor for MPI-2 MPICH2. Several MPI implementations offer one specific functionality in their code-base, e.g. LAM/MPI offers an easily extensible framework, FT-MPI [4] offers fault tolerance, MPICH-G2 [7] grid support, and LA-MPI [6] offers multiple network device support, but there is no MPI available which combines all of these aspects. Open MPI is intended to fill this gap and combine different ideas into a single and extensible implementation (compare [5]). Many developers with a MPI background combine their knowledge to create this new framework and to add some remarkable features to Open MPI. These developers came from different directions and want to make a new design which is free from old architectural constraints. FT-MPI [4] offers fault tolerance and design for changing environments, LA-MPI [6] offers multi-device support, LAM/MPI [3] offers the framework and PACX-MPI the multi-cluster and grid support. Open MPI started in January 2004 and was written from scratch to overcome several constraints and fit the needs of all members. Roughly two years later, version 1.0 was presented to the public. 1.1 Short Introduction to Open MPI Regarding to Gabriel et. al. in [5], the main goals of Open MPI are to keep the code open and extensible, to support MPI-2 with the maximum thread level (MPI THREAD MULTIPLE) for all widely deployed communication networks. Additionally, features like message fragmentation and striping across multiple interfaces as well as fault recovery and tolerance, both totally transparent from the applications point of view, will be included in the implementation. The useability of the Run-Time-Environment (RTE) should be as comfortable as possible to support the user for execution his MPI programs. Additional features can be enabled optionally, such as data integrity check, to recognize errors on internal data busses. To reach all this goals stated above, Open MPI uses a special architecture, called Modular Component Architecture compareable to the Common Component Architecture (CCA), described in [1]. This architecture offers a framework to implement every needed functionality for MPI in a modular manner. The necessary functional domains are derived from the goals and equipped with a clear interface as a framework inside Open MPI. Each framework can host multiple modules which implement the necessary functionality to serve the interface. Each module could support different hardware or implement different algorithms to enable the user or the framework to choose a special module for each run without recompiling or relinking the code. Torsten Hoefler Page 2/8

1 INTRODUCTION 1.1 Short Introduction to Open MPI The modular structure is shown in Figure 1. MCA Framework Figure 1: Open MPI Architecture The three entities MCA, framework and modules are explained in the following. The MCA offers a link between the user, who is able to pass command line arguments or MCA parameters to it, and the modules which receive the parameters. Each framework is dedicated to a special task (functional domain) and discovers, loads, unloads and uses all modules which implement its interface. The framework performs also the module selection, which can be influenced by user (MCA) parameters. Source code for modules is supported and configured/built by the MCA as well as binary modules which are loaded by the framework via the GNU Libltdl. Current frameworks include the Byte Transport Layer (BTL) which acts as a data transport driver for each network, the BTL Management Layer which manages BTL modules and the Point-to-Point Management Layer which performs higher level tasks as reliability, scheduling or bookkeeping of messages. All these frameworks are directly related to the transport of point-to-point messages (e.g. MPI SEND, MPI RECV). Other frameworks, such as the topology framework (TOPO), the Parallel IO framework (IO) offer additional functionality for other MPI related domains. Helper frameworks, such as Rcache or MPool are available to BTL implementers to support the pinned management of registered/pinned memory. The collective framework (COLL) was used to implement a module which offers the hardware barrier support. The layering and the interaction between the different frameworks in the case of InfiniBand [10] communication is shown in Figure 2. The framework is specially designed for HPC and optimized in all critical paths. Thus, the overhead introduced compared to a static structure is less than 1% (cmp. [2]). Torsten Hoefler Page 3/8

2 THE TUC HARDWARE BARRIER Application MPI IB BTL OB1 PML R2 BML TCP BTL IB COLL IB Hardware Rcache MPool Figure 2: Interaction of Frameworks inside the MCA 2 The TUC Hardware Barrier The TUC Hardware Barrier is a self-made barrier hardware based on commodity hardware. The standard parallel port is used for communication between the nodes and a central hardware controls all nodes. A schema of the pin-out of the parallel port is given in Figure 3. The current prototype supports only a single barrier operation between all nodes, thus it is only usable for the global communicator MPI COMM WORLD. Each node has a single inbound and outbound channel which can have the binary state 0 or 1 and is connected to the barrier hardware. The prototypical barrier hardware consists of a ALTERA UP1 FPGA (shown in Figure 4) and is able to control 60 nodes. The FPGA implements a relatively easy Finite State Machine with two states regarding to the two states of the input wire of each node, 0 or 1. A transition between these two states is performed, when either all output wires of all nodes are set to 1 or all output wires are set to 0. The graph representing the state transitions is shown in Figure 5. When a single node reaches its MPI BARRIER call, it reads the input wire, toggles the read bit and writes it to the output wire. Then, the node waits until the input is toggled by the barrier hardware, and leaves the barrier call (all nodes toggled their bit). Torsten Hoefler Page 4/8

2 THE TUC HARDWARE BARRIER outgoing incoming Control Port (BASE + 2) 7 6 5 4 3 2 1 0 17 16 14 1 Status Port (BASE + 1) 7 6 5 4 3 2 1 0 11 10 12 13 15 IRQ enable 1 14 Data Port (BASE + 0) 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 13 25 Figure 3: Parallel Port Pin-out Figure 4: The prototypical Hardware o = 1 i1 and i2 and i3 and i4 = 1 i1 or i2 or i3 or i4 = 0 o = 0 Figure 5: The implemented Finite State Machine Torsten Hoefler Page 5/8

3 IMPLEMENTING MPI BARRIER 3 Implementing MPI BARRIER The barrier is implemented inside the COLL framework, which is originally introduced by Squyres et. al. in [9]. The framework finds and loads all s and queries each if it wants to run on this machine (the is able to check for specific hardware or other prerequisites). During the creation of a new communicator, the framework initializes all runnable modules. Each module can test for all its prerequisites on this specific communicator and return a priority to the framework. This selects the best one according to MCA parameters or priorities returned by the modules. When a communicator is destroyed, the framework calls uninitialize functions of the active modules that all resources can be freed. In particular, each COLL implements three functions: collm init query() Which is invoked during MPI INIT and tests if all prerequisites are available, collm comm query() Which is invoked for each new communicator and collm comm unquery() Which instructs the to free all used structures for the communicator. Once, a is selected for a specific communicator, it is called module. There are two basic functions for each module, coll module init() which can be used to initialize structures for each communicator and coll module finalize() which should free all used resources. Additionally, a module can implement selected or all collective functions. All collective functions which are not implemented are performed by a reference implementation based on point-to-point messages. Our hwbarr utilizes the functions in the following way (note that OMPI SUCCESS indicates the successful completion of a function to the framework): collm init query() Checks for the appropriate access rights (IN,OUT CPU calls) and returns OMPI SUCCESS if they are available collm comm query() Returns a priority if the communicator is MPI COMM WORLD, or refuses to run otherwise collm comm unquery() Returns OMPI SUCCESS The module functions are used as stated in the following: Torsten Hoefler Page 6/8

REFERENCES coll module init() Returns OMPI SUCCESS coll module finalize() Returns OMPI SUCCESS coll barrier() Performs the barrier call as described in section 2 by reading the input, writing the toggled value to the output and waiting until the input also toggles Other collective functions have not been implemented and the framework uses the standard implementation if they are called. 4 Conclusions We showed that it is easy to add support for a special hardware to support collective operations in Open MPI. The application or Open MPI installation has neither to be recompiled nor to be relinked, the module has to implement the interface functions and has to be copied to the right directory in the library path. The architecture of Open MPI adds also nearly no overhead which makes the collective framework very attractive to programmers of collective operations. The TUC Hardware Barrier and the according COLL has to be extended to support multiple communicators and allow non-root processes to use the hardware. References [1] Rob Armstrong, Dennis Gannon, Al Geist, Katarzyna Keahey, Scott R. Kohn, Lois McInnes, Steve R. Parker, and Brent A. Smolinski. Toward a common architecture for high-performance scientific computing. In HPDC, 1999. [2] B. Barrett, J. M. Squyres, A. Lumsdaine, R. L. Graham, and G. Bosilca. Analysis of the Component Architecture Overhead in Open MPI. In Proceedings, 12th European PVM/MPI Users Group Meeting, Sorrento, Italy, September 2005. [3] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379 386, 1994. Torsten Hoefler Page 7/8

REFERENCES REFERENCES [4] G. Fagg and J. Dongarra. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Lecture Notes in Computer Science: Proceedings of EuroPVM-MPI 2000, volume 1908, pages 346 353. Springer Verlag, 2000, 2000. [5] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Proceedings, 11th European PVM/MPI Users Group Meeting, Budapest, Hungary, September 2004. [6] Richard L. Graham, Sung-Eun Choi, David J. Daniel, Nehal N. Desai, Ronald G. Minnich, Craig E. Rasmussen, L. Dean Risinger, and Mitchel W. Sukalski. A network-failure-tolerant message-passing system for terascale clusters. In ICS 02: Proceedings of the 16th international conference on Supercomputing, pages 77 83, New York, NY, USA, 2002. ACM Press. [7] N. Karonis, B. Toonen, and I. Foster. MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface. Journal of Parallel and Distributed Computing (JPDC), 63(5):551 563, 2003. [8] MPICH2 Developers. http://www-unix.mcs.anl.gov/mpi/mpich2/. [9] Jeffrey M. Squyres and Andrew Lumsdaine. The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms. In Proceedings, 18th ACM International Conference on Supercomputing, Workshop on Component Models and Systems for Grid Applications, St. Malo, France, July 2004. [10] The InfiniBand Trade Association. Infiniband Architecture Specification Volume 1, Release 1.2. InfiniBand Trade Association, 2003. Torsten Hoefler Page 8/8