An MPI failure detector over PMPI 1

Similar documents
Failure Detection within MPI Jobs: Periodic Outperforms Sporadic

Proactive Process-Level Live Migration in HPC Environments

ABSTRACT. KHARBAS, KISHOR H. Failure Detection and Partial Redundancy in HPC. (Under the direction of Dr. Frank Mueller.)

Programming with MPI

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Programming with MPI

Combing Partial Redundancy and Checkpointing for HPC

TCP Strategies. Keepalive Timer. implementations do not have it as it is occasionally regarded as controversial. between source and destination

NOTE: The S9500E switch series supports HDLC encapsulation only on POS interfaces. Enabling HDLC encapsulation on an interface

Synchronizing the Record Replay engines for MPI Trace compression. By Apoorva Kulkarni Vivek Thakkar

Clusters of SMP s. Sean Peisert

Using Lamport s Logical Clocks

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Introduction to parallel computing concepts and technics

Fault Tolerant Java Virtual Machine. Roy Friedman and Alon Kama Technion Haifa, Israel

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

On the scalability of tracing mechanisms 1

Table of Contents. Cisco Introduction to EIGRP

CUDA GPGPU Workshop 2012

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

6.UAP Final Report: Replication in H-Store

Detection and Analysis of Iterative Behavior in Parallel Applications

Homework 2 COP The total number of paths required to reach the global state is 20 edges.

Lazy Agent Replication and Asynchronous Consensus for the Fault-Tolerant Mobile Agent System

Optimising MPI for Multicore Systems

What s in this talk? Quick Introduction. Programming in Parallel

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer

Go Deep: Fixing Architectural Overheads of the Go Scheduler

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Identifying Ad-hoc Synchronization for Enhanced Race Detection

Kernel Korner AEM: A Scalable and Native Event Mechanism for Linux

Automatic trace analysis with the Scalasca Trace Tools

An Empirical Study of Reliable Multicast Protocols over Ethernet Connected Networks

xsim The Extreme-Scale Simulator

Parallel Programming Interfaces

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Scalable and Fault Tolerant Failure Detection and Consensus

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Performance Analysis of Parallel Applications Using LTTng & Trace Compass

Intuitive distributed algorithms. with F#

CS 403/534 Distributed Systems Midterm April 29, 2004

AN EMPIRICAL STUDY OF EFFICIENCY IN DISTRIBUTED PARALLEL PROCESSING

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Communication Characteristics in the NAS Parallel Benchmarks

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

CSE 486/586 Distributed Systems

CSCE 626 Experimental Evaluation.

SimpleChubby: a simple distributed lock service

A Global Operating System for HPC Clusters

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Noise Injection Techniques to Expose Subtle and Unintended Message Races

Optimization of MPI Applications Rolf Rabenseifner

Parallel & Cluster Computing. cs 6260 professor: elise de doncker by: lina hussein

Lessons learned from MPI

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Chapter 6: Process Synchronization. Module 6: Process Synchronization

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Optimizing TCP Receive Performance

Module 6: Process Synchronization. Operating System Concepts with Java 8 th Edition

Group Management Schemes for Implementing MPI Collective Communication over IP Multicast

A Message Passing Standard for MPP and Workstations

Workload Characterization using the TAU Performance System

Linux - Not real-time!

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Our new HPC-Cluster An overview

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

REMEM: REmote MEMory as Checkpointing Storage

A Scalable Event Dispatching Library for Linux Network Servers

Getting Insider Information via the New MPI Tools Information Interface

Specifying and Proving Broadcast Properties with TLA

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Networked Systems and Services, Fall 2018 Chapter 4. Jussi Kangasharju Markku Kojo Lea Kutvonen

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

Performance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Clusters

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

Problem Set 2. CS347: Operating Systems

WITH the development of high performance systems

The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters

Module 6: Process Synchronization

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Optimizing TCP in a Cluster of Low-End Linux Machines

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

MPI: the Message Passing Interface

Last Class: RPCs. Today:

Distributed Algorithms Failure detection and Consensus. Ludovic Henrio CNRS - projet SCALE

Cost-Performance Evaluation of SMP Clusters

CRC. Implementation. Error control. Software schemes. Packet errors. Types of packet errors

Chapter 6: Process Synchronization. Operating System Concepts 8 th Edition,

OpenDIEL Supported by The National Science Foundation. Tristin Baker, Jordan Scott, and Zachary Trzil Mentor: Dr. Kwai Wong

Chapter 3. Distributed Memory Programming with MPI

Final Exam Solutions May 11, 2012 CS162 Operating Systems

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Acknowledgments. Programming with MPI Basic send and receive. A Minimal MPI Program (C) Contents. Type to enter text

Implementation of a Reliable Multicast Transport Protocol (RMTP)

Transcription:

An MPI failure detector over PMPI 1 Donghoon Kim Department of Computer Science, North Carolina State University Raleigh, NC, USA Email : {dkim2}@ncsu.edu Abstract Fault Detectors are valuable services which provide information about process failures in largescale parallel systems. Previous many studies suggest guidelines for the implementation of a fault detector. However, a practical approach to implementation is another challenge due to various parallel system environments of both hardware and software. This study explains that Fault detector is able to be transparent, scalable, and portable. The experimental results show that Fault Detector can be embedded to any MPI application with negligible overhead. 1. Introduction Modern scientific applications on Massive Parallel Processing Systems have execution times ranging from day to months. These long-running MPI applications on clusters are prone to node or network failures as the systems scale[1]. The MPI application may have no progress in the case of node or network failures if such an application needs to exchange its computation results through the communication. Furthermore, the recovery overhead would be increased unless the fault detection services provide timely detection. On the other hand, the overhead of fault detection would be increased as the frequency of fault detection increases for monitoring accurate failures. Thus, the Fault Detector still provides valuable services which are system management, load balancing, and replication as well as failure detections. Previous many studies suggest guidelines of theoretical methodology for the implementation of Fault Detection services. However, a practical approach to implementation is another matter because various parallel system environments of both hardware and software yield more complicated other issues due to the property of unreliable failure detectors, that is, completeness and accuracy[2]. 1 The purpose of this paper is to complete CSC548 Parallel Systems - Section 001 supervised by Dr. Frank Mueller in Computer Science, North Carolina State University, Fall 2009. 1

Assume that the system model provides certain temporal guarantees on communication or computation called partially synchronous [3], the Fault Detector (FD) is able to utilize time based scheme, namely, ping-ack based implementation with the following proprieties : Transparency The FD is launched in MPI_Init routine with a MPI profiling interface, which creates FD threads. The FD runs independently with a unique communicator different from an application program. When MPI application program pass through MPI_Init, FD is also running on each processes without additional touch. Scalability The FD sends a check message sporadically at any time when an application program has a routine to communicate. It would not lead to high communication overhead compared with the frequency of periodic check message since the FD at each node avoids redundant check messages for a defined time period. Portability MPI application can be compiled with FD if the user just adds FD source code in the same directory before compiling MPI application source codes. In this paper, I implement the Fault Detector over MPI profiling layer to detect a failure of MPI application or network. The experimental results show that the Fault Detector has the above proprieties with the negligible overhead since I use sporadic communication approach. The rest of this paper is structured as follows. Section 2 describes the design of Fault Detector and practical methods. Section 3 discusses implementation issues. Section 4 is experimental framework. Section 5 demonstrates the experimental results and analyzes the output. Section 6 reviews the related work of Fault Detection service. Section 7 is the conclusion and the future direction of this work. 2. Design Figure 1 shows the diagram of the fault detector for sporadic communication which I have implemented so far. The arrows present the function call from MPI applications to Fault Detector. Suppose that the MPI application such as NAS Parallel Benchmark (NAS PB) runs on Massive parallel processing (MPP) environment. As the most MPI tools utilize, MPI profiling layer (PMPI) intercepts MPI calls during application execution. When MPI_Init is called in the application, Fault Detector (FD) is also launched since the code for calling FD is inserted in MPI_Init. FD code has been implemented with C code while some of NAS PB has been written by Fortran code. Thus, the Wrapper Function (WF) is for the application written in Fortran at which WF links MPI functions of the application to PMPI. 2

Figure 1. The Diagram of Fault Detector Upon running FD, FD executes its own separate routine independent from the application. The new communicator, FD_COMM is used for FD communication routines to provide independent execution among FDs on MPP environment. FD consists of two threads, FD sender and FD receiver. Whenever MPI communication routines (e.g. MPI_Send or MPI_Isend) in the MPI application program are called, the corresponding PMPI routines are also executed. Each PMPI routine has three steps such as pre-processing, PMPI function calll and post-processing. As an example of MPI_Send, Pre-processing registers a destination node with current time in Queue and sends a signal to FD sender in case of waiting for a signal since Queue is empty. PMPI function call executes normal function like PMPI_Send for the application execution. Post-processing deletes the destination node registered in Queue if returned SUCCESS in PMPI_Send. FD manages the status of neighbor nodes with Queue which is implemented using a doubly liked list with node ID, timestamp, check-in and so on. The FD is a thread created by pthread_create() function which works independently side by side with the application program. The FD uses two messages, ALIVE and ACK. ALIVE message is to check whether a destination node is alive or not. ACK message is to verify from a destination node. The FD should consider the time delay between two nodes with communication and computation time. The FD could suspect a destination node to be failed if no ACK message is received in correspondence to ALIVE message. The FD should be integrated into MPI environment. The following is the more detailed description of both FD sender and FD receiver. FD sender : It is supposed to send the ALIVE message unless Queue is empty. Before sending ALIVE message, FD sender checks the timestamp of a destination node in Queue whether the delay time passes the timestamp or not. The purpose of the delay time is not to make a redundant 3

message. If delay time passes the timestamp, FD sender sends ALIVE message to that destination node and then flips check-in flag to indicate that FD already sent ALIVE message and is waiting for ACK message from that node with updated timestamp. When FD rechecks this node for the next turn, FD sender is able to suspect this node as a node failure if this node still exists in Queue. FD sender sorts Queue by updated timestamps in ascending order after one cycle. FD receiver : It is supposed to receive either ALIVE or ACK message. FD receiver probes periodically whether a message is arrived or not. On receiving a message, FD receiver takes the next action according to MPI_TAG. FD receiver replies back ACK message for ALIVE message while deleting a node ID from Queue for ACK message. 3. Implementation Issues I have implemented three kinds of FDs depending on algorithms such as periodic all-to-all, and tree structure and sporadic FD. In this paper, I only show sporadic FD since it is more reasonable and practical approach regarding the performance. FD utilizes pthread wait and signal to run continuously. The signals are conveyed to FD sender whenever MPI communication routines are called. FD sender keeps working as long as Queue has a destination node. FD sender also checks queue regularly such as every 20ms unless there is a signal. Queue is updated whenever there is any change such as insert, delete, timestamp update and etc. Queue is maintained by many internal functions in FD which requires consistent changes so that Queue is the critical section controlled by pthread lock and unlock. FD should keep running until MPI applications terminate. FD might make processor keep busy so that it causes the performance of MPI application to go down. Thus, FD goes to sleep for some time after one cycle in both FD sender and FD receiver. In this implementation, 20ms is the sleep time. The CG benchmark is written in Fortran so that WF is called whenever MPI routine executes. Fortran compilers are different, that is, one of the following function name is used for MPI_Send as an example, mpi_send_, mpi_send, MPI_SEND, or MPI_Send. mpi_send_ is used on opt10. The all arguments are pointer arguments in WF. Furthermore, the ierr argument at the end of the argument list in Fortran is not used in C because the ierr is an integer and has the same meaning as the return value of the routine in C. 4. Experimental Framework 4

I conducted my performance evaluations on a local cluster. All machines are 2-way SMPs with dual-core AMD Opteron 265 processors by a Gigabit Ethernet switch running Fedora Core 5 Linux x86 64 with MPICH2 for FD test[4]. I used 3 nodes as an experimental cluster for limited resource. 5. Experimental Results In this section, experimental results show how FD affects the performance of MPI applications. FD has been tested with the CG in NAS Parallel Benchmark (NPB) suite as a basic test. I still have several implementation issues on several test cases related Queue management by pthread lock & unlock. I make minor changes in PMPI routines removing FD processing codes in case of some issues exists. However, all the basic tests in the graph do not have any issue. Thus, the experimental results ensure the confidence for all cases. 5.1 Performance overhead Figure 2 shows the performance results of NPB in comparison of NPB with FD. Normal in figure indicates the performance of NPB while FD indicates the performance of NPB with FD code. X-axis presents the number of processes and Y-axis represents the time in seconds in NPB output. 5

Figure 2. The performance results with FD DT and IS in NPB are written in C while the others are written in Fortran so Fortran NPBs call additional wrapper functions. In DT, right bar (red color) is higher than left (blue color) bar because the execution time in DT is too short and relatively communication time such as round-trip time is higher than computation. However, we can say that there is no overhead to call wrapper function and there is no significant difference between C and Fortran. Three steps in MPI communication routines affect the overall performance Overwhelming trend in Figure 2 indicates that there is no performance difference between Normal and FD. 5.2 Communication overhead Another overhead is communication overhead which is also able to be negligible. Because each MPI communication routine has three steps as I mentioned in Introduction. Post processing is to remove destination node in Queue if SUCCESS returned. This means that FD may send ALIVE signal to destination node rarely since a destination node is deleted as soon as it registers in Queue for most cases. I have tested many cases changing time parameters such as sleep ep and timedwait in order to make high communication overhead intentionally. However, it turns out less than 1% even in worst case. Periodic based algorithms such as all-to-alll and tree structure may affect communication overhead because it should send and receive a signal periodically every a given time interval until MPI applications terminate. Thus, communication overhead in this work is able to be negligible in sporadic FD. 6. Related Work In [2], Tushar and Sam classify 8 classes of failure detectors by specifying the completeness and accuracy properties and show how to reduce 8 failure detectors to 4 and how to solve consensus for each class by defining consensus problem. This paper affects fects many contemporary papers because it clearly indicates the false positive problem of many practical systems such as asynchronous system. 6

In [3], Srikanth and et al. address celerating environments due to a system upgrade or periods of high stress where absolute time speeds could increase or decrease. Bichronal timer with the composition of action clocks and real-time clocks is able to utilize in celerating environments. My implementation is only for real-time clocks at local node. In [5], Stephane and et al. implemented Fault Detector in P2P-MPI environment with heartbeat counter. This paper addresses failure information sharing and consensus phase. They mention fault detection overhead because they send heartbeat periodically. It is practical approach in commodity system. 7. Conclusion In this work, I implement sporadic Fault Detector based ping-ack messages. There are still some implementation issues. However, I can say the practical Fault Detector is able to be implemented with the following properties, transparency, scalability, and portability. The experimental results show that Fault Detector has negligible overhead for both communication and performance. I should add the global view list for node failures and how to consensus the difference among different global view lists. I will implement this feature in my future work [6]. 8. References [1] Jitsumoto, H., Endo, T., Matsuoka, S., "ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs," IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007) pp.1-8, March 2007. [2] Tushar Deepak Chandra, Sam Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM (JACM), v.43 n.2, p.225-267, March 1996 [3] Srikanth Sastry, Scott M. Pike, Jennifer L. Welch Crash fault detection in celerating environments IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009) pp.1-12, 2009. [4] http://moss.csc.ncsu.edu/~mueller/cluster/opt/ [5] Evaluation of Replication and Fault Detection in P2P-MPI, St ephane Genaud, Choopan Rattanapoka, IPDPS09 [6] http://www4.ncsu.edu/~dkim2/csc548/csc548.htm 7