Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Similar documents
Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

The Performance of Consistent Checkpointing. Willy Zwaenepoel. Rice University. for transparently adding fault tolerance to distributed applications

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].

Optimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

On the Relevance of Communication Costs of Rollback-Recovery Protocols

Fault Tolerance. Distributed Systems IT332

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

1 Introduction A mobile computing system is a distributed system where some of nodes are mobile computers [3]. The location of mobile computers in the

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

The Performance of Consistent Checkpointing in Distributed. Shared Memory Systems. IRISA, Campus de beaulieu, Rennes Cedex FRANCE

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Practical Byzantine Fault

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

A Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA

Efficient Recovery in Harp

Support for Software Interrupts in Log-Based. Rollback-Recovery. J. Hamilton Slye E.N. Elnozahy. IBM Austin Research Lab. July 27, 1997.

Adaptive Recovery for Mobile Environments

Fault Tolerance. Distributed Systems. September 2002

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

CSE 5306 Distributed Systems

Designing secure agents with 0.0. technologies for user's mobility

Today: Fault Tolerance. Reliable One-One Communication

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

CSE 5306 Distributed Systems. Fault Tolerance

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.

OS DESIGN PATTERNS II. CS124 Operating Systems Fall , Lecture 4

processes based on Message Passing Interface

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

The Google File System

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Jeremy Casas, Dan Clark, Ravi Konuru, Steve W. Otto, Robert Prouty, and Jonathan Walpole.

APPLICATION-TRANSPARENT ERROR-RECOVERY TECHNIQUES FOR MULTICOMPUTERS

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream

Distributed Systems Fault Tolerance

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

Rollback-Recovery p Σ Σ

NFS Design Goals. Network File System - NFS

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

global checkpoint and recovery line interchangeably). When processes take local checkpoint independently, a rollback might force the computation to it

Fault Tolerance. Basic Concepts

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time

Distributed KIDS Labs 1

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Unit 2.

Chapter 8 Fault Tolerance

Improving PVM Performance Using ATOMIC User-Level Protocol. Hong Xu and Tom W. Fisher. Marina del Rey, CA

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Discount Checking: Transparent, Low-Overhead Recovery for General Applications

Today: Fault Tolerance. Failure Masking by Redundancy

Ecient Redo Processing in. Jun-Lin Lin. Xi Li. Southern Methodist University

Today: Fault Tolerance. Replica Management

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

On Checkpoint Latency. Nitin H. Vaidya. Texas A&M University. Phone: (409) Technical Report

Practical Byzantine Fault Tolerance Using Fewer than 3f+1 Active Replicas

Somersault Software Fault-Tolerance

GFS: The Google File System. Dr. Yingwu Zhu

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Distributed Systems COMP 212. Lecture 19 Othon Michail

Enhanced N+1 Parity Scheme combined with Message Logging

The Totem System. L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, C. A. Lingley-Papadopoulos, T. P. Archambault

GFS: The Google File System

Networked Systems and Services, Fall 2018 Chapter 4. Jussi Kangasharju Markku Kojo Lea Kutvonen

FAULT TOLERANT SYSTEMS

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Failure Tolerance. Distributed Systems Santa Clara University

An Integrated Course on Parallel and Distributed Processing

CLOUD-SCALE FILE SYSTEMS

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Application-Transparent Checkpointing in Mach 3.OKJX

RTC: Language Support for Real-Time Concurrency

Practical Byzantine Fault Tolerance

Assignment 5. Georgia Koloniari

Exam 2 Review. October 29, Paul Krzyzanowski 1

Breakpoints and Halting in Distributed Programs

Distributed Systems Inter-Process Communication (IPC) in distributed systems

ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS

Distributed Systems COMP 212. Revision 2 Othon Michail

Today CSCI Communication. Communication in Distributed Systems. Communication in Distributed Systems. Remote Procedure Calls (RPC)

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Active Motion Detection and Object Tracking. Joachim Denzler and Dietrich W.R.Paulus.

Real-time and Reliable Video Transport Protocol (RRVTP) for Visual Wireless Sensor Networks (VSNs)

The Checkpoint Mechanism in KeyKOS

ES623 Networked Embedded Systems

Design of High Performance Distributed Snapshot/Recovery Algorithms for Ring Networks

Designing Issues For Distributed Computing System: An Empirical View

Distributed Systems. The main method of distributed object communication is with remote method invocation

Transcription:

Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a low-overhead solution. Parallel Virtual Machine (PVM) allows a heterogeneous network of UNIX workstations to serve immmediately as a distributed computer by providing message-passing services implemented on top of UNIX inter-process communication. We briey show that correct user-level support for an aggressive, asynchronous two-phase-commit checkpointing protocol for PVM's virtual circuit mode requires message logging. 1 Introduction Networks of workstations are readily accessible, and an environment providing the necessary support for message-passing or shared-memory can immediately turn such a network into a distributed computer. UNIX, being portable and the dominant operating system in research and high-performance computing, is frequently the base for such environments. While a distributed computer like this may not be as ecient as a network of workstations built directly for the purpose, using an existing network and operating system is inexpensive and convenient. The higher rate of failure in such systems, however, requires eective fault-tolerance. Parallel Virtual Machine (PVM) [1] is the most common UNIX-based environment for distributed computing. It implements a message-passing environment for a heterogeneous network of computers. It provides two modes: direct process-to-process communication using UNIX sockets (virtual circuits), or daemon-todaemon communication, where the message is rst passed to the PVM daemon on the sender's machine, then to the daemon on the recipient's machine, and nally to the recipient. PVM lacks an ecient mechanism for fault-tolerance; Fail-safe PVM [9] uses a slower sync-and-stop protocol. We set out to implement for PVM's virtual circuit mode the aggressive two-phase-commit checkpointing protocol described by Elnozahy, Johnson, and Zwaenepoel [5], in which computation is only suspended long enough to fork a thread; checkpoints are written concurrently with computation. The overlap permits very low failure-free overhead, typically less than 5% even with frequent checkpoints. We nd, however, that a correct user-level implementation for PVM requires message logging. UNIX sockets hide acknowledgements in the kernel, but the protocol requires control of acknowledgements in the message routines. Kernel modi- cation is not an option; one of PVM's chief virtues is that it is strictly user-level code and can be installed by any user on almost any system. This paper describes the need for message logging in implementing the Elnozahy protocol for PVM. Section 2 gives a brief overview of the semantics of consistent checkpointing. Section 3 describes the need for logging. Section 4 describes related work, and we summarize our work in Section 5. Our study was done using PVM version 2.4 [1]. 2 Semantics of Consistent Checkpointing The system is assumed to consist of a number of fail-stop processes [10] comprising a distributed application, with one or more processes at each processor node. Processes are assumed to communicate only by passing revised 7 February 1996; was \Modications to PVM for Distributed Checkpointing under UNIX" 1

messages. A process checkpoint consists of saving that process's state (memory contents and process context). An application checkpoint consists of one checkpoint each per process comprising the application. This set of checkpoints must represent a globally consistent state: the state that would exist upon restarting the application from its checkpoints must not be a state impossible during uninterrupted operation. For example, a state in which some process has received a message which no process has sent is not allowed. When a failure is detected, the application terminates and restarts from its checkpoints, with the faulty node not used. Checkpoints must be stored on a central server to provide maximum availability of the checkpoints; storing checkpoints on individual nodes makes their availability subject to the node failures which are to be defended against. In the case of uninterrupted server operation (a reasonably reliable server can be attained through replication), consistent checkpointing is resilient to an arbitrary number of failures. The individual processes can only aect each other through interprocess communication [3]. This means ecient fault-tolerance can be achieved by allowing processes to continue computing while they checkpoint. Synchronizing communication is sucient to achieve consistency. The two disallowed states are a message which has been received but not sent (see Figure 1), and a message which has been successfully sent but not received (see Figure 2). p0 p1 Figure 1 p0 has recorded its checkpoint before sending the message; if restarted from the checkpoint, it will restart in a state in which the message remains to be sent. p1, however, receives the message before checkpointing, and will restart in a state in which it has received a message that appears never to have been sent. p0 p1 Figure 2 p1 has recorded its checkpoint before receiving the message; if restarted from the checkpoint, it will restart in a state in which the message has not yet occurred. p1, however, sends the message and receives the acknowledgement before checkpointing, and will restart in a state in which it has already successfully sent the message and will not resend it. Consistency is achieved using an asynchronous two-phase-commit protocol that restricts a process's view of messages while checkpointing, but otherwise allows computation to continue. Two-phase-commit ensures that the application's checkpoints all see a message as either successfully sent and received, or not successfully sent and not received. This is accomplished by tagging each message with a consistent checkpoint number (CCN) corresponding to the sender's current checkpointing epoch. A process must defer receipt of any message or acknowledgement tagged with a CCN higher than its own until it too reaches that epoch. 2

In the rst phase a distinguished checkpoint coordinator initiates a checkpoint by sending to each process a trigger message. 1 On receipt of a trigger, a process increments its CCN and takes the checkpoint. In the second phase, each process replies with a success message if its checkpoint succeeded. When success messages have been received from all processes, the coordinator sends to each a commit message, which causes the tentative checkpoint to be made permanent and the previous permanent checkpoint to be discarded. If there is a failure, an abort message is sent instead, the previous checkpoint is maintained, and the tentative checkpoint is discarded; failure recovery is initiated if deemed necessary. Use of the CCN ensures that all processes agree on the state of messages within an application checkpoint. If some process sends a message in its epoch N + 1, all recipients will defer receipt of that message until they reach their epochs N + 1. If some process sends a message in its epoch N and the recipient acknowledges in its epoch N + 1, the recipient's checkpoint N has not seen the message, because the message hasn't ocially been received until the acknowledgement can be sent; the acknowledgment is deferred by the sender until the sender also enters epoch N +1, and the sender's checkpoint N sees the recipient as not having acknowledged, interpreting this on restart as a failed send. In both scenarios, on restart all processes agree the message \never happened". 3 Implementing Consistent Checkpointing for PVM 3.1 The Problem Implementing asynchronous two-phase-commit requires access to the message facilities in order to append and check CCNs. The diculty is that PVM's virtual circuit mode uses sockets, which are implemented using TCP/IP in the UNIX kernel. Sockets present access to streamed communication, which appears to the user as a two-way continuous stream of bytes (hence \virtual circuit"); the sender can write any number of bytes and the receiver can read any number of bytes. Reliability is built in; acknowledgements, resending, and duplicate elimination are all handled transparently by the kernel. PVM's messaging routines, which are user-level code, thus cannot modify socket behavior to append CCNs to messages. CCNs can easily be added to PVM's own message header, and this allows a recipient to determine when a message should be deferred. This is not sucient. The kernel performs acknowledgements, but CCNs need to be appended to them, too. When doing a send, as soon as the message arrives at the destination machine, that kernel sends an acknowledgement back and the sender regards the message as successfully received. This creates the potential for sent-but-not-received inconsistencies. 3.2 The Solution The best solution is sender-based logging. The sender simply saves messages, and in the event of a restart after a failure, all processes must go through a special synchronization phase in which each process announces the last message it has seen, and any messages which have been lost by that process are retransmitted. This in fact amounts to allowing the messaging routines to control acknowledgements. Garbage collection is easily achieved by exchanging vectors containing last-seen values, and this can take advantage of the fact that processes must already perform periodic synchronization in order to checkpoint. Sender-based logging only requires minor modications to the described protocol, and because the buer of sent messages is saved with the checkpoint, this scheme remains resilient to an arbitrary number of failures. Additional overhead during failure-free operation consists only of saving and managing the messages. 4 Related Work Koo and Toueg [8] describe consistent checkpointing using a two-phase-commit protocol, with a focus on minimizing the number of processes which must take part in a checkpoint. They prove the need for a resilient checkpoint algorithm to store at least two stable checkpoints. 1 The coordinator uses some form of timer to initiate the checkpoints at regular intervals. 3

Elnozahy, Johnson, and Zwaenepoel [5], whose two-phase-commit protocol we study, demonstrate excellent performance for consistent checkpointing under the V-System. This system exhibits less than 1% overhead for a wide range of compute-intensive applications representing a range of memory and communication behavior, and at most 6% overhead for any application tested. These gures are for checkpoints taken every 2 minutes. Their implementation relies on kernel support; this also facilitates incremental checkpointing, in which only the state which has changed since the last checkpoint is written to disk. While a user-level implementation such as PVM with checkpointing cannot hope to achieve comparable performance, it has the advantage of platform-independence among UNIX systems, like PVM itself. Leon, Fisher, and Steenkiste [9] have implemented Fail-safe PVM, which takes coordinated checkpoints by issuing a global barrier and waiting for a quiescent state, thus forcing a consistent state. This sync-andstop protocol, however, imposes a heavy penalty on failure-free performance. Our method of asynchronous checkpointing using a two-phase-commit protocol avoids such delays; the only sources of overhead are contention due to coordination, contention at the disk from writing of the checkpoint, and managing the sender logs. Borg et al [2] describe an implementation of UNIX, TARGON/32, which provides completely transparent fault-tolerance, but the system relies on the existence of atomic three-way message delivery. TARGON/32 provides fault-tolerance for distributed applications by syncing each primary process with an inactive backup process on a dierent node. It requires all processes wishing to take advantage of this service to only communicate via supplied messaging routines. Reported overhead is 10%. Recovery is automatic if the failure does not also involve failure at the node with a needed backup process. Elnozahy and Zwaenepoel [6] analyze the performance of consistent checkpointing, message-logging, and Manetho (a hybrid of uncoordinated checkpoints and message logging) [7], and conclude that for applications where output latency is not critical, consistent checkpointing is superior to the other methods. 5 Conclusions PVM provides a convenient and inexpensive way to turn a network of workstations into a distributed computer, but lacks an ecient means of fault-tolerance. We describe an aggressive, asynchronous two-phasecommit protocol for consistent checkpointing, originally suggested by Elnozahy et al [5]. For correct user-level implementation under PVM, this protocol requires sender-based logging. Such a system will provide the superior performance of asynchronous consistent checkpointing for a package which can be installed by a user on any UNIX system. 6 Acknowledgements This work was supported by the Rice Undergraduate Scholars Program, a program encouraging undergraduate research and operated by the Rice University School of Natural Sciences. Additional support was provided by the Rice University Department of Computer Science. The author would like to thank Dr. Willy Zwaenepoel of Rice's Department of Computer Science for his support and guidance; and Marina Drobnic and Larry Weiner of Rice's Department of Electrical and Computer Engineering, who did heroic work in implementing a prototype of checkpointing for PVM [4]. References [1] A. Beguelin, J. Dongarra, A. Geist, R. Manchek, and V. Sunderam. A users' guide to PVM Parallel Virtual Machine. Oak Ridge National Laboratory, ORNL/TM-11826, September 1992. [2] A. Borg, W. Blau, W. Graetsch, F. Herrmann, and W. Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7(1): 1-24, February 1989. [3] K. M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems, 3(1):63-75, February 1985. 4

[4] M. Drobnic, L. D. Weiner, and K. Skadron. Atlas: consistent checkpointing under UNIX. Final class presentation, Computer Science 520, Rice University, December 1993. [5] E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Proceedings of the Eleventh Symposium on Reliable Distributed Systems, October 1992. [6] E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. To be presented in Proceedings of the 1994 Fault Tolerant Computing Systems Symposium, June 1994. [7] E. N. Elnozahy and W. Zwaenepoel. Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Transactions on Computers, 41(5):526{531 [8] R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on Software Engineering, SE-13(1):23-31, January 1987. [9] J. Leon, A. L. Fisher, and P. Steenkiste. Fail-safe PVM: a portable package for distributed programming with transparent recovery. CMU-CS-93-124. February, 1993. [10] R. D. Schlichting and F. B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222-238, August 1983. 5