A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

Similar documents
Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Novel Log Management for Sender-based Message Logging

Checkpointing HPC Applications

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Failure Tolerance. Distributed Systems Santa Clara University

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Rollback-Recovery p Σ Σ

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

Fault Tolerance. Distributed Systems IT332

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures.

Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Single System Image OS for Clusters: Kerrighed Approach

On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Novel low-overhead roll-forward recovery scheme for distributed systems

Estimation of MPI Application Performance on Volunteer Environments

Fault Tolerance. Basic Concepts

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

A Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System

Adding semi-coordinated checkpoints to RADIC in Multicore clusters

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed Systems (ICE 601) Fault Tolerance

A Distributed Scheme for Fault-Tolerance in Large Clusters of Workstations

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

The Google File System

The Google File System

Fault Tolerance. Distributed Systems. September 2002

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Lightweight Logging for Lazy Release Consistent Distributed Shared Memory

MPICH-V Project: a Multiprotocol Automatic Fault Tolerant MPI

2017 Paul Krzyzanowski 1

Chapter 8 Fault Tolerance

Today: Fault Tolerance. Reliable One-One Communication

The Google File System (GFS)

CSE 5306 Distributed Systems

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

processes based on Message Passing Interface

CS514: Intermediate Course in Computer Systems

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Assignment 5. Georgia Koloniari

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS

Distributed Systems Principles and Paradigms

An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

CSE 5306 Distributed Systems. Fault Tolerance

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

Toward An Integrated Cluster File System

02 - Distributed Systems

Adaptive Recovery for Mobile Environments

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

PERSONAL communications service (PCS) provides

02 - Distributed Systems

The Google File System

Course Snapshot. The Next Few Classes. Parallel versus Distributed Systems. Distributed Systems. We have covered all the fundamental OS components:

A host selection model for a distributed bandwidth broker

Vigne: Towards a Self-Healing Grid Operating System

Course Snapshot. The Next Few Classes

It also performs many parallelization operations like, data loading and query processing.

Multi-level Fault Tolerance in 2D and 3D Networks-on-Chip

A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

Today: Fault Tolerance. Replica Management

A Fast Group Communication Mechanism for Large Scale Distributed Objects 1

Worst-case Ethernet Network Latency for Shaped Sources

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 20: Database System Architectures

Design Patterns for Real-Time Computer Music Systems

Issues in Programming Language Design for Embedded RT Systems

A Consensus-based Fault-Tolerant Event Logger for High Performance Applications

Routing, Routing Algorithms & Protocols

Kernel Level Speculative DSM

2. Time and Global States Page 1. University of Freiburg, Germany Department of Computer Science. Distributed Systems

Parallel Discrete Event Simulation

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Orisync Usability Improvement

A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS

Recovering from a Crash. Three-Phase Commit

Heckaton. SQL Server's Memory Optimized OLTP Engine

Current Topics in OS Research. So, what s hot?

Recoverable Distributed Shared Memory Using the Competitive Update Protocol

A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems

Alleviating Scalability Issues of Checkpointing

MCAP: Multiple Client Access Protocol Ravindra Singh Rathore SML College Jhunjhunu, Rajasthan India

Transcription:

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet IRISA Sebastien.Monnet@irisa.fr Christine Morin IRISA/INRIA Christine.Morin@irisa.fr Ramamurthy Badrinath Hewlett-Packard ISO, Bangalore, India badrinar@india.hp.com Abstract Code coupling applications can be divided into communicating modules, that may be executed on different clusters in a cluster federation. As a cluster federation comprises of a large number of nodes, there is a high probability of a node failure. We propose a hierarchical checkpointing protocol that combines a synchronized checkpointing technique inside clusters and a communication-induced technique between clusters. This protocol fits to the characteristics of a cluster federation (large number of nodes, high latency and low bandwidth networking technologies between clusters). A preliminary performance evaluation performed using a discrete event simulator shows that the protocol is suitable for code coupling applications. Key words Cluster Federation, Checkpointing and Recovery, Faulttolerance, Parallel Application, Code Coupling. Introduction Cluster federations contain a large number of nodes and are heterogeneous. Nodes in a cluster are often linked by a SAN (System Area Network) while clusters are linked by LANs (Local Area Network) or WANs (World Area Network). The applications running on such architectures are often divided into communicating modules. These modules may need to run on different clusters for various reasons: security, hardware or software constraints, or because the This work was done when R. Badrinath was a visiting researcher at IRISA, on leave from IIT Kharagpur application needs a very large number of nodes. An example of such applications is one coupling several simulation codes that sometimes need to communicate with each other. The literature describes a lot of checkpoint/restart protocols suitable for clusters but very few work has been done to adapt these protocols to large scale architectures such as cluster federations and to take benefit of the communication patterns of code coupling applications. We propose a hierarchical checkpointing protocol which has a limited impact on the performance of code coupling applications executed in a cluster federation. As SAN networks used in clusters exhibit low latency and high throughput, a coordinated checkpointing approach can be used to efficiently checkpoint the state of the processes executing inside a cluster. As networks used for interconnecting clusters in a cluster federation are LANs or WANs with a much higher latency and a lower bandwidth than SANs, a coordinated checkpointing approach cannot be used at the federation level. The hierarchical protocol we propose relies on a communication-induced checkpointing strategy to build a global checkpoint of a code coupling application executed on several clusters of a cluster federation. A communication-induced checkpointing approach is reasonable for code coupling applications as inter-module communications are not very intensive. Our protocol, which is called HC I checkpointing protocol thereafter, has been simulated using a discrete event simulator. Preliminary results show that it works well with applications like code coupling. The remainder of this paper is organized as follows. Section present the protocol design principles. Section describes the hierarchical protocol combining coordinated and communication-induced checkpointing techniques in a cluster federation. Section shows a sample execution with the HC I checkpointing protocol. Section 5 gives a brief description of the discrete event simulator we used for the

evaluation of the HC I protocol and analyzes preliminary performance results. In Section 6, our work is compared with related works. Section 7 concludes. Design Principles. Models and assumptions Application model. We consider parallel applications designed using the code coupling model. Processes of this kind of applications are divided into groups (modules). Processes inside the same group communicate a lot while communications between processes belonging to different groups are limited. Inter-group communications may be pipelined as in Figure or they may consist of exchanges between two modules for example. Simulation Treatment Cluster Federation Display Cluster Cluster Cluster Figure. Application Model System model. We assume the following system model. As shown in Figure, a node is a system-level module that implements the protocol. It is able to save the processes states, to catch every inter-processes message, and to communicate with other nodes for protocol needs. Process Virtual inter process communication deliver send Process Internet, or a LAN. Such an architecture is suitable for the code coupling application model described above. Each group of processes may run in a cluster where network links have small latencies and large bandwidths (SAN). We assume that a sent message will be received in an arbitrary but finite laps of time. This means that the network is reliable, it does not lose messages. This assumption implies that the fault tolerance mechanism has to take care of intransit messages, since they are assumed not to be lost. Failure assumptions. We assume that only one fault occurs at a time. However, the protocol can be extended to tolerate simultaneous faults as explain in Section 7. The failure model is fail-stop. When a node fails it will not send messages anymore. The protocol takes into account neither omission nor Byzantine faults.. Checkpointing large scale applications in cluster federations Dependencies and consistent state. The basic principle of all backward error recovery techniques is to periodically store a consistent state of the application in order to be able to restart from there in the event of a failure. A process state consists of all the data it needs to be restarted (i.e. the virtual memory, list of opened files, sockets, etc.). A parallel application state is defined as the set of all its processes states. Consistent means that there is neither in-transit messages (sent but not received) nor ghost-messages (received but not sent) in the set of process states stored. Messages generate dependencies. For example, Figure presents the execution of two processes which both store their local states (S and S respectively). A message m is sent from process to process. If the execution is restarted from the set of states S/S, the message m will have been received by process but not sent by process (ghost-message). Process will send m again to process which is not consistent. The State S can depend on the content of m which may depend on the state S. Node : application level transmit/receive : system level : virtual link between nodes : physical link : system call Node P P S Time m S Figure. System Model Architecture model and network assumptions. We assume a cluster federation as a set of clusters interconnected by a WAN, inter-cluster links being either dedicated or even Figure. Dependency between two states The most recent record of a consistent state is called the recovery line. Checkpointing methods. The recovery line can be found at checkpoint time (i.e. when the states are stored). This

method is called coordinated checkpointing. This means there is a two-phase commit protocol during which application messages are frozen. With the independent checkpointing method, each process of a parallel application can store its local state without any synchronization. The recovery line is computed at rollback time. Quasi-synchronous methods also exist. For example the application messages can be used to piggy-back some more information in order for the receiver to know if it needs to store its local state. This last method is called communication-induced checkpointing. [5] provides detailed information about these different checkpointing techniques. Large scale checkpointing The large number of nodes and network performance between clusters do not allow a global synchronization. An independent checkpointing mechanism does not fit either: tracking dependencies to compute the recovery line at rollback time would be very hard and nodes may rollback to very old checkpoints (domino effect). If we intend to log inter-cluster communications (to avoid inter-cluster dependencies), we need the piecewise deterministic (PWD) assumption. The PWD assumption means that we are able to replay a parallel execution in a cluster that produces exactly the same messages as the first execution. This assumption is very strong. Replaying a parallel execution means detecting, logging and replaying all non-deterministic events, which is very difficult. Hierarchical checkpointing Inside a cluster we use a coordinated checkpointing method. It ensures that the stored state (the cluster checkpoint) is consistent. Coordinated checkpointing is possible inside a cluster as nodes are interconnected with a high performance network (low latency and large bandwidth). Coordinated checkpointing is a wellestablished technique [7], [], [], [] which is relatively easy to implement. A Cluster Level Checkpoint is called CLC thereafter. The assumption that the number of intercluster messages is low leads us to use a communicationinduced method between clusters. This means each cluster takes CLC independently, but information is added to each inter-cluster communication. It may lead the receiver to take a CLC (called forced CLC) to ensure the recovery line progress. Therefore we propose a hierarchical protocol combining coordinated and communication-induced checkpointing (HC I). Description of the HC I Checkpointing Protocol This section presents the HC I checkpointing protocol, the algorithms can be found in [6].. Cluster level checkpointing In each cluster, a traditional two-phase commit protocol is used. An initiator node broadcasts (in its cluster) a CLC request. All the cluster nodes acknowledge the request, then the initiator node broadcasts a commit. Between the request and the commit messages, application messages are queued to prevent intra-cluster dependencies. In order to be able to retrieve CLC data in case of a node failure, each node record its part of the CLCs, and in the memory of an other node in the cluster. Because of this stable storage implementation, only one simultaneous fault in a cluster is tolerated. Each CLC is numbered. Each node in a cluster maintains a sequence number (SN). SN is incremented each time a CLC is committed. This ensures that the sequence number is the same on all the nodes of a cluster (outside the two-phase commit protocol). The SN is used for inter-cluster dependency tracking. Indeed, each cluster takes its CLC periodically, independently from the others.. Federation level checkpointing In our application model, communications between two processes in different clusters may appear. This generates dependencies between CLCs taken in different clusters. Dependencies need to be tracked in order to allow the application to be restarted from a consistent state. Forcing a CLC in the receiver s cluster for each inter-cluster application message would work but the overhead would be huge as it would force useless checkpoints. In Figure, cluster takes two forced CLCs (the filled ones) at message reception, and the application takes messages into account only when the forced CLC is committed. CLC is useful: in the event of a failure, a rollback to CLC/CLC will be consistent (m would be sent and received again). On the other hand, forcing CLC is useless: cluster has not stored any CLC between its two message sending. In the event of a failure it will have to rollback to CLC which will force cluster to rollback to CLC. CLC would have been useful only if cluster would have stored a CLC after sending m and before sending m. C C CLC Time m m CLC CLC Figure. Limitation of the number of forced CLCs Thus, a CLC is forced in the receiver s cluster only when a CLC has been stored in the sender s cluster since the last communication from the sender s cluster to the receiver s

cluster. This is controlled using the SN (introduced in Section.). The current cluster s sequence number is piggybacked on each inter-cluster application message. To be able to decide if a CLC needs to be initiated, all the processes in each cluster need to keep the last received sequence number from each other cluster. All these sequence numbers are stored in a DDV (Direct Dependencies Vector, []). More formally: DDV j [i] is the i th DDV entry of cluster j, and SN i is the sequence number of cluster i. For a cluster j: If i=j, DDV j [i]=sn j If i j, DDV j [i]= last received SN i ( if none). Note that the size of the DDV is the number of clusters in the federation, not the number of nodes. In order to have the same DDV and SN on each node inside a cluster, we use the synchronization induced by the CLC two-phase commit protocol to synchronize them. Each time the DDV is updated, a forced CLC is initiated which ensures that all the nodes in the cluster which take a CLC will be timestamped by the same DDV at commit time.. Logs to avoid huge rollbacks Coordinated checkpointing implies to rollback the entire cluster of a faulty node. We want to limit the number of clusters that rollback. If the sender of a message does not rollback while the receiver does, the sender s cluster does not need to be forced to rollback. When a message is sent outside a cluster, the sender logs it optimistically in its volatile memory (logged messages are used only if the sender does not rollback). The message is acknowledged with the receiver s SN which is logged along with the message itself. The next section explains which messages are replayed in the event of a failure.. Rollback When a node failure is detected, the cluster rolls back to its last stored CLC (the description of the failure detector is out of the scope of this paper). One node in each other cluster in the federation receives a rollback alert. It contains the faulty cluster s SN that corresponds to the CLC to which it rolls back. When a node receives such a rollback alert from another cluster with its new SN, it checks if its cluster needs to rollback by comparing its DDV entry corresponding to the faulty cluster to the received SN. If the former is greater than or equal to the latter its cluster needs to rollback to the first (the older) CLC which has its DDV entry corresponding to the faulty cluster greater than or equal to the received SN. The node that has received the alert initiates the rollback. If a cluster needs to rollback due to a received alert, it sends a rollback alert containing its new SN to alert all the other clusters. This is how the recovery line is computed. Even if its cluster does not need to rollback, a node receiving a rollback alert broadcasts it in its cluster. Logged messages sent to nodes in the faulty cluster acknowledged with a SN greater than the alert one (or not acknowledged at all) will then be resent. Our communication-induced mechanism implies that clusters need to keep multiple CLC and logged messages. They need to be garbage-collected..5 Garbage collection Our protocol needs to store multiple CLCs in each cluster in order to compute the recovery line at rollback time. The memory cost may become important. Periodically, or when a node memory saturates, a garbage collection is initiated. Our garbage collector is centralized. A node initiates a garbage collection, it asks one node in each cluster to send back its list of all the DDVs associated with the stored CLCs. Then it simulates a failure in each cluster and keeps the smallest SN to which the clusters of the federation might rollback. It sends a vector containing all the smallest SNs to one node in each cluster which broadcasts it in its cluster. Each node removes the CLCs which have their cluster DDV entry smaller than the smallest SN (received in the vector) associated to their cluster. They also remove logged messages that are acknowledged with a SN smaller than the receiver s cluster smallest SN. Example Figure 5 shows a sample execution on three clusters. It is composed of three successive snapshots of the execution. On each snapshot, the execution time goes from left to right, each horizontal line represents a parallel execution on a cluster. The boxes stand for the CLCs, the darker ones are forced CLCs. The corresponding DDVs are embedded in the CLC s boxes. The first snapshot shows a normal execution until a failure appears in cluster. Notice that each cluster stores a first CLC which is the beginning of the application. Cluster sends message m to cluster, it sends its SN () along with m. When receiving m, cluster compares the received SN with cluster DDV entry (). is greater than, this forces cluster to take a CLC before delivering m to the application level. When receiving m from cluster, cluster does not have to initiate a CLC, the received SN () is equal to cluster DDV entry in cluster. As for m, we see that m, m and m5 force CLCs respectively on clusters, and. Notice that inter cluster messages are acknowledged with the local SN + (the inter-cluster message will be delivered after the CLC is committed). Logged messages are not represented to keep the figure easy to read.

() C C C m m m m m5 Fault () C C m m m Alert() m5 C () C C m m C Alert() represents the fact that the message is delivered to the application layer afer the CLC is committed Figure 5. HC I checkpointing protocol sample When a fault is detected in cluster, the whole cluster rolls back to its last stored CLC, its new SN is. It then sends a rollback alert with the SN (second snapshot). Cluster does not have any cluster DDV entry greater than or equal to the received SN in its DDVs stored with the CLCs, it does not need to rollback. On the over hand, cluster has to rollback to the first CLC that has its associated DDV containing cluster entry greater than or equal (equal in the sample) to the received SN. Cluster sends an alert with its new SN, (third snapshot). Cluster has never received messages from cluster so its DDVs entries corresponding to cluster are all equal to. It does not need to rollback. Cluster has to rollback to its last CLC which has in cluster s entry. It sends a rollback alert with its new SN () but no cluster has to rollback anymore (due to the DDV lists). 5 Evaluation To evaluate the protocol, a discrete event simulator has been implemented. We evaluate the protocol overhead in terms of network and storage cost first, then we observe what happens with different communication patterns. Finally the garbage collector effectiveness and cost are evaluated. 5. Simulator We use the C++SIM library [] to write the simulator. This library provides generic threads, a scheduler, random flows and classes for statistical analysis. Our simulator is configurable. The user has to provide three files: a topology file, an application file and a timer file. The topology file specifies the number of clusters, the number of nodes in each cluster, the bandwidth and latency in each cluster and between clusters (represented as a triangular matrix) and the federation MTBF (Mean Time Between Failures). The application file contains, for each cluster, the mean computation time for each node, communication patterns between computations (represented by probabilities between nodes) and the application total time. Finally, the timers file contains the delays for the protocol timers for each cluster (delays between two CLCs, garbage collection,...). The simulator is composed of four main threads. The thread Nodes takes the identity of all the nodes, one by one. The thread Network stores the messages and computes their arrival time. The thread Timers simulates all the different timers. The thread Controller controls the other threads (launches them, displays results at the end,...). Communication between threads is done by shared variables. The simulator can be compiled with different trace levels. With the higher trace level, we can observe each node

Sender s Receiver s Message Cluster Cluster Count Cluster Cluster 9 Cluster Cluster 97 Cluster Cluster 5 Cluster Cluster Table. Application messages Number of CLCs realy committed 8 6 Interval Between CLCs Influence in Cluster Unforced CLCs Forced CLCs time-stamped action (sends, receives, timer interruptions, log searches...). The lowest simulator output is statistical data, as messages count in clusters and between each cluster, number of stored CLCs, number of protocol messages,... 5. Network traffic and storage cost induced by the checkpointing protocol Evaluating network traffic and storage cost is very hard. It depends on how the protocol is tuned. If the frequency of unforced CLCs is low in a cluster, the SNs will not grow too fast, so inter-cluster messages from this cluster would have a low probability to force CLCs. Reducing the protocol overhead becomes easy. If no CLC is initiated, the only protocol cost consists in logging optimistically in volatile memory inter-cluster messages and transmitting an integer (SN) with them. There is also a little overhead due to message interception (between the network interface and the application). To take advantage of the protocol, the timer that regulates the frequency of unforced CLCs in a cluster should be set to a value that is much smaller than the MTBF of this cluster. The first simulation evaluates how much CLCs the protocol forces. The simulator simulates clusters of nodes. In both clusters the network is Myrinet-like (µs latency and 8Mb/sec bandwidth). The clusters are linked by Ethernet-like links (5µs latency and Mb/sec bandwidth). The application total execution time is hours. There are lots of communications inside each cluster and few between them. This could correspond to a simulation running on cluster and to trace processor on cluster, for example. Table 5. displays the number of messages (intra and inter-cluster). Graph 6 and 7 show the number of forced and unforced committed CLCs in each cluster according to the delay between unforced CLCs in cluster (x axis, in minutes). Cluster delay between CLCs is set to infinite. Cluster stores some forced CLCs (8) because of the communications from cluster. This number of forced CLCs is constant - there are few messages from cluster. Notice that the total number of stored CLCs is smaller than totalcomputationtime delaybetweenclcs + number of forced CLCs because the timer is reset when a Number of CLCs realy committed 9 8 7 6 5 6 8 Delay Between CLCs (timer) in Cluster Figure 6. Number of CLCs in Cluster Interval Between CLCs Influence in Cluster Unforced CLCs Forced CLCs 6 8 Delay Between CLCs (timer) in Cluster Figure 7. Number of CLCs in Cluster forced CLC is established. Clusters store a few more CLCs, but they are placed better (in time). Cluster does not store any unforced CLCs as its timer is set to infinite, but it stores some forced CLCs induced by incoming communications from cluster. The number of these forced CLCs is proportional to the number of CLCs stored in cluster, because numerous messages come from cluster. One may want to store more CLCs in cluster, if this cluster is intensively used and computation time is expensive for example. Graph 8 shows that cluster (which delay between CLCs timer is set to minutes) do not store more CLCs even if cluster timer is set to 5 minutes. This is thanks to the low number of messages from cluster to cluster. 5. Communication patterns To better understand the influence of the communications patterns on the checkpointing protocol, Graph 9 shows what happens when the number of messages from cluster to cluster increases. Both cluster delay between CLCs timers are set to minutes. The application is the same as

Number of CLCs realy committed 5 5 5 5 Increasing the Number of CLCs in Cluster Cluster Total Cluster Total Cluster Forced Cluster Cluster Cluster Cluster Before After Before After 8 8 5 5 Table. Number of stored CLCs 5 5 5 5 5 55 6 Delay Between CLCs (timer) in Cluster Figure 8. Impact of the Number of CLCs in previous section except for the number of messages from cluster to cluster that is represented on the x axis. Number of committed CLCs 7 6 5 Communication Patterns Cluster Total Cluster Total Cluster Forced Cluster Forced 5 6 7 8 9 Number of Messages from Cluster to Cluster Figure 9. Increasing Communication from Cluster to Cluster The number of forced CLCs increases fast with the number of messages from cluster to cluster. If the two clusters communicate a lot in both ways, SNs will grow very fast and most of the messages will induce a forced CLC. The overhead of our protocol will not be good in that case. 5. Garbage collection A garbage collection has got a non negligible overhead. If N is the number of clusters in the federation, each garbage collection implies: N- inter-cluster requests; N- intercluster responses which contain the list of all the DDVs associated to the stored CLCs in a cluster; N- inter cluster collect requests; broadcast in each cluster. However, our hybrid checkpointing protocol may store multiple CLCs in each cluster. They can become very numerous. It also logs every inter-cluster application message. We evaluate the efficiency of the garbage collector. For the Cluster (before) 8 5 8 Cluster (after) Cluster (before) 5 8 78 6 Cluster (after) Cluster (before) 5 8 78 6 Cluster (after) Table. Number of stored CLCs sample above, in the case of messages sent from cluster to cluster, without any garbage collection, 6 CLCs are stored in each cluster. It means that each node in the federation stores 6 local states (its own 6 local states and the ones of one of its neighbor, because of the stable storage implementation). If a garbage collection is launched every hours, the maximum number of stored CLCs just after a garbage collection is per cluster in this sample. Only oldest CLCs are removed, as explain in Section.5, rollbacks will not be too deep. The maximum number of logged messages during the execution in the sample above is in both clusters. Table shows for each garbage collection the number of CLCs stored just before and just after the collection. In order to see what happens with more clusters, a second experimentation simulates an application that runs on three clusters. Clusters and have the same configuration as above. Cluster is a clone of cluster. There s approximately messages that leave and arrive in each cluster. Table shows for each garbage collection the number of CLCs stored just before and just after the collection. After each garbage collection only CLCs are kept. Thanks to the communication-induced method, the recovery line progresses. A tradeoff has to be found between the frequency of garbage collection and the number of CLCs stored. 6 Related Work A lot of papers about checkpointing methods can be found in the literature. However, most of the previous works are related to clusters, or small scale architectures. A lot of systems are implemented at the application level, partitioning the application processes into steps. Our protocol is im-

plemented at system level so that programmers do not need to write specific code. Moreover the protocol in this paper takes clusters federation architecture into account. This section presents several works that are close to ours. Integrating fault-tolerance techniques in grid applications. [8] does not present a protocol for fault tolerance but it describes a framework that provides hooks to help developers to incorporate fault tolerance algorithms. They have implemented different known fault tolerance algorithms and it seems to fit well with large scale. However, these algorithms are implemented at application level and are made for object-based grid applications. MPICH-V. [] describes a fault tolerant implementation of MPI. It is designed for large scale architectures. All the communications are logged and can be replayed. This avoids all dependencies so that a faulty node will rollback, but not the others. But this means that strong assumptions upon determinism have to be made. Our protocol does not need any assumption upon the application determinism, moreover it takes advantage of the fast network available in the clusters. Hierarchical coordinated checkpointing. The work presented in [9] is the closest to ours. It proposes a coordinated checkpointing method, based on the two-phase commit protocol. The synchronization between two clusters (linked by slower links) is relaxed. In [9], it is the coordinated checkpointing mechanism that is relaxed between clusters. It is not a hybrid protocol like ours. Our protocol is more relaxed, it is independent checkpointing if there are no inter-cluster messages. 7 Conclusion and Future Work This paper introduces a hierarchical checkpointing protocol suitable for code coupling applications. It relies on a hybrid method combining coordinated checkpointing inside clusters and communication-induced checkpointing between clusters. The protocol can be tuned according to the underlying network, the application communication patterns and needs. The dependency tracking mechanism can be improved by adding some transitivity (by sending the whole DDV instead of the SN) in order to take less forced checkpoints. The user should be able to choose the degree of replication in the stable storage implementation inside a cluster (in order to tolerate more than one fault in a cluster). The protocol should tolerate simultaneous faults in different clusters (the garbage collector should take care of this). At last, the garbage collector could be more distributed. We need to implement the protocol on a real system to validate it. Acknowledgements The authors wish to thank Gabriel Antoniu for his careful proof-reading of this paper. References [] A. Agbaria and R. Friedman. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations. The Eighth IEEE International Symposium on High Performance Distributed Computing, pages 67 76, August 999. [] R. Badrinath and C. Morin. Common mechanisms for supporting fault tolerance in DSM and message passing systems. Technical report, July. [] G. Bosilca, A. Bouteiller, F. Cappello, S. Djailali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes. In Proceedings of the IEEE/ACM SC Conference, pages 9 7, Baltimore, Maryland, November. [] M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro. Lightweight Logging for Lazy Release Consistent Distributed Shared Memory. In Operating Systems Design and Implementation, pages 59 7, October 996. [5] M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys (CSUR), :75 8, September. [6] S. Monnet, C. Morin, and R. Badrinath. A hierarchical checkpointing protocol for parallel applications in cluster federations. Research report 59, IRISA, Rennes, France, Jan.. [7] C. Morin, A.-M. Kermarrec, M. Banâtre, and A. Gefflaut. An Efficient and Scalable Approach for Implementing Fault Tolerant DSM Architectures. IEEE Transactions on Computers, 9(5):, May. [8] A. Nguyen-Tuong. Integrating Fault-Tolerance Techniques in Grid Applications. PhD thesis, Faculty of the School of Engineering and Applied Science at the University of Virginia, August. [9] H. Paul, A. Gupta, and R. Badrinath. Hierarchical Coordinated Checkpointing Protocol. In International Conference on Parallel and Distributed Computing Systems, pages 5, Novenber. [] J. Rough and A. Goscinski. Exploiting Operating System Services to Efficiently Checkpoint Parallel Applications in GENESIS. Proceedings of the 5th IEEE International Conference on Algorithms and Architectures for Parallel Processing, October. [] C++SIM. http://cxxsim.ncl.ac.uk.