A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS

Size: px
Start display at page:

Download "A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS"

Transcription

1 International Journal of Computer Science and Communication Vol. 2, No. 1, January-June 2011, pp A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS Ruchi Tuli 1 and Parveen Kumar 2 1 Yanbu University College, Royal Commission for Jubail and Yanbu, Directorate General for Yanbu, P.O. Box Madinat Yanbu Al Sinaiyah Kingdom of Saudi Arabia., tuli.ruchi@gmail.com 2 Merrut Institute of Engineering and Technology, Merrut (INDIA) pk223475@yahoo.com ABSTRACT Check point is defined as a designated place in a program where normal processing of a system is interrupted to preserve the status information. Checkpointing is a process of saving status information. Mobile computing systems often suffer from high failure rates that are transient and independent in nature. To add reliability and high availability to such distributed systems, checkpoint based rollback recovery is one of the widely used techniques for applications such as scientific computing, database, telecommunication applications and mission critical applications. In this paper we discuss various checkpointing schemes to recover from system failure leading to failure of running services and computational tasks or transactions being executed for mobile computing systems. We have also compared checkpointing schemes on different parameters and as well as discussed the various issues related to distributed mobile computing systems. Keyword: Mobile computing systems, co-ordinated checkpoint, rollback recovery, mobile host. 1. INTRODUCTION A mobile computing system is a distributed system where some of the nodes are mobile computers [1]. It consists of a fixed node called Mobile Support Station (MSS) and a number of Mobile Hosts (MHs). A cell is a geographical area around a MSS in which it can support a MH. A mobile Host can change its geographical position freely from one cell to another or even to an area covered by no cell. All the communication from one Mobile Host to another Mobile Host goes through MSS. MSS has both types of links wired and wireless links. A MSS communicate with Mobile Host bywireless links, while with other MSSs by wired links. Checkpoint is a fault tolerant technique that allows system to roll back to a most recent failure free state when failure occurs in mobile computing system By periodically invoking the checkpointing process, one can save the status of a program at regular intervals. A single failure can disturb the entire computation. If there is a failure, computation may be restarted from the last checkpoint instead of repeating the computation from beginning. The process of resuming computation from the last saved state is called as rollback recovery. As mobile devices communicate with other mobile devices so at the time of recovery, rolling back of just one system which has failed may lead to inconsistency. So when a process rolls back to some previous failure free intermediate state, some other process on which failed process depends also roll back to achieve consistent global state. 1.1 Need of Checkpointing Apart from its use to recover a system from failure, checkpointing also finds its application in debugging distributed programs and migrating processes in a multiprocessor system. In debugging distributed programs, checkpointing assists in monitoring the state changes of a process during execution at various time instances. To balance the load of processors in a distributed system, processes are usually moved from heavily loaded processors to lightly loaded processors. Checkpointing periodically provides the information necessary to move from one process to another. 1.2 Aspects of Checkpointing Some of the aspects need to be considered with checkpointing are ( a) frequency of checkpointing, (b) contents of checkpointing, and ( c) methods of checkpointing. (a) Frequency of checkpointing: A checkpointing algorithm executes in parallel with the underlying computation. Therefore, the overheads introduced due to checkpointing be minimized. Checkpointing should enable a user to recover and not loose substantial computation in case of an error, which necessitates frequent checkpointing and consequently significant overhead. The number of checkpoints initiated should be such that the cost of information loss due to failure is small and the overhead due to checkpointing is not significant. These depend on the failure probability and the importance of computation.

2 90 International Journal of Computer Science and Communication (IJCSC) (b) Contents of a checkpoint : The state of a process has to be saved in a stable storage so that the process can be restarted in case of an error. The state/context includes code, data and stack segments alongwith the environment and the register contents. Environment has the information about the various files currently in use and file pointers. (c) Methods of checkpointing : The methodology used for checkpointing depends on the architecture of the system. Methods used in multiprocessor systems should incorporate explicit coordination. In a message passing system, the messages should be monitored and if necessary saved as part of the global context. The reason is that the messages introduces dependencies among the processors. 2. SYSTEM MODEL A message passing distributed system is assumed. Communication subsystem is assumed to be reliable and FIFO based in some protocols. In other simplified protocols, it is assumed unreliable. Interaction with outside world is modelled as interaction with a special process called OWP (Outside World Process), which cannot fail, maintain state or participate in the recovery protocol. 3. CHECKPOINTING SCHEMES A checkpoint is a local state of a process saved on a stable storage. In mobile computing systems, since the processes in the system do not share memory, a global state of the system is defined as a set of local states, one from each process. The problem of taking a checkpoint in message passing distributed system is quite complex because any arbitrary set of checkpoints cannot be used for the recovery [2], [3], [4]. This is due to the fact that the set of checkpoints used for recovery must form a consistent global state. Upon a failure, checkpoint based rollback recovery based restores the system state to the most recent consistent set of checkpoints i.e. the recovery line. Rollback recovery schemes are of two types Checkpoint based Log based Checkpoint based rollback recovery techniques can be classified into three categories Uncoordinated checkpointing, coordinated checkpointing and communication-induced checkpointing. Log based recovery schemes are also of 3 types pessimistic logging, optimistic logging and casual logging. 3.1 Uncoordinated Checkpointing Also known as independent checkpointing. In uncoordinated checkpointing, processes do not coordinate their checkpointing activity and each process records its local checkpoint independently [5], [6], [7]. It allows each process the maximum autonomy in deciding when to take checkpoint i.e., each process may take a checkpoint when it is most convenient. It eliminates coordination overhead all together and forms a consistent global state on recovery after a fault [5]. After a failure, a consistent global checkpoint is established by tracking the dependencies. If a failure occurs, the recovering process initiates rollback by broadcasting a dependency request message to collect all the dependency information maintained by each process. Based on the global dependency information thus collected, the initiator calculates the recovery line and broadcasts a rollback request message containing the recovery line. Upon receiving this message, a process whose current state belongs to the recovery line simply resumes execution, otherwise it rolls back to an earlier checkpoint as indicated by the recovery line. Recovery line determination is done using: Rollback Dependency Graph: Let c i, x denote the x th checkpoint of process P i. We call x as the checkpoint index. Let I i, x denote the checkpoint interval between checkpoints c i, x 1 and c i, x. We first construct the dependency graph as follows: Each node represents a checkpoint and a directed edge is drawn from c i, x to c j, y if (a) i 6 = j or a message m is sent from I i, x and received in I j, y, or (b) i = j and y = x + 1 Nodes corresponding to states of processes at failure point are marked and then we perform a reachability analysis from the failure states. The recent states which are unreachable from the failed states form the recovery line. Checkpoint Graph: Checkpoint graphs are very similar to the rollback-dependency graphs except that, when a message is sent from I i, x and received in I j, y a directed edge is drawn from c i, x 1 to c j, y (instead of c i, x to c j, y ). Fig. 1: (a) Example Execution; (b) Rollback-dependency Graph; (c) Checkpoint Graph

3 A Survey And Performance Analysis Of Checkpointing And Recovery Schemes For Mobile Computing Systems 91 The main advantage of this technique is that each process may take a checkpoint when it is most convenient. For example, a process may reduce the overhead by taking checkpoints when the amount of state information to be saved is small. But there are several disadvantages. Firstly, there may be a possibility of domino effect, which may cause the loss of a large amount of useful information, possibly all the way back to the beginning of the computation. Secondly, processes do not coordinate their checkpointing activity and each process records its local checkpoint independently [5], [6], [7]. It allows each process the maximum autonomy in deciding when to take checkpoint i.e. each process may take a checkpoint when it is most convenient. It eliminates coordination overhead altogether and forms a consistent global state on recovery after a fault [5]. After a failure, a consistent global checkpoint is established by tracking the dependencies. It may require cascaded rollbacks that may lead to initial state due to domino-effect [3], [4], [9]. It requires multiple checkpoints to be saved for each process and periodically invokes garbage collection algorithm to reclaim the checkpoints that are no longer needed. In this scheme, a process may take a useless checkpoint that will never be a part of global consistent state. Useless checkpoints incur overhead without advancing the recovery line [10]. Third, uncoordinated checpointing forces each process to maintain multiple checkpoints, and periodically invokes garbage collection algorithm to discard the checkpoints that are no longer useful. Fourthly, it is not suitable for applications with frequent output commits because these require global coordination to compute the recovery line. 3.2 Coordinated Checkpointing This scheme requires the processes to plan their checkpoints in order to form a consistent global state. Coordinated checkpointing simplifies recovery and is not susceptible to the domino effect, since every process always starts from its most recent checkpoint in case of a failure. This scheme basically follows a two-phase commit structure [2], [9], [10]. In the first phase, processes take a tentative checkpoints and in the second phase, the tentative checkpoints are made permanent. Several variants of coordinated checkpointing have been proposed in literature. Few of them are described below: Variation-I Straight-Forward Approach: It is a two-phase protocol. Straight-forward approach is to block communications while the checkpointing protocol executes [11]. A coordinator takes a checkpoint and broadcasts a request message to all processes, asking them to take a checkpoint. When a process receives the message, it stops its executions, flushes all the communication channels, takes a tentative checkpoint, and sends an acknowledgement message back to the coordinator. After the coordinator receives acknowledgements from all processes, itbroadcasts a commit message that completes the two phase checkpoint protocol. On receiving commit, a process converts its tentative checkpoint into permanent one and discards its old permanent checkpoint, if any. The process is then free to resume execution and exchange messages with other processes. However, this method has certain demerits. Every process has to block for the entire duration that the protocol executes. Large overhead is involved in broadcasting twice (for checkpoint request message and commit message) Variation-II Non-blocking Checkpoint Co-ordination: This protocol is also known as the Distributed snapshot protocol and was proposed by Chandy and Lamport in The initiator takes a checkpoint and broadcasts a marker (i.e. a checkpoint request) to all processes. Each process takes a checkpoint upon receiving the first marker and rebroadcasts the marker to all the other processes before sending any application message. The underlying assumption is that the comunication channels are FIFO (First In First Out) based and reliable. If the channels are non-fifo, the marker can be piggybacked on every post-checkpoint message. Alternatively, checkpoint indices can serve as markers, whereby a checkpoint is triggered if the receiver s local checkpoint index is lower than the piggybacked checkpoint index Variation-III Checkpointing with Synchronized Clocks: The underlying principle is that loosely synchronized clocks can trigger the local checkpointing actions of all the participating processes at approximately the same time without a checkpoint initiator. A process takes a checkpoint and waits for a period that equals the sum of the maximum deviation between clocks and the maximum time to detect a failure in another process in the system. If a failure occurs, it is detected within the specified time and the protocol is aborted. But the drawback in this method is that all the processes need to participate in every checkpoint. So, scalability is an issue Variation-IV Minimal Checkpoint Coordination: It is also a two-phase protocol. In the first phase, the checkpoint initiator identifies all the processes with which it has communicated since last checkpoint and sends them a request. Upon receiving the request, each process in turn identifies all the processes it has communicated with since last checkpoint and sends them a request and so on until no more processes can be identified. In phase two of the protocol, all the processes identified in the first phase take a checkpoint. However, the demerit in this method is that after a process takes a checkpoint, it cannot send any message until the second phase terminates successfully.

4 92 International Journal of Computer Science and Communication (IJCSC) So the main advantage of coordinated checkpointing is that storage overhead is reduced and need for garbage collection is also eliminated as only one permanent and atmost only one tentative checkpoint needs to be stored. In case of a failure, the system restarts from the last checkpointed state. A permanent checkpoint cannot be undone and guarantees that the computation will start from the last checkpointed state and not from the beginning. A tentative checkpoint can be undone or changed to be a permanent checkpoint. Coordinated checkpointing is not susceptible to Domino effect since every process upon failure always restarts from the most recent checkpoint. The main disadvantage is the Message overhead. This approach will be efficient if the number of processes involved in the computation is small, say in hundreds. If the number of processes in the system is in lakhs, then this approach will cause lot of message overhead. Another disadvantage is the checkpointing overhead. In general, the number of I/O nodes is much smaller as compared to the number of processes in the system. After a process sends a ready message, it queues its activities until it receives proceed message from the coordinator, during which time no useful work is done by the processes. Furthermore, when the checkpoint of a process is dumped to an I/O node, it can cause a lot of contention for the I/O nodes since one I/O node supports several processors. All this contribute to checkpointing overhead (i.e., the time spent by the processes without doing any useful work). 3.3 Communication Induced Checkpointing The another name for this scheme is quasi-synchrounous checkpointing. Communication-induced checkpointing avoids the domino-effect while allowing processes to take some of their checkpoints independently [12], [13], [14]. In these protocols, processes take two kinds of checkpoints, local and forced. Local checkpoints can be taken independently, while forced checkpoints are taken to guarantee the eventual progress of the recovery line and to minimize useless checkpoints. As opposed to coordinated checkpointing, these protocols do no exchange any special coordination messages to determine when forced checkpoints should be taken. But, they piggyback protocol specific information [generally checkpoint sequence numbers] on each application message; the receiver then uses this information to decide if it should take a forced checkpoint to advance the global recovery line. This decision is based on the receiver determining if past communication and checkpoint patterns can lead to the creation of useless checkpoints; a forced checkpoint is taken to break these patterns [10], [14]. Comminucation induced checkpointing can be classified into two types Model-based Checkpointing: Model-based checkpointing relies on preventing patterns of communications and checkpoints that could result in inconsistent states among the existing checkpoints. A model is set up to detect the possibility that such patterns could be forming within the system, according to some heuristic. A checkpoint is usually forced to prevent the undesirable patterns from occurring. Index-based Checkpointing: Index-based checkpointing works by assigning monotonically increasing indexes to checkpoints, such that the checkpoints having the same index at different processes form a consistent state. The indices are piggybacked on application messages to help receivers decide when they should force a checkpoint. 3.4 Message-logging Based Checkpointing Protocols Message-logging protocols (for example [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], are popular for building systems that can tolerate process crash failures. Message logging and checkpointing can be used to provide fault tolerance in distributed mobile systems in which all inter-process communication is through messages. Each message received by a process is saved in message log on stable storage. No coordination is required between the checkpointing of different processes or between message logging and checkpointing. The execution of each process is assumed to be deterministic between received messages, and all processes are assumed to execute on fail stop processes. When a process crashes, a new process is created in its place. The new process is given the appropriate recorded local state, and then the logged messages are replayed in the order the process originally received them. All message-logging protocols require that once a crashed process recovers, its state needs to be consistent with the states of the other processes [11], [29]. This consistency requirement is usually expressed in terms of orphan processes, which are surviving processes whose states are inconsistent with the recovered states of crashed processes. Thus, message- logging protocols guarantee that upon recovery, no process is an orphan. This requirement can be enforced either by avoiding the creation of orphans during an execution, as pessimistic protocols do, or by taking appropriate actions during recovery to eliminate all orphans as optimistic protocols do. Bin Yao et al. [29] describes a receiver based message logging protocol for mobile hosts, mobile support stations and home agents in a Mobile IP environment, which guarantees independent recovery. Checkpointing is utilized to limit log size and recovery latency. Log-based recovery protocols can be classified into three types-pessimistic logging, Optimistic logging and Casual logging.

5 A Survey And Performance Analysis Of Checkpointing And Recovery Schemes For Mobile Computing Systems 93 Pessimistic logging protocols are designed under the assumption that a failure can occur after any nondeterministic event in the computation. This assumption is pessimistic since in reality failures are rare. In their most straightforward form, pessimistic protocols log to stable storage the determinant of each nondeterministic event before the event is allowed to affect the computation. These pessimistic protocols implement the following property, often referred to as synchronous logging, which is a strengthening of the always-noorphans condition: This property stipulates that if an event has not been logged on stable storage, then no process can depend on it. In addition to logging determinants, processes also take periodic checkpoints to limit the amount of work that has to be repeated in execution replay during recovery. Should a failure occur when the application program is restarted from the most recent checkpoint and the logged determinants are used during recovery to recreate the pre-failure execution. This property has four advantages: ( i) Processes can commit output to the outside world without running a special protocol. ( ii) Processes restart from their most recent checkpoint upon a failure, therefore limiting the extent of execution that has to be replayed. Thus, the frequency of checkpoints can be determined by trading off the desired runtime performance with the desired protection of the on-going execution. (iii) Recovery is simplified because the effects of a failure are confined only to the processes that fail. Functioning processes continue to operate and never become orphans because a process always recovers to the state that included its most recent interaction with any other process or with the outside world. This is highly desirable in practical systems. (iv) Recovery information can be garbage-collected easily. Older checkpoints and determinants of nondeterministic events that occurred before the most recent checkpoint can be reclaimed because they will never be needed for recovery [11]. Optimistic logging protocols processes log determinants asynchronously to stable storage. These protocols make the optimistic assumption that logging will complete before a failure occurs. Determinants are kept in a volatile log, which is periodically flushed to stable storage. Thus, optimistic logging does not require the application to block waiting for the determinants to be actually written to stable storage, and therefore incurs little overhead during failure-free execution. However, this advantage comes at the expense of more complicated recovery, garbage collection, and slower output commit than in pessimistic logging. If a process fails, the determinants in its volatile log will be lost, and the state intervals that were started by the nondeterministic events corresponding to these determinants cannot be recovered. Furthermore, if the failed process sent a message during any of the state intervals that cannot be recovered, the receiver of the message becomes an orphan process and must roll back to undo the effects of receiving the message. Optimistic protocols [11] do not implement the alwaysno-orphans condition, and therefore permit the temporary creation of orphan processes. To perform these rollbacks correctly, optimistic logging protocols track causal dependencies during failure-free execution. Upon a failure, the dependency information is used to calculate and recover the latest global state of the pre-failure execution in which no process is in an orphan. Causal logging [11] has the failure-free performance advantages of optimistic logging while retaining most of the advantages of pessimistic logging. Like optimistic logging, it avoids synchronous access to stable storage except during output commit. Like pessimistic logging, it allows each process to commit output independently and never creates orphans, thereby isolating processes from the effects of failures that occur in other processes. Furthermore, causal logging limits the rollback of any failed process to the most recent checkpoint on stable storage. This reduces the storage overhead and the amount of work at risk. 4. COMPARISON OF ROLLBACK RECOVERY PROTOCOLS In this section we have compared all the above discussed checkpointing schemes on various parameters Domino Effect: Processes may coordinate their checkpoints to form consistent states. The cascaded rollback may continue and eventually may lead to the Domino effect, which causes the system to rollback to the beginning of the computation, in spite of all saved checkpoints. Orphan Message : Messages whose reception has been recorded, but the record of their transmission has been lost. This situation arises when the sender node rolls back to a state prior to sending the message while the receiver node still has the record of its reception. Recovery Line : It is desirable to minimize the amount of lost work by restoring the system to most recent consistent global checkpoint, which is called the recovery line. Output commit : Before sending output to the outside world, the system must ensure that the state from which the output is sent will be recovered despite any future failure. Such problem is called output commit problem.

6 94 International Journal of Computer Science and Communication (IJCSC) Table 1 Comparison of Rollback Recovery Protocols Parameters Uncoordinated Coordinated Communication Message Logging Protocols Checkpointing Checkpointing Induced Pessimistic Optistic Casual Checkpointing Logging Logging Logging Domino Possible No No No No No Effect Orphan Possible No Possible No Possible No Message Recovery Unbounded Last global Possibly several Last Possibly several Last Line checkpoint checkpoints checkpoint check points check point Output Not possible Global Global Local decision Global Local decision Commit Coordination Coordination Coordination required required required 5. CHECKPOINTING ISSUES IN DISTRIBUTED MOBILE SYSTEMS The existence of mobile nodes in a distributed system introduces new issues that need proper handling while designing a checkpointing algorithm for such systems. These issues are mobility, disconnections, finite power source, vulnerable to physical damage, lack of stable storage etc. [30], [31]. The location of an Mobile Host within the network, as represented by its current local Mobile Support Station, changes with time. Checkpointing schemes that send control messages to Mobiloe Hostss, will need to first locate the Mobile Host within the network, and thereby incur a search overhead [1]. Due to vulnerability of mobile computers to catastrophic failures, disk storage of an Mobile Host is not acceptably stable for storing message logs or local checkpoints. Checkpointing schemes must therefore, rely on an alternative stable repository for an Mobile Host s local checkpoint [1]. Disconnections of one or more Mobile Hostss should not prevent recording the global state of an application executing on MHs. It should be noted that disconnection of an Mobile Host is a voluntary operation, and frequent disconnections of Mobile Hosts is an expected feature of the mobile computing environments [1]. The battery at the Mobile Host has limited life. To save energy, the Mobile Host can power down individual components during periods of low activity [32]. This strategy is referred to as the doze mode operation. The Mobile Host in doze mode is awakened on receiving a message. Therefore, energy conservation and low bandwidth constraints require the checkpointing algorithms to minimize the number of synchronization messages and the number of checkpoints. The new issues make traditional checkpointing techniques unsuitable to checkpoint mobile distributed systems [30], [33]. 6. CONCLUSION A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The availability of such systems is on the rise due to the profileration of portable computers and advances in communication technology. An efficient recovery mechanism for mobile computing systems is required to maintain the continuity of computation in the event of failure. In this paper we have reviewed different schemes to rollback recovery in mobile computing systems with respect to a set of properties including performance overhead, storage over-head, ease of recovery, freedom from domino effect, freedom from orphan processes, and the extent of rollback. Checkpointing protocols require the processes to take periodic checkpoints with varying degrees of coordination. Coordinated checkpointing requires the processes to coordinate their checkpoints to form global consistent system states. Coordinated checkpointing generally simplifies recovery and garbage collection, and yields good performance in practice. At the other end of the spectrum, uncoordinated checkpointing does not require the processes to coordinate their checkpoints, but it suffers from potential domino effect, complicates recovery, and still requires coordination to perform output commit or garbage collection. Communication-induced checkpointing schemes depend on the communication patterns of the applications to trigger checkpoints. These schemes do not suffer from the domino effect and do not require coordination. Log-based rollback recovery is often a natural choice for applications that frequently interact with the outside world. It allows efficient output commit, and has three choices, pessimistic, optimistic, and causal. This form of logging simplifies recovery, output commit, and protects surviving processes from having to roll back. These advantages have made pessimistic logging attractive in commercial environment where simplicity and robustness are necessary. Causal logging reduces the overhead while still preserving the properties of fast output commit and orphan-free recovery.

7 A Survey And Performance Analysis Of Checkpointing And Recovery Schemes For Mobile Computing Systems 95 REFRENCES [1] B.R. Badrinath, A. Acharya, and T. Lmeilinski. Structuring Distributed Algorithms for Mobile Hosts. In Proceedings of the 14th International Conference on Distributed Computing Systems, (to Appear), June [2] Chandy K.M., and Lamport L., Distributed Snapshots: Determining Global State of Distribited Systems, ACM Transaction on Computing Systems, 3 (1), pp., 63-75, February, [3] Randall B., System Structure for Software Fault Tolerance, IEEE Trans. On Software Engineering, 1 (2), , [4] Russell D.L., State Restoration in Systems of Communicating Processes, IEEE Trans. on Software Engineering, 6 (2), , [5] Bhargava B., and Lian S.R., Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems - An Optimistic Approach, Proceedings of 17th IEEE Symposium on Reliable Distributed Systems, pp., 3-12, [6] Storm R., and Temini S., Optimistic Recovery in Distributed Systems, ACM Trans. Computer Systems, Aug, 1985, pp [7] Weigang Ni, Susan V. Vrbsky, and Sibabrata Ray, Lowcost Coordinated Checkpointing in Mobile Computing Systems, Proceeding of the Eighth IEEE International Symposium on Computers and Communications, [8] Zomaya A.Y.H., Parallel and Distributed Computing Handbook, (New York : McGraw - Hill), 1996 [9] Koo R., and Tueg S., Checkpointing and Rollback Recovery for Distributed Systems, IEEE Trans. On Software Engineering, 13 (1), pp , January [10] Elonzahy E.N., Alvisi L., Wang Y.M., and Johnson D.B., A Survey of Rollback-Recovery Protocols in Message- Passing Systems, ACM Computing surveys, 34 (3), pp , [11] Tamir Y., Sequin C.H., Error Recovery in Multicomputers using Global Checkpoints, In Proceedings of the International Conference on Parallel Processing, pp , [12] Baldoni R., Hélary J-M., Mostefaoui A., and Raynal M., A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability, Proceedings of the International Symposium on Fault- Tolerant-Computing Systems, pp , June [13] Hélary J.M., Mostefaoui A., and Raynal M., Communication-Induced Determination of Consistent Snapshots, Proceedings of the 28th International Symposium on Fault-Tolerant Computing, pp , June [14] Manivannan D., and Singhal M., Quasi-Synchronous Checkpointing: Models, Characterization, and Classification, IEEE Trans. Parallel and Distributed Systems, 10 (7), pp , July [15] Alvisi Lorenzo, and Marzullo Keith, Message Logging: Pessimistic, Optimistic, Causal, and Optimal, IEEE Transactions on Software Engineering, 24 (2), February 1998, pp [16] L. Alvisi, Hoppe B., Marzullo K., Nonblocking and Orphan-Free message Logging Protocol, Proc. of 23rd Fault Tolerant Computing Symp., pp , June [17] L. Alvisi, Understanding the Message Logging Paradigm for Masking Process Crashes, Ph.D. Thesis, Cornell Univ., Dept. of Computer Science, Jan Available as Technical Report TR [18] L. Alvisi, and K. Marzullo, Tradeoffs in Implementing Optimal Message Logging Protocol, Proc. 15th Symp. Principles of Distributed Computing, pp , ACM, June, [19] A. Borg, J. Baumbach, and S. Glazer, A Message System Supporting Fault Tolerance, Proc. Symp. Operating System Principles, pp , ACM SIG OPS, Oct [20] Elnozahy, and Zwaenepoel W, Manetho: Transparent Rollback Recovery with Low-overhead, Limited Rollback and Fast Output Commit, IEEE Trans. Computers, 41 (5), pp , May [21] Elnozahy, and Zwaenepoel W, On the Use and Implementation of Message Logging, 24th int l Symp. Fault Tolerant Computing, pp , IEEE Computer Society, June [22] D. Johnson, Distributed System Fault Tolerance Using Message Logging and Checkpointing, Ph.D. Thesis, Rice Univ., Dec [23] M.L. Powell, and D.L. Presotto, Publishing: A Reliable Broadcase Communication Mechanism, Proc. Ninth Symp. Operating System Principles, pp , ACM SIGOPS, Oct [24] A.P. Sistla and, J.L. Welch, Efficient Distributed Recovery Using Message Logging, Proc. 18th Symp. Principles of Distributed Computing, pp , Aug [25] S. Venketasan, and T.Y. Juang, Efficient Algorithms for Optimistic Crash recovery, Distributed Computing, 8 (2), pp , June [26] S. Venketasan, Message-Optimal Incremental Snapshots, Computer and Software Engineering, 1 (3), pp , [27] S. Venketasan, Optimistic Crash Recovery without Rolling Back Non-Faulty Processors, Information Sciences, [28] S. Venketasan, and T.T.Y. Juang, Low Overhead Optimistic Crash Recovery, Proc. 11th Int. [29] Wang Y., and Fuchs W.K., Lazy Checkpoint Coordination for Bounding Rollback Propagation, Proc. 12th Symp. Reliable Distributed Systems, pp , Oct [30] Acharya A., and Badrinath B.R., Checkpointing Distributed Applications on Mobile Computers, Proceedings of the 3rd International Conference on Parallel and Distributed Information Systems, pp , September [31] Adnan Agbaria, William H. Sanders, Distributed Snapshots for Mobile Computing Systems, Proceedings of the Second IEEE Annual Conference on Pervasive Computing and Communications (Percom 04), pp. 1-10, [32] George H. Forman, and John Zahorjan, The Challenges of Mobile Computing, IEEE Computers, 27 (4), pp , April [33] Prakash R., and Singhal M., Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems, IEEE Transaction On Parallel and DistributedSystems, 7 (10), pp , October 1996.

A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems

A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems Rachit Garg 1, Praveen Kumar 2 1 Singhania University, Department of Computer Science & Engineering, Pacheri Bari (Rajasthan),

More information

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

A Survey of Various Fault Tolerance Checkpointing Algorithms in Distributed System

A Survey of Various Fault Tolerance Checkpointing Algorithms in Distributed System 2682 A Survey of Various Fault Tolerance Checkpointing Algorithms in Distributed System Sudha Department of Computer Science, Amity University Haryana, India Email: sudhayadav.91@gmail.com Nisha Department

More information

A Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System

A Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System A Low-Overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System Parveen Kumar 1, Poonam Gahlan 2 1 Department of Computer Science & Engineering Meerut Institute of Engineering

More information

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 11, Nov 2015, pp. 46-53, Article ID: IJCET_06_11_005 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=6&itype=11

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

On the Relevance of Communication Costs of Rollback-Recovery Protocols

On the Relevance of Communication Costs of Rollback-Recovery Protocols On the Relevance of Communication Costs of Rollback-Recovery Protocols E.N. Elnozahy June 1995 CMU-CS-95-167 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 To appear in the

More information

International Journal of Distributed and Parallel systems (IJDPS) Vol.1, No.1, September

International Journal of Distributed and Parallel systems (IJDPS) Vol.1, No.1, September DESIGN AND PERFORMANCE ANALYSIS OF COORDINATED CHECKPOINTING ALGORITHMS FOR DISTRIBUTED MOBILE SYSTEMS Surender Kumar 1,R.K. Chauhan 2 and Parveen Kumar 3 1 Deptt. of I.T, Haryana College of Tech. & Mgmt.

More information

global checkpoint and recovery line interchangeably). When processes take local checkpoint independently, a rollback might force the computation to it

global checkpoint and recovery line interchangeably). When processes take local checkpoint independently, a rollback might force the computation to it Checkpointing Protocols in Distributed Systems with Mobile Hosts: a Performance Analysis F. Quaglia, B. Ciciani, R. Baldoni Dipartimento di Informatica e Sistemistica Universita di Roma "La Sapienza" Via

More information

A NON-BLOCKING MINIMUM-PROCESS CHECKPOINTING PROTOCOL FOR DETERMINISTIC MOBILE COMPUTING SYSTEMS

A NON-BLOCKING MINIMUM-PROCESS CHECKPOINTING PROTOCOL FOR DETERMINISTIC MOBILE COMPUTING SYSTEMS A NON-BLOCKING MINIMUM-PROCESS CHECKPOINTING PROTOCOL FOR DETERMINISTIC MOBILE COMPUTING SYSTEMS 1 Ajay Khunteta, 2 Praveen Kumar 1,Singhania University, Pacheri, Rajasthan, India-313001 Email: ajay_khunteta@rediffmail.com

More information

Novel low-overhead roll-forward recovery scheme for distributed systems

Novel low-overhead roll-forward recovery scheme for distributed systems Novel low-overhead roll-forward recovery scheme for distributed systems B. Gupta, S. Rahimi and Z. Liu Abstract: An efficient roll-forward checkpointing/recovery scheme for distributed systems has been

More information

Checkpointing HPC Applications

Checkpointing HPC Applications Checkpointing HC Applications Thomas Ropars thomas.ropars@imag.fr Université Grenoble Alpes 2016 1 Failures in supercomputers Fault tolerance is a serious problem Systems with millions of components Failures

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

Optimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory

Optimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems Yi-Min Wang and W. Kent Fuchs Coordinated Science Laboratory University of Illinois at Urbana-Champaign Abstract Message-passing

More information

A Token Ring Minimum Process Checkpointing Algorithm for Distributed Mobile Computing System

A Token Ring Minimum Process Checkpointing Algorithm for Distributed Mobile Computing System 162 A Token Ring Minimum Process Checkpointing Algorithm for Distributed Mobile Computing System P. Kanmani, Dr. R. Anitha, and R. Ganesan Research Scholar, Mother Teresa Women s University, kodaikanal,

More information

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware Where should RC be implemented? In hardware sensitive to architecture changes At the OS level state transitions hard to track and coordinate At the application level requires sophisticated application

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

Message Logging: Pessimistic, Optimistic, Causal, and Optimal

Message Logging: Pessimistic, Optimistic, Causal, and Optimal IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 24, NO. 2, FEBRUARY 1998 149 Message Logging: Pessimistic, Optimistic, Causal, and Optimal Lorenzo Alvisi and Keith Marzullo Abstract Message-logging protocols

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions D. Manivannan Department of Computer Science University of Kentucky Lexington, KY 40506

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 17 - Checkpointing II Chapter 6 - Checkpointing Part.17.1 Coordinated Checkpointing Uncoordinated checkpointing may lead

More information

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm FAULT TOLERANT SYSTEMS Coordinated http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Chapter 6 II Uncoordinated checkpointing may lead to domino effect or to livelock Example: l P wants to take a

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg Distributed Recovery with K-Optimistic Logging Yi-Min Wang Om P. Damani Vijay K. Garg Abstract Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world

More information

Novel Log Management for Sender-based Message Logging

Novel Log Management for Sender-based Message Logging Novel Log Management for Sender-based Message Logging JINHO AHN College of Natural Sciences, Kyonggi University Department of Computer Science San 94-6 Yiuidong, Yeongtonggu, Suwonsi Gyeonggido 443-760

More information

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks J. Parallel Distrib. Comput. 64 (4) 649 661 Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks Partha Sarathi Mandal and Krishnu Mukhopadhyaya* Advanced Computing and

More information

Adaptive Recovery for Mobile Environments

Adaptive Recovery for Mobile Environments This paper appeared in proceedings of the IEEE High-Assurance Systems Engineering Workshop, October 1996. Adaptive Recovery for Mobile Environments Nuno Neves W. Kent Fuchs Coordinated Science Laboratory

More information

A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems

A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems Mobile Information Systems 4 (2008) 13 32 13 IOS Press A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems Parveen Kumar Department of Computer Sc & Engineering, Asia Pacific

More information

Surender Kumar 1,R.K. Chauhan 2 and Parveen Kumar 3 1 Deptt. of I.T, Haryana College of Tech. & Mgmt. Kaithal-136027(HR), INDIA skjangra@hctmkaithal-edu.org 2 Deptt. of Computer Sc & Application, Kurukshetra

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery

On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery Franco ambonelli Dipartimento di Scienze dell Ingegneria Università di Modena Via Campi 213-b 41100 Modena ITALY franco.zambonelli@unimo.it

More information

Design of High Performance Distributed Snapshot/Recovery Algorithms for Ring Networks

Design of High Performance Distributed Snapshot/Recovery Algorithms for Ring Networks Southern Illinois University Carbondale OpenSIUC Publications Department of Computer Science 2008 Design of High Performance Distributed Snapshot/Recovery Algorithms for Ring Networks Bidyut Gupta Southern

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,

More information

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7]. Sender-Based Message Logging David B. Johnson Willy Zwaenepoel Department of Computer Science Rice University Houston, Texas Abstract Sender-based message logging isanewlow-overhead mechanism for providing

More information

APPLICATION-TRANSPARENT ERROR-RECOVERY TECHNIQUES FOR MULTICOMPUTERS

APPLICATION-TRANSPARENT ERROR-RECOVERY TECHNIQUES FOR MULTICOMPUTERS Proceedings of the Fourth onference on Hypercubes, oncurrent omputers, and Applications Monterey, alifornia, pp. 103-108, March 1989. APPLIATION-TRANSPARENT ERROR-REOVERY TEHNIQUES FOR MULTIOMPUTERS Tiffany

More information

1 Introduction A mobile computing system is a distributed system where some of nodes are mobile computers [3]. The location of mobile computers in the

1 Introduction A mobile computing system is a distributed system where some of nodes are mobile computers [3]. The location of mobile computers in the Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems Ravi Prakash and Mukesh Singhal Department of Computer and Information Science The Ohio State University Columbus, OH 43210. e-mail:

More information

Enhanced N+1 Parity Scheme combined with Message Logging

Enhanced N+1 Parity Scheme combined with Message Logging IMECS 008, 19-1 March, 008, Hong Kong Enhanced N+1 Parity Scheme combined with Message Logging Ch.D.V. Subba Rao and M.M. Naidu Abstract Checkpointing schemes facilitate fault recovery in distributed systems.

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

Consistent Checkpointing in Distributed Computations: Theoretical Results and Protocols

Consistent Checkpointing in Distributed Computations: Theoretical Results and Protocols Università degli Studi di Roma La Sapienza Dottorato di Ricerca in Ingegneria Informatica XI Ciclo 1999 Consistent Checkpointing in Distributed Computations: Theoretical Results and Protocols Francesco

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed McGraw-Hill by

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed McGraw-Hill by Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available

More information

CHECKPOINTING WITH MINIMAL RECOVERY IN ADHOCNET BASED TMR

CHECKPOINTING WITH MINIMAL RECOVERY IN ADHOCNET BASED TMR CHECKPOINTING WITH MINIMAL RECOVERY IN ADHOCNET BASED TMR Sarmistha Neogy Department of Computer Science & Engineering, Jadavpur University, India Abstract: This paper describes two-fold approach towards

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello

Rollback-Recovery Protocols for Send-Deterministic Applications. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Rollback-Recovery Protocols for Send-Deterministic Applications Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir and Franck Cappello Fault Tolerance in HPC Systems is Mandatory Resiliency is

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications Last Class Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Basic Timestamp Ordering Optimistic Concurrency Control Multi-Version Concurrency Control C. Faloutsos A. Pavlo Lecture#23:

More information

Parallel & Distributed Systems group

Parallel & Distributed Systems group How to Recover Efficiently and Asynchronously when Optimism Fails Om P Damani Vijay K Garg TR TR-PDS-1995-014 August 1995 PRAESIDIUM THE DISCIPLINA CIVITATIS UNIVERSITYOFTEXAS AT AUSTIN Parallel & Distributed

More information

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations

A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations A Hierarchical Checkpointing Protocol for Parallel Applications in Cluster Federations Sébastien Monnet IRISA Sebastien.Monnet@irisa.fr Christine Morin IRISA/INRIA Christine.Morin@irisa.fr Ramamurthy Badrinath

More information

Analysis of Distributed Snapshot Algorithms

Analysis of Distributed Snapshot Algorithms Analysis of Distributed Snapshot Algorithms arxiv:1601.08039v1 [cs.dc] 29 Jan 2016 Sharath Srivatsa sharath.srivatsa@iiitb.org September 15, 2018 Abstract Many problems in distributed systems can be cast

More information

Movement-based checkpointing and logging for failure recovery of database applications in mobile environments

Movement-based checkpointing and logging for failure recovery of database applications in mobile environments Distrib Parallel Databases (2008) 23: 189 205 DOI 10.1007/s10619-008-7026-3 Movement-based checkpointing and logging for failure recovery of database applications in mobile environments Sapna E. George

More information

WITH THE proliferation and ubiquity of handheld devices

WITH THE proliferation and ubiquity of handheld devices IEEE TRANSACTIONS ON RELIABILITY, VOL. 54, NO. 1, MARCH 2005 115 On Failure Recoverability of Client-Server Applications in Mobile Wireless Environments Ing-Ray Chen, Member, IEEE, Baoshan Gu, Sapna E.

More information

Chapter 17: Recovery System

Chapter 17: Recovery System Chapter 17: Recovery System Database System Concepts See www.db-book.com for conditions on re-use Chapter 17: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

Distributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases

Distributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases Distributed Database Management System UNIT-2 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi-63,By Shivendra Goel. U2.1 Concurrency Control Concurrency control is a method

More information

Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm***

Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm*** Fault Tolerant Environment Using Hardware Failure Detection, Roll Forward Recovery Approach and Microrebooting For Distributed Systems Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm*** ABSTRACT *(Department

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

T ransaction Management 4/23/2018 1

T ransaction Management 4/23/2018 1 T ransaction Management 4/23/2018 1 Air-line Reservation 10 available seats vs 15 travel agents. How do you design a robust and fair reservation system? Do not enough resources Fair policy to every body

More information

Distributed KIDS Labs 1

Distributed KIDS Labs 1 Distributed Databases @ KIDS Labs 1 Distributed Database System A distributed database system consists of loosely coupled sites that share no physical component Appears to user as a single system Database

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

A Formal Model of Crash Recovery in Distributed Software Transactional Memory (Extended Abstract)

A Formal Model of Crash Recovery in Distributed Software Transactional Memory (Extended Abstract) A Formal Model of Crash Recovery in Distributed Software Transactional Memory (Extended Abstract) Paweł T. Wojciechowski, Jan Kończak Poznań University of Technology 60-965 Poznań, Poland {Pawel.T.Wojciechowski,Jan.Konczak}@cs.put.edu.pl

More information

Transaction Processing in Mobile Database Systems

Transaction Processing in Mobile Database Systems Ashish Jain* 1 http://dx.doi.org/10.18090/samriddhi.v7i2.8631 ABSTRACT In a mobile computing environment, a potentially large number of mobile and fixed users may simultaneously access shared data; therefore,

More information

BRANCH:IT FINAL YEAR SEVENTH SEM SUBJECT: MOBILE COMPUTING UNIT-IV: MOBILE DATA MANAGEMENT

BRANCH:IT FINAL YEAR SEVENTH SEM SUBJECT: MOBILE COMPUTING UNIT-IV: MOBILE DATA MANAGEMENT - 1 Mobile Data Management: Mobile Transactions - Reporting and Co Transactions Kangaroo Transaction Model - Clustering Model Isolation only transaction 2 Tier Transaction Model Semantic based nomadic

More information

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Lecture 21: Logging Schemes 15-445/645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Crash Recovery Recovery algorithms are techniques to ensure database consistency, transaction

More information

Chapter 17: Recovery System

Chapter 17: Recovery System Chapter 17: Recovery System! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management! Failure with

More information

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure Chapter 17: Recovery System Failure Classification! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management!

More information

transaction - (another def) - the execution of a program that accesses or changes the contents of the database

transaction - (another def) - the execution of a program that accesses or changes the contents of the database Chapter 19-21 - Transaction Processing Concepts transaction - logical unit of database processing - becomes interesting only with multiprogramming - multiuser database - more than one transaction executing

More information

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed

Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed Recovery System These slides are a modified version of the slides of the book Database System Concepts (Chapter 17), 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available

More information

The Performance of Coordinated and Independent Checkpointing

The Performance of Coordinated and Independent Checkpointing The Performance of inated and Independent Checkpointing Luis Moura Silva João Gabriel Silva Departamento Engenharia Informática Universidade de Coimbra, Polo II P-3030 - Coimbra PORTUGAL Email: luis@dei.uc.pt

More information

Parallel and Distributed VHDL Simulation

Parallel and Distributed VHDL Simulation Parallel and Distributed VHDL Simulation Dragos Lungeanu Deptartment of Computer Science University of Iowa C.J. chard Shi Department of Electrical Engineering University of Washington Abstract This paper

More information

On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery

On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery On the Effectiveness of Distributed Checkpoint Algorithms for Domino-free Recovery Franco Zambonelli Dipartimento di Scienze dell Ingegneria Università di Modena Via Campi 213-b 41100 Modena ITALY franco.zambonelli@unimo.it

More information

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI

DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI DHANALAKSHMI COLLEGE OF ENGINEERING, CHENNAI Department of Computer Science and Engineering CS6302- DATABASE MANAGEMENT SYSTEMS Anna University 2 & 16 Mark Questions & Answers Year / Semester: II / III

More information

A MOBILE COMMIT PROTOCOL BASED ON TIMEOUTS. Lavanya Sita Tekumalla, BE (CSE) Osmania University College of Engineering. Hyderabad , India.

A MOBILE COMMIT PROTOCOL BASED ON TIMEOUTS. Lavanya Sita Tekumalla, BE (CSE) Osmania University College of Engineering. Hyderabad , India. A MOBILE COMMIT PROTOCOL BASED ON TIMEOUTS Lavanya Sita Tekumalla, BE (CSE) Osmania University College of Engineering Hyderabad 500007, India. ABSTRACT In a mobile environment, the two phase commit protocol

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

Routing Journal Operations on Disks Using Striping With Parity 1

Routing Journal Operations on Disks Using Striping With Parity 1 Routing Journal Operations on Disks Using Striping With Parity 1 ABSTRACT Ramzi Haraty and Fadi Yamout Lebanese American University P.O. Box 13-5053 Beirut, Lebanon Email: rharaty@beirut.lau.edu.lb, fadiyam@inco.com.lb

More information

An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation

An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation 230 The International Arab Journal of Information Technology, Vol. 6, No. 3, July 2009 An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation Ali Al-Humaimidi and Hussam Ramadan

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

Introduction. Storage Failure Recovery Logging Undo Logging Redo Logging ARIES

Introduction. Storage Failure Recovery Logging Undo Logging Redo Logging ARIES Introduction Storage Failure Recovery Logging Undo Logging Redo Logging ARIES Volatile storage Main memory Cache memory Nonvolatile storage Stable storage Online (e.g. hard disk, solid state disk) Transaction

More information

End-To-End Delay Optimization in Wireless Sensor Network (WSN)

End-To-End Delay Optimization in Wireless Sensor Network (WSN) Shweta K. Kanhere 1, Mahesh Goudar 2, Vijay M. Wadhai 3 1,2 Dept. of Electronics Engineering Maharashtra Academy of Engineering, Alandi (D), Pune, India 3 MITCOE Pune, India E-mail: shweta.kanhere@gmail.com,

More information

Optimistic Recovery in Distributed Systems

Optimistic Recovery in Distributed Systems Optimistic Recovery in Distributed Systems ROBERT E. STROM and SHAULA YEMINI IBM Thomas J. Watson Research Center Optimistic Recovery is a new technique supporting application-independent transparent recovery

More information

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic

More information

Parveen Kumar Deptt. of CSE, Bhart Inst of Engg. & Tech., Meerut(UP), India

Parveen Kumar Deptt. of CSE, Bhart Inst of Engg. & Tech., Meerut(UP), India Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Cluster based

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya.   reduce the average performance overhead. A Case for Two-Level Distributed Recovery Schemes Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-31, U.S.A. E-mail: vaidya@cs.tamu.edu Abstract Most distributed

More information

Chapter 16: Recovery System. Chapter 16: Recovery System

Chapter 16: Recovery System. Chapter 16: Recovery System Chapter 16: Recovery System Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 16: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Chapter 14: Recovery System

Chapter 14: Recovery System Chapter 14: Recovery System Chapter 14: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery Remote Backup Systems Failure Classification Transaction failure

More information

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615 Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A. Pavlo Lecture#23: Crash Recovery Part 1 (R&G ch. 18) Last Class Basic Timestamp Ordering Optimistic Concurrency

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

ISSN: Monica Gahlyan et al, International Journal of Computer Science & Communication Networks,Vol 3(3),

ISSN: Monica Gahlyan et al, International Journal of Computer Science & Communication Networks,Vol 3(3), Waiting Algorithm for Concurrency Control in Distributed Databases Monica Gahlyan M-Tech Student Department of Computer Science & Engineering Doon Valley Institute of Engineering & Technology Karnal, India

More information

Event List Management In Distributed Simulation

Event List Management In Distributed Simulation Event List Management In Distributed Simulation Jörgen Dahl ½, Malolan Chetlur ¾, and Philip A Wilsey ½ ½ Experimental Computing Laboratory, Dept of ECECS, PO Box 20030, Cincinnati, OH 522 0030, philipwilsey@ieeeorg

More information

Chapter 17: Distributed Systems (DS)

Chapter 17: Distributed Systems (DS) Chapter 17: Distributed Systems (DS) Silberschatz, Galvin and Gagne 2013 Chapter 17: Distributed Systems Advantages of Distributed Systems Types of Network-Based Operating Systems Network Structure Communication

More information