A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS

Size: px

Start display at page:

Download "A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS"

Cornelia Heath
5 years ago
Views:

1 International Journal of Computer Science and Communication Vol. 2, No. 1, January-June 2011, pp A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS Ruchi Tuli 1 and Parveen Kumar 2 1 Yanbu University College, Royal Commission for Jubail and Yanbu, Directorate General for Yanbu, P.O. Box Madinat Yanbu Al Sinaiyah Kingdom of Saudi Arabia., tuli.ruchi@gmail.com 2 Merrut Institute of Engineering and Technology, Merrut (INDIA) pk223475@yahoo.com ABSTRACT Check point is defined as a designated place in a program where normal processing of a system is interrupted to preserve the status information. Checkpointing is a process of saving status information. Mobile computing systems often suffer from high failure rates that are transient and independent in nature. To add reliability and high availability to such distributed systems, checkpoint based rollback recovery is one of the widely used techniques for applications such as scientific computing, database, telecommunication applications and mission critical applications. In this paper we discuss various checkpointing schemes to recover from system failure leading to failure of running services and computational tasks or transactions being executed for mobile computing systems. We have also compared checkpointing schemes on different parameters and as well as discussed the various issues related to distributed mobile computing systems. Keyword: Mobile computing systems, co-ordinated checkpoint, rollback recovery, mobile host. 1. INTRODUCTION A mobile computing system is a distributed system where some of the nodes are mobile computers [1]. It consists of a fixed node called Mobile Support Station (MSS) and a number of Mobile Hosts (MHs). A cell is a geographical area around a MSS in which it can support a MH. A mobile Host can change its geographical position freely from one cell to another or even to an area covered by no cell. All the communication from one Mobile Host to another Mobile Host goes through MSS. MSS has both types of links wired and wireless links. A MSS communicate with Mobile Host bywireless links, while with other MSSs by wired links. Checkpoint is a fault tolerant technique that allows system to roll back to a most recent failure free state when failure occurs in mobile computing system By periodically invoking the checkpointing process, one can save the status of a program at regular intervals. A single failure can disturb the entire computation. If there is a failure, computation may be restarted from the last checkpoint instead of repeating the computation from beginning. The process of resuming computation from the last saved state is called as rollback recovery. As mobile devices communicate with other mobile devices so at the time of recovery, rolling back of just one system which has failed may lead to inconsistency. So when a process rolls back to some previous failure free intermediate state, some other process on which failed process depends also roll back to achieve consistent global state. 1.1 Need of Checkpointing Apart from its use to recover a system from failure, checkpointing also finds its application in debugging distributed programs and migrating processes in a multiprocessor system. In debugging distributed programs, checkpointing assists in monitoring the state changes of a process during execution at various time instances. To balance the load of processors in a distributed system, processes are usually moved from heavily loaded processors to lightly loaded processors. Checkpointing periodically provides the information necessary to move from one process to another. 1.2 Aspects of Checkpointing Some of the aspects need to be considered with checkpointing are ( a) frequency of checkpointing, (b) contents of checkpointing, and ( c) methods of checkpointing. (a) Frequency of checkpointing: A checkpointing algorithm executes in parallel with the underlying computation. Therefore, the overheads introduced due to checkpointing be minimized. Checkpointing should enable a user to recover and not loose substantial computation in case of an error, which necessitates frequent checkpointing and consequently significant overhead. The number of checkpoints initiated should be such that the cost of information loss due to failure is small and the overhead due to checkpointing is not significant. These depend on the failure probability and the importance of computation.

2 90 International Journal of Computer Science and Communication (IJCSC) (b) Contents of a checkpoint : The state of a process has to be saved in a stable storage so that the process can be restarted in case of an error. The state/context includes code, data and stack segments alongwith the environment and the register contents. Environment has the information about the various files currently in use and file pointers. (c) Methods of checkpointing : The methodology used for checkpointing depends on the architecture of the system. Methods used in multiprocessor systems should incorporate explicit coordination. In a message passing system, the messages should be monitored and if necessary saved as part of the global context. The reason is that the messages introduces dependencies among the processors. 2. SYSTEM MODEL A message passing distributed system is assumed. Communication subsystem is assumed to be reliable and FIFO based in some protocols. In other simplified protocols, it is assumed unreliable. Interaction with outside world is modelled as interaction with a special process called OWP (Outside World Process), which cannot fail, maintain state or participate in the recovery protocol. 3. CHECKPOINTING SCHEMES A checkpoint is a local state of a process saved on a stable storage. In mobile computing systems, since the processes in the system do not share memory, a global state of the system is defined as a set of local states, one from each process. The problem of taking a checkpoint in message passing distributed system is quite complex because any arbitrary set of checkpoints cannot be used for the recovery [2], [3], [4]. This is due to the fact that the set of checkpoints used for recovery must form a consistent global state. Upon a failure, checkpoint based rollback recovery based restores the system state to the most recent consistent set of checkpoints i.e. the recovery line. Rollback recovery schemes are of two types Checkpoint based Log based Checkpoint based rollback recovery techniques can be classified into three categories Uncoordinated checkpointing, coordinated checkpointing and communication-induced checkpointing. Log based recovery schemes are also of 3 types pessimistic logging, optimistic logging and casual logging. 3.1 Uncoordinated Checkpointing Also known as independent checkpointing. In uncoordinated checkpointing, processes do not coordinate their checkpointing activity and each process records its local checkpoint independently [5], [6], [7]. It allows each process the maximum autonomy in deciding when to take checkpoint i.e., each process may take a checkpoint when it is most convenient. It eliminates coordination overhead all together and forms a consistent global state on recovery after a fault [5]. After a failure, a consistent global checkpoint is established by tracking the dependencies. If a failure occurs, the recovering process initiates rollback by broadcasting a dependency request message to collect all the dependency information maintained by each process. Based on the global dependency information thus collected, the initiator calculates the recovery line and broadcasts a rollback request message containing the recovery line. Upon receiving this message, a process whose current state belongs to the recovery line simply resumes execution, otherwise it rolls back to an earlier checkpoint as indicated by the recovery line. Recovery line determination is done using: Rollback Dependency Graph: Let c i, x denote the x th checkpoint of process P i. We call x as the checkpoint index. Let I i, x denote the checkpoint interval between checkpoints c i, x 1 and c i, x. We first construct the dependency graph as follows: Each node represents a checkpoint and a directed edge is drawn from c i, x to c j, y if (a) i 6 = j or a message m is sent from I i, x and received in I j, y, or (b) i = j and y = x + 1 Nodes corresponding to states of processes at failure point are marked and then we perform a reachability analysis from the failure states. The recent states which are unreachable from the failed states form the recovery line. Checkpoint Graph: Checkpoint graphs are very similar to the rollback-dependency graphs except that, when a message is sent from I i, x and received in I j, y a directed edge is drawn from c i, x 1 to c j, y (instead of c i, x to c j, y ). Fig. 1: (a) Example Execution; (b) Rollback-dependency Graph; (c) Checkpoint Graph

3 A Survey And Performance Analysis Of Checkpointing And Recovery Schemes For Mobile Computing Systems 91 The main advantage of this technique is that each process may take a checkpoint when it is most convenient. For example, a process may reduce the overhead by taking checkpoints when the amount of state information to be saved is small. But there are several disadvantages. Firstly, there may be a possibility of domino effect, which may cause the loss of a large amount of useful information, possibly all the way back to the beginning of the computation. Secondly, processes do not coordinate their checkpointing activity and each process records its local checkpoint independently [5], [6], [7]. It allows each process the maximum autonomy in deciding when to take checkpoint i.e. each process may take a checkpoint when it is most convenient. It eliminates coordination overhead altogether and forms a consistent global state on recovery after a fault [5]. After a failure, a consistent global checkpoint is established by tracking the dependencies. It may require cascaded rollbacks that may lead to initial state due to domino-effect [3], [4], [9]. It requires multiple checkpoints to be saved for each process and periodically invokes garbage collection algorithm to reclaim the checkpoints that are no longer needed. In this scheme, a process may take a useless checkpoint that will never be a part of global consistent state. Useless checkpoints incur overhead without advancing the recovery line [10]. Third, uncoordinated checpointing forces each process to maintain multiple checkpoints, and periodically invokes garbage collection algorithm to discard the checkpoints that are no longer useful. Fourthly, it is not suitable for applications with frequent output commits because these require global coordination to compute the recovery line. 3.2 Coordinated Checkpointing This scheme requires the processes to plan their checkpoints in order to form a consistent global state. Coordinated checkpointing simplifies recovery and is not susceptible to the domino effect, since every process always starts from its most recent checkpoint in case of a failure. This scheme basically follows a two-phase commit structure [2], [9], [10]. In the first phase, processes take a tentative checkpoints and in the second phase, the tentative checkpoints are made permanent. Several variants of coordinated checkpointing have been proposed in literature. Few of them are described below: Variation-I Straight-Forward Approach: It is a two-phase protocol. Straight-forward approach is to block communications while the checkpointing protocol executes [11]. A coordinator takes a checkpoint and broadcasts a request message to all processes, asking them to take a checkpoint. When a process receives the message, it stops its executions, flushes all the communication channels, takes a tentative checkpoint, and sends an acknowledgement message back to the coordinator. After the coordinator receives acknowledgements from all processes, itbroadcasts a commit message that completes the two phase checkpoint protocol. On receiving commit, a process converts its tentative checkpoint into permanent one and discards its old permanent checkpoint, if any. The process is then free to resume execution and exchange messages with other processes. However, this method has certain demerits. Every process has to block for the entire duration that the protocol executes. Large overhead is involved in broadcasting twice (for checkpoint request message and commit message) Variation-II Non-blocking Checkpoint Co-ordination: This protocol is also known as the Distributed snapshot protocol and was proposed by Chandy and Lamport in The initiator takes a checkpoint and broadcasts a marker (i.e. a checkpoint request) to all processes. Each process takes a checkpoint upon receiving the first marker and rebroadcasts the marker to all the other processes before sending any application message. The underlying assumption is that the comunication channels are FIFO (First In First Out) based and reliable. If the channels are non-fifo, the marker can be piggybacked on every post-checkpoint message. Alternatively, checkpoint indices can serve as markers, whereby a checkpoint is triggered if the receiver s local checkpoint index is lower than the piggybacked checkpoint index Variation-III Checkpointing with Synchronized Clocks: The underlying principle is that loosely synchronized clocks can trigger the local checkpointing actions of all the participating processes at approximately the same time without a checkpoint initiator. A process takes a checkpoint and waits for a period that equals the sum of the maximum deviation between clocks and the maximum time to detect a failure in another process in the system. If a failure occurs, it is detected within the specified time and the protocol is aborted. But the drawback in this method is that all the processes need to participate in every checkpoint. So, scalability is an issue Variation-IV Minimal Checkpoint Coordination: It is also a two-phase protocol. In the first phase, the checkpoint initiator identifies all the processes with which it has communicated since last checkpoint and sends them a request. Upon receiving the request, each process in turn identifies all the processes it has communicated with since last checkpoint and sends them a request and so on until no more processes can be identified. In phase two of the protocol, all the processes identified in the first phase take a checkpoint. However, the demerit in this method is that after a process takes a checkpoint, it cannot send any message until the second phase terminates successfully.

4 92 International Journal of Computer Science and Communication (IJCSC) So the main advantage of coordinated checkpointing is that storage overhead is reduced and need for garbage collection is also eliminated as only one permanent and atmost only one tentative checkpoint needs to be stored. In case of a failure, the system restarts from the last checkpointed state. A permanent checkpoint cannot be undone and guarantees that the computation will start from the last checkpointed state and not from the beginning. A tentative checkpoint can be undone or changed to be a permanent checkpoint. Coordinated checkpointing is not susceptible to Domino effect since every process upon failure always restarts from the most recent checkpoint. The main disadvantage is the Message overhead. This approach will be efficient if the number of processes involved in the computation is small, say in hundreds. If the number of processes in the system is in lakhs, then this approach will cause lot of message overhead. Another disadvantage is the checkpointing overhead. In general, the number of I/O nodes is much smaller as compared to the number of processes in the system. After a process sends a ready message, it queues its activities until it receives proceed message from the coordinator, during which time no useful work is done by the processes. Furthermore, when the checkpoint of a process is dumped to an I/O node, it can cause a lot of contention for the I/O nodes since one I/O node supports several processors. All this contribute to checkpointing overhead (i.e., the time spent by the processes without doing any useful work). 3.3 Communication Induced Checkpointing The another name for this scheme is quasi-synchrounous checkpointing. Communication-induced checkpointing avoids the domino-effect while allowing processes to take some of their checkpoints independently [12], [13], [14]. In these protocols, processes take two kinds of checkpoints, local and forced. Local checkpoints can be taken independently, while forced checkpoints are taken to guarantee the eventual progress of the recovery line and to minimize useless checkpoints. As opposed to coordinated checkpointing, these protocols do no exchange any special coordination messages to determine when forced checkpoints should be taken. But, they piggyback protocol specific information [generally checkpoint sequence numbers] on each application message; the receiver then uses this information to decide if it should take a forced checkpoint to advance the global recovery line. This decision is based on the receiver determining if past communication and checkpoint patterns can lead to the creation of useless checkpoints; a forced checkpoint is taken to break these patterns [10], [14]. Comminucation induced checkpointing can be classified into two types Model-based Checkpointing: Model-based checkpointing relies on preventing patterns of communications and checkpoints that could result in inconsistent states among the existing checkpoints. A model is set up to detect the possibility that such patterns could be forming within the system, according to some heuristic. A checkpoint is usually forced to prevent the undesirable patterns from occurring. Index-based Checkpointing: Index-based checkpointing works by assigning monotonically increasing indexes to checkpoints, such that the checkpoints having the same index at different processes form a consistent state. The indices are piggybacked on application messages to help receivers decide when they should force a checkpoint. 3.4 Message-logging Based Checkpointing Protocols Message-logging protocols (for example [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], are popular for building systems that can tolerate process crash failures. Message logging and checkpointing can be used to provide fault tolerance in distributed mobile systems in which all inter-process communication is through messages. Each message received by a process is saved in message log on stable storage. No coordination is required between the checkpointing of different processes or between message logging and checkpointing. The execution of each process is assumed to be deterministic between received messages, and all processes are assumed to execute on fail stop processes. When a process crashes, a new process is created in its place. The new process is given the appropriate recorded local state, and then the logged messages are replayed in the order the process originally received them. All message-logging protocols require that once a crashed process recovers, its state needs to be consistent with the states of the other processes [11], [29]. This consistency requirement is usually expressed in terms of orphan processes, which are surviving processes whose states are inconsistent with the recovered states of crashed processes. Thus, message- logging protocols guarantee that upon recovery, no process is an orphan. This requirement can be enforced either by avoiding the creation of orphans during an execution, as pessimistic protocols do, or by taking appropriate actions during recovery to eliminate all orphans as optimistic protocols do. Bin Yao et al. [29] describes a receiver based message logging protocol for mobile hosts, mobile support stations and home agents in a Mobile IP environment, which guarantees independent recovery. Checkpointing is utilized to limit log size and recovery latency. Log-based recovery protocols can be classified into three types-pessimistic logging, Optimistic logging and Casual logging.

5 A Survey And Performance Analysis Of Checkpointing And Recovery Schemes For Mobile Computing Systems 93 Pessimistic logging protocols are designed under the assumption that a failure can occur after any nondeterministic event in the computation. This assumption is pessimistic since in reality failures are rare. In their most straightforward form, pessimistic protocols log to stable storage the determinant of each nondeterministic event before the event is allowed to affect the computation. These pessimistic protocols implement the following property, often referred to as synchronous logging, which is a strengthening of the always-noorphans condition: This property stipulates that if an event has not been logged on stable storage, then no process can depend on it. In addition to logging determinants, processes also take periodic checkpoints to limit the amount of work that has to be repeated in execution replay during recovery. Should a failure occur when the application program is restarted from the most recent checkpoint and the logged determinants are used during recovery to recreate the pre-failure execution. This property has four advantages: ( i) Processes can commit output to the outside world without running a special protocol. ( ii) Processes restart from their most recent checkpoint upon a failure, therefore limiting the extent of execution that has to be replayed. Thus, the frequency of checkpoints can be determined by trading off the desired runtime performance with the desired protection of the on-going execution. (iii) Recovery is simplified because the effects of a failure are confined only to the processes that fail. Functioning processes continue to operate and never become orphans because a process always recovers to the state that included its most recent interaction with any other process or with the outside world. This is highly desirable in practical systems. (iv) Recovery information can be garbage-collected easily. Older checkpoints and determinants of nondeterministic events that occurred before the most recent checkpoint can be reclaimed because they will never be needed for recovery [11]. Optimistic logging protocols processes log determinants asynchronously to stable storage. These protocols make the optimistic assumption that logging will complete before a failure occurs. Determinants are kept in a volatile log, which is periodically flushed to stable storage. Thus, optimistic logging does not require the application to block waiting for the determinants to be actually written to stable storage, and therefore incurs little overhead during failure-free execution. However, this advantage comes at the expense of more complicated recovery, garbage collection, and slower output commit than in pessimistic logging. If a process fails, the determinants in its volatile log will be lost, and the state intervals that were started by the nondeterministic events corresponding to these determinants cannot be recovered. Furthermore, if the failed process sent a message during any of the state intervals that cannot be recovered, the receiver of the message becomes an orphan process and must roll back to undo the effects of receiving the message. Optimistic protocols [11] do not implement the alwaysno-orphans condition, and therefore permit the temporary creation of orphan processes. To perform these rollbacks correctly, optimistic logging protocols track causal dependencies during failure-free execution. Upon a failure, the dependency information is used to calculate and recover the latest global state of the pre-failure execution in which no process is in an orphan. Causal logging [11] has the failure-free performance advantages of optimistic logging while retaining most of the advantages of pessimistic logging. Like optimistic logging, it avoids synchronous access to stable storage except during output commit. Like pessimistic logging, it allows each process to commit output independently and never creates orphans, thereby isolating processes from the effects of failures that occur in other processes. Furthermore, causal logging limits the rollback of any failed process to the most recent checkpoint on stable storage. This reduces the storage overhead and the amount of work at risk. 4. COMPARISON OF ROLLBACK RECOVERY PROTOCOLS In this section we have compared all the above discussed checkpointing schemes on various parameters Domino Effect: Processes may coordinate their checkpoints to form consistent states. The cascaded rollback may continue and eventually may lead to the Domino effect, which causes the system to rollback to the beginning of the computation, in spite of all saved checkpoints. Orphan Message : Messages whose reception has been recorded, but the record of their transmission has been lost. This situation arises when the sender node rolls back to a state prior to sending the message while the receiver node still has the record of its reception. Recovery Line : It is desirable to minimize the amount of lost work by restoring the system to most recent consistent global checkpoint, which is called the recovery line. Output commit : Before sending output to the outside world, the system must ensure that the state from which the output is sent will be recovered despite any future failure. Such problem is called output commit problem.

6 94 International Journal of Computer Science and Communication (IJCSC) Table 1 Comparison of Rollback Recovery Protocols Parameters Uncoordinated Coordinated Communication Message Logging Protocols Checkpointing Checkpointing Induced Pessimistic Optistic Casual Checkpointing Logging Logging Logging Domino Possible No No No No No Effect Orphan Possible No Possible No Possible No Message Recovery Unbounded Last global Possibly several Last Possibly several Last Line checkpoint checkpoints checkpoint check points check point Output Not possible Global Global Local decision Global Local decision Commit Coordination Coordination Coordination required required required 5. CHECKPOINTING ISSUES IN DISTRIBUTED MOBILE SYSTEMS The existence of mobile nodes in a distributed system introduces new issues that need proper handling while designing a checkpointing algorithm for such systems. These issues are mobility, disconnections, finite power source, vulnerable to physical damage, lack of stable storage etc. [30], [31]. The location of an Mobile Host within the network, as represented by its current local Mobile Support Station, changes with time. Checkpointing schemes that send control messages to Mobiloe Hostss, will need to first locate the Mobile Host within the network, and thereby incur a search overhead [1]. Due to vulnerability of mobile computers to catastrophic failures, disk storage of an Mobile Host is not acceptably stable for storing message logs or local checkpoints. Checkpointing schemes must therefore, rely on an alternative stable repository for an Mobile Host s local checkpoint [1]. Disconnections of one or more Mobile Hostss should not prevent recording the global state of an application executing on MHs. It should be noted that disconnection of an Mobile Host is a voluntary operation, and frequent disconnections of Mobile Hosts is an expected feature of the mobile computing environments [1]. The battery at the Mobile Host has limited life. To save energy, the Mobile Host can power down individual components during periods of low activity [32]. This strategy is referred to as the doze mode operation. The Mobile Host in doze mode is awakened on receiving a message. Therefore, energy conservation and low bandwidth constraints require the checkpointing algorithms to minimize the number of synchronization messages and the number of checkpoints. The new issues make traditional checkpointing techniques unsuitable to checkpoint mobile distributed systems [30], [33]. 6. CONCLUSION A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The availability of such systems is on the rise due to the profileration of portable computers and advances in communication technology. An efficient recovery mechanism for mobile computing systems is required to maintain the continuity of computation in the event of failure. In this paper we have reviewed different schemes to rollback recovery in mobile computing systems with respect to a set of properties including performance overhead, storage over-head, ease of recovery, freedom from domino effect, freedom from orphan processes, and the extent of rollback. Checkpointing protocols require the processes to take periodic checkpoints with varying degrees of coordination. Coordinated checkpointing requires the processes to coordinate their checkpoints to form global consistent system states. Coordinated checkpointing generally simplifies recovery and garbage collection, and yields good performance in practice. At the other end of the spectrum, uncoordinated checkpointing does not require the processes to coordinate their checkpoints, but it suffers from potential domino effect, complicates recovery, and still requires coordination to perform output commit or garbage collection. Communication-induced checkpointing schemes depend on the communication patterns of the applications to trigger checkpoints. These schemes do not suffer from the domino effect and do not require coordination. Log-based rollback recovery is often a natural choice for applications that frequently interact with the outside world. It allows efficient output commit, and has three choices, pessimistic, optimistic, and causal. This form of logging simplifies recovery, output commit, and protects surviving processes from having to roll back. These advantages have made pessimistic logging attractive in commercial environment where simplicity and robustness are necessary. Causal logging reduces the overhead while still preserving the properties of fast output commit and orphan-free recovery.

7 A Survey And Performance Analysis Of Checkpointing And Recovery Schemes For Mobile Computing Systems 95 REFRENCES [1] B.R. Badrinath, A. Acharya, and T. Lmeilinski. Structuring Distributed Algorithms for Mobile Hosts. In Proceedings of the 14th International Conference on Distributed Computing Systems, (to Appear), June [2] Chandy K.M., and Lamport L., Distributed Snapshots: Determining Global State of Distribited Systems, ACM Transaction on Computing Systems, 3 (1), pp., 63-75, February, [3] Randall B., System Structure for Software Fault Tolerance, IEEE Trans. On Software Engineering, 1 (2), , [4] Russell D.L., State Restoration in Systems of Communicating Processes, IEEE Trans. on Software Engineering, 6 (2), , [5] Bhargava B., and Lian S.R., Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems - An Optimistic Approach, Proceedings of 17th IEEE Symposium on Reliable Distributed Systems, pp., 3-12, [6] Storm R., and Temini S., Optimistic Recovery in Distributed Systems, ACM Trans. Computer Systems, Aug, 1985, pp [7] Weigang Ni, Susan V. Vrbsky, and Sibabrata Ray, Lowcost Coordinated Checkpointing in Mobile Computing Systems, Proceeding of the Eighth IEEE International Symposium on Computers and Communications, [8] Zomaya A.Y.H., Parallel and Distributed Computing Handbook, (New York : McGraw - Hill), 1996 [9] Koo R., and Tueg S., Checkpointing and Rollback Recovery for Distributed Systems, IEEE Trans. On Software Engineering, 13 (1), pp , January [10] Elonzahy E.N., Alvisi L., Wang Y.M., and Johnson D.B., A Survey of Rollback-Recovery Protocols in Message- Passing Systems, ACM Computing surveys, 34 (3), pp , [11] Tamir Y., Sequin C.H., Error Recovery in Multicomputers using Global Checkpoints, In Proceedings of the International Conference on Parallel Processing, pp , [12] Baldoni R., Hélary J-M., Mostefaoui A., and Raynal M., A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability, Proceedings of the International Symposium on Fault- Tolerant-Computing Systems, pp , June [13] Hélary J.M., Mostefaoui A., and Raynal M., Communication-Induced Determination of Consistent Snapshots, Proceedings of the 28th International Symposium on Fault-Tolerant Computing, pp , June [14] Manivannan D., and Singhal M., Quasi-Synchronous Checkpointing: Models, Characterization, and Classification, IEEE Trans. Parallel and Distributed Systems, 10 (7), pp , July [15] Alvisi Lorenzo, and Marzullo Keith, Message Logging: Pessimistic, Optimistic, Causal, and Optimal, IEEE Transactions on Software Engineering, 24 (2), February 1998, pp [16] L. Alvisi, Hoppe B., Marzullo K., Nonblocking and Orphan-Free message Logging Protocol, Proc. of 23rd Fault Tolerant Computing Symp., pp , June [17] L. Alvisi, Understanding the Message Logging Paradigm for Masking Process Crashes, Ph.D. Thesis, Cornell Univ., Dept. of Computer Science, Jan Available as Technical Report TR [18] L. Alvisi, and K. Marzullo, Tradeoffs in Implementing Optimal Message Logging Protocol, Proc. 15th Symp. Principles of Distributed Computing, pp , ACM, June, [19] A. Borg, J. Baumbach, and S. Glazer, A Message System Supporting Fault Tolerance, Proc. Symp. Operating System Principles, pp , ACM SIG OPS, Oct [20] Elnozahy, and Zwaenepoel W, Manetho: Transparent Rollback Recovery with Low-overhead, Limited Rollback and Fast Output Commit, IEEE Trans. Computers, 41 (5), pp , May [21] Elnozahy, and Zwaenepoel W, On the Use and Implementation of Message Logging, 24th int l Symp. Fault Tolerant Computing, pp , IEEE Computer Society, June [22] D. Johnson, Distributed System Fault Tolerance Using Message Logging and Checkpointing, Ph.D. Thesis, Rice Univ., Dec [23] M.L. Powell, and D.L. Presotto, Publishing: A Reliable Broadcase Communication Mechanism, Proc. Ninth Symp. Operating System Principles, pp , ACM SIGOPS, Oct [24] A.P. Sistla and, J.L. Welch, Efficient Distributed Recovery Using Message Logging, Proc. 18th Symp. Principles of Distributed Computing, pp , Aug [25] S. Venketasan, and T.Y. Juang, Efficient Algorithms for Optimistic Crash recovery, Distributed Computing, 8 (2), pp , June [26] S. Venketasan, Message-Optimal Incremental Snapshots, Computer and Software Engineering, 1 (3), pp , [27] S. Venketasan, Optimistic Crash Recovery without Rolling Back Non-Faulty Processors, Information Sciences, [28] S. Venketasan, and T.T.Y. Juang, Low Overhead Optimistic Crash Recovery, Proc. 11th Int. [29] Wang Y., and Fuchs W.K., Lazy Checkpoint Coordination for Bounding Rollback Propagation, Proc. 12th Symp. Reliable Distributed Systems, pp , Oct [30] Acharya A., and Badrinath B.R., Checkpointing Distributed Applications on Mobile Computers, Proceedings of the 3rd International Conference on Parallel and Distributed Information Systems, pp , September [31] Adnan Agbaria, William H. Sanders, Distributed Snapshots for Mobile Computing Systems, Proceedings of the Second IEEE Annual Conference on Pervasive Computing and Communications (Percom 04), pp. 1-10, [32] George H. Forman, and John Zahorjan, The Challenges of Mobile Computing, IEEE Computers, 27 (4), pp , April [33] Prakash R., and Singhal M., Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems, IEEE Transaction On Parallel and DistributedSystems, 7 (10), pp , October 1996.

A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems

A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems Rachit Garg 1, Praveen Kumar 2 1 Singhania University, Department of Computer Science & Engineering, Pacheri Bari (Rajasthan),