On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

Similar documents
On Checkpoint Latency. Nitin H. Vaidya. Texas A&M University. Phone: (409) Technical Report

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Recoverable Mobile Environments: Design and. Trade-o Analysis. Dhiraj K. Pradhan P. Krishna Nitin H. Vaidya. College Station, TX

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Ecient Redo Processing in. Jun-Lin Lin. Xi Li. Southern Methodist University

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

Space-Efficient Page-Level Incremental Checkpointing *

FAULT TOLERANT SYSTEMS

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Optimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory

Recoverable Distributed Shared Memory Using the Competitive Update Protocol

Transport protocols are of practical. login, le transfer, and remote procedure. calls. will operate on and therefore are generally

/$10.00 (c) 1998 IEEE

AN EFFICIENT ALGORITHM IN FAULT TOLERANCE FOR ELECTING COORDINATOR IN DISTRIBUTED SYSTEMS

Analysis of Replication Control Protocols

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

A Customized MVA Model for ILP Multiprocessors

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering

Enhanced N+1 Parity Scheme combined with Message Logging

CHAPTER 5 PROPAGATION DELAY

The temporal explorer who returns to the base 1

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

Implementing Sequential Consistency In Cache-Based Systems

processes based on Message Passing Interface

Multi-Version Caches for Multiscalar Processors. Manoj Franklin. Clemson University. 221-C Riggs Hall, Clemson, SC , USA

P2FS: supporting atomic writes for reliable file system design in PCM storage

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

Issues with Curve Detection Grouping (e.g., the Canny hysteresis thresholding procedure) Model tting They can be performed sequentially or simultaneou

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.

A Delayed Vacation Model of an M/G/1 Queue with Setup. Time and its Application to SVCC-based ATM Networks

BMVC 1996 doi: /c.10.41

A Boolean Expression. Reachability Analysis or Bisimulation. Equation Solver. Boolean. equations.

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Medium Access Control Protocols. using Directional Antennas in Ad Hoc Networks. (Preliminary Version) Young-Bae Ko and Nitin H.

Operating Systems. Deadlock. Lecturer: William Fornaciari Politecnico di Milano Homel.deib.polimi.

Multiprocessor Interconnection Networks- Part Three

FAULT TOLERANT SYSTEMS

3 Competitive Dynamic BSTs (January 31 and February 2)

WITH THE proliferation and ubiquity of handheld devices

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

An Improved Measurement Placement Algorithm for Network Observability

On Exploiting Transient Contact Patterns for Data Forwarding in Delay Tolerant Networks

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

4. Conclusion. 5. Acknowledgment. 6. References

A Behavior Based File Checkpointing Strategy

Application-Driven Coordination-Free Distributed Checkpointing

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Improving IEEE Power Saving Mechanism

Real-Time Model-Free Detection of Low-Quality Synchrophasor Data

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

Checkpointing and Rollback Recovery in Distributed Systems: Existing Solutions, Open Issues and Proposed Solutions

Lecture 5: Performance Analysis I

Responsive Roll-Forward Recovery in Embedded Real-Time Systems

Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network

Improving the Performance of Coordinated Checkpointers. on Networks of Workstations using RAID Techniques. University of Tennessee

Stackable Layers: An Object-Oriented Approach to. Distributed File System Architecture. Department of Computer Science

Fault tolerant scheduling in real time systems

Performance Analysis of Multiprocessor Disk Array Systems Using Colored Petri Nets

Concurrent Reading and Writing of Clocks

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

n = 2 n = 1 µ λ n = 0

Optimum Alphabetic Binary Trees T. C. Hu and J. D. Morgenthaler Department of Computer Science and Engineering, School of Engineering, University of C

Page 1 FAULT TOLERANT SYSTEMS. Coordinated Checkpointing. Time-Based Synchronization. A Coordinated Checkpointing Algorithm

Computing Submesh Reliability in Two-Dimensional Meshes

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance Evaluation of Two New Disk Scheduling Algorithms. for Real-Time Systems. Department of Computer & Information Science

An Algorithm for Fast Incremental Checkpointing. James S. Plank y. Jian Xu z. Robert H. B. Netzer z. University of Tennessee. Knoxville, TN 37996

Byzantine Consensus in Directed Graphs

A Large TimeAware Web Graph

A Distribution-Sensitive Dictionary with Low Space Overhead

TESTING OF FAULTS IN VLSI CIRCUITS USING ONLINE BIST TECHNIQUE BASED ON WINDOW OF VECTORS

Chapter 18 Parallel Processing

The Performance of Consistent Checkpointing. Willy Zwaenepoel. Rice University. for transparently adding fault tolerance to distributed applications

Statistical Testing of Software Based on a Usage Model

Complexity Analysis in Heterogeneous System

Parallel Pipeline STAP System

Enhancing Integrated Layer Processing using Common Case. Anticipation and Data Dependence Analysis. Extended Abstract

Quickest Search Over Multiple Sequences with Mixed Observations

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

An Order-2 Context Model for Data Compression. With Reduced Time and Space Requirements. Technical Report No

Derivatives and Graphs of Functions

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Cache Management for Shared Sequential Data Access

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Transcription:

On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract Checkpointing and rollback is a technique for minimizing loss of computation in presence of failures. Two metrics can be used to characterize a checkpointing scheme: (i) checkpoint overhead (increase in the execution time of the application because of a checkpoint), and (ii) checkpoint latency (duration of time required to save the checkpoint). For many checkpointing methods, checkpoint latency is larger than checkpoint overhead. This paper evaluates the expression for \average overhead" of the checkpointing scheme as a function of checkpoint latency and overhead. It is shown that the \average overhead" is much more sensitive to the changes in checkpoint overhead, as compared to checkpoint latency. Also, for equi-distant checkpoints, the optimal checkpoint interval is shown to be independent of the checkpoint latency. 1 Introduction Many applications (sequential and parallel) require large amount of time to complete. Such applications can encounter loss of a signicant amount of computation if a failure occurs during the execution. Checkpointing and rollback recovery is a technique used to minimize the loss of computation in an environment subject to failures [1]. A checkpoint is a copy of the application's state stored on a stable storage a stable storage is not subject to failures. The application periodically saves checkpoints; the application recovers from a failure by rolling back to a recent checkpoint. Checkpointing can be used for sequential as well as parallel (or distributed) applications. When the application consists of more than one process, a consistent checkpointing algorithm can be used to save a consistent state of the multi-process application. Two metrics can be used to characterize a checkpointing scheme: Checkpoint overhead C is the increase in the execution time of the application because of a checkpoint. Checkpoint latency L is the duration of time required to save the checkpoint. In many imple- This work is supported in part by the National Science Foundation under CAREER Grant MIP-9502563. mentations, checkpoint latency is larger than the checkpoint overhead. (Illustrated in Section 2.) In the past, a large number of researchers have analyzed the checkpointing and rollback recovery scheme (e.g. [1, 2, 3, 11]). However, to our knowledge, the past work has not taken checkpoint latency into account. This paper evaluates the impact of checkpoint latency on the performance of a checkpointing scheme. This work is motivated by the schemes that attempt to reduce checkpoint overhead while causing an increase in the checkpoint latency (e.g., [5, 4, 8]). Related work: Plank et al. [5, 4] present measurements of checkpoint latency and overhead for a few applications, however, they do not present any performance analysis. We measured checkpoint latency and overhead for a few uni-process applications, and briey analyzed the impact of checkpoint latency on performance of \two-level" recovery schemes [7, 9]. 2 Checkpoint Latency We limit the discussion to uni-process applications. Due to lack of space, multi-process applications are not discussed here [10]. In this section, we illustrate the distinction between checkpoint latency and checkpoint overhead with two examples. Sequential checkpointing is an approach for which checkpoint overhead is identical to checkpoint latency. In this approach, when an application process wants to take a checkpoint, it pauses and saves its state on the stable storage [5]. Therefore, the time required to save the checkpoint (i.e., checkpoint latency) is practically identical to the increase in the execution time of the process (i.e., checkpoint overhead). Figure 1 illustrates this approach. The horizontal line represents processor execution, time increasing from left to right. The shaded box represents the checkpointing operation. The sequential checkpointing approach achieves the smallest possible checkpoint latency. However, it results in a larger checkpoint overhead as compared to other approaches. Forked checkpointing is an approach for which checkpoint overhead is usually much smaller than the checkpoint latency. In this approach, when a process wants to take a checkpoint, it forks a child process

begin checkpointing checkpoint overhead C checkpoint latency L end checkpointing checkpoint latency L child process ==> the application does not perform useful computation while checkpointing is in progress parent process time checkpoint is considered to be "established" at the end of checkpoint latency period Figure 1: Sequential Checkpointing fork the checkpoint is "established" at this point in time child saves the state child exits parent process continues computation while the child saves state (a) checkpoint latency L (b) time checkpoint considered "established" at the end of checkpoint latency period Figure 2: Forked Checkpointing useful computation checkpoint overhead [5]. The state of the child process is identical to that of the parent process when fork is performed. After the fork, the parent process continues computation, while the child process saves its state on the stable storage. Figure 2(a) illustrates this approach. In this approach, computation is overlapped with stable storage access (i.e., overlapped with state saving), therefore the checkpoint overhead is usually smaller than the sequential checkpointing approach. Also, as the parent and the child execute in parallel, checkpoint latency is larger than the checkpoint overhead. Figure 2(b) illustrates the interleaved execution of the child and parent processes on the same processor. As shown in the gure, useful computation performed by the parent process is interleaved with the checkpointing operation performed by the child process. For future reference, note that, a checkpoint is said to have been established if a future failure can be tolerated by a rollback to this checkpoint. Thus, a checkpoint is not considered to be established until the end of the latency period. When the execution progresses past the end of the checkpoint latency period, the checkpoint is considered to have been established checkpoint overhead C the checkpoint is considered "established" only at the end of the checkpoint latency period Figure 3: Modeling checkpoint latency and overhead (refer Figures 1 and 2). A checkpoint interval is dened as the duration between the establishment of two consecutive checkpoints. That is, a checkpoint interval begins when one checkpoint is established, and the when the next checkpoint is established. (For brevity, the term interval will be used to denote a checkpoint interval.) We assume that the checkpoints are equi-distant, i.e., the amount of useful computation performed during each interval is identical (denoted by T ). 3 Latency and Overhead As discussed in the previous section, the checkpoint latency period is divided into two types of execution: (1) useful computation, and (2) execution necessary for checkpointing. The two types are usually interleaved in time. However, for modeling purposes, we can assume that the two types of executions are separated in time, as shown in Figure 3. As shown in the gure, the rst C units of time during the checkpoint latency period is assumed to be used for saving the checkpoint. The remaining (L C) units of time is assumed to be spent on useful computation. Although the C units of overhead is modeled as being incurred at the beginning of the checkpoint latency period, the checkpoint is considered to have been established only at the end of the checkpoint latency period. Although our representation of checkpoint latency and overhead is simplied, we now demonstrate that it will lead to accurate analysis. Two distinct situations may occur when a checkpoint interval is executed. Situation 1: A failure does not occur while the interval is executed. In this case, the execution time from the beginning to the end of an interval is T + C. Of the T + C units, T units are spent doing useful computation, while incurring an overhead of C time units. As shown in Figure 4(a), (L C) units of useful computation is performed during the checkpoint latency period. Now consider Figure 4(b). Similar to Figure 4(a), L C units of useful computation is performed during the latency period. Also, the execution time for the interval is T + C. Situation 2: A failure occurs during the interval. When a failure occurs, the task must be rolled back to the previous checkpoint, incurring an overhead of R time units. In Figure 5(a), the task is rolled back to checkpoint CP1. After the rollback, L C units of useful computation performed during the latency period of CP1 must be performed again { this is necessary, because the state saved during checkpoint

L-C units of useful computations is performed during the latency period T+C R+T+L 1 Chandy et al. [1] present an analysis of checkpointing schemes that does not take checkpoint latency into account. However, for sequential checkpointing (with L = C), our analysis is similar to theirs with one exception. An assumption made by Chandy et al. [1] implies that a failure that occurs while checkpoint CP2 (in Figure 5) is being saved, only requires re-initiation of the checkpointing operation. As per their assumption, computation during the interval preceding checkpoint CP2 need not be performed again even if a failure occurs while checkpoint CP2 is being established (i.e. the failure occurs after checkpointing is initiated but before it is completed). For many environments this assumption is not realistic. There- denotes failure denotes rollback overhead = R L L CP1 R CP2 interval begins (a) interval begins (a) T+C C T C CP1 rollback R+T+L CP2 L-C R L-C T+C L interval begins L-C interval begins re-do L-C units of computations (b) (b) Figure 5: Situation 2 Figure 4: Situation 1 CP1 is the state at the beginning of the latency period for CP1. In the absence of a further failure, additional T + C units of execution is required before the completion of the interval. Thus, after a failure, R + (L C) + (T + C) = R + T + L units of execution is required before the completion of the interval, provided additional failures do not occur. Now consider Figure 5(b). When the failure occurs, as shown in Figure 5(b), the system can be considered to have rolled back to the end of the \shaded portion" in the latency period for checkpoint CP1. (Note that no state change occurs during the \shaded portion".) Now it is apparent that, in the absence of further failure, R + T + L units of execution is required to complete the interval. Thus, our representation of checkpoint latency and overhead yields the same conclusion as the more accurate representation in Figure 5(a). The above discussion is also applicable if the failure occurs during the checkpoint latency period of checkpoint CP2. Such a failure will also require a rollback to checkpoint CP1, as checkpoint CP2 is not established when the failure occurred. 1 The above discussion can also be extended to multiple failures during a checkpoint interval. The above two cases imply that the simplied representation of checkpoint latency and overhead will yield the same results as the accurate representation. 4 Evaluating the Overhead We assume that C, L and R are constants for a given scheme. The fault model assumed for the analysis is as follows: Processor failures are governed by a Poisson process with rate. When a processor fails, its local state is corrupted. A processor can fail during normal operation, during checkpointing, as well as during rollback and recovery. The stable storage is not subject to failures. (The stable storage is used for storing checkpoints.) Let G(t) denote the expected (average) amount of execution time required to perform t units of useful computation. (Useful computation excludes the time spent on checkpointing and rollback recovery.) Then, we dene overhead ratio (r) as: overhead ratio r = t!1 lim G(t) t t = t!1 lim G(t) 1: t We assume that the system is executing an innite task that takes equi-distant checkpoints. The execution of the task can be considered to be a series of intervals, each interval beginning immediately after a checkpoint is established, and ending when the next checkpoint is established. Figure 6 illustrates intervals I1, I2 and I3. (Meaning of various states in the gure will be explained shortly.) As shown in the gure, fore, we make the realistic assumption that a failure that occurs while a checkpoint (say, CP2 in Figure 5) is being established requires a rollback to the previous checkpoint (checkpoint CP1).

interval I1 interval I2 interval I3 CP0 CP1 CP2 CP3 state 0 state 0 state 2 state 2 state 0 state 1 state 1 state 1 Figure 6: Intervals interval I1 begins after checkpoint CP0 is established, and ends when checkpoint CP1 is established. Interval I2 begins after checkpoint CP1 is established. Before checkpoint CP2 is established, two failures occur, each requiring a rollback to checkpoint CP1. Subsequently, the computation progresses without failure and checkpoint CP2 is established. Interval I2 is completed when checkpoint CP2 is established. Observe that T units of useful computation is performed during each checkpoint interval. Provided no failures occur during the interval, the total time required to execute an interval is T + C. If one or more failure occurs while executing an interval, then the execution time is longer than T + C. Let denote the expected (average) execution time of an interval. Then, it is easy to see that, overhead ratio r = t!1 lim G(t) 1 = 1 t T Expected execution time of a single checkpoint interval can be evaluated using the Markov chain [6, 12] in Figure 7. State 0 is the initial state, when an interval starts execution. A transition from state 0 to state 1 occurs when the interval is completed without a failure. If a failure occurs while executing the checkpoint interval, then a transition is made from state 0 to state 2. After state 2 is entered, a transition occurs to state 1 if no further failure occurs before the next checkpoint is established. If, however, another failure occurs after entering state 2 and before the next checkpoint is established, then a transition is made from state 2 back to state 2. When state 1 is entered, the interval has completed execution. Therefore, state 1 is a sink state there are no transitions out of state 1. Figure 6 illustrates the various states for an example execution. As shown in Figure 6, during interval I1, the task is in state 0, and enters state 1 when the interval completes without a failure. During interval I2, the task is initially in state 0, and enters state 2 when a failure occurs. When another failure occurs, the task re-enters state 2. Subsequent to the second failure, interval I2 completes without any further failures. State 1 is entered at the end of the interval. Each transition (X; Y ), from state X to state Y in the Markov chain, has an associated transition probability P XY and a cost K XY. Cost K XY of a transition (X; Y ) is the expected (average) time spent in state X before making the transition to state Y. It is easy to see that, P 01 = e (T +C) and K 01 = start 0 2 Figure 7: Markov chain T + C. Also, P 02 = 1 P 01 = 1 e (T +C). Now, the cost K 02 of transition (0,2) is the expected duration, from the beginning of the interval until the time when the failure occurred, given that a failure occurs before the end of the interval. Therefore, K 02 = Z T +C 0 = 1 (t) 1 e t 1 e (T +C) dt (T + C)e (T +C) 1 e (T +C) After state 2 is entered, a transition is made to state 1 if no further failure occurs before the next checkpoint is established. As discussed earlier, the execution time required for the next checkpoint to be established after a failure is R + T + L. Therefore, P 21 = e (R+T +L) and K 21 = R + T + L. If, however, another failure occurs after entering state 2 and before the next checkpoint is established, then a transition is made from state 2 back to state 2. It follows that, P 22 = 1 P 21 = 1 e (R+T +L), and K 22 = Z R+T +L 0 = 1 (t) e t 1 e (R+T +L) dt (R + T + L)e (R+T +L) 1 e (R+T +L) The expected execution time is the expected cost of a path from state 0 to state 1. It follows that, = P 01 K 01 + P 02 K 02 + P 22 K 22 + K 21 1 P 22 Substituting the expressions for various costs and transition probabilities, and simplifying, the following expression is obtained [10]. = 1 e (L C+R) (e (T +C) 1) (1)

It follows that, the overhead ratio r is given by r = T 1 = 1 e (L C+R) (e (T +C) 1) T 1 (2) 4.1 Minimizing the Overhead Ratio Consider a checkpointing scheme that achieves a certain overhead C and checkpoint latency L. For this checkpointing scheme, the objective now is to choose an appropriate value of T so as to minimize the overhead ratio r. The optimal value of T must satisfy the following equation: @T = @ @T 1 e (L C+R) (e (T +C) 1) T 1 = 0 =) e (T +C) (1 T ) = 1 for T 6= 0 (3) As the above equation does not include L or R, the optimal value of T is not dependent on L and R { the optimal T, however, depends on C. Thus, to evaluate the optimal checkpoint interval for a given checkpointing scheme, it is adequate to know the value of C. However, to evaluate the overhead ratio with the optimal T, L and R must also be known. 5 Inter-Dependence Between L and C Checkpoint latency and checkpoint overhead are dependent on each other. An attempt to reduce the checkpoint overhead typically causes an increase in the checkpoint latency. Now, from Equation 2, = e (L C+R) (e (T +C) 1) > 0 (4) @L T = e (L C+R) > 0 (5) @C T From Equations 4 and 5, observe that = @L (e(t +C) 1) @C : Also observe that (e (T +C) 1) is likely to be much smaller than 1 for realistic values of, T and C. Thus, @L will typically be much smaller than @C { this implies that r is more sensitive to the changes in C, as compared to changes in L. This is intuitive: Observe that L does not aect the failure-free execution time, while C does. L only aects the execution time when a failure occurs. As failures do not occur too frequently, r is less sensitive to L, as compared to C. In practice, if a checkpointing scheme increases L and also results in an increase in C, then one will not use that checkpointing scheme. Therefore, in practice, an increase in L is accompanied by a decrease in C. For sequential checkpointing, checkpoint overhead and latency are identical, say C max. A recovery scheme that attempts to achieve a smaller checkpoint overhead (C) than sequential checkpointing will achieve a latency (L) larger than sequential checkpointing. One would not use such a scheme, unless it resulted in a lower overhead ratio r as compared to sequential checkpointing. Equations 4 and 5 imply that a recovery scheme that achieves a smaller checkpoint overhead and larger latency, as compared to sequential checkpointing, can achieve a smaller overhead ratio than sequential checkpointing, provided that the latency is not \too much" larger than sequential checkpointing. It is our objective here to determine when the latency is not \too much" larger than sequential checkpointing. More precisely, the objective is to determine a function g of C such that, for any C < C max, the overhead ratio r is smaller than the sequential checkpointing scheme if L < g(c). Derivation of function g(c) is omitted here [10]. It can be shown that, g(c) = C + 1 ln 1 T c 1 T m where T c is the value of T that satises Equation 3, and T m is the solution of Equation 3 with C = C max. 2. Clearly, as one would expect, g(c max ) = C max. (Note that, when C = C max, T c = T m.) The g(c) expression derived above can be used to determine when a checkpointing scheme will perform better than the sequential checkpointing scheme. Figure 8 plots g(c) for = 10 6 and 10 4. ( = 10 6 for curves (1)-(4) and = 10 4 for curves (5)-(8).) Consider the g(c) curve for C max = 25 and = 10 6. The denition of g(c) implies that, if a checkpointing scheme achieves overhead and latency corresponding to a point \below" the g(c) curve for C max = 25, then this scheme achieves a smaller overhead ratio than the corresponding sequential checkpointing scheme (with checkpoint overhead 25). For instance, if some scheme reduces C from 25 to 10, then it can achieve a smaller overhead ratio r than the sequential checkpointing scheme, even if it increases the latency from 25 to as large as 2000. Comparison of curves for = 10 6 and 10 4 indicates that, for the same C max, as increases, g(c) decreases. This is intuitive, because with larger, it is necessary to keep checkpoint latency smaller (to avoid an increase in the overhead ratio). The \measured L" curve in Figure 9 plots checkpoint overhead and latency measured for a merge sort program using four dierent checkpointing schemes { the data is borrowed from Li et al. [4]. (Although the data in [4] corresponds to a parallel implementation on a shared memory machine, our analysis is applicable to this implementation.) One of the four schemes in the \measured L" curve is sequential checkpointing with overhead C max = 31 seconds. For comparison, Figure 9 also plots g(c) for three dierent values of. Observe that, even when is as large as 10 4 per second, the measured checkpoint latency is well below the g(c) curve. This indicates that, the checkpointing p 2 Tc is approximately 2 C= when C << 1= (similarly, Tm p 2 Cmax=). Young [11] previously obtained this expression by a somewhat dierent analysis

function g(c) checkpoint latency or g(c) (second) lambda=1e-6 for (1)-(4) and 1e-4 otherwise 100000 (1) Cmax=100 (2) Cmax= 75 (3) Cmax= 50 10000 (4) Cmax= 25 (5) Cmax=100 (6) Cmax= 75 (7) Cmax= 50 (8) Cmax= 25 1000 100 10 0 10 20 30 40 50 60 70 80 90 100 checkpoint overhead C Figure 8: g(c) for various values of C max 100000 10000 1000 100 lambda=1e-8 /sec lambda=1e-6 /sec lambda=1e-4 /sec measured L 10 0 5 10 15 20 25 30 35 checkpoint overhead C (second) Figure 9: Comparison of measured checkpoint latency and g(c) for = 10 4 ; 10 6 ; 10 8 /sec techniques used in practice can achieve a signicantly smaller overhead ratio as compared to the sequential checkpointing scheme. 6 Conclusions This paper evaluates an expression for the overhead ratio of a checkpointing scheme, as a function of checkpoint latency (L) and checkpoint overhead (C). Our analysis shows that, for an equi-distant checkpointing strategy, the optimal checkpoint interval is not dependent on the value of L { though it depends on the value of C. It is also observed that the overhead ratio is much more sensitive to the changes in C, as compared to changes in L. The paper uses a simple analytical model { if a dierent model is used, the previous observation will remain valid, however, the optimal checkpoint interval may not remain independent of L (although it should be less sensitive to L than C). The paper considers only uni-process applications; the results can potentially be extended to multiprocess applications as well [10]. References [1] K. M. Chandy, J. C. Browne, C. W. Dissly, and W. R. Uhrig, \Analytic models for rollback and recovery strategies in data base systems," IEEE Trans. Softw. Eng., vol. 1, March 1975. [2] E. Gelenbe, \On the optimum checkpointing interval," J. ACM, vol. 2, pp. 259{270, April 1979. [3] P. L'Ecuyer and J. Malenfant, \Computing optimal checkpointing strategies for rollback and recovery systems," IEEE Trans. Computers, vol. 37, pp. 491{496, April 1988. [4] K. Li, J. F. Naughton, and J. S. Plank, \Lowlatency, concurrent checkpointing for parallel programs," IEEE Trans. Par. Distr. Syst., vol. 5, pp. 874{879, August 1994. [5] J. S. Plank, M. Beck, G. Kingsley, and K. Li, \Libckpt: Transparent checkpointing under Unix," in Usenix Winter 1995 Technical Conference, New Orleans, January 1995. [6] K. S. Trivedi, Probability and Statistics with Reliability, Queueing and Computer Science Applications. Prentice-Hall, 1988. [7] N. H. Vaidya, \Another two-level failure recovery scheme: Performance impact of checkpoint placement and checkpoint latency," Tech. Rep. 94-068, Computer Science Dept., Texas A&M University, College Station, December 1994. [8] N. H. Vaidya, \Consistent logical checkpointing," Tech. Rep. 94-051, Computer Science Department, Texas A&M University, College Station, July 1994. [9] N. H. Vaidya, \A case for two-level distributed recovery schemes," in ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May 1995. [10] N. H. Vaidya, \On checkpoint latency," Tech. Rep. 95-015, Computer Science Department, Texas A&M University, College Station, March 1995. [11] J. W. Young, \A rst order approximation to the optimum checkpoint interval," Comm. ACM, vol. 17, pp. 530{531, September 1974. [12] A. Ziv and J. Bruck, \Analysis of checkpointing schemes for multiprocessor systems," Tech. Rep. RJ 9593, IBM Almaden Res. Center, Nov. 1993.