Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time

Using Reection for Checkpointing Concurrent Object Oriented Programs Mangesh Kasbekar, Chandramouli Narayanan, Chita R Das Department of Computer Science & Engineering The Pennsylvania State University University Park, PA 16802. fkasbekar,cnarayan,dasg@cse.psu.edu Abstract This paper presents a reective approach to checkpointing concurrent object oriented programs. We describe a checkpointing and rollback library for multithreaded programs written in C++. We demonstrate some of the unique features oered by this library, such as selective checkpointing and selective rollbacks of threads of a process that are achievable only through the use of reection. 1 Introduction Checkpointing is one of the commonly used cures against transient software failures. Checkpointing a running program involves saving enough state information of the program on stable storage, so that it can be restarted from the saved state if the program crashes, instead of restarting it from the beginning. Libraries [1, 2, 3] provide checkpointing facility but they do not take multithreaded or object oriented nature of the software system into consideration. Reection, as a method for separating faulttolerance mechanism from application, has been used in software fault-tolerance. But these systems assume a non-concurrent software model and implement fault tolerance through N-version programming and recovery blocks[4] or server replication[5]. Concurrent object oriented software systems are known to have transient faults [6]. In this paper we demonstrate the use of reection for building a prototype checkpointing and recovery library, libooft [7], which addresses the transient faults in concurrent object oriented systems. The conventional methods of checkpointing are completely non-object-oriented in nature. We take an object oriented approach tocheckpointing by assuming all data of a program is in the form of objects, each of which knows how to checkpoint its own data. This assumption allows us to develop many interesting and unique options for checkpointing and rollback in addition to the conventional ones[8, 9], especially for concurrent programs. The prominent features presented in this paper are the following.first, it allows rollback of some threads of a process to their previous checkpoint while allowing others to continue unaected from these rollbacks. Second, separation of functional part of objects from its non-functional part(faulttolerance) transparently with the help of reection. The most important property to be satised in rollbacks is consistency of program data. We use reection to ensure consistency in such rollbacks. The run-time component enables optimizations during the checkpointing phase by identifying and limiting the number of objects that are required to be included in a checkpoint. In order to keep the runtime monitoring transparent from application programmer, we use MOP provided by OpenC++ [10] to analyze and translate user programs and insert the runtime support code in them. When applied to software fault tolerance, this scheme is useful to tolerate thread-level failures and limit the number of threads aected by failure of other threads in the program.

2 Denitions and Consistency Requirements Checkpoint (T1) Thread 1 Checkpoint (T1) Thread 1 Fault(T2) We dene some terms that will be used later. Rollback object set and Rollback thread set: The sets of objects and threads that are needed to be rolled back to ensure correctness. Thread-object interaction information : At any given instant, the thread-object interaction information for a process can be specied by a set of 2-tuples (T i ;O j )8i 2 (1;N t ) where N t is maximum number of threads, T i is the ith thread and O j is the jth object being accessed at that instant. Object dependency graph : At any given instant, the object dependency graph can be specied by an undirected graph of objects in the system, with an edge connecting any two objects, if at least one of the two objects has invoked a method of the other since the last checkpoint. 2.1 Ensuring Consistency After a Selective Rollback Consider a typical program execution as shown in Fig. 1(a) below. Acheckpoint at time T1 represents a consistent state of the program. A rollback to this checkpoint is therefore always consistent. But, from observation of the execution of the program after time T1, we see that threads Thread1 and have disjoint reference data sets. Thread1 accesses objects,, and accesses objects,. At time T2, if Thread1 is to be rolled back to the checkpoint, only,, need be rolled back, while allowing, to remain at their state at time T2. This state, shown in Fig. 1 is also consistent, since there is no dened ordering between execution of Thread1 and, which is manifested by their disjoint reference datasets. Even though methods to checkpoint and restore a checkpoint are best provided by the programmer, We believe that the programmer should not be burdened with the responsibility of providing this information. Moreover, with the use of various class libraries in a program, generating such information may not be possible for the user of these libraries. A generic way to generate this information can save programmers the trouble of specifying if for all programs they write. The same explanation holds true for reference datasets (a) Execution of a Program (b) Selective Rollback Figure 1: Execution and a consistent selective rollback of a program and thread-object interaction which are required for selective rollbacks. Reection is the process of reasoning about and acting upon the system itself. We use compile-time reection to automatically generate the checkpoint and recovery methods for all the objects in the system. By tracking the method invocation of each object, the reference datasets for threads are built and thread-object interaction information is generated. 3 Checkpointing and Rollback Schemes In this section we present the blocking global checkpointing scheme and the blocking selective rollback scheme. These schemes require all the threads to be suspended while both checkpointing and recovery are in progress. 3.1 Blocking Global Checkpoints The blocking algorithm begins by a checkpointer thread Sending a CHECKPOINT signal to all worker threads, that makes them generate the threadobject interaction information and suspend themselves till an END CHECKPOINT signal. It then goes through the list of objects invoking checkpoint method of each object, which stores the checkpoint of each object in stable storage. After each object is checkpointed, it stores the thread-object interaction information in the stable storage. Finally, it sends an END CHECKPOINT signal to all the worker threads. After receipt of this signal,

each thread stores its state in stable storage and continues with its computation. 3.2 Blocking Selective Rollback The Blocking selective rollback algorithm begins by a thread sending a ROLLBACK signal to all worker threads in the process. This makes them generate the thread-object interaction information and suspend themselves till an END ROLLBACK signal. The starting object for recovery (O s ) is the object that initiated the rollback process. It is inserted into the rollback object set. Using O s, a search is made in the dependency graph to discover connected objects. As objects are discovered, they are added to the rollback object set and also compared against the threadobject interaction information to discover any thread that was associated with any of them when ROLLBACK signal was initiated. Any thread discovered this way is then added into the rollback thread set. At the end of search, all objects in the rollback object are rewound back to the state from its previous global checkpoint. Finally, it sends an END ROLLBACK signal to all worker threads. After this signal, each worker thread checks if it is included in the rollback thread set. If so, it rolls itself back to a state from the previous global checkpoint. Otherwise, it continues with its computation. O7 O8 O9 0 1 2 Global Checkpoint (a) Checkpoint O7 O8 O9 0 1 2 Global Checkpoint T1 Thread1 (b) Rollback T2 T3 Recovery Interval Figure 2: The Blocking Checkpointing and Rollback Schemes Fig. 2 (a) and (b) show pictorially the checkpointing and recovery schemes described above by taking a simple example. As shown in the Fig. 2(b) above, global checkpoint is taken at time T1 and both threads executing, when initiates a rollback at time T2. The selective rollback scheme analyzes the object interaction graph and thread-object interaction information to discover that the reference dataset of Thread1 after T1 is f,g, that of is f,o7g and the initiating object is. It then rolls back, O7 and. Thread1 continues without any loss of computation. It can be easily proved that the resultant state is consistent. 4 Design of the Library The library libooft has two distinct components. One is routines required for saving and restoring stacks, registers et al, which characterise this class of fault-tolerant libraries. The other is a generic metaclass which makes every object in the system reective. It combines compile-time and run-time reection to make every object in the system fault-tolerant and every thread selectively recoverable. Saving of program data is done at object level, unlike in other conventional implementations. 4.1 Compile Reection At compile time, the OpenC++ compiler analyzes data members and inheritance hierarchies of the base level program, and generates the checkpoint and restore functions for each class. These functions are responsible for saving(restoring) data members of objects into(from) the checkpoint le when the object is asked to checkpoint(restore) itself. All classes are added a base class that maintains information about the instances of the class. All member functions are wrapped for generating run-time object interaction information. We show a sample program below, before the source translation stage. #include <ooft.h> class A { int foo () ; private : int x1 ; int x2 ; ; int A :: foo () { // function body int main () { //main body The program, after translation becomes :

class ooft { // library data members ; class A : virtual public ooft { int foo () ; private : int original_foo () ; int x1 ; int x2 ; void checkpoint () { // Generated code checkpointwrite (x1, sizeof(int)); checkpointwrite (x2, sizeof(int)); void restore () { // Generated code checkpointread (&x1, sizeof(int)); checkpointread (&x2, sizeof(int)); ; int A :: foo () { // Get caller's id and make an // entry in object dependency graph return original_foo() ; int A:: original_foo () { // function body main () { // main body 4.2 Runtime Component To begin with, the object list is empty and the object dependency graph has no edges in it. As objects are created and they invoke each others' methods, entries are made in the object list by the library class ooft and edges are added to the object dependency graphs by the function wrappers. After each checkpoint, all edges are removed from the graph. Similarly, the runtime library maintains a thread list which iskept in synch thread creation and deletion during execution. on. 5 Performance We used a synthetic workload used for evaluating performance of the checkpointing and recovery schemes in absence of any standard MT-OO benchmarks. The rst program is deterministic and consists of 10 threads and 1000 objects that are used to sort a given piece of data. Each object owns integer data of random size, and a sort method for sorting the data. Objects are arranged in groups of 100, and each thread owns a group. Each thread's reference data set is the group owned by it and a few objects from its two neighbors' group. In each iteration of the workload, for all objects in the reference data set, the sort method is called at least once. To create dierent types of workloads, we vary the size of objects and the overlap between groups to create two workloads, workload1 and workload2. Workload1 has small objects and small overlap, and it represents a program in which the complete reference dataset is built up very quickly after a checkpoint. In workload2, the object sizes and overlap are bigger and therefore it represents a program with a very slow growth of reference data set. The complete reference dataset of a thread is never completely referenced within a checkpoint interval. The second program, Workload3, consists of 1000 objects and 20 threads. Unlike the rst program, it is completely random in terms of the number of objects used by a thread, communication between objects and time spent by each thread in each object. 5.1 Performance of Checkpointing Execution overhead includes the cost of run-time reection and actual checkpoint time. We estimate the checkpoint time from overhead and total number of checkpoints, assuming that cost of runtime data collection is distributed equally over the number of checkpoints. Study of the execution times of the workload programs shows that in absence of faults, the overhead due to runtime data collection is small, and found to be less than 2% for the workloads used above. In presence of random faults, the selective recovery mechanism is triggered always, and rolls back a subset of the program's threads

and objects. From the more detailed study in [7], It can be seen that the eect of the fault is localized by the selective recovery mechanism. If the failure occurs very shortly after a checkpoint, very few objects and threads are aected by it. 6 Conclusions and Future Work Using meta-information for checkpointing of threads and objects has a run-time overhead, but its cost is only a small part of functional cost of object themselves. Our experiment with using reection has helped in separation of the fault-tolerance mechanism from application and in the process increasing the reusability of the prototype library. Transparent addition of checkpoint and recovery, and dynamic maintenance of the object dependency graph have been possible largely due to reection. Finally, exploiting multithreaded nature of applications to selectively recover threads and objects has a good potential for superior performance, especially for server class of applications where thread interactions do not grow rapidly with time. To transition from synthetic workload to a realistic workload, libooft will have to handle dynamic allocation by and of objects. This entails either swizzling of pointers [11] or deferring the deletion of allocated objects to consistent checkpoint times. Further, reinstatement of threads blocked on synchronization objects, like mutexes and condition variables, at recovery is a necessary enhancement. Acknowledgements We would like to thank Shigeru Chiba of University of Tsukuba, for clarifying numerous questions on use of the OpenC++ compiler, Yennun Huang, Reinhard Klemm and Shalini Yajnik of Bell Laboratories for valuable discussions about issues in the implementation of the source code generator. References [1] J. S. Plank, M. Beck, G. Kingsley, and K. Li, \Libckpt: Transparent checkpointing under unix," in Conference Proceedings, Usenix Winter 1995 Technical Conference, (New Orleans, LA,), January 1995. [2] Y. Huang and C. M. R. Kintala, \Software implemented fault tolerance: Technologies and experience," in Proceedings of Intl. Symposium on Fault-Tolerant Computing, (Toulouse, France), pp. 2{9, June 1993. [3] H. chang Nam, J. Kim, S. Hong, and S. Lee, \Probabilistic checkpointing," in Proceedings of Intl. Symposium on Fault-Tolerant Computing, 1997. [4] J. Xu, B. Randell, and A. F. Zorzo, \Implementing software-fault tolerance in c++ and openc++:an object oriented and reective approach," in Proceedings of International Workshop on Computer Aided Design,Test, and Evaluation for Dependability, pp. 224{ 229, July 1996. [5] J.-C. Fabre and T. Perennou, \A metaobject architecture for fault tolerant distributed systems:the friends approach," in IEEE Transactions on Computers, pp. 78{95, January 1998. [6] B. J. Xu, Randell, A. Romanovsky, C. M, F. Rubira, R. Stroud, and Z. Wu, \Fault tolerance in concurrent object-oriented software through coordinated error recovery," in Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS-25), pp. 499{508, 1995. [7] M. Kasbekar, C. R. Das, and A. Sivasubramaniam, \An object oriented approach to checkpointing." The Pennsylvania State University, University Park. [8] G. Deconinck, J. Vounckx, R. Cuyvers, and R. Lauwereins, \Survey of checkpointing and rollback techniques," Tech. Rep..1.8 and 3.1.12, ESAT-ACCA Laboratory, Katholieke Universiteit Leuven, Belgium, June 1993. [9] E. Elnozahy, D. Johnson, and Y. Wang, \A survey of rollback-recovery protocols in message passing systems," Tech. Rep. CMU-CS- 96-144, Department of Computer Science, Carnegie Mellon University, August 1996. [10] S. Chiba, \A metaobject protocol for c++," in Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), pp. 285{299, October 1995. [11] J. Eliot B. Moss, \Working with persistent objects: To swizzle or not to swizzle," IEEE Transactions on Software Engineering, vol. 18, pp. 657{673, August 1992.