Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time
|
|
- Kelly Stokes
- 6 years ago
- Views:
Transcription
1 Using Reection for Checkpointing Concurrent Object Oriented Programs Mangesh Kasbekar, Chandramouli Narayanan, Chita R Das Department of Computer Science & Engineering The Pennsylvania State University University Park, PA fkasbekar,cnarayan,dasg@cse.psu.edu Abstract This paper presents a reective approach to checkpointing concurrent object oriented programs. We describe a checkpointing and rollback library for multithreaded programs written in C++. We demonstrate some of the unique features oered by this library, such as selective checkpointing and selective rollbacks of threads of a process that are achievable only through the use of reection. 1 Introduction Checkpointing is one of the commonly used cures against transient software failures. Checkpointing a running program involves saving enough state information of the program on stable storage, so that it can be restarted from the saved state if the program crashes, instead of restarting it from the beginning. Libraries [1, 2, 3] provide checkpointing facility but they do not take multithreaded or object oriented nature of the software system into consideration. Reection, as a method for separating faulttolerance mechanism from application, has been used in software fault-tolerance. But these systems assume a non-concurrent software model and implement fault tolerance through N-version programming and recovery blocks[4] or server replication[5]. Concurrent object oriented software systems are known to have transient faults [6]. In this paper we demonstrate the use of reection for building a prototype checkpointing and recovery library, libooft [7], which addresses the transient faults in concurrent object oriented systems. The conventional methods of checkpointing are completely non-object-oriented in nature. We take an object oriented approach tocheckpointing by assuming all data of a program is in the form of objects, each of which knows how to checkpoint its own data. This assumption allows us to develop many interesting and unique options for checkpointing and rollback in addition to the conventional ones[8, 9], especially for concurrent programs. The prominent features presented in this paper are the following.first, it allows rollback of some threads of a process to their previous checkpoint while allowing others to continue unaected from these rollbacks. Second, separation of functional part of objects from its non-functional part(faulttolerance) transparently with the help of reection. The most important property to be satised in rollbacks is consistency of program data. We use reection to ensure consistency in such rollbacks. The run-time component enables optimizations during the checkpointing phase by identifying and limiting the number of objects that are required to be included in a checkpoint. In order to keep the runtime monitoring transparent from application programmer, we use MOP provided by OpenC++ [10] to analyze and translate user programs and insert the runtime support code in them. When applied to software fault tolerance, this scheme is useful to tolerate thread-level failures and limit the number of threads aected by failure of other threads in the program.
2 2 Denitions and Consistency Requirements Checkpoint (T1) Thread 1 Checkpoint (T1) Thread 1 Fault(T2) We dene some terms that will be used later. Rollback object set and Rollback thread set: The sets of objects and threads that are needed to be rolled back to ensure correctness. Thread-object interaction information : At any given instant, the thread-object interaction information for a process can be specied by a set of 2-tuples (T i ;O j )8i 2 (1;N t ) where N t is maximum number of threads, T i is the ith thread and O j is the jth object being accessed at that instant. Object dependency graph : At any given instant, the object dependency graph can be specied by an undirected graph of objects in the system, with an edge connecting any two objects, if at least one of the two objects has invoked a method of the other since the last checkpoint. 2.1 Ensuring Consistency After a Selective Rollback Consider a typical program execution as shown in Fig. 1(a) below. Acheckpoint at time T1 represents a consistent state of the program. A rollback to this checkpoint is therefore always consistent. But, from observation of the execution of the program after time T1, we see that threads Thread1 and have disjoint reference data sets. Thread1 accesses objects,, and accesses objects,. At time T2, if Thread1 is to be rolled back to the checkpoint, only,, need be rolled back, while allowing, to remain at their state at time T2. This state, shown in Fig. 1 is also consistent, since there is no dened ordering between execution of Thread1 and, which is manifested by their disjoint reference datasets. Even though methods to checkpoint and restore a checkpoint are best provided by the programmer, We believe that the programmer should not be burdened with the responsibility of providing this information. Moreover, with the use of various class libraries in a program, generating such information may not be possible for the user of these libraries. A generic way to generate this information can save programmers the trouble of specifying if for all programs they write. The same explanation holds true for reference datasets (a) Execution of a Program (b) Selective Rollback Figure 1: Execution and a consistent selective rollback of a program and thread-object interaction which are required for selective rollbacks. Reection is the process of reasoning about and acting upon the system itself. We use compile-time reection to automatically generate the checkpoint and recovery methods for all the objects in the system. By tracking the method invocation of each object, the reference datasets for threads are built and thread-object interaction information is generated. 3 Checkpointing and Rollback Schemes In this section we present the blocking global checkpointing scheme and the blocking selective rollback scheme. These schemes require all the threads to be suspended while both checkpointing and recovery are in progress. 3.1 Blocking Global Checkpoints The blocking algorithm begins by a checkpointer thread Sending a CHECKPOINT signal to all worker threads, that makes them generate the threadobject interaction information and suspend themselves till an END CHECKPOINT signal. It then goes through the list of objects invoking checkpoint method of each object, which stores the checkpoint of each object in stable storage. After each object is checkpointed, it stores the thread-object interaction information in the stable storage. Finally, it sends an END CHECKPOINT signal to all the worker threads. After receipt of this signal,
3 each thread stores its state in stable storage and continues with its computation. 3.2 Blocking Selective Rollback The Blocking selective rollback algorithm begins by a thread sending a ROLLBACK signal to all worker threads in the process. This makes them generate the thread-object interaction information and suspend themselves till an END ROLLBACK signal. The starting object for recovery (O s ) is the object that initiated the rollback process. It is inserted into the rollback object set. Using O s, a search is made in the dependency graph to discover connected objects. As objects are discovered, they are added to the rollback object set and also compared against the threadobject interaction information to discover any thread that was associated with any of them when ROLLBACK signal was initiated. Any thread discovered this way is then added into the rollback thread set. At the end of search, all objects in the rollback object are rewound back to the state from its previous global checkpoint. Finally, it sends an END ROLLBACK signal to all worker threads. After this signal, each worker thread checks if it is included in the rollback thread set. If so, it rolls itself back to a state from the previous global checkpoint. Otherwise, it continues with its computation. O7 O8 O Global Checkpoint (a) Checkpoint O7 O8 O Global Checkpoint T1 Thread1 (b) Rollback T2 T3 Recovery Interval Figure 2: The Blocking Checkpointing and Rollback Schemes Fig. 2 (a) and (b) show pictorially the checkpointing and recovery schemes described above by taking a simple example. As shown in the Fig. 2(b) above, global checkpoint is taken at time T1 and both threads executing, when initiates a rollback at time T2. The selective rollback scheme analyzes the object interaction graph and thread-object interaction information to discover that the reference dataset of Thread1 after T1 is f,g, that of is f,o7g and the initiating object is. It then rolls back, O7 and. Thread1 continues without any loss of computation. It can be easily proved that the resultant state is consistent. 4 Design of the Library The library libooft has two distinct components. One is routines required for saving and restoring stacks, registers et al, which characterise this class of fault-tolerant libraries. The other is a generic metaclass which makes every object in the system reective. It combines compile-time and run-time reection to make every object in the system fault-tolerant and every thread selectively recoverable. Saving of program data is done at object level, unlike in other conventional implementations. 4.1 Compile Reection At compile time, the OpenC++ compiler analyzes data members and inheritance hierarchies of the base level program, and generates the checkpoint and restore functions for each class. These functions are responsible for saving(restoring) data members of objects into(from) the checkpoint le when the object is asked to checkpoint(restore) itself. All classes are added a base class that maintains information about the instances of the class. All member functions are wrapped for generating run-time object interaction information. We show a sample program below, before the source translation stage. #include <ooft.h> class A { int foo () ; private : int x1 ; int x2 ; ; int A :: foo () { // function body int main () { //main body The program, after translation becomes :
4 class ooft { // library data members ; class A : virtual public ooft { int foo () ; private : int original_foo () ; int x1 ; int x2 ; void checkpoint () { // Generated code checkpointwrite (x1, sizeof(int)); checkpointwrite (x2, sizeof(int)); void restore () { // Generated code checkpointread (&x1, sizeof(int)); checkpointread (&x2, sizeof(int)); ; int A :: foo () { // Get caller's id and make an // entry in object dependency graph return original_foo() ; int A:: original_foo () { // function body main () { // main body 4.2 Runtime Component To begin with, the object list is empty and the object dependency graph has no edges in it. As objects are created and they invoke each others' methods, entries are made in the object list by the library class ooft and edges are added to the object dependency graphs by the function wrappers. After each checkpoint, all edges are removed from the graph. Similarly, the runtime library maintains a thread list which iskept in synch thread creation and deletion during execution. on. 5 Performance We used a synthetic workload used for evaluating performance of the checkpointing and recovery schemes in absence of any standard MT-OO benchmarks. The rst program is deterministic and consists of 10 threads and 1000 objects that are used to sort a given piece of data. Each object owns integer data of random size, and a sort method for sorting the data. Objects are arranged in groups of 100, and each thread owns a group. Each thread's reference data set is the group owned by it and a few objects from its two neighbors' group. In each iteration of the workload, for all objects in the reference data set, the sort method is called at least once. To create dierent types of workloads, we vary the size of objects and the overlap between groups to create two workloads, workload1 and workload2. Workload1 has small objects and small overlap, and it represents a program in which the complete reference dataset is built up very quickly after a checkpoint. In workload2, the object sizes and overlap are bigger and therefore it represents a program with a very slow growth of reference data set. The complete reference dataset of a thread is never completely referenced within a checkpoint interval. The second program, Workload3, consists of 1000 objects and 20 threads. Unlike the rst program, it is completely random in terms of the number of objects used by a thread, communication between objects and time spent by each thread in each object. 5.1 Performance of Checkpointing Execution overhead includes the cost of run-time reection and actual checkpoint time. We estimate the checkpoint time from overhead and total number of checkpoints, assuming that cost of runtime data collection is distributed equally over the number of checkpoints. Study of the execution times of the workload programs shows that in absence of faults, the overhead due to runtime data collection is small, and found to be less than 2% for the workloads used above. In presence of random faults, the selective recovery mechanism is triggered always, and rolls back a subset of the program's threads
5 and objects. From the more detailed study in [7], It can be seen that the eect of the fault is localized by the selective recovery mechanism. If the failure occurs very shortly after a checkpoint, very few objects and threads are aected by it. 6 Conclusions and Future Work Using meta-information for checkpointing of threads and objects has a run-time overhead, but its cost is only a small part of functional cost of object themselves. Our experiment with using reection has helped in separation of the fault-tolerance mechanism from application and in the process increasing the reusability of the prototype library. Transparent addition of checkpoint and recovery, and dynamic maintenance of the object dependency graph have been possible largely due to reection. Finally, exploiting multithreaded nature of applications to selectively recover threads and objects has a good potential for superior performance, especially for server class of applications where thread interactions do not grow rapidly with time. To transition from synthetic workload to a realistic workload, libooft will have to handle dynamic allocation by and of objects. This entails either swizzling of pointers [11] or deferring the deletion of allocated objects to consistent checkpoint times. Further, reinstatement of threads blocked on synchronization objects, like mutexes and condition variables, at recovery is a necessary enhancement. Acknowledgements We would like to thank Shigeru Chiba of University of Tsukuba, for clarifying numerous questions on use of the OpenC++ compiler, Yennun Huang, Reinhard Klemm and Shalini Yajnik of Bell Laboratories for valuable discussions about issues in the implementation of the source code generator. References [1] J. S. Plank, M. Beck, G. Kingsley, and K. Li, \Libckpt: Transparent checkpointing under unix," in Conference Proceedings, Usenix Winter 1995 Technical Conference, (New Orleans, LA,), January [2] Y. Huang and C. M. R. Kintala, \Software implemented fault tolerance: Technologies and experience," in Proceedings of Intl. Symposium on Fault-Tolerant Computing, (Toulouse, France), pp. 2{9, June [3] H. chang Nam, J. Kim, S. Hong, and S. Lee, \Probabilistic checkpointing," in Proceedings of Intl. Symposium on Fault-Tolerant Computing, [4] J. Xu, B. Randell, and A. F. Zorzo, \Implementing software-fault tolerance in c++ and openc++:an object oriented and reective approach," in Proceedings of International Workshop on Computer Aided Design,Test, and Evaluation for Dependability, pp. 224{ 229, July [5] J.-C. Fabre and T. Perennou, \A metaobject architecture for fault tolerant distributed systems:the friends approach," in IEEE Transactions on Computers, pp. 78{95, January [6] B. J. Xu, Randell, A. Romanovsky, C. M, F. Rubira, R. Stroud, and Z. Wu, \Fault tolerance in concurrent object-oriented software through coordinated error recovery," in Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS-25), pp. 499{508, [7] M. Kasbekar, C. R. Das, and A. Sivasubramaniam, \An object oriented approach to checkpointing." The Pennsylvania State University, University Park. [8] G. Deconinck, J. Vounckx, R. Cuyvers, and R. Lauwereins, \Survey of checkpointing and rollback techniques," Tech. Rep..1.8 and , ESAT-ACCA Laboratory, Katholieke Universiteit Leuven, Belgium, June [9] E. Elnozahy, D. Johnson, and Y. Wang, \A survey of rollback-recovery protocols in message passing systems," Tech. Rep. CMU-CS , Department of Computer Science, Carnegie Mellon University, August [10] S. Chiba, \A metaobject protocol for c++," in Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), pp. 285{299, October [11] J. Eliot B. Moss, \Working with persistent objects: To swizzle or not to swizzle," IEEE Transactions on Software Engineering, vol. 18, pp. 657{673, August 1992.
On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme
On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract
More informationConsistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:
Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical
More informationOn Checkpoint Latency. Nitin H. Vaidya. Texas A&M University. Phone: (409) Technical Report
On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Phone: (409) 845-0512 FAX: (409) 847-8578 Technical Report
More informationReflective Java and A Reflective Component-Based Transaction Architecture
Reflective Java and A Reflective Component-Based Transaction Architecture Zhixue Wu APM Ltd., Poseidon House, Castle Park, Cambridge CB3 0RD UK +44 1223 568930 zhixue.wu@citrix.com ABSTRACT In this paper,
More informationKevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a
Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate
More informationprocesses based on Message Passing Interface
Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This
More informationImplementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach
Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach Jie Xu, Brian Randell and Avelino F. Zorzo Department of Computing Science University of Newcastle
More informationOn Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems
On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer
More informationSome Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:
Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:
More informationA Behavior Based File Checkpointing Strategy
Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract
More informationA Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments
1 A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments E. M. Karanikolaou and M. P. Bekakos Laboratory of Digital Systems, Department of Electrical and Computer Engineering,
More informationAdaptive Fault Tolerant Systems: Reflective Design and Validation
1 Adaptive Fault Tolerant Systems: Reflective Design and Validation Marc-Olivier Killijian Dependable Computing and Fault Tolerance Research Group Toulouse - France 2 Motivations Provide a framework for
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationMichel Heydemann Alain Plaignaud Daniel Dure. EUROPEAN SILICON STRUCTURES Grande Rue SEVRES - FRANCE tel : (33-1)
THE ARCHITECTURE OF A HIGHLY INTEGRATED SIMULATION SYSTEM Michel Heydemann Alain Plaignaud Daniel Dure EUROPEAN SILICON STRUCTURES 72-78 Grande Rue - 92310 SEVRES - FRANCE tel : (33-1) 4626-4495 Abstract
More informationShigeru Chiba Michiaki Tatsubori. University of Tsukuba. The Java language already has the ability for reection [2, 4]. java.lang.
A Yet Another java.lang.class Shigeru Chiba Michiaki Tatsubori Institute of Information Science and Electronics University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan. Phone: +81-298-53-5349
More informationOptimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory
Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems Yi-Min Wang and W. Kent Fuchs Coordinated Science Laboratory University of Illinois at Urbana-Champaign Abstract Message-passing
More informationDavid B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].
Sender-Based Message Logging David B. Johnson Willy Zwaenepoel Department of Computer Science Rice University Houston, Texas Abstract Sender-based message logging isanewlow-overhead mechanism for providing
More informationA Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.
A Case for Two-Level Distributed Recovery Schemes Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-31, U.S.A. E-mail: vaidya@cs.tamu.edu Abstract Most distributed
More informationDo! environment. DoT
The Do! project: distributed programming using Java Pascale Launay and Jean-Louis Pazat IRISA, Campus de Beaulieu, F35042 RENNES cedex Pascale.Launay@irisa.fr, Jean-Louis.Pazat@irisa.fr http://www.irisa.fr/caps/projects/do/
More informationDistributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg
Distributed Recovery with K-Optimistic Logging Yi-Min Wang Om P. Damani Vijay K. Garg Abstract Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world
More informationEnhanced N+1 Parity Scheme combined with Message Logging
IMECS 008, 19-1 March, 008, Hong Kong Enhanced N+1 Parity Scheme combined with Message Logging Ch.D.V. Subba Rao and M.M. Naidu Abstract Checkpointing schemes facilitate fault recovery in distributed systems.
More informationMESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS
MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute
More informationExperimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software
Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software Avelino Zorzo, Jie Xu, and Brian Randell * Department of Computing Science, University of Newcastle upon Tyne, NE1 7RU,UK
More informationAn Architecture for Recoverable Interaction Between. Applications and Active Databases. Eric N. Hanson Roxana Dastur Vijay Ramaswamy.
An Architecture for Recoverable Interaction Between Applications and Active Databases (extended abstract) Eric N. Hanson Roxana Dastur Vijay Ramaswamy CIS Department University of Florida Gainseville,
More informationPriya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA
OMG Real-Time and Distributed Object Computing Workshop, July 2002, Arlington, VA Providing Real-Time and Fault Tolerance for CORBA Applications Priya Narasimhan Assistant Professor of ECE and CS Carnegie
More informationto automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu
Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu
More informationNovel low-overhead roll-forward recovery scheme for distributed systems
Novel low-overhead roll-forward recovery scheme for distributed systems B. Gupta, S. Rahimi and Z. Liu Abstract: An efficient roll-forward checkpointing/recovery scheme for distributed systems has been
More informationSpace-Efficient Page-Level Incremental Checkpointing *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science
More informationDesign Framework for Self-Stabilizing Real- Time Systems Based on Real-Time Objects and Prototype Implementation with Analysis 1
Design Framework for Self-Stabilizing Real- Time Systems Based on Real-Time Objects and Prototype Implementation with Analysis 1 Sushil S. Digewade and Albert M. K. Cheng Computer Science Department University
More informationglobal checkpoint and recovery line interchangeably). When processes take local checkpoint independently, a rollback might force the computation to it
Checkpointing Protocols in Distributed Systems with Mobile Hosts: a Performance Analysis F. Quaglia, B. Ciciani, R. Baldoni Dipartimento di Informatica e Sistemistica Universita di Roma "La Sapienza" Via
More informationA Survey of Rollback-Recovery Protocols in Message-Passing Systems
A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer
More informationOperating System Architecture. CS3026 Operating Systems Lecture 03
Operating System Architecture CS3026 Operating Systems Lecture 03 The Role of an Operating System Service provider Provide a set of services to system users Resource allocator Exploit the hardware resources
More informationFault-Tolerant Computer Systems ECE 60872/CS Recovery
Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.
More informationNooks. Robert Grimm New York University
Nooks Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Design and Implementation Nooks Overview An isolation
More informationOverhead-Free Portable Thread-Stack Checkpoints
Overhead-Free Portable Thread-Stack Checkpoints Ronald Veldema and Michael Philippsen University of Erlangen-Nuremberg, Computer Science Department 2, Martensstr. 3 91058 Erlangen Germany {veldema, philippsen@cs.fau.de
More informationWhat is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?
What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,
More informationRecovering from Main-Memory Lapses. H.V. Jagadish Avi Silberschatz S. Sudarshan. AT&T Bell Labs. 600 Mountain Ave., Murray Hill, NJ 07974
Recovering from Main-Memory Lapses H.V. Jagadish Avi Silberschatz S. Sudarshan AT&T Bell Labs. 600 Mountain Ave., Murray Hill, NJ 07974 fjag,silber,sudarshag@allegra.att.com Abstract Recovery activities,
More informationMonitoring Script. Event Recognizer
Steering of Real-Time Systems Based on Monitoring and Checking Oleg Sokolsky, Sampath Kannan, Moonjoo Kim, Insup Lee, and Mahesh Viswanathan Department of Computer and Information Science University of
More informationDistributed Scheduling for the Sombrero Single Address Space Distributed Operating System
Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationCPS221 Lecture: Threads
Objectives CPS221 Lecture: Threads 1. To introduce threads in the context of processes 2. To introduce UML Activity Diagrams last revised 9/5/12 Materials: 1. Diagram showing state of memory for a process
More informationRollback-Recovery p Σ Σ
Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8
More informationReflective Design Patterns to Implement Fault Tolerance
Reflective Design Patterns to Implement Fault Tolerance Luciane Lamour Ferreira Cecília Mary Fischer Rubira Institute of Computing - IC State University of Campinas UNICAMP P.O. Box 676, Campinas, SP 3083-970
More informationTHE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano
THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,
More informationJulep: an Environment for the Evaluation of Distributed Process Recovery Protocols
Julep: an Environment for the Evaluation of Distributed Process Recovery Protocols Lawrence R. Klos Golden G. Richard III {lklos, golden}@cs.uno.edu Department of Computer Science University of New Orleans
More informationReflection and Object-Oriented Analysis
Walter Cazzola, Andrea Sosio, and Francesco Tisato. Reflection and Object-Oriented Analysis. In Proceedings of the 1 st Workshop on Object-Oriented Reflection and Software Engineering (OORaSE 99), pages
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationCauses of Software Failures
Causes of Software Failures Hardware Faults Permanent faults, e.g., wear-and-tear component Transient faults, e.g., bit flips due to radiation Software Faults (Bugs) (40% failures) Nondeterministic bugs,
More informationArchitectural Blueprint
IMPORTANT NOTICE TO STUDENTS These slides are NOT to be used as a replacement for student notes. These slides are sometimes vague and incomplete on purpose to spark a class discussion Architectural Blueprint
More informationTechnical Comparison between several representative checkpoint/rollback solutions for MPI programs
Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More informationAvailability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742
Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve
More informationFault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network
Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Diana Hecht 1 and Constantine Katsinis 2 1 Electrical and Computer Engineering, University of Alabama in Huntsville,
More informationStackable Layers: An Object-Oriented Approach to. Distributed File System Architecture. Department of Computer Science
Stackable Layers: An Object-Oriented Approach to Distributed File System Architecture Thomas W. Page Jr., Gerald J. Popek y, Richard G. Guy Department of Computer Science University of California Los Angeles
More informationMachine-Independent Virtual Memory Management for Paged June Uniprocessor 1st, 2010and Multiproce 1 / 15
Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multiprocessor Architectures Matthias Lange TU Berlin June 1st, 2010 Machine-Independent Virtual Memory Management for Paged June
More informationConcurrent Exception Handling and Resolution in Distributed Object Systems
Concurrent Exception Handling and Resolution in Distributed Object Systems Presented by Prof. Brian Randell J. Xu A. Romanovsky and B. Randell University of Durham University of Newcastle upon Tyne 1 Outline
More informationSAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group
SAMOS: an Active Object{Oriented Database System Stella Gatziu, Klaus R. Dittrich Database Technology Research Group Institut fur Informatik, Universitat Zurich fgatziu, dittrichg@ifi.unizh.ch to appear
More informationDYNAMIC SCHEDULING AND RESCHEDULING WITH FAULT TOLERANCE STRATEGY IN GRID COMPUTING
DYNAMIC SCHEDULING AND RESCHEDULING WITH FAULT TOLERANCE STRATEGY IN GRID COMPUTING Ms. P. Kiruthika Computer Science & Engineering, SNS College of Engineering, Coimbatore, Tamilnadu, India. Abstract Grid
More informationBioTechnology. An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.
[Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 15 2014 BioTechnology An Indian Journal FULL PAPER BTAIJ, 10(15), 2014 [8768-8774] The Java virtual machine in a thread migration of
More informationBhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm***
Fault Tolerant Environment Using Hardware Failure Detection, Roll Forward Recovery Approach and Microrebooting For Distributed Systems Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm*** ABSTRACT *(Department
More informationDMTCP: Fixing the Single Point of Failure of the ROS Master
DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u
More informationRecoverable Mobile Environments: Design and. Trade-o Analysis. Dhiraj K. Pradhan P. Krishna Nitin H. Vaidya. College Station, TX
Recoverable Mobile Environments: Design and Trade-o Analysis Dhiraj K. Pradhan P. Krishna Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: (409)
More informationPeriodic Thread A. Deadline Handling Thread. Periodic Thread B. Periodic Thread C. Rate Change. Deadline Notification Port
A Continuous Media Application supporting Dynamic QOS Control on Real-Time Mach Tatsuo Nakajima Hiroshi Tezuka Japan Advanced Institute of Science and Technology 15 Asahidai, Tatsunokuchi, Ishikawa, 923-12
More informationJulep: A Framework for Reliable Distributed Computing in Java
Julep: A Framework for Reliable Distributed Computing in Java Lawrence R. Klos Golden G. Richard III Zhidong Xu Department of Computer Science University of New Orleans New Orleans, LA 70148 Abstract Julep
More informationObject Oriented Transaction Processing in the KeyKOS Microkernel
Object Oriented Transaction Processing in the KeyKOS Microkernel William S. Frantz Charles R. Landau Periwinkle Computer Consulting Tandem Computers Inc. 16345 Englewood Ave. 19333 Vallco Pkwy, Loc 3-22
More informationHardware-Supported Pointer Detection for common Garbage Collections
2013 First International Symposium on Computing and Networking Hardware-Supported Pointer Detection for common Garbage Collections Kei IDEUE, Yuki SATOMI, Tomoaki TSUMURA and Hiroshi MATSUO Nagoya Institute
More informationHypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware
Where should RC be implemented? In hardware sensitive to architecture changes At the OS level state transitions hard to track and coordinate At the application level requires sophisticated application
More informationProviding Real-Time and Fault Tolerance for CORBA Applications
Providing Real-Time and Tolerance for CORBA Applications Priya Narasimhan Assistant Professor of ECE and CS University Pittsburgh, PA 15213-3890 Sponsored in part by the CMU-NASA High Dependability Computing
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationNFSv4 as the Building Block for Fault Tolerant Applications
NFSv4 as the Building Block for Fault Tolerant Applications Alexandros Batsakis Overview Goal: To provide support for recoverability and application fault tolerance through the NFSv4 file system Motivation:
More informationSurvey on Incremental MapReduce for Data Mining
Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,
More informationSynopsis by: Stephen Roberts, GMU CS 895, Spring 2013
Using Components for Architecture-Based Management The Self-Repair case Sylvain Sicard Université Joseph Fourier, Grenoble, France, Fabienne Boyer Université Joseph Fourier, Grenoble, France, Noel De Palma
More informationScalable In-memory Checkpoint with Automatic Restart on Failures
Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th
More informationfeatures of Python 1.5, including the features earlier described in [2]. Section 2.6 summarizes what is new in Python The class and the class
A note on reection in Python 1.5 Anders Andersen y AAndersen@ACM.Org March 13, 1998 Abstract This is a note on reection in Python 1.5. Both this and earlier versions of Python has an open implementation
More informationThe Extensible Java Preprocessor Kit. and a Tiny Data-Parallel Java. Abstract
The Extensible Java Preprocessor Kit and a Tiny Data-Parallel Java Yuuji ICHISUGI 1, Yves ROUDIER 2 fichisugi,roudierg@etl.go.jp 1 Electrotechnical Laboratory, 2 STA Fellow, Electrotechnical Laboratory
More informationRedo Log Undo Log. Redo Log Undo Log. Redo Log Tail Volatile Store. Pers. Redo Log
Recovering from Main-Memory Lapses H.V. Jagadish AT&T Research Murray Hill, NJ 07974 jag@research.att.com Avi Silberschatz Bell Laboratories Murray Hill, NJ 07974 avi@bell-labs.com S. Sudarshan Indian
More information14.7 Dynamic Linking. Building a Runnable Program
14 Building a Runnable Program 14.7 Dynamic Linking To be amenable to dynamic linking, a library must either (1) be located at the same address in every program that uses it, or (2) have no relocatable
More informationAutomatic Code Generation for Non-Functional Aspects in the CORBALC Component Model
Automatic Code Generation for Non-Functional Aspects in the CORBALC Component Model Diego Sevilla 1, José M. García 1, Antonio Gómez 2 1 Department of Computer Engineering 2 Department of Information and
More informationRecoverability. Kathleen Durant PhD CS3200
Recoverability Kathleen Durant PhD CS3200 1 Recovery Manager Recovery manager ensures the ACID principles of atomicity and durability Atomicity: either all actions in a transaction are done or none are
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationProcesses and Threads Implementation
Processes and Threads Implementation 1 Learning Outcomes An understanding of the typical implementation strategies of processes and threads Including an appreciation of the trade-offs between the implementation
More informationAs related works, OMG's CORBA (Common Object Request Broker Architecture)[2] has been developed for long years. CORBA was intended to realize interope
HORB: Distributed Execution of Java Programs HIRANO Satoshi Electrotechnical Laboratory and RingServer Project 1-1-4 Umezono Tsukuba, 305 Japan hirano@etl.go.jp http://ring.etl.go.jp/openlab/horb/ Abstract.
More informationDistributed File Systems. CS432: Distributed Systems Spring 2017
Distributed File Systems Reading Chapter 12 (12.1-12.4) [Coulouris 11] Chapter 11 [Tanenbaum 06] Section 4.3, Modern Operating Systems, Fourth Ed., Andrew S. Tanenbaum Section 11.4, Operating Systems Concept,
More informationConcept as a Generalization of Class and Principles of the Concept-Oriented Programming
Computer Science Journal of Moldova, vol.13, no.3(39), 2005 Concept as a Generalization of Class and Principles of the Concept-Oriented Programming Alexandr Savinov Abstract In the paper we describe a
More informationDiscount Checking: Transparent, Low-Overhead Recovery for General Applications
Discount Checking: Transparent, Low-Overhead Recovery for General Applications David E. Lowell and Peter M. Chen Computer Science and Engineering Division Department of Electrical Engineering and Computer
More informationTransparent Access to Legacy Data in Java. Olivier Gruber. IBM Almaden Research Center. San Jose, CA Abstract
Transparent Access to Legacy Data in Java Olivier Gruber IBM Almaden Research Center San Jose, CA 95120 Abstract We propose in this paper an extension to PJava in order to provide a transparent access
More informationNovel Log Management for Sender-based Message Logging
Novel Log Management for Sender-based Message Logging JINHO AHN College of Natural Sciences, Kyonggi University Department of Computer Science San 94-6 Yiuidong, Yeongtonggu, Suwonsi Gyeonggido 443-760
More informationTransparent Orthogonal Checkpointing Through User-Level Pagers
Transparent Orthogonal Checkpointing Through User-Level Pagers Espen Skoglund, Christian Ceelen, and Jochen Liedtke System Architecture Group University of Karlsruhe {skoglund,ceelen,liedtke}@ira.uka.de
More informationOptimistic Distributed Simulation Based on Transitive Dependency. Tracking. Dept. of Computer Sci. AT&T Labs-Research Dept. of Elect. & Comp.
Optimistic Distributed Simulation Based on Transitive Dependency Tracking Om P. Damani Yi-Min Wang Vijay K. Garg Dept. of Computer Sci. AT&T Labs-Research Dept. of Elect. & Comp. Eng Uni. of Texas at Austin
More informationRecovering Device Drivers
1 Recovering Device Drivers Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy University of Washington Presenter: Hayun Lee Embedded Software Lab. Symposium on Operating Systems
More informationDevelopment of Technique for Healing Data Races based on Software Transactional Memory
, pp.482-487 http://dx.doi.org/10.14257/astl.2016.139.96 Development of Technique for Healing Data Races based on Software Transactional Memory Eu-Teum Choi 1,, Kun Su Yoon 2, Ok-Kyoon Ha 3, Yong-Kee Jun
More informationSupport for Software Interrupts in Log-Based. Rollback-Recovery. J. Hamilton Slye E.N. Elnozahy. IBM Austin Research Lab. July 27, 1997.
Support for Software Interrupts in Log-Based Rollback-Recovery J. Hamilton Slye E.N. Elnozahy Transarc IBM Austin Research Lab July 27, 1997 Abstract The piecewise deterministic execution model is a fundamental
More informationTowards a Resilient Operating System for Wireless Sensor Networks
Towards a Resilient Operating System for Wireless Sensor Networks Hyoseung Kim Hojung Cha Yonsei University, Korea 2006. 6. 1. Hyoseung Kim hskim@cs.yonsei.ac.kr Motivation (1) Problems: Application errors
More informationFault Tolerance. The Three universe model
Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful
More informationApplication. Protocol Stack. Kernel. Network. Network I/F
Real-Time Communication in Distributed Environment Real-Time Packet Filter Approach 3 Takuro Kitayama Keio Research Institute at SFC Keio University 5322 Endo Fujisawa Kanagawa, Japan takuro@sfc.keio.ac.jp
More informationA Metaobject Protocol for Fault-Tolerant CORBA Applications *
A Metaobject Protocol for Fault-Tolerant CORBA Applications * Marc-Olivier Killijian 1, Jean-Charles Fabre, Juan-Carlos Ruiz-Garcia LAAS-CNRS, 7 Avenue du Colonel Roche 31077 Toulouse cedex, France Shigeru
More informationLecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo
Lecture 21: Logging Schemes 15-445/645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Crash Recovery Recovery algorithms are techniques to ensure database consistency, transaction
More informationResearch on the Novel and Efficient Mechanism of Exception Handling Techniques for Java. Xiaoqing Lv 1 1 huihua College Of Hebei Normal University,
International Conference on Informatization in Education, Management and Business (IEMB 2015) Research on the Novel and Efficient Mechanism of Exception Handling Techniques for Java Xiaoqing Lv 1 1 huihua
More informationThe Procedure Abstraction
The Procedure Abstraction Procedure Abstraction Begins Chapter 6 in EAC The compiler must deal with interface between compile time and run time Most of the tricky issues arise in implementing procedures
More information