Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time

Size: px
Start display at page:

Download "Checkpoint (T1) Thread 1. Thread 1. Thread2. Thread2. Time"

Transcription

1 Using Reection for Checkpointing Concurrent Object Oriented Programs Mangesh Kasbekar, Chandramouli Narayanan, Chita R Das Department of Computer Science & Engineering The Pennsylvania State University University Park, PA fkasbekar,cnarayan,dasg@cse.psu.edu Abstract This paper presents a reective approach to checkpointing concurrent object oriented programs. We describe a checkpointing and rollback library for multithreaded programs written in C++. We demonstrate some of the unique features oered by this library, such as selective checkpointing and selective rollbacks of threads of a process that are achievable only through the use of reection. 1 Introduction Checkpointing is one of the commonly used cures against transient software failures. Checkpointing a running program involves saving enough state information of the program on stable storage, so that it can be restarted from the saved state if the program crashes, instead of restarting it from the beginning. Libraries [1, 2, 3] provide checkpointing facility but they do not take multithreaded or object oriented nature of the software system into consideration. Reection, as a method for separating faulttolerance mechanism from application, has been used in software fault-tolerance. But these systems assume a non-concurrent software model and implement fault tolerance through N-version programming and recovery blocks[4] or server replication[5]. Concurrent object oriented software systems are known to have transient faults [6]. In this paper we demonstrate the use of reection for building a prototype checkpointing and recovery library, libooft [7], which addresses the transient faults in concurrent object oriented systems. The conventional methods of checkpointing are completely non-object-oriented in nature. We take an object oriented approach tocheckpointing by assuming all data of a program is in the form of objects, each of which knows how to checkpoint its own data. This assumption allows us to develop many interesting and unique options for checkpointing and rollback in addition to the conventional ones[8, 9], especially for concurrent programs. The prominent features presented in this paper are the following.first, it allows rollback of some threads of a process to their previous checkpoint while allowing others to continue unaected from these rollbacks. Second, separation of functional part of objects from its non-functional part(faulttolerance) transparently with the help of reection. The most important property to be satised in rollbacks is consistency of program data. We use reection to ensure consistency in such rollbacks. The run-time component enables optimizations during the checkpointing phase by identifying and limiting the number of objects that are required to be included in a checkpoint. In order to keep the runtime monitoring transparent from application programmer, we use MOP provided by OpenC++ [10] to analyze and translate user programs and insert the runtime support code in them. When applied to software fault tolerance, this scheme is useful to tolerate thread-level failures and limit the number of threads aected by failure of other threads in the program.

2 2 Denitions and Consistency Requirements Checkpoint (T1) Thread 1 Checkpoint (T1) Thread 1 Fault(T2) We dene some terms that will be used later. Rollback object set and Rollback thread set: The sets of objects and threads that are needed to be rolled back to ensure correctness. Thread-object interaction information : At any given instant, the thread-object interaction information for a process can be specied by a set of 2-tuples (T i ;O j )8i 2 (1;N t ) where N t is maximum number of threads, T i is the ith thread and O j is the jth object being accessed at that instant. Object dependency graph : At any given instant, the object dependency graph can be specied by an undirected graph of objects in the system, with an edge connecting any two objects, if at least one of the two objects has invoked a method of the other since the last checkpoint. 2.1 Ensuring Consistency After a Selective Rollback Consider a typical program execution as shown in Fig. 1(a) below. Acheckpoint at time T1 represents a consistent state of the program. A rollback to this checkpoint is therefore always consistent. But, from observation of the execution of the program after time T1, we see that threads Thread1 and have disjoint reference data sets. Thread1 accesses objects,, and accesses objects,. At time T2, if Thread1 is to be rolled back to the checkpoint, only,, need be rolled back, while allowing, to remain at their state at time T2. This state, shown in Fig. 1 is also consistent, since there is no dened ordering between execution of Thread1 and, which is manifested by their disjoint reference datasets. Even though methods to checkpoint and restore a checkpoint are best provided by the programmer, We believe that the programmer should not be burdened with the responsibility of providing this information. Moreover, with the use of various class libraries in a program, generating such information may not be possible for the user of these libraries. A generic way to generate this information can save programmers the trouble of specifying if for all programs they write. The same explanation holds true for reference datasets (a) Execution of a Program (b) Selective Rollback Figure 1: Execution and a consistent selective rollback of a program and thread-object interaction which are required for selective rollbacks. Reection is the process of reasoning about and acting upon the system itself. We use compile-time reection to automatically generate the checkpoint and recovery methods for all the objects in the system. By tracking the method invocation of each object, the reference datasets for threads are built and thread-object interaction information is generated. 3 Checkpointing and Rollback Schemes In this section we present the blocking global checkpointing scheme and the blocking selective rollback scheme. These schemes require all the threads to be suspended while both checkpointing and recovery are in progress. 3.1 Blocking Global Checkpoints The blocking algorithm begins by a checkpointer thread Sending a CHECKPOINT signal to all worker threads, that makes them generate the threadobject interaction information and suspend themselves till an END CHECKPOINT signal. It then goes through the list of objects invoking checkpoint method of each object, which stores the checkpoint of each object in stable storage. After each object is checkpointed, it stores the thread-object interaction information in the stable storage. Finally, it sends an END CHECKPOINT signal to all the worker threads. After receipt of this signal,

3 each thread stores its state in stable storage and continues with its computation. 3.2 Blocking Selective Rollback The Blocking selective rollback algorithm begins by a thread sending a ROLLBACK signal to all worker threads in the process. This makes them generate the thread-object interaction information and suspend themselves till an END ROLLBACK signal. The starting object for recovery (O s ) is the object that initiated the rollback process. It is inserted into the rollback object set. Using O s, a search is made in the dependency graph to discover connected objects. As objects are discovered, they are added to the rollback object set and also compared against the threadobject interaction information to discover any thread that was associated with any of them when ROLLBACK signal was initiated. Any thread discovered this way is then added into the rollback thread set. At the end of search, all objects in the rollback object are rewound back to the state from its previous global checkpoint. Finally, it sends an END ROLLBACK signal to all worker threads. After this signal, each worker thread checks if it is included in the rollback thread set. If so, it rolls itself back to a state from the previous global checkpoint. Otherwise, it continues with its computation. O7 O8 O Global Checkpoint (a) Checkpoint O7 O8 O Global Checkpoint T1 Thread1 (b) Rollback T2 T3 Recovery Interval Figure 2: The Blocking Checkpointing and Rollback Schemes Fig. 2 (a) and (b) show pictorially the checkpointing and recovery schemes described above by taking a simple example. As shown in the Fig. 2(b) above, global checkpoint is taken at time T1 and both threads executing, when initiates a rollback at time T2. The selective rollback scheme analyzes the object interaction graph and thread-object interaction information to discover that the reference dataset of Thread1 after T1 is f,g, that of is f,o7g and the initiating object is. It then rolls back, O7 and. Thread1 continues without any loss of computation. It can be easily proved that the resultant state is consistent. 4 Design of the Library The library libooft has two distinct components. One is routines required for saving and restoring stacks, registers et al, which characterise this class of fault-tolerant libraries. The other is a generic metaclass which makes every object in the system reective. It combines compile-time and run-time reection to make every object in the system fault-tolerant and every thread selectively recoverable. Saving of program data is done at object level, unlike in other conventional implementations. 4.1 Compile Reection At compile time, the OpenC++ compiler analyzes data members and inheritance hierarchies of the base level program, and generates the checkpoint and restore functions for each class. These functions are responsible for saving(restoring) data members of objects into(from) the checkpoint le when the object is asked to checkpoint(restore) itself. All classes are added a base class that maintains information about the instances of the class. All member functions are wrapped for generating run-time object interaction information. We show a sample program below, before the source translation stage. #include <ooft.h> class A { int foo () ; private : int x1 ; int x2 ; ; int A :: foo () { // function body int main () { //main body The program, after translation becomes :

4 class ooft { // library data members ; class A : virtual public ooft { int foo () ; private : int original_foo () ; int x1 ; int x2 ; void checkpoint () { // Generated code checkpointwrite (x1, sizeof(int)); checkpointwrite (x2, sizeof(int)); void restore () { // Generated code checkpointread (&x1, sizeof(int)); checkpointread (&x2, sizeof(int)); ; int A :: foo () { // Get caller's id and make an // entry in object dependency graph return original_foo() ; int A:: original_foo () { // function body main () { // main body 4.2 Runtime Component To begin with, the object list is empty and the object dependency graph has no edges in it. As objects are created and they invoke each others' methods, entries are made in the object list by the library class ooft and edges are added to the object dependency graphs by the function wrappers. After each checkpoint, all edges are removed from the graph. Similarly, the runtime library maintains a thread list which iskept in synch thread creation and deletion during execution. on. 5 Performance We used a synthetic workload used for evaluating performance of the checkpointing and recovery schemes in absence of any standard MT-OO benchmarks. The rst program is deterministic and consists of 10 threads and 1000 objects that are used to sort a given piece of data. Each object owns integer data of random size, and a sort method for sorting the data. Objects are arranged in groups of 100, and each thread owns a group. Each thread's reference data set is the group owned by it and a few objects from its two neighbors' group. In each iteration of the workload, for all objects in the reference data set, the sort method is called at least once. To create dierent types of workloads, we vary the size of objects and the overlap between groups to create two workloads, workload1 and workload2. Workload1 has small objects and small overlap, and it represents a program in which the complete reference dataset is built up very quickly after a checkpoint. In workload2, the object sizes and overlap are bigger and therefore it represents a program with a very slow growth of reference data set. The complete reference dataset of a thread is never completely referenced within a checkpoint interval. The second program, Workload3, consists of 1000 objects and 20 threads. Unlike the rst program, it is completely random in terms of the number of objects used by a thread, communication between objects and time spent by each thread in each object. 5.1 Performance of Checkpointing Execution overhead includes the cost of run-time reection and actual checkpoint time. We estimate the checkpoint time from overhead and total number of checkpoints, assuming that cost of runtime data collection is distributed equally over the number of checkpoints. Study of the execution times of the workload programs shows that in absence of faults, the overhead due to runtime data collection is small, and found to be less than 2% for the workloads used above. In presence of random faults, the selective recovery mechanism is triggered always, and rolls back a subset of the program's threads

5 and objects. From the more detailed study in [7], It can be seen that the eect of the fault is localized by the selective recovery mechanism. If the failure occurs very shortly after a checkpoint, very few objects and threads are aected by it. 6 Conclusions and Future Work Using meta-information for checkpointing of threads and objects has a run-time overhead, but its cost is only a small part of functional cost of object themselves. Our experiment with using reection has helped in separation of the fault-tolerance mechanism from application and in the process increasing the reusability of the prototype library. Transparent addition of checkpoint and recovery, and dynamic maintenance of the object dependency graph have been possible largely due to reection. Finally, exploiting multithreaded nature of applications to selectively recover threads and objects has a good potential for superior performance, especially for server class of applications where thread interactions do not grow rapidly with time. To transition from synthetic workload to a realistic workload, libooft will have to handle dynamic allocation by and of objects. This entails either swizzling of pointers [11] or deferring the deletion of allocated objects to consistent checkpoint times. Further, reinstatement of threads blocked on synchronization objects, like mutexes and condition variables, at recovery is a necessary enhancement. Acknowledgements We would like to thank Shigeru Chiba of University of Tsukuba, for clarifying numerous questions on use of the OpenC++ compiler, Yennun Huang, Reinhard Klemm and Shalini Yajnik of Bell Laboratories for valuable discussions about issues in the implementation of the source code generator. References [1] J. S. Plank, M. Beck, G. Kingsley, and K. Li, \Libckpt: Transparent checkpointing under unix," in Conference Proceedings, Usenix Winter 1995 Technical Conference, (New Orleans, LA,), January [2] Y. Huang and C. M. R. Kintala, \Software implemented fault tolerance: Technologies and experience," in Proceedings of Intl. Symposium on Fault-Tolerant Computing, (Toulouse, France), pp. 2{9, June [3] H. chang Nam, J. Kim, S. Hong, and S. Lee, \Probabilistic checkpointing," in Proceedings of Intl. Symposium on Fault-Tolerant Computing, [4] J. Xu, B. Randell, and A. F. Zorzo, \Implementing software-fault tolerance in c++ and openc++:an object oriented and reective approach," in Proceedings of International Workshop on Computer Aided Design,Test, and Evaluation for Dependability, pp. 224{ 229, July [5] J.-C. Fabre and T. Perennou, \A metaobject architecture for fault tolerant distributed systems:the friends approach," in IEEE Transactions on Computers, pp. 78{95, January [6] B. J. Xu, Randell, A. Romanovsky, C. M, F. Rubira, R. Stroud, and Z. Wu, \Fault tolerance in concurrent object-oriented software through coordinated error recovery," in Proceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS-25), pp. 499{508, [7] M. Kasbekar, C. R. Das, and A. Sivasubramaniam, \An object oriented approach to checkpointing." The Pennsylvania State University, University Park. [8] G. Deconinck, J. Vounckx, R. Cuyvers, and R. Lauwereins, \Survey of checkpointing and rollback techniques," Tech. Rep..1.8 and , ESAT-ACCA Laboratory, Katholieke Universiteit Leuven, Belgium, June [9] E. Elnozahy, D. Johnson, and Y. Wang, \A survey of rollback-recovery protocols in message passing systems," Tech. Rep. CMU-CS , Department of Computer Science, Carnegie Mellon University, August [10] S. Chiba, \A metaobject protocol for c++," in Proceedings of the ACM Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), pp. 285{299, October [11] J. Eliot B. Moss, \Working with persistent objects: To swizzle or not to swizzle," IEEE Transactions on Software Engineering, vol. 18, pp. 657{673, August 1992.

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

On Checkpoint Latency. Nitin H. Vaidya. Texas A&M University. Phone: (409) Technical Report

On Checkpoint Latency. Nitin H. Vaidya. Texas A&M University.   Phone: (409) Technical Report On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Phone: (409) 845-0512 FAX: (409) 847-8578 Technical Report

More information

Reflective Java and A Reflective Component-Based Transaction Architecture

Reflective Java and A Reflective Component-Based Transaction Architecture Reflective Java and A Reflective Component-Based Transaction Architecture Zhixue Wu APM Ltd., Poseidon House, Castle Park, Cambridge CB3 0RD UK +44 1223 568930 zhixue.wu@citrix.com ABSTRACT In this paper,

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

processes based on Message Passing Interface

processes based on Message Passing Interface Checkpointing and Migration of parallel processes based on Message Passing Interface Zhang Youhui, Wang Dongsheng, Zheng Weimin Department of Computer Science, Tsinghua University, China. Abstract This

More information

Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach

Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach Jie Xu, Brian Randell and Avelino F. Zorzo Department of Computing Science University of Newcastle

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments

A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments 1 A Load Balancing Fault-Tolerant Algorithm for Heterogeneous Cluster Environments E. M. Karanikolaou and M. P. Bekakos Laboratory of Digital Systems, Department of Electrical and Computer Engineering,

More information

Adaptive Fault Tolerant Systems: Reflective Design and Validation

Adaptive Fault Tolerant Systems: Reflective Design and Validation 1 Adaptive Fault Tolerant Systems: Reflective Design and Validation Marc-Olivier Killijian Dependable Computing and Fault Tolerance Research Group Toulouse - France 2 Motivations Provide a framework for

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Michel Heydemann Alain Plaignaud Daniel Dure. EUROPEAN SILICON STRUCTURES Grande Rue SEVRES - FRANCE tel : (33-1)

Michel Heydemann Alain Plaignaud Daniel Dure. EUROPEAN SILICON STRUCTURES Grande Rue SEVRES - FRANCE tel : (33-1) THE ARCHITECTURE OF A HIGHLY INTEGRATED SIMULATION SYSTEM Michel Heydemann Alain Plaignaud Daniel Dure EUROPEAN SILICON STRUCTURES 72-78 Grande Rue - 92310 SEVRES - FRANCE tel : (33-1) 4626-4495 Abstract

More information

Shigeru Chiba Michiaki Tatsubori. University of Tsukuba. The Java language already has the ability for reection [2, 4]. java.lang.

Shigeru Chiba Michiaki Tatsubori. University of Tsukuba. The Java language already has the ability for reection [2, 4]. java.lang. A Yet Another java.lang.class Shigeru Chiba Michiaki Tatsubori Institute of Information Science and Electronics University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan. Phone: +81-298-53-5349

More information

Optimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory

Optimistic Message Logging for Independent Checkpointing. in Message-Passing Systems. Yi-Min Wang and W. Kent Fuchs. Coordinated Science Laboratory Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems Yi-Min Wang and W. Kent Fuchs Coordinated Science Laboratory University of Illinois at Urbana-Champaign Abstract Message-passing

More information

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7]. Sender-Based Message Logging David B. Johnson Willy Zwaenepoel Department of Computer Science Rice University Houston, Texas Abstract Sender-based message logging isanewlow-overhead mechanism for providing

More information

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya. reduce the average performance overhead.

A Case for Two-Level Distributed Recovery Schemes. Nitin H. Vaidya.   reduce the average performance overhead. A Case for Two-Level Distributed Recovery Schemes Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-31, U.S.A. E-mail: vaidya@cs.tamu.edu Abstract Most distributed

More information

Do! environment. DoT

Do! environment. DoT The Do! project: distributed programming using Java Pascale Launay and Jean-Louis Pazat IRISA, Campus de Beaulieu, F35042 RENNES cedex Pascale.Launay@irisa.fr, Jean-Louis.Pazat@irisa.fr http://www.irisa.fr/caps/projects/do/

More information

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg

Distributed Recovery with K-Optimistic Logging. Yi-Min Wang Om P. Damani Vijay K. Garg Distributed Recovery with K-Optimistic Logging Yi-Min Wang Om P. Damani Vijay K. Garg Abstract Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world

More information

Enhanced N+1 Parity Scheme combined with Message Logging

Enhanced N+1 Parity Scheme combined with Message Logging IMECS 008, 19-1 March, 008, Hong Kong Enhanced N+1 Parity Scheme combined with Message Logging Ch.D.V. Subba Rao and M.M. Naidu Abstract Checkpointing schemes facilitate fault recovery in distributed systems.

More information

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS

MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS MESSAGE INDUCED SOFT CHEKPOINTING FOR RECOVERY IN MOBILE ENVIRONMENTS Ruchi Tuli 1 & Parveen Kumar 2 1 Research Scholar, Singhania University, Pacheri Bari (Rajasthan) India 2 Professor, Meerut Institute

More information

Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software

Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software Avelino Zorzo, Jie Xu, and Brian Randell * Department of Computing Science, University of Newcastle upon Tyne, NE1 7RU,UK

More information

An Architecture for Recoverable Interaction Between. Applications and Active Databases. Eric N. Hanson Roxana Dastur Vijay Ramaswamy.

An Architecture for Recoverable Interaction Between. Applications and Active Databases. Eric N. Hanson Roxana Dastur Vijay Ramaswamy. An Architecture for Recoverable Interaction Between Applications and Active Databases (extended abstract) Eric N. Hanson Roxana Dastur Vijay Ramaswamy CIS Department University of Florida Gainseville,

More information

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA OMG Real-Time and Distributed Object Computing Workshop, July 2002, Arlington, VA Providing Real-Time and Fault Tolerance for CORBA Applications Priya Narasimhan Assistant Professor of ECE and CS Carnegie

More information

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu Semantic Foundations of Commutativity Analysis Martin C. Rinard y and Pedro C. Diniz z Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 fmartin,pedrog@cs.ucsb.edu

More information

Novel low-overhead roll-forward recovery scheme for distributed systems

Novel low-overhead roll-forward recovery scheme for distributed systems Novel low-overhead roll-forward recovery scheme for distributed systems B. Gupta, S. Rahimi and Z. Liu Abstract: An efficient roll-forward checkpointing/recovery scheme for distributed systems has been

More information

Space-Efficient Page-Level Incremental Checkpointing *

Space-Efficient Page-Level Incremental Checkpointing * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science

More information

Design Framework for Self-Stabilizing Real- Time Systems Based on Real-Time Objects and Prototype Implementation with Analysis 1

Design Framework for Self-Stabilizing Real- Time Systems Based on Real-Time Objects and Prototype Implementation with Analysis 1 Design Framework for Self-Stabilizing Real- Time Systems Based on Real-Time Objects and Prototype Implementation with Analysis 1 Sushil S. Digewade and Albert M. K. Cheng Computer Science Department University

More information

global checkpoint and recovery line interchangeably). When processes take local checkpoint independently, a rollback might force the computation to it

global checkpoint and recovery line interchangeably). When processes take local checkpoint independently, a rollback might force the computation to it Checkpointing Protocols in Distributed Systems with Mobile Hosts: a Performance Analysis F. Quaglia, B. Ciciani, R. Baldoni Dipartimento di Informatica e Sistemistica Universita di Roma "La Sapienza" Via

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

Operating System Architecture. CS3026 Operating Systems Lecture 03

Operating System Architecture. CS3026 Operating Systems Lecture 03 Operating System Architecture CS3026 Operating Systems Lecture 03 The Role of an Operating System Service provider Provide a set of services to system users Resource allocator Exploit the hardware resources

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Nooks. Robert Grimm New York University

Nooks. Robert Grimm New York University Nooks Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Design and Implementation Nooks Overview An isolation

More information

Overhead-Free Portable Thread-Stack Checkpoints

Overhead-Free Portable Thread-Stack Checkpoints Overhead-Free Portable Thread-Stack Checkpoints Ronald Veldema and Michael Philippsen University of Erlangen-Nuremberg, Computer Science Department 2, Martensstr. 3 91058 Erlangen Germany {veldema, philippsen@cs.fau.de

More information

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint?

What is checkpoint. Checkpoint libraries. Where to checkpoint? Why we need it? When to checkpoint? Who need checkpoint? What is Checkpoint libraries Bosilca George bosilca@cs.utk.edu Saving the state of a program at a certain point so that it can be restarted from that point at a later time or on a different machine. interruption

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,

More information

Recovering from Main-Memory Lapses. H.V. Jagadish Avi Silberschatz S. Sudarshan. AT&T Bell Labs. 600 Mountain Ave., Murray Hill, NJ 07974

Recovering from Main-Memory Lapses. H.V. Jagadish Avi Silberschatz S. Sudarshan. AT&T Bell Labs. 600 Mountain Ave., Murray Hill, NJ 07974 Recovering from Main-Memory Lapses H.V. Jagadish Avi Silberschatz S. Sudarshan AT&T Bell Labs. 600 Mountain Ave., Murray Hill, NJ 07974 fjag,silber,sudarshag@allegra.att.com Abstract Recovery activities,

More information

Monitoring Script. Event Recognizer

Monitoring Script. Event Recognizer Steering of Real-Time Systems Based on Monitoring and Checking Oleg Sokolsky, Sampath Kannan, Moonjoo Kim, Insup Lee, and Mahesh Viswanathan Department of Computer and Information Science University of

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

CPS221 Lecture: Threads

CPS221 Lecture: Threads Objectives CPS221 Lecture: Threads 1. To introduce threads in the context of processes 2. To introduce UML Activity Diagrams last revised 9/5/12 Materials: 1. Diagram showing state of memory for a process

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

Reflective Design Patterns to Implement Fault Tolerance

Reflective Design Patterns to Implement Fault Tolerance Reflective Design Patterns to Implement Fault Tolerance Luciane Lamour Ferreira Cecília Mary Fischer Rubira Institute of Computing - IC State University of Campinas UNICAMP P.O. Box 676, Campinas, SP 3083-970

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Julep: an Environment for the Evaluation of Distributed Process Recovery Protocols

Julep: an Environment for the Evaluation of Distributed Process Recovery Protocols Julep: an Environment for the Evaluation of Distributed Process Recovery Protocols Lawrence R. Klos Golden G. Richard III {lklos, golden}@cs.uno.edu Department of Computer Science University of New Orleans

More information

Reflection and Object-Oriented Analysis

Reflection and Object-Oriented Analysis Walter Cazzola, Andrea Sosio, and Francesco Tisato. Reflection and Object-Oriented Analysis. In Proceedings of the 1 st Workshop on Object-Oriented Reflection and Software Engineering (OORaSE 99), pages

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Causes of Software Failures

Causes of Software Failures Causes of Software Failures Hardware Faults Permanent faults, e.g., wear-and-tear component Transient faults, e.g., bit flips due to radiation Software Faults (Bugs) (40% failures) Nondeterministic bugs,

More information

Architectural Blueprint

Architectural Blueprint IMPORTANT NOTICE TO STUDENTS These slides are NOT to be used as a replacement for student notes. These slides are sometimes vague and incomplete on purpose to spark a class discussion Architectural Blueprint

More information

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs

Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Technical Comparison between several representative checkpoint/rollback solutions for MPI programs Yuan Tang Innovative Computing Laboratory Department of Computer Science University of Tennessee Knoxville,

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742 Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve

More information

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network Diana Hecht 1 and Constantine Katsinis 2 1 Electrical and Computer Engineering, University of Alabama in Huntsville,

More information

Stackable Layers: An Object-Oriented Approach to. Distributed File System Architecture. Department of Computer Science

Stackable Layers: An Object-Oriented Approach to. Distributed File System Architecture. Department of Computer Science Stackable Layers: An Object-Oriented Approach to Distributed File System Architecture Thomas W. Page Jr., Gerald J. Popek y, Richard G. Guy Department of Computer Science University of California Los Angeles

More information

Machine-Independent Virtual Memory Management for Paged June Uniprocessor 1st, 2010and Multiproce 1 / 15

Machine-Independent Virtual Memory Management for Paged June Uniprocessor 1st, 2010and Multiproce 1 / 15 Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multiprocessor Architectures Matthias Lange TU Berlin June 1st, 2010 Machine-Independent Virtual Memory Management for Paged June

More information

Concurrent Exception Handling and Resolution in Distributed Object Systems

Concurrent Exception Handling and Resolution in Distributed Object Systems Concurrent Exception Handling and Resolution in Distributed Object Systems Presented by Prof. Brian Randell J. Xu A. Romanovsky and B. Randell University of Durham University of Newcastle upon Tyne 1 Outline

More information

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group

SAMOS: an Active Object{Oriented Database System. Stella Gatziu, Klaus R. Dittrich. Database Technology Research Group SAMOS: an Active Object{Oriented Database System Stella Gatziu, Klaus R. Dittrich Database Technology Research Group Institut fur Informatik, Universitat Zurich fgatziu, dittrichg@ifi.unizh.ch to appear

More information

DYNAMIC SCHEDULING AND RESCHEDULING WITH FAULT TOLERANCE STRATEGY IN GRID COMPUTING

DYNAMIC SCHEDULING AND RESCHEDULING WITH FAULT TOLERANCE STRATEGY IN GRID COMPUTING DYNAMIC SCHEDULING AND RESCHEDULING WITH FAULT TOLERANCE STRATEGY IN GRID COMPUTING Ms. P. Kiruthika Computer Science & Engineering, SNS College of Engineering, Coimbatore, Tamilnadu, India. Abstract Grid

More information

BioTechnology. An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

BioTechnology. An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 15 2014 BioTechnology An Indian Journal FULL PAPER BTAIJ, 10(15), 2014 [8768-8774] The Java virtual machine in a thread migration of

More information

Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm***

Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm*** Fault Tolerant Environment Using Hardware Failure Detection, Roll Forward Recovery Approach and Microrebooting For Distributed Systems Bhushan Sapre*, Anup Garje**, Dr. B. B. Mesharm*** ABSTRACT *(Department

More information

DMTCP: Fixing the Single Point of Failure of the ROS Master

DMTCP: Fixing the Single Point of Failure of the ROS Master DMTCP: Fixing the Single Point of Failure of the ROS Master Tw i n k l e J a i n j a i n. t @ h u s k y. n e u. e d u G e n e C o o p e r m a n g e n e @ c c s. n e u. e d u C o l l e g e o f C o m p u

More information

Recoverable Mobile Environments: Design and. Trade-o Analysis. Dhiraj K. Pradhan P. Krishna Nitin H. Vaidya. College Station, TX

Recoverable Mobile Environments: Design and. Trade-o Analysis. Dhiraj K. Pradhan P. Krishna Nitin H. Vaidya. College Station, TX Recoverable Mobile Environments: Design and Trade-o Analysis Dhiraj K. Pradhan P. Krishna Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: (409)

More information

Periodic Thread A. Deadline Handling Thread. Periodic Thread B. Periodic Thread C. Rate Change. Deadline Notification Port

Periodic Thread A. Deadline Handling Thread. Periodic Thread B. Periodic Thread C. Rate Change. Deadline Notification Port A Continuous Media Application supporting Dynamic QOS Control on Real-Time Mach Tatsuo Nakajima Hiroshi Tezuka Japan Advanced Institute of Science and Technology 15 Asahidai, Tatsunokuchi, Ishikawa, 923-12

More information

Julep: A Framework for Reliable Distributed Computing in Java

Julep: A Framework for Reliable Distributed Computing in Java Julep: A Framework for Reliable Distributed Computing in Java Lawrence R. Klos Golden G. Richard III Zhidong Xu Department of Computer Science University of New Orleans New Orleans, LA 70148 Abstract Julep

More information

Object Oriented Transaction Processing in the KeyKOS Microkernel

Object Oriented Transaction Processing in the KeyKOS Microkernel Object Oriented Transaction Processing in the KeyKOS Microkernel William S. Frantz Charles R. Landau Periwinkle Computer Consulting Tandem Computers Inc. 16345 Englewood Ave. 19333 Vallco Pkwy, Loc 3-22

More information

Hardware-Supported Pointer Detection for common Garbage Collections

Hardware-Supported Pointer Detection for common Garbage Collections 2013 First International Symposium on Computing and Networking Hardware-Supported Pointer Detection for common Garbage Collections Kei IDEUE, Yuki SATOMI, Tomoaki TSUMURA and Hiroshi MATSUO Nagoya Institute

More information

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware Where should RC be implemented? In hardware sensitive to architecture changes At the OS level state transitions hard to track and coordinate At the application level requires sophisticated application

More information

Providing Real-Time and Fault Tolerance for CORBA Applications

Providing Real-Time and Fault Tolerance for CORBA Applications Providing Real-Time and Tolerance for CORBA Applications Priya Narasimhan Assistant Professor of ECE and CS University Pittsburgh, PA 15213-3890 Sponsored in part by the CMU-NASA High Dependability Computing

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

NFSv4 as the Building Block for Fault Tolerant Applications

NFSv4 as the Building Block for Fault Tolerant Applications NFSv4 as the Building Block for Fault Tolerant Applications Alexandros Batsakis Overview Goal: To provide support for recoverability and application fault tolerance through the NFSv4 file system Motivation:

More information

Survey on Incremental MapReduce for Data Mining

Survey on Incremental MapReduce for Data Mining Survey on Incremental MapReduce for Data Mining Trupti M. Shinde 1, Prof.S.V.Chobe 2 1 Research Scholar, Computer Engineering Dept., Dr. D. Y. Patil Institute of Engineering &Technology, 2 Associate Professor,

More information

Synopsis by: Stephen Roberts, GMU CS 895, Spring 2013

Synopsis by: Stephen Roberts, GMU CS 895, Spring 2013 Using Components for Architecture-Based Management The Self-Repair case Sylvain Sicard Université Joseph Fourier, Grenoble, France, Fabienne Boyer Université Joseph Fourier, Grenoble, France, Noel De Palma

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

features of Python 1.5, including the features earlier described in [2]. Section 2.6 summarizes what is new in Python The class and the class

features of Python 1.5, including the features earlier described in [2]. Section 2.6 summarizes what is new in Python The class and the class A note on reection in Python 1.5 Anders Andersen y AAndersen@ACM.Org March 13, 1998 Abstract This is a note on reection in Python 1.5. Both this and earlier versions of Python has an open implementation

More information

The Extensible Java Preprocessor Kit. and a Tiny Data-Parallel Java. Abstract

The Extensible Java Preprocessor Kit. and a Tiny Data-Parallel Java. Abstract The Extensible Java Preprocessor Kit and a Tiny Data-Parallel Java Yuuji ICHISUGI 1, Yves ROUDIER 2 fichisugi,roudierg@etl.go.jp 1 Electrotechnical Laboratory, 2 STA Fellow, Electrotechnical Laboratory

More information

Redo Log Undo Log. Redo Log Undo Log. Redo Log Tail Volatile Store. Pers. Redo Log

Redo Log Undo Log. Redo Log Undo Log. Redo Log Tail Volatile Store. Pers. Redo Log Recovering from Main-Memory Lapses H.V. Jagadish AT&T Research Murray Hill, NJ 07974 jag@research.att.com Avi Silberschatz Bell Laboratories Murray Hill, NJ 07974 avi@bell-labs.com S. Sudarshan Indian

More information

14.7 Dynamic Linking. Building a Runnable Program

14.7 Dynamic Linking. Building a Runnable Program 14 Building a Runnable Program 14.7 Dynamic Linking To be amenable to dynamic linking, a library must either (1) be located at the same address in every program that uses it, or (2) have no relocatable

More information

Automatic Code Generation for Non-Functional Aspects in the CORBALC Component Model

Automatic Code Generation for Non-Functional Aspects in the CORBALC Component Model Automatic Code Generation for Non-Functional Aspects in the CORBALC Component Model Diego Sevilla 1, José M. García 1, Antonio Gómez 2 1 Department of Computer Engineering 2 Department of Information and

More information

Recoverability. Kathleen Durant PhD CS3200

Recoverability. Kathleen Durant PhD CS3200 Recoverability Kathleen Durant PhD CS3200 1 Recovery Manager Recovery manager ensures the ACID principles of atomicity and durability Atomicity: either all actions in a transaction are done or none are

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Processes and Threads Implementation

Processes and Threads Implementation Processes and Threads Implementation 1 Learning Outcomes An understanding of the typical implementation strategies of processes and threads Including an appreciation of the trade-offs between the implementation

More information

As related works, OMG's CORBA (Common Object Request Broker Architecture)[2] has been developed for long years. CORBA was intended to realize interope

As related works, OMG's CORBA (Common Object Request Broker Architecture)[2] has been developed for long years. CORBA was intended to realize interope HORB: Distributed Execution of Java Programs HIRANO Satoshi Electrotechnical Laboratory and RingServer Project 1-1-4 Umezono Tsukuba, 305 Japan hirano@etl.go.jp http://ring.etl.go.jp/openlab/horb/ Abstract.

More information

Distributed File Systems. CS432: Distributed Systems Spring 2017

Distributed File Systems. CS432: Distributed Systems Spring 2017 Distributed File Systems Reading Chapter 12 (12.1-12.4) [Coulouris 11] Chapter 11 [Tanenbaum 06] Section 4.3, Modern Operating Systems, Fourth Ed., Andrew S. Tanenbaum Section 11.4, Operating Systems Concept,

More information

Concept as a Generalization of Class and Principles of the Concept-Oriented Programming

Concept as a Generalization of Class and Principles of the Concept-Oriented Programming Computer Science Journal of Moldova, vol.13, no.3(39), 2005 Concept as a Generalization of Class and Principles of the Concept-Oriented Programming Alexandr Savinov Abstract In the paper we describe a

More information

Discount Checking: Transparent, Low-Overhead Recovery for General Applications

Discount Checking: Transparent, Low-Overhead Recovery for General Applications Discount Checking: Transparent, Low-Overhead Recovery for General Applications David E. Lowell and Peter M. Chen Computer Science and Engineering Division Department of Electrical Engineering and Computer

More information

Transparent Access to Legacy Data in Java. Olivier Gruber. IBM Almaden Research Center. San Jose, CA Abstract

Transparent Access to Legacy Data in Java. Olivier Gruber. IBM Almaden Research Center. San Jose, CA Abstract Transparent Access to Legacy Data in Java Olivier Gruber IBM Almaden Research Center San Jose, CA 95120 Abstract We propose in this paper an extension to PJava in order to provide a transparent access

More information

Novel Log Management for Sender-based Message Logging

Novel Log Management for Sender-based Message Logging Novel Log Management for Sender-based Message Logging JINHO AHN College of Natural Sciences, Kyonggi University Department of Computer Science San 94-6 Yiuidong, Yeongtonggu, Suwonsi Gyeonggido 443-760

More information

Transparent Orthogonal Checkpointing Through User-Level Pagers

Transparent Orthogonal Checkpointing Through User-Level Pagers Transparent Orthogonal Checkpointing Through User-Level Pagers Espen Skoglund, Christian Ceelen, and Jochen Liedtke System Architecture Group University of Karlsruhe {skoglund,ceelen,liedtke}@ira.uka.de

More information

Optimistic Distributed Simulation Based on Transitive Dependency. Tracking. Dept. of Computer Sci. AT&T Labs-Research Dept. of Elect. & Comp.

Optimistic Distributed Simulation Based on Transitive Dependency. Tracking. Dept. of Computer Sci. AT&T Labs-Research Dept. of Elect. & Comp. Optimistic Distributed Simulation Based on Transitive Dependency Tracking Om P. Damani Yi-Min Wang Vijay K. Garg Dept. of Computer Sci. AT&T Labs-Research Dept. of Elect. & Comp. Eng Uni. of Texas at Austin

More information

Recovering Device Drivers

Recovering Device Drivers 1 Recovering Device Drivers Michael M. Swift, Muthukaruppan Annamalai, Brian N. Bershad, and Henry M. Levy University of Washington Presenter: Hayun Lee Embedded Software Lab. Symposium on Operating Systems

More information

Development of Technique for Healing Data Races based on Software Transactional Memory

Development of Technique for Healing Data Races based on Software Transactional Memory , pp.482-487 http://dx.doi.org/10.14257/astl.2016.139.96 Development of Technique for Healing Data Races based on Software Transactional Memory Eu-Teum Choi 1,, Kun Su Yoon 2, Ok-Kyoon Ha 3, Yong-Kee Jun

More information

Support for Software Interrupts in Log-Based. Rollback-Recovery. J. Hamilton Slye E.N. Elnozahy. IBM Austin Research Lab. July 27, 1997.

Support for Software Interrupts in Log-Based. Rollback-Recovery. J. Hamilton Slye E.N. Elnozahy. IBM Austin Research Lab. July 27, 1997. Support for Software Interrupts in Log-Based Rollback-Recovery J. Hamilton Slye E.N. Elnozahy Transarc IBM Austin Research Lab July 27, 1997 Abstract The piecewise deterministic execution model is a fundamental

More information

Towards a Resilient Operating System for Wireless Sensor Networks

Towards a Resilient Operating System for Wireless Sensor Networks Towards a Resilient Operating System for Wireless Sensor Networks Hyoseung Kim Hojung Cha Yonsei University, Korea 2006. 6. 1. Hyoseung Kim hskim@cs.yonsei.ac.kr Motivation (1) Problems: Application errors

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Application. Protocol Stack. Kernel. Network. Network I/F

Application. Protocol Stack. Kernel. Network. Network I/F Real-Time Communication in Distributed Environment Real-Time Packet Filter Approach 3 Takuro Kitayama Keio Research Institute at SFC Keio University 5322 Endo Fujisawa Kanagawa, Japan takuro@sfc.keio.ac.jp

More information

A Metaobject Protocol for Fault-Tolerant CORBA Applications *

A Metaobject Protocol for Fault-Tolerant CORBA Applications * A Metaobject Protocol for Fault-Tolerant CORBA Applications * Marc-Olivier Killijian 1, Jean-Charles Fabre, Juan-Carlos Ruiz-Garcia LAAS-CNRS, 7 Avenue du Colonel Roche 31077 Toulouse cedex, France Shigeru

More information

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo

Lecture 21: Logging Schemes /645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Lecture 21: Logging Schemes 15-445/645 Database Systems (Fall 2017) Carnegie Mellon University Prof. Andy Pavlo Crash Recovery Recovery algorithms are techniques to ensure database consistency, transaction

More information

Research on the Novel and Efficient Mechanism of Exception Handling Techniques for Java. Xiaoqing Lv 1 1 huihua College Of Hebei Normal University,

Research on the Novel and Efficient Mechanism of Exception Handling Techniques for Java. Xiaoqing Lv 1 1 huihua College Of Hebei Normal University, International Conference on Informatization in Education, Management and Business (IEMB 2015) Research on the Novel and Efficient Mechanism of Exception Handling Techniques for Java Xiaoqing Lv 1 1 huihua

More information

The Procedure Abstraction

The Procedure Abstraction The Procedure Abstraction Procedure Abstraction Begins Chapter 6 in EAC The compiler must deal with interface between compile time and run time Most of the tricky issues arise in implementing procedures

More information