Design of Fault Tolerant Software

Size: px
Start display at page:

Download "Design of Fault Tolerant Software"

Transcription

1 Design of Fault Tolerant Software Andrea Bondavalli CNUCE/CNR, via S.Maria 36, Pisa, Italy. Abstract In this paper we deal with structured software fault-tolerance. Structured software fault tolerance are those techniques where redundancy (both for detection and correction) is applied to the individual blocks of software with the goal of masking or reveal errors internal to the block. Each technique has its own way of structuring the interactions among redundant parts and of managing the complexity added. We discuss some of the open problems of software fault tolerance structures and the issues related to effectiveness. In particular we address generality and flexibility, which can be improved at the price of adding complexity to the design, and discuss the need of proper trade-offs between generality and flexibility on one hand and complexity on the other. The SCOP (Self-Configuring Optimal Programming) scheme [7, 24] constitutes an interesting example of such a trade-off. SCOP is a fault tolerant system structuring method, originally intended to improve run time efficiency, that is also very flexible and general. Together with the mechanisms used by SCOP for obtaining flexibility and generality, some recent research aiming at reducing the complexity is finally described 1 Introduction As computer systems are used in modern society for many critical applications, it is commonly recognised that it is necessary to improve their reliability, and in general their dependability. Since the early seventies it has become also apparent that obtaining software dependability (i.e., coping with design faults) constitutes a major problem. The development of dependable computing consists in the combined utilisation of a large number of techniques that can be classified into fault tolerance and fault prevention. Fault prevention techniques aim at a product that is as much as possible free, and likely to remain free, from internal defects (faults). Fault tolerance techniques are intended at "tolerating", by redundancy, the effects of faults, that is to cope with the effects of faults and avert the occurrence of failures or at least to warn a user that errors have been introduced into the state of the system. All the techniques known as 'software fault tolerance' are based on the concept of diversity of design [4] or data [1] (the principle of "double checking one's results" already found in Babbage's work) and are characterised by the emphasis on structuring and systematicity to make these concepts applicable in practice. Design diversity is the approach in which the production of two or more components (variants) of a system is aimed at delivering the same service through independent designs. The major advantage of design diversity is that it does not require the complete absence of design faults, but only that they should not produce similar errors in variants. Classical techniques for tolerating software faults are recovery blocks [18], N-version programming [3], which can be seen as extreme organisations following the design diversity approach, and other intermediate or combined techniques [13, 16, 19, 23]. Recovery blocks (RB) are the first scheme designed to provide software fault tolerance. In this approach, variants are named alternates and the main part of the adjudicator is an acceptance test that is applied sequentially to the results produced by variants. The variants are usually executed serially on a single processor. The execution time of a recovery block is normally that of the first variant, acceptance test, and the operations required to establish and discard a checkpoint. This will not impose a high run-time overhead unless an error is detected and backward recovery required. In this regard, RB is highly efficient. Limitations of the RB method are mainly related to its acceptance test. This test is usually derived from the semantics of a given application, and close dependency between the test and variants may impact dependability of the whole system. Moreover, the development of simple, effective acceptance tests is a difficult task. The N-version programming (NVP) approach avoids use of an acceptance test by taking advantage of parallel execution of multiple versions and result's comparison (although sequential execution is conceptually possible just as parallel execution of RB alternates is possible). NVP is a direct application of the hardware N-modular redundancy approach (NMR) to software. Many adjudication mechanisms, usually based on result comparison and in most cases independent of semantics of the applications, are available and can be selected to determine a single adjudication result from a set or a subset of all the results of variants. Here the probability of common mode failure between the adjudicator and the variants is relatively low. When variants are executed in parallel, NVP may have a fixed response time, thereby guaranteeing timely responses in presence of faults. However, it utilises redundancy in a static manner and always execute all the versions regardless of the normal or abnormal state of the system. The purpose is to tolerate the maximum number of faults that may be present in the system; but, since such a worst case rarely happens, the amount of resources consumed is often higher than necessary. 2 Open Problems in Designing Software Fault Tolerance Academic research and practical applications on software fault tolerance have made much progress in clarifying the possibilities of such methodologies and the problems related to their page 1 page 2

2 application such as complexity which should be kept under control. At the same time, the advantages obtainable from individual software fault tolerance schemes are not clearly measurable. Design diversity still has some difficulties in ensuring a routine-based improvement in software dependability; [9] details such discussion. The work on using simple retry of programs to mask the effects of the faults that cause transient errors [10, 11] seems to fit practical experience, but it is less complete and its effectiveness may be a matter of luck. Using data diversity to tolerate design faults in software systems [1] might provide a more cost effective alternative though such a technique is not generally applicable either. In order to improve the effectiveness of software fault tolerance some problems need to be addressed. Among them are the high costs (both the run-time overhead and design cost), the ability to evaluate the impact of software fault tolerance structures and the usually very limited flexibility of software fault tolerance designs and their consequent inability to adapt to changing run-time conditions. Some of these problems are typical of software fault tolerance while others are common also to software implemented fault tolerance, i.e. the tolerance to hardware faults performed by software. The cost of developing variants and adjudicator may be many times more than that of a single variant [15]. Some research activity is being undertaken with the objective of reducing the development cost of fault-tolerant software. The object-oriented programming paradigm has shown some possibilities through inheritance and polymorphism mechanisms [25]. At run time, all the fault tolerance approaches require some extra space or extra time, or both. Note that efficient use of the available resources generally requires dynamic management and conditional execution of the software variants. This should come with a dynamic trade-off between full parallel execution and totally sequential execution of the variants. Dynamic redundancy for the purpose of space-time trade-off is a classical idea, e.g. Duplicated Configuration with a Spare and NMR with Spares used in hardware [12]. However, the majority of software fault tolerance schemes do not provide such a dynamic space-time trade-off. In order to use redundancy in a dynamic or conditional manner, a scheme has to decide, at appropriate intermediate points of its execution, which of the following three execution states has been reached: i) End-state E a result exists that meets the required condition for delivery and can thus be delivered; ii) Non end-state N there is no result that meets the condition, but it is still possible to obtain such a result if further redundancy is employed; or iii) Failure state F there is no further possibility of producing a result that meets the condition. Research to evaluate the impact of software fault tolerance structures has followed two main directions: experimental measurement and analytical estimation. Some of the experiments developed a software project, with procedures as close as possible to those that would be used if the proposed methodology were chosen for producing "real-world" software, and then tested extensively. They have mainly provided useful insight into the problems of implementing the methodology since very little statistical value can be given to the data collected. Other experiments were "Statistically oriented" in which a number of variants of the same software were developed, and then tested to obtain statistical data. Of course, the scope of applicability of such results is still limited. A common outcome was that good specifications are of paramount importance. The fundamental problem of coincident failures has been studied to some detail for the NVP. The experiment described in [14] has disproved (for one particular sample of variants, of course) the independence hypothesis while PODS [5] has suggested that it may hold for small software modules. In PODS coincident failures in two variants were not normally due to similar programming errors (faults) but rather to a fault-masking effect in Boolean decision logic, a well-known phenomenon in the study of combinatorial circuits. Different faults (bugs) appeared to produce failures with independent probabilities. Analytical estimations of fault-tolerant software have been published in a number of papers, most recently [2, 6, 8, 15, 17, 20-22]. They differ in the models and analytical tools used and in the assumptions made. The main problem in using these models is the difficulty of estimating the values of the parameters, in particular the probabilities of errors common to redundant components. This information must be obtained experimentally, but we are still far from being able to determine it with acceptable confidence. All the structuring methods for software fault tolerance, are designed such that one single condition for delivering a result is usually embedded explicitly or implicitly in the adjudicator. This rigid design choice limits the flexibility and prevents the possibility to adapt to variations of the run-time environment or of the application requirements. Note that different conditions for delivering a result in principle have different fault coverage though some conditions seem to be very similar. Take as an example a mission of a critical system where two phases have been identified: normal and critical. In the critical phase an application could use a more severe delivery condition ensuring very strict checks against the delivery of erroneous output, thus a support for strong detection of abnormal conditions and an help to trigger the safety mechanisms could be provided. Obviously the ability to adapt to variations of the runtime environment or of the application requirements calls for dynamic decision to be taken. The price to be paid for this ability is clearly additional complexity, which must be controlled and limited as it may be source of errors in itself and cause the defeat of the method. 3 SCOP Among the existing problems in the area of software fault tolerance, we here intend to focus our attention on generality and flexibility. These features may of course be improved but more complex structures must be designed. Here we discuss the case of SCOP [7, 24]. It has been originally proposed aiming primarily at improving run-time effectiveness and to this page 3 page 4

3 purpose it applies conditional redundancy also when comparison based adjudicators are used. A second thinking, however, allows to recognise SCOP as a general dynamic scheme coping with both flexibility and run time efficiency for tolerating either software or hardware faults. The question that remains to be solved is whether the increased complexity is justified and the trade off represented by SCOP is satisfactory. In this respect the SCOP critical features, namely the inherent complexity of the control algorithm and the need for mechanism to support implementation are discussed and directions towards reducing complexity identified. 3.1 Basic Description The SCOP scheme consists of a set of software components, V={v 1, v 2,..., v n }, an adjudication mechanism, a set of delivery conditions one of them to be dynamically chosen at run time, and a controller that coordinates dynamic actions of the architecture. At run time an instance of SCOP accepts as additional parameters the selected delivery condition and, possibly, a deadline for the whole execution. First the controller decides how many phases can be performed (in order to provide a timely result), then it selects the (minimum) set of components that (if successful) could satisfy the selected delivery condition. After execution of the set of components the adjudicator verifies if the chosen delivery condition has been met, (using a Syndrome that may grow as more phases are performed). This behaviour is repeated until a result can be delivered or the software components are exhausted. The behaviour of SCOP can be described more precisely by the following control algorithm with comments on the right side. begin i:= 0; State_mark := N; Si = {}; C := one of { delivery conditions }; decide(max_phase); while State_mark = N and i < max_phase do begin i := i+1; configure(c, Si-1, i, Vi); execute(vi, Si); adjudicate(c, Si, State_mark, res); end; if State = E then deliver(res) else signal(failure); end {index of the current phase, set to 0} {set current state as non end-state} {set syndrome as empty} {set required delivery condition} {based on time constraints} {while current state is non end-state and current phase < maximum allowed} {start new phase} {set new Currently Active Set} {execute and obtain new syndrome} {set new state mark and select result} {current state is end-state or failure state?} The decide procedure determines the maximum number max_phase of phases to be permitted by the specified timing constraints. Procedure configure constructs the CAS set V 1 in phase one according to the selected delivery condition and the given application environment, and establishes the CAS set V i (i>1) based on the syndrome S i-1 collected in the (i-1)th phase and the information on phases. The execution of a CAS may lead to a successful state E. Note that software components in V i are selected from the software components that have not been used in any of the previous phases, i.e. V i is a subset of V - (V 1 V 2... V i-1 ). If the i-th phase is the last, V i would contain all the remaining spare software components. The execute procedure manages the execution of the software components in CAS and generates the syndrome S i, where S 0 is an empty set and S i-1 is a subset of S i. Procedure adjudicate implements the adjudication function using the selected condition C. It receives the syndrome S i, sets the new State_mark and selects the result res, if one exists. The deliver procedure delivers the selected result and the signal produces a failure notification. 3.2 SCOP characteristics Given the above mentioned difficulties for demonstrating the general effectiveness of design diversity to attain software dependability, in order to tolerate software faults (and some hardware-related faults), an instance of SCOP may employ software components according to an application-specific approach for masking the effect of faults. One of these is obviously multiple versions of software, but also diversity in data space, simple retry of programs or multiple replicas could be used, much depending upon specific application requirements and considerations of cost-effectiveness. An instance of SCOP can be designed to obey multiple different delivery conditions. One of them is dynamically chosen at run time, and the selected condition may change for different executions, according to the degradation of the system or to the actual execution phase, as discussed previously. In addition, if SCOP is used for the provision of a service used by many different applications, different delivery conditions may be dynamically chosen by the different applications, according to their degrees of criticality. Since the different conditions will usually have different fault coverage, SCOP is therefore able to provide different levels of dependability. SCOP is very efficient and makes a dynamic use of redundancy; i.e., always tries to execute the minimum number of software components strictly necessary for providing a result that meets the stated delivery conditions. To do this it organises the execution of components in phases, dynamically configuring a currently active set (CAS) V i, a subset of V, at the beginning of the ith phase. An adjudication is made after the execution of V i in order to check if conditions for the release of a result are satisfied. The result will be output immediately and any further phases and actions will be ended once these conditions are met. The initial CAS V 1 in phase one can be determined and flexibly changed with respect to different delivery conditions. Whenever recognised necessary, according to the selected delivery condition, the page 5 page 6

4 syndrome (a set of information used by an adjudicator to perform its judgement as to the correctness of a result) in SCOP is accumulated with the increase of phases. All the results produced and the additional information collected so far are employed to support the selection of a correct result. The architecture is very general allowing to combine several approaches for masking the effect of faults with different delivery conditions. For example, combining the design diversity approach with an acceptance test the Recovery block behaviour is obtained, while a pure replication and a majority voter (with the selection of one phase only) can be used for the design of an instance of NMR. This way the best alternative appropriate for the specific application can be specified and designed. The mechanisms for providing flexibility (concept of delivery condition) and efficiency (adaptive redundancy management) are basically independent of ways of redundancy for masking software faults, but must rely on the highly dynamic and complex control algorithm described. 3.3 Complexity The complexity of the dynamic control algorithm and of the adjudication mechanism for SCOP could itself be a source of errors and thus defeat the proposed scheme. These components are very general and can be re-used for all the instances of the scheme. Thus it appears feasible to verify them formally prove their correctness. Then, there is the need to develop a methodology for designing SCOP components hopefully supported by automatic tools. Given the application requirements to be fulfilled, the design of the specific SCOP instance implies the proper selection of the fault masking approach, of the redundancy degree and of the delivery condition(s). The appropriateness of the synthesised SCOP instance can then be verified and evaluated. The information specifying the desired behaviour, relative to this instance, of the control algorithm and the adjudication mechanism can be made read-only and recorded on stable storage if necessary. In this way, complexity of on-line control is greatly reduced: the adjudication mechanism and control algorithm of SCOP can take run-time actions by just monitoring the execution and reading the proper information, without performing complex computation. 3.4 Supporting Mechanism To support SCOP methodology several OS implemented mechanisms are necessary which are basically the same used to implement NVP and RB. Control of the components may be provided by a controller (similar to the driver program used in N-version programming). The controller is responsible for: 1) a synchronisation mechanism; 2) a mechanism for ensuring an identical set of input values or the proper data representation, as the case may be, to each component; 3) a mechanism for dynamically invoking an appropriate subset of variants; 4) a support to the specialisation of possible types of adjudication (both application semantics and syntax based); 5) a support to logical and/or physical reconfiguration, if needed. This set of OS implemented mechanisms may be seen as constituting a generic SCOP runtime support. Actually, from our preliminary analysis it results that the implementation of this support should not introduce serious technical difficulties, in comparison with the classical approaches such as NVP and RB. We are currently implementing a prototype of the SCOP scheme in an experimental C++ testbed (a local network environment that consists of a number of Sun-3 and Sun-4 workstations). Preliminary experimental data are promising and have been providing us with additional confidence for adopting SCOP in practical systems. 4 Conclusions In this paper we have dealt with some of the open problems that need to be addressed to improve the effectiveness of software fault tolerance structures. Besides pointing out at the high costs (both the run-time overhead and design cost) and at the ability to evaluate the impact of software fault tolerance structures, we focused on the issues of lack of generality and the usually very limited flexibility of software fault tolerance designs and their consequent inability to adapt to changing run-time conditions. These characteristics can be improved at the price of adding complexity to the design as exemplified by SCOP; thus proper trade-offs are required. SCOP is a fault tolerant system structuring method based on conditional usage of the available redundant components. It allows to combine several approaches for masking the effect of faults with different, even multiple, delivery conditions. It is immediate to design specific instances of SCOP which behave as a Recovery block or an NVP or as software implemented techniques for tolerating hardware faults. This generality allows to specify the best alternative for a specific application and to reduce the supporting mechanisms that the kernel/os must provide to a common set. The possibility in SCOP to define multiple delivery conditions to be used by the same instance represents a significant novelty in software fault tolerance in what it allows a great degree of flexibility and generality. Different applications with different criticality may thus use the same basic service equipped with multiple delivery conditions. In this way SCOP supports for different integrity levels. Another case easily managed is that of a single application with a variable degree of criticality in different phases of a mission. A SCOP instance, which usually provides services with a given degree of fault tolerance, may dynamically be used to provide support for safety. It is enough to switch to a delivery condition enforcing very strict checks against the delivery of erroneous outputs: the behaviour obtained consists in very strong detection of abnormal conditions helping to trigger the safety mechanisms. In both cases, this flexibility allows the scheme to easily manage executions that must occur in a degraded system, where less resources than usual are available due to some faults and before a repair or replacing action can take place. All these nice char- page 7 page 8

5 acteristics are obtained at the price of adding complexity which must be controlled and limited since it may be source of errors itself. To understand whether the increased complexity is justified and the trade off represented by SCOP is satisfactory the different sources of additional complexity of SCOP compared to the other schemes have been pointed out and directions towards reducing complexity identified. References [1] P. E. Ammann and J. C. Knight, "Data diversity: an approach to software fault tolerance," IEEE Trans. Comput., Vol. 37, pp , [2] J. Arlat, K. Kanoun and J. C. Laprie, "Dependability modelling and evaluation of software fault tolerant systems," IEEE Trans. Comput., Vol. 39, pp , [3] A. Avizienis and L. Chen, "On the implementation of N-version-programming for software ault-tolerance during execution," in Proc. Int. Conf. Comput. Soft. and Appli., New York, 1977, pp [4] A. Avizienis and J. C. Laprie, "Dependable Computing: from Concepts to Design Diversity," Proc. of the IEEE, Vol. 74, pp , [5] P. G. Bishop and F. D. Pullen, "PODS Revisited - A Study of Software Failure Behaviour," in Proc. 18th IEEE Int. Symp. on Fault-Tolerant Computing (FTCS-18), Tokyo, Japan, 1988, pp [6] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and L. Strigini, "Dependability Analysis of Iterative Fault Tolerant Software Considering Correlation," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp [7] A. Bondavalli, F. Di Giandomenico and J. Xu, "A Cost-Effective and Flexible Scheme for Software fault Tolerance," Journal of Computer Systems Science and Engineering, Vol. 8, pp , [8] S. Chiaradonna, A. Bondavalli and L. Strigini, "On Performability Modeling and Evaluation of Software Fault Tolerance Structures," in Proc. 1st European Dependable Computing Conference (EDCC-1), Berlin, Germany, 1994, pp [9] D. E. Eckhardt, A. K. Caglayan, J. C. Knight, L. D. Lee, D. F. McAllister, M. A. Vouk and J. P. J. Kelly, "An Experimental Evaluation of Software Redundancy as a Strategy for Improving Reliability," IEEE Trans. Soft. Eng., Vol. 17, pp , [10] J. Gray and A. Reuter, "Transaction Processing: Concepts and Techniques," Morgan Kaufmann, [11] Y. Huang and C.M.R. Kintala, "Software implemented fault tolerance: Technologies and experience," in Proc. 23rd Int. Symp. Fault Tolerant Comput. (FTCS-23), Toulouse, 1993, pp [12] B. W. Johnson, "Design and Analysis of Fault Tolerant Digital Systems," Addison- Wesley Pub. Co., [13] K. H. Kim, "Distributed execution of recovery blocks: an approach to uniform treatment of hardware and software faults," in Proc. 4th Int. Conf. Distributed Comput. Sys., 1984, pp [14] J. C. Knight and N. G. Leveson, "An Experimental Evaluation of the Assumption of Independence in Multiversion Programming," IEEE Trans. Soft. Eng., Vol. SE-12, pp , [15] J. C. Laprie, J. Arlat, C. Beounes and K. Kanoun, "Definition and Analysis of Hardware and Software Fault-Tolerant Architecture," IEEE Computer, Vol. 23, pp , [16] J. C. Laprie, J. Arlat, C. Beounes, K. Kanoun and C. Hourtolle, "Hardware and Software Fault Tolerance: Definition and Analysis of Architectural Solutions," in Proc. 17th Int. Symp. Fault-Tolerant Comput., Pittsburgh, 1987, pp [17] M. R. Lyu and Y. He, "Improving the N-Version Programming Process Through the Evolution of a Design Paradigm," IEEE Transactions on Reliability, Special Issue on Fault-Tolerant Software, Vol. R-42, pp , [18] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Soft. Eng., Vol. SE-1, pp , [19] R. K. Scott, J. W. Gault and D. F. Mcallister, "Fault tolerant software reliability modeling," IEEE Trans. Soft. Eng., Vol. SE-13, pp , [20] A. Tai, A. Avizienis and J. Meyer, "Evaluation of fault-tolerant software: a perormability modeling approach," in "Dependable Computing for Critical Applications 3", C. E. Landweh, B. Randell and L. Simoncini Ed., Sprinter-Verlag, 1993, pp. [21] A. T. Tai, "Performability-Driven Adaptive Fault Tolerance," in Proc. 24th IEEE Int. Symp. on Fault-Tolerant Computing (FTCS-24), Austin, Texas, 1994, pp [22] A. T. Tai, A. Avizienis and J. F. Meyer, "Performability Enhancement of Fault-Tolerant Software," IEEE Transactions on Reliability, Special Issue on Fault-Tolerant Software, Vol. R-42, pp , [23] J. Xu, "The t/(n-1)-diagnosability and Its Applications to Fault Tolerance," in Proc. 21st Int. Symp. Fault-Tolerant Comput., Montreal, 1991, pp [24] J. Xu, A. Bondavalli and F. Di Giandomenico, "Dynamic Adjustment of Dependability and Efficiency in Fault-Tolerant Software," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp [25] J. Xu, B. Randell, C.M.F. Rubira-Calsavara and R.J. Stroud, "Toward an Object-Oriented Approach to Software Fault Tolerance," in "Fault-Tolerant Parallel and Distributed Systems", D. R. Avresky Ed., IEEE Computer Society Press, 1994, pp. page 9 page 10

The Reliable Hybrid Pattern A Generalized Software Fault Tolerant Design Pattern

The Reliable Hybrid Pattern A Generalized Software Fault Tolerant Design Pattern 1 The Reliable Pattern A Generalized Software Fault Tolerant Design Pattern Fonda Daniels Department of Electrical & Computer Engineering, Box 7911 North Carolina State University Raleigh, NC 27695 email:

More information

Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach

Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach Implementing Software-Fault Tolerance in C++ and Open C++: An Object-Oriented and Reflective Approach Jie Xu, Brian Randell and Avelino F. Zorzo Department of Computing Science University of Newcastle

More information

Hardware and Software Fault Tolerance: Adaptive Architectures in Distributed Computing Environments

Hardware and Software Fault Tolerance: Adaptive Architectures in Distributed Computing Environments Hardware and Software Fault Tolerance: Adaptive Architectures in Distributed Computing Environments F. Di Giandomenico 1, A. Bondavalli 2 and J. Xu 3 1 IEI/CNR, Pisa, Italy; 2 CNUCE/CNR, Pisa, Italy 3

More information

Responsive Roll-Forward Recovery in Embedded Real-Time Systems

Responsive Roll-Forward Recovery in Embedded Real-Time Systems Responsive Roll-Forward Recovery in Embedded Real-Time Systems Jie Xu and Brian Randell Department of Computing Science University of Newcastle upon Tyne, Newcastle upon Tyne, UK ABSTRACT Roll-forward

More information

A Low-Cost Correction Algorithm for Transient Data Errors

A Low-Cost Correction Algorithm for Transient Data Errors A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction

More information

Review of Software Fault-Tolerance Methods for Reliability Enhancement of Real-Time Software Systems

Review of Software Fault-Tolerance Methods for Reliability Enhancement of Real-Time Software Systems International Journal of Electrical and Computer Engineering (IJECE) Vol. 6, No. 3, June 2016, pp. 1031 ~ 1037 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i3.9041 1031 Review of Software Fault-Tolerance Methods

More information

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Dependability Analysis of Web Service-based Business Processes by Model Transformations

Dependability Analysis of Web Service-based Business Processes by Model Transformations Dependability Analysis of Web Service-based Business Processes by Model Transformations László Gönczy 1 1 DMIS, Budapest University of Technology and Economics Magyar Tudósok krt. 2. H-1117, Budapest,

More information

Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software

Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software Experimental Evaluation of Fault-Tolerant Mechanisms for Object-Oriented Software Avelino Zorzo, Jie Xu, and Brian Randell * Department of Computing Science, University of Newcastle upon Tyne, NE1 7RU,UK

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Software Engineering: Integration Requirements

Software Engineering: Integration Requirements Software Engineering: Integration Requirements AYAZ ISAZADEH Department of Computer Science Tabriz University Tabriz, IRAN Abstract: - This paper presents a discussion of software integration requirements,

More information

Designing fault-tolerant SOA based on design diversity

Designing fault-tolerant SOA based on design diversity Nascimento et al. Journal of Software Engineering Research and Development 2014, 2:13 RESEARCH Open Access Designing fault-tolerant SOA based on design diversity Amanda S Nascimento 1*, Cecília MF Rubira

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Sequential Fault Tolerance Techniques

Sequential Fault Tolerance Techniques COMP-667 Software Fault Tolerance Software Fault Tolerance Sequential Fault Tolerance Techniques Jörg Kienzle Software Engineering Laboratory School of Computer Science McGill University Overview Robust

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

Component Failure Mitigation According to Failure Type

Component Failure Mitigation According to Failure Type onent Failure Mitigation According to Failure Type Fan Ye, Tim Kelly Department of uter Science, The University of York, York YO10 5DD, UK {fan.ye, tim.kelly}@cs.york.ac.uk Abstract Off-The-Shelf (OTS)

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM

AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM AUTONOMOUS RECONFIGURATION OF IP CORE UNITS USING BLRB ALGORITHM B.HARIKRISHNA 1, DR.S.RAVI 2 1 Sathyabama Univeristy, Chennai, India 2 Department of Electronics Engineering, Dr. M. G. R. Univeristy, Chennai,

More information

Software Diversity and Fault-Tolerance: An Overview

Software Diversity and Fault-Tolerance: An Overview Software Diversity and Fault-Tolerance: An Overview Daniel Rodriguez Retamosa and Mehrdad Saadatmand Mälardalen Real-Time Research Centre (MRTC) Mälardalen University Västerås, Sweden dra05002@student.mdh.se,

More information

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy

More information

Diversely Designed Classes for Use by Multiple Tasks

Diversely Designed Classes for Use by Multiple Tasks Diversely Designed Classes for Use by Multiple Tasks Alexander Romanovsky Department of Computing Science University of Newcastle upon Tyne, NE1 7RU, UK email: alexander.romanovsky@newcastle.ac.uk tel:

More information

Chapter 9. Software Testing

Chapter 9. Software Testing Chapter 9. Software Testing Table of Contents Objectives... 1 Introduction to software testing... 1 The testers... 2 The developers... 2 An independent testing team... 2 The customer... 2 Principles of

More information

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION MATHEMATICAL MODELLING AND SCIENTIFIC COMPUTING, Vol. 8 (997) VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE ULATION Jehan-François Pâris Computer Science Department, University of Houston, Houston,

More information

ACTIVE NETWORK MANAGEMENT FACILITATING THE CONNECTION OF DISTRIBUTED GENERATION AND ENHANCING SECURITY OF SUPPLY IN DENSE URBAN DISTRIBUTION NETWORKS

ACTIVE NETWORK MANAGEMENT FACILITATING THE CONNECTION OF DISTRIBUTED GENERATION AND ENHANCING SECURITY OF SUPPLY IN DENSE URBAN DISTRIBUTION NETWORKS ACTIVE NETWORK MANAGEMENT FACILITATING THE CONNECTION OF DISTRIBUTED GENERATION AND ENHANCING SECURITY OF SUPPLY IN DENSE URBAN DISTRIBUTION NETWORKS David OLMOS MATA Ali R. AHMADI Graham AULT Smarter

More information

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES S. SRINIVAS KUMAR *, R.BASAVARAJU ** * PG Scholar, Electronics and Communication Engineering, CRIT

More information

A Modelling and Analysis Environment for LARES

A Modelling and Analysis Environment for LARES A Modelling and Analysis Environment for LARES Alexander Gouberman, Martin Riedl, Johann Schuster, and Markus Siegle Institut für Technische Informatik, Universität der Bundeswehr München, {firstname.lastname@unibw.de

More information

Concurrent Exception Handling and Resolution in Distributed Object Systems

Concurrent Exception Handling and Resolution in Distributed Object Systems Concurrent Exception Handling and Resolution in Distributed Object Systems Presented by Prof. Brian Randell J. Xu A. Romanovsky and B. Randell University of Durham University of Newcastle upon Tyne 1 Outline

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

LabVIEW Based Embedded Design [First Report]

LabVIEW Based Embedded Design [First Report] LabVIEW Based Embedded Design [First Report] Sadia Malik Ram Rajagopal Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 malik@ece.utexas.edu ram.rajagopal@ni.com

More information

MATERIALS AND METHOD

MATERIALS AND METHOD e-issn: 2349-9745 p-issn: 2393-8161 Scientific Journal Impact Factor (SJIF): 1.711 International Journal of Modern Trends in Engineering and Research www.ijmter.com Evaluation of Web Security Mechanisms

More information

ITERATIVE MULTI-LEVEL MODELLING - A METHODOLOGY FOR COMPUTER SYSTEM DESIGN. F. W. Zurcher B. Randell

ITERATIVE MULTI-LEVEL MODELLING - A METHODOLOGY FOR COMPUTER SYSTEM DESIGN. F. W. Zurcher B. Randell ITERATIVE MULTI-LEVEL MODELLING - A METHODOLOGY FOR COMPUTER SYSTEM DESIGN F. W. Zurcher B. Randell Thomas J. Watson Research Center Yorktown Heights, New York Abstract: The paper presents a method of

More information

What are Embedded Systems? Lecture 1 Introduction to Embedded Systems & Software

What are Embedded Systems? Lecture 1 Introduction to Embedded Systems & Software What are Embedded Systems? 1 Lecture 1 Introduction to Embedded Systems & Software Roopa Rangaswami October 9, 2002 Embedded systems are computer systems that monitor, respond to, or control an external

More information

A Framework for Reliability Assessment of Software Components

A Framework for Reliability Assessment of Software Components A Framework for Reliability Assessment of Software Components Rakesh Shukla, Paul Strooper, and David Carrington School of Information Technology and Electrical Engineering, The University of Queensland,

More information

!! An!Orthogonal!Framework!for!Fault! Tolerance!Composition!in!Software!Systems!!!

!! An!Orthogonal!Framework!for!Fault! Tolerance!Composition!in!Software!Systems!!! AnOrthogonalFrameworkforFault ToleranceCompositioninSoftwareSystems SobiaKhurshidKhan ComputingDepartment LancasterUniversity UnitedKingdom SUBMITTEDINPARTIALFULFILLMENTOFTHE REQUIREMENTFORTHEDEGREEOF

More information

Conceptual Model for a Software Maintenance Environment

Conceptual Model for a Software Maintenance Environment Conceptual Model for a Software Environment Miriam. A. M. Capretz Software Engineering Lab School of Computer Science & Engineering University of Aizu Aizu-Wakamatsu City Fukushima, 965-80 Japan phone:

More information

References: internet notes; Bertrand Meyer, Object-Oriented Software Construction; 10/14/2004 1

References: internet notes; Bertrand Meyer, Object-Oriented Software Construction; 10/14/2004 1 References: internet notes; Bertrand Meyer, Object-Oriented Software Construction; 10/14/2004 1 Assertions Statements about input to a routine or state of a class Have two primary roles As documentation,

More information

An Optimal Locking Scheme in Object-Oriented Database Systems

An Optimal Locking Scheme in Object-Oriented Database Systems An Optimal Locking Scheme in Object-Oriented Database Systems Woochun Jun Le Gruenwald Dept. of Computer Education School of Computer Science Seoul National Univ. of Education Univ. of Oklahoma Seoul,

More information

Aerospace Software Engineering

Aerospace Software Engineering 16.35 Aerospace Software Engineering Verification & Validation Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Would You...... trust a completely-automated nuclear power plant?... trust a completely-automated

More information

Reflective Design Patterns to Implement Fault Tolerance

Reflective Design Patterns to Implement Fault Tolerance Reflective Design Patterns to Implement Fault Tolerance Luciane Lamour Ferreira Cecília Mary Fischer Rubira Institute of Computing - IC State University of Campinas UNICAMP P.O. Box 676, Campinas, SP 3083-970

More information

Improving Memory Repair by Selective Row Partitioning

Improving Memory Repair by Selective Row Partitioning 200 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems Improving Memory Repair by Selective Row Partitioning Muhammad Tauseef Rab, Asad Amin Bawa, and Nur A. Touba Computer

More information

A Case Study for Fault Tolerance Oriented Programming in Multi-core Architecture

A Case Study for Fault Tolerance Oriented Programming in Multi-core Architecture Software Engineering Group Department of Computer Science Nanjing University http://seg.nju.edu.cn Technical Report No. NJU-SEG- 2009-IC-001 A Case Study for Fault Tolerance Oriented Programming in Multi-core

More information

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems

On Object Orientation as a Paradigm for General Purpose. Distributed Operating Systems On Object Orientation as a Paradigm for General Purpose Distributed Operating Systems Vinny Cahill, Sean Baker, Brendan Tangney, Chris Horn and Neville Harris Distributed Systems Group, Dept. of Computer

More information

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1 Verification and Validation 1 Objectives To introduce software verification and validation and to discuss the distinction between them To describe the program inspection process and its role in V & V To

More information

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or

NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or NOTICE WARNING CONCERNING COPYRIGHT RESTRICTIONS: The copyright law of the United States (title 17, U.S. Code) governs the making of photocopies or other reproductions of copyrighted material. Any copying

More information

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale Saurabh Hukerikar Christian Engelmann Computer Science Research Group Computer Science & Mathematics Division Oak Ridge

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence

Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence Introducing MESSIA: A Methodology of Developing Software Architectures Supporting Implementation Independence Ratko Orlandic Department of Computer Science and Applied Math Illinois Institute of Technology

More information

Assertions. Assertions - Example

Assertions. Assertions - Example References: internet notes; Bertrand Meyer, Object-Oriented Software Construction; 11/13/2003 1 Assertions Statements about input to a routine or state of a class Have two primary roles As documentation,

More information

Part 5. Verification and Validation

Part 5. Verification and Validation Software Engineering Part 5. Verification and Validation - Verification and Validation - Software Testing Ver. 1.7 This lecture note is based on materials from Ian Sommerville 2006. Anyone can use this

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

An Approach to Task Attribute Assignment for Uniprocessor Systems

An Approach to Task Attribute Assignment for Uniprocessor Systems An Approach to ttribute Assignment for Uniprocessor Systems I. Bate and A. Burns Real-Time Systems Research Group Department of Computer Science University of York York, United Kingdom e-mail: fijb,burnsg@cs.york.ac.uk

More information

A Robust Bloom Filter

A Robust Bloom Filter A Robust Bloom Filter Yoon-Hwa Choi Department of Computer Engineering, Hongik University, Seoul, Korea. Orcid: 0000-0003-4585-2875 Abstract A Bloom filter is a space-efficient randomized data structure

More information

The Design Space of Software Development Methodologies

The Design Space of Software Development Methodologies The Design Space of Software Development Methodologies Kadie Clancy, CS2310 Term Project I. INTRODUCTION The success of a software development project depends on the underlying framework used to plan and

More information

3.4 Data-Centric workflow

3.4 Data-Centric workflow 3.4 Data-Centric workflow One of the most important activities in a S-DWH environment is represented by data integration of different and heterogeneous sources. The process of extract, transform, and load

More information

MONIKA HEINER.

MONIKA HEINER. LESSON 1 testing, intro 1 / 25 SOFTWARE TESTING - STATE OF THE ART, METHODS, AND LIMITATIONS MONIKA HEINER monika.heiner@b-tu.de http://www.informatik.tu-cottbus.de PRELIMINARIES testing, intro 2 / 25

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

Exploiting Unused Spare Columns to Improve Memory ECC

Exploiting Unused Spare Columns to Improve Memory ECC 2009 27th IEEE VLSI Test Symposium Exploiting Unused Spare Columns to Improve Memory ECC Rudrajit Datta and Nur A. Touba Computer Engineering Research Center Department of Electrical and Computer Engineering

More information

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES (1) Nallaparaju Sneha, PG Scholar in VLSI Design, (2) Dr. K. Babulu, Professor, ECE Department, (1)(2)

More information

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AUTOMATIC TEST CASE GENERATION FOR PERFORMANCE ENHANCEMENT OF SOFTWARE THROUGH GENETIC ALGORITHM AND RANDOM TESTING Bright Keswani,

More information

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems Kripa K B 1, Akshatha K N 2,Nazma S 3 1 ECE dept, Srinivas Institute of Technology 2 ECE dept, KVGCE 3 ECE dept, Srinivas Institute

More information

VTV A Voting Strategy for Real-Time Systems

VTV A Voting Strategy for Real-Time Systems VTV A Voting Strategy for Real-Time Systems Hüseyin Aysan, Sasikumar Punnekkat, and Radu Dobrin Mälardalen Real-Time Research Centre, Mälardalen University, Västerås, Sweden {huseyin.aysan, sasikumar.punnekkat,

More information

Doctoral Studies and Research Proposition. Diversity in Peer-to-Peer Networks. Mikko Pervilä. Helsinki 24 November 2008 UNIVERSITY OF HELSINKI

Doctoral Studies and Research Proposition. Diversity in Peer-to-Peer Networks. Mikko Pervilä. Helsinki 24 November 2008 UNIVERSITY OF HELSINKI Doctoral Studies and Research Proposition Diversity in Peer-to-Peer Networks Mikko Pervilä Helsinki 24 November 2008 UNIVERSITY OF HELSINKI Department of Computer Science Supervisor: prof. Jussi Kangasharju

More information

SOFTWARE ENGINEERING DECEMBER. Q2a. What are the key challenges being faced by software engineering?

SOFTWARE ENGINEERING DECEMBER. Q2a. What are the key challenges being faced by software engineering? Q2a. What are the key challenges being faced by software engineering? Ans 2a. The key challenges facing software engineering are: 1. Coping with legacy systems, coping with increasing diversity and coping

More information

Metaheuristic Optimization with Evolver, Genocop and OptQuest

Metaheuristic Optimization with Evolver, Genocop and OptQuest Metaheuristic Optimization with Evolver, Genocop and OptQuest MANUEL LAGUNA Graduate School of Business Administration University of Colorado, Boulder, CO 80309-0419 Manuel.Laguna@Colorado.EDU Last revision:

More information

Software Testing. Software Testing

Software Testing. Software Testing Software Testing Software Testing Error: mistake made by the programmer/ developer Fault: a incorrect piece of code/document (i.e., bug) Failure: result of a fault Goal of software testing: Cause failures

More information

Quality Assurance in Software Development

Quality Assurance in Software Development Quality Assurance in Software Development Qualitätssicherung in der Softwareentwicklung A.o.Univ.-Prof. Dipl.-Ing. Dr. Bernhard Aichernig Graz University of Technology Austria Summer Term 2017 1 / 47 Agenda

More information

Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization

Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization Handling Multi Objectives of with Multi Objective Dynamic Particle Swarm Optimization Richa Agnihotri #1, Dr. Shikha Agrawal #1, Dr. Rajeev Pandey #1 # Department of Computer Science Engineering, UIT,

More information

Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL

Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL Ch.Srujana M.Tech [EDT] srujanaxc@gmail.com SR Engineering College, Warangal. M.Sampath Reddy Assoc. Professor, Department

More information

Ian Sommerville 2006 Software Engineering, 8th edition. Chapter 22 Slide 1

Ian Sommerville 2006 Software Engineering, 8th edition. Chapter 22 Slide 1 Verification and Validation Slide 1 Objectives To introduce software verification and validation and to discuss the distinction between them To describe the program inspection process and its role in V

More information

Detecting Common Mode Failures in N-Version Software Using Weakest Precondition Analysis

Detecting Common Mode Failures in N-Version Software Using Weakest Precondition Analysis Detecting Common Mode Failures in N-Version Software Using Weakest Precondition Analysis Gwang Sik Yoon, Sung Deok Cha, and Yong Rae Kwon Department of Computer Science Korea Advanced Institute of Science

More information

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability CDA 5140 Software Fault-tolerance - so far have looked at reliability as hardware reliability - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

More information

A Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment

A Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment A Case Study of Agreement Problems in Distributed Systems : Non-Blocking Atomic Commitment Michel RAYNAL IRISA, Campus de Beaulieu 35042 Rennes Cedex (France) raynal @irisa.fr Abstract This paper considers

More information

Two-dimensional Totalistic Code 52

Two-dimensional Totalistic Code 52 Two-dimensional Totalistic Code 52 Todd Rowland Senior Research Associate, Wolfram Research, Inc. 100 Trade Center Drive, Champaign, IL The totalistic two-dimensional cellular automaton code 52 is capable

More information

Topics in Software Testing

Topics in Software Testing Dependable Software Systems Topics in Software Testing Material drawn from [Beizer, Sommerville] Software Testing Software testing is a critical element of software quality assurance and represents the

More information

TSW Reliability and Fault Tolerance

TSW Reliability and Fault Tolerance TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software

More information

GOOFI : Generic Object-Oriented Fault Injection Tool

GOOFI : Generic Object-Oriented Fault Injection Tool GOOFI : Generic Object-Oriented Fault Injection Tool Joakim Aidemark, Jonny Vinter, Peter Folkesson, and Johan Karlsson Laboratory for Dependable Computing Department of Computer Engineering Chalmers University

More information

B.H. Far

B.H. Far SENG 637 Dependability, Reliability & Testing of Software Systems Defining i Necessary Reliability (Chapter 4) Department of Electrical & Computer Engineering, University of Calgary B.H. Far (far@ucalgary.ca)

More information

Detecting Structural Refactoring Conflicts Using Critical Pair Analysis

Detecting Structural Refactoring Conflicts Using Critical Pair Analysis SETra 2004 Preliminary Version Detecting Structural Refactoring Conflicts Using Critical Pair Analysis Tom Mens 1 Software Engineering Lab Université de Mons-Hainaut B-7000 Mons, Belgium Gabriele Taentzer

More information

On UML2.0 s Abandonment of the Actors-Call-Use-Cases Conjecture

On UML2.0 s Abandonment of the Actors-Call-Use-Cases Conjecture On UML2.0 s Abandonment of the Actors-Call-Use-Cases Conjecture Sadahiro Isoda Toyohashi University of Technology Toyohashi 441-8580, Japan isoda@tutkie.tut.ac.jp Abstract. UML2.0 recently made a correction

More information

Simulink/Stateflow. June 2008

Simulink/Stateflow. June 2008 Simulink/Stateflow Paul Caspi http://www-verimag.imag.fr/ Pieter Mosterman http://www.mathworks.com/ June 2008 1 Introduction Probably, the early designers of Simulink in the late eighties would have been

More information

Petri-net-based Workflow Management Software

Petri-net-based Workflow Management Software Petri-net-based Workflow Management Software W.M.P. van der Aalst Department of Mathematics and Computing Science, Eindhoven University of Technology, P.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands,

More information

Computation of Multiple Node Disjoint Paths

Computation of Multiple Node Disjoint Paths Chapter 5 Computation of Multiple Node Disjoint Paths 5.1 Introduction In recent years, on demand routing protocols have attained more attention in mobile Ad Hoc networks as compared to other routing schemes

More information

ECE 60872/CS 590: Fault-Tolerant Computer System Design Software Fault Tolerance

ECE 60872/CS 590: Fault-Tolerant Computer System Design Software Fault Tolerance ECE : Fault-Tolerant Computer System Design Software Fault Tolerance Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Some material based on ECE442 at the University of Illinois

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths Mario Schölzel Department of Computer Science Brandenburg University of Technology Cottbus, Germany

More information

An Automatic Test Case Generator for Testing Safety-Critical Software Systems

An Automatic Test Case Generator for Testing Safety-Critical Software Systems An Automatic Test Case Generator for Testing Safety-Critical Software Systems Mehdi Malekzadeh Faculty of Computer Science and IT University of Malaya Kuala Lumpur, Malaysia mehdi_malekzadeh@perdana.um.edu.my

More information

Introduction to Software Engineering

Introduction to Software Engineering Introduction to Software Engineering Gérald Monard Ecole GDR CORREL - April 16, 2013 www.monard.info Bibliography Software Engineering, 9th ed. (I. Sommerville, 2010, Pearson) Conduite de projets informatiques,

More information

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical

More information

Framework for replica selection in fault-tolerant distributed systems

Framework for replica selection in fault-tolerant distributed systems Framework for replica selection in fault-tolerant distributed systems Daniel Popescu Computer Science Department University of Southern California Los Angeles, CA 90089-0781 {dpopescu}@usc.edu Abstract.

More information

Basic Concepts of Reliability

Basic Concepts of Reliability Basic Concepts of Reliability Reliability is a broad concept. It is applied whenever we expect something to behave in a certain way. Reliability is one of the metrics that are used to measure quality.

More information

M. Xie, G. Y. Hong and C. Wohlin, "A Practical Method for the Estimation of Software Reliability Growth in the Early Stage of Testing", Proceedings

M. Xie, G. Y. Hong and C. Wohlin, A Practical Method for the Estimation of Software Reliability Growth in the Early Stage of Testing, Proceedings M. Xie, G. Y. Hong and C. Wohlin, "A Practical Method for the Estimation of Software Reliability Growth in the Early Stage of Testing", Proceedings IEEE 7th International Symposium on Software Reliability

More information

Verification and Validation

Verification and Validation Lecturer: Sebastian Coope Ashton Building, Room G.18 E-mail: coopes@liverpool.ac.uk COMP 201 web-page: http://www.csc.liv.ac.uk/~coopes/comp201 Verification and Validation 1 Verification and Validation

More information

MODEL FOR DELAY FAULTS BASED UPON PATHS

MODEL FOR DELAY FAULTS BASED UPON PATHS MODEL FOR DELAY FAULTS BASED UPON PATHS Gordon L. Smith International Business Machines Corporation Dept. F60, Bldg. 706-2, P. 0. Box 39 Poughkeepsie, NY 12602 (914) 435-7988 Abstract Delay testing of

More information

Defect Tolerance in VLSI Circuits

Defect Tolerance in VLSI Circuits Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy).

More information

Approaches to Software Based Fault Tolerance A Review

Approaches to Software Based Fault Tolerance A Review Computer Science Journal of Moldova, vol.13, no.3(39), 2005 Approaches to Software Based Fault Tolerance A Review Goutam Kumar Saha Abstract This paper presents a review work on various approaches to software

More information

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1 Verification and Validation Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1 Verification vs validation Verification: "Are we building the product right?. The software should

More information

Fault Tolerance Against Design Faults

Fault Tolerance Against Design Faults Fault Tolerance Against Design Faults Lorenzo Strigini Abstract Centre for Software Reliability, City University Northampton Square, London EC1V OHB, U.K. E-mail: strigini@csr.city.ac.uk This chapter surveys

More information