Design of Fault Tolerant Software

Size: px

Start display at page:

Download "Design of Fault Tolerant Software"

Shavonne Adela Francis
6 years ago
Views:

1 Design of Fault Tolerant Software Andrea Bondavalli CNUCE/CNR, via S.Maria 36, Pisa, Italy. Abstract In this paper we deal with structured software fault-tolerance. Structured software fault tolerance are those techniques where redundancy (both for detection and correction) is applied to the individual blocks of software with the goal of masking or reveal errors internal to the block. Each technique has its own way of structuring the interactions among redundant parts and of managing the complexity added. We discuss some of the open problems of software fault tolerance structures and the issues related to effectiveness. In particular we address generality and flexibility, which can be improved at the price of adding complexity to the design, and discuss the need of proper trade-offs between generality and flexibility on one hand and complexity on the other. The SCOP (Self-Configuring Optimal Programming) scheme [7, 24] constitutes an interesting example of such a trade-off. SCOP is a fault tolerant system structuring method, originally intended to improve run time efficiency, that is also very flexible and general. Together with the mechanisms used by SCOP for obtaining flexibility and generality, some recent research aiming at reducing the complexity is finally described 1 Introduction As computer systems are used in modern society for many critical applications, it is commonly recognised that it is necessary to improve their reliability, and in general their dependability. Since the early seventies it has become also apparent that obtaining software dependability (i.e., coping with design faults) constitutes a major problem. The development of dependable computing consists in the combined utilisation of a large number of techniques that can be classified into fault tolerance and fault prevention. Fault prevention techniques aim at a product that is as much as possible free, and likely to remain free, from internal defects (faults). Fault tolerance techniques are intended at "tolerating", by redundancy, the effects of faults, that is to cope with the effects of faults and avert the occurrence of failures or at least to warn a user that errors have been introduced into the state of the system. All the techniques known as 'software fault tolerance' are based on the concept of diversity of design [4] or data [1] (the principle of "double checking one's results" already found in Babbage's work) and are characterised by the emphasis on structuring and systematicity to make these concepts applicable in practice. Design diversity is the approach in which the production of two or more components (variants) of a system is aimed at delivering the same service through independent designs. The major advantage of design diversity is that it does not require the complete absence of design faults, but only that they should not produce similar errors in variants. Classical techniques for tolerating software faults are recovery blocks [18], N-version programming [3], which can be seen as extreme organisations following the design diversity approach, and other intermediate or combined techniques [13, 16, 19, 23]. Recovery blocks (RB) are the first scheme designed to provide software fault tolerance. In this approach, variants are named alternates and the main part of the adjudicator is an acceptance test that is applied sequentially to the results produced by variants. The variants are usually executed serially on a single processor. The execution time of a recovery block is normally that of the first variant, acceptance test, and the operations required to establish and discard a checkpoint. This will not impose a high run-time overhead unless an error is detected and backward recovery required. In this regard, RB is highly efficient. Limitations of the RB method are mainly related to its acceptance test. This test is usually derived from the semantics of a given application, and close dependency between the test and variants may impact dependability of the whole system. Moreover, the development of simple, effective acceptance tests is a difficult task. The N-version programming (NVP) approach avoids use of an acceptance test by taking advantage of parallel execution of multiple versions and result's comparison (although sequential execution is conceptually possible just as parallel execution of RB alternates is possible). NVP is a direct application of the hardware N-modular redundancy approach (NMR) to software. Many adjudication mechanisms, usually based on result comparison and in most cases independent of semantics of the applications, are available and can be selected to determine a single adjudication result from a set or a subset of all the results of variants. Here the probability of common mode failure between the adjudicator and the variants is relatively low. When variants are executed in parallel, NVP may have a fixed response time, thereby guaranteeing timely responses in presence of faults. However, it utilises redundancy in a static manner and always execute all the versions regardless of the normal or abnormal state of the system. The purpose is to tolerate the maximum number of faults that may be present in the system; but, since such a worst case rarely happens, the amount of resources consumed is often higher than necessary. 2 Open Problems in Designing Software Fault Tolerance Academic research and practical applications on software fault tolerance have made much progress in clarifying the possibilities of such methodologies and the problems related to their page 1 page 2

2 application such as complexity which should be kept under control. At the same time, the advantages obtainable from individual software fault tolerance schemes are not clearly measurable. Design diversity still has some difficulties in ensuring a routine-based improvement in software dependability; [9] details such discussion. The work on using simple retry of programs to mask the effects of the faults that cause transient errors [10, 11] seems to fit practical experience, but it is less complete and its effectiveness may be a matter of luck. Using data diversity to tolerate design faults in software systems [1] might provide a more cost effective alternative though such a technique is not generally applicable either. In order to improve the effectiveness of software fault tolerance some problems need to be addressed. Among them are the high costs (both the run-time overhead and design cost), the ability to evaluate the impact of software fault tolerance structures and the usually very limited flexibility of software fault tolerance designs and their consequent inability to adapt to changing run-time conditions. Some of these problems are typical of software fault tolerance while others are common also to software implemented fault tolerance, i.e. the tolerance to hardware faults performed by software. The cost of developing variants and adjudicator may be many times more than that of a single variant [15]. Some research activity is being undertaken with the objective of reducing the development cost of fault-tolerant software. The object-oriented programming paradigm has shown some possibilities through inheritance and polymorphism mechanisms [25]. At run time, all the fault tolerance approaches require some extra space or extra time, or both. Note that efficient use of the available resources generally requires dynamic management and conditional execution of the software variants. This should come with a dynamic trade-off between full parallel execution and totally sequential execution of the variants. Dynamic redundancy for the purpose of space-time trade-off is a classical idea, e.g. Duplicated Configuration with a Spare and NMR with Spares used in hardware [12]. However, the majority of software fault tolerance schemes do not provide such a dynamic space-time trade-off. In order to use redundancy in a dynamic or conditional manner, a scheme has to decide, at appropriate intermediate points of its execution, which of the following three execution states has been reached: i) End-state E a result exists that meets the required condition for delivery and can thus be delivered; ii) Non end-state N there is no result that meets the condition, but it is still possible to obtain such a result if further redundancy is employed; or iii) Failure state F there is no further possibility of producing a result that meets the condition. Research to evaluate the impact of software fault tolerance structures has followed two main directions: experimental measurement and analytical estimation. Some of the experiments developed a software project, with procedures as close as possible to those that would be used if the proposed methodology were chosen for producing "real-world" software, and then tested extensively. They have mainly provided useful insight into the problems of implementing the methodology since very little statistical value can be given to the data collected. Other experiments were "Statistically oriented" in which a number of variants of the same software were developed, and then tested to obtain statistical data. Of course, the scope of applicability of such results is still limited. A common outcome was that good specifications are of paramount importance. The fundamental problem of coincident failures has been studied to some detail for the NVP. The experiment described in [14] has disproved (for one particular sample of variants, of course) the independence hypothesis while PODS [5] has suggested that it may hold for small software modules. In PODS coincident failures in two variants were not normally due to similar programming errors (faults) but rather to a fault-masking effect in Boolean decision logic, a well-known phenomenon in the study of combinatorial circuits. Different faults (bugs) appeared to produce failures with independent probabilities. Analytical estimations of fault-tolerant software have been published in a number of papers, most recently [2, 6, 8, 15, 17, 20-22]. They differ in the models and analytical tools used and in the assumptions made. The main problem in using these models is the difficulty of estimating the values of the parameters, in particular the probabilities of errors common to redundant components. This information must be obtained experimentally, but we are still far from being able to determine it with acceptable confidence. All the structuring methods for software fault tolerance, are designed such that one single condition for delivering a result is usually embedded explicitly or implicitly in the adjudicator. This rigid design choice limits the flexibility and prevents the possibility to adapt to variations of the run-time environment or of the application requirements. Note that different conditions for delivering a result in principle have different fault coverage though some conditions seem to be very similar. Take as an example a mission of a critical system where two phases have been identified: normal and critical. In the critical phase an application could use a more severe delivery condition ensuring very strict checks against the delivery of erroneous output, thus a support for strong detection of abnormal conditions and an help to trigger the safety mechanisms could be provided. Obviously the ability to adapt to variations of the runtime environment or of the application requirements calls for dynamic decision to be taken. The price to be paid for this ability is clearly additional complexity, which must be controlled and limited as it may be source of errors in itself and cause the defeat of the method. 3 SCOP Among the existing problems in the area of software fault tolerance, we here intend to focus our attention on generality and flexibility. These features may of course be improved but more complex structures must be designed. Here we discuss the case of SCOP [7, 24]. It has been originally proposed aiming primarily at improving run-time effectiveness and to this page 3 page 4

3 purpose it applies conditional redundancy also when comparison based adjudicators are used. A second thinking, however, allows to recognise SCOP as a general dynamic scheme coping with both flexibility and run time efficiency for tolerating either software or hardware faults. The question that remains to be solved is whether the increased complexity is justified and the trade off represented by SCOP is satisfactory. In this respect the SCOP critical features, namely the inherent complexity of the control algorithm and the need for mechanism to support implementation are discussed and directions towards reducing complexity identified. 3.1 Basic Description The SCOP scheme consists of a set of software components, V={v 1, v 2,..., v n }, an adjudication mechanism, a set of delivery conditions one of them to be dynamically chosen at run time, and a controller that coordinates dynamic actions of the architecture. At run time an instance of SCOP accepts as additional parameters the selected delivery condition and, possibly, a deadline for the whole execution. First the controller decides how many phases can be performed (in order to provide a timely result), then it selects the (minimum) set of components that (if successful) could satisfy the selected delivery condition. After execution of the set of components the adjudicator verifies if the chosen delivery condition has been met, (using a Syndrome that may grow as more phases are performed). This behaviour is repeated until a result can be delivered or the software components are exhausted. The behaviour of SCOP can be described more precisely by the following control algorithm with comments on the right side. begin i:= 0; State_mark := N; Si = {}; C := one of { delivery conditions }; decide(max_phase); while State_mark = N and i < max_phase do begin i := i+1; configure(c, Si-1, i, Vi); execute(vi, Si); adjudicate(c, Si, State_mark, res); end; if State = E then deliver(res) else signal(failure); end {index of the current phase, set to 0} {set current state as non end-state} {set syndrome as empty} {set required delivery condition} {based on time constraints} {while current state is non end-state and current phase < maximum allowed} {start new phase} {set new Currently Active Set} {execute and obtain new syndrome} {set new state mark and select result} {current state is end-state or failure state?} The decide procedure determines the maximum number max_phase of phases to be permitted by the specified timing constraints. Procedure configure constructs the CAS set V 1 in phase one according to the selected delivery condition and the given application environment, and establishes the CAS set V i (i>1) based on the syndrome S i-1 collected in the (i-1)th phase and the information on phases. The execution of a CAS may lead to a successful state E. Note that software components in V i are selected from the software components that have not been used in any of the previous phases, i.e. V i is a subset of V - (V 1 V 2... V i-1 ). If the i-th phase is the last, V i would contain all the remaining spare software components. The execute procedure manages the execution of the software components in CAS and generates the syndrome S i, where S 0 is an empty set and S i-1 is a subset of S i. Procedure adjudicate implements the adjudication function using the selected condition C. It receives the syndrome S i, sets the new State_mark and selects the result res, if one exists. The deliver procedure delivers the selected result and the signal produces a failure notification. 3.2 SCOP characteristics Given the above mentioned difficulties for demonstrating the general effectiveness of design diversity to attain software dependability, in order to tolerate software faults (and some hardware-related faults), an instance of SCOP may employ software components according to an application-specific approach for masking the effect of faults. One of these is obviously multiple versions of software, but also diversity in data space, simple retry of programs or multiple replicas could be used, much depending upon specific application requirements and considerations of cost-effectiveness. An instance of SCOP can be designed to obey multiple different delivery conditions. One of them is dynamically chosen at run time, and the selected condition may change for different executions, according to the degradation of the system or to the actual execution phase, as discussed previously. In addition, if SCOP is used for the provision of a service used by many different applications, different delivery conditions may be dynamically chosen by the different applications, according to their degrees of criticality. Since the different conditions will usually have different fault coverage, SCOP is therefore able to provide different levels of dependability. SCOP is very efficient and makes a dynamic use of redundancy; i.e., always tries to execute the minimum number of software components strictly necessary for providing a result that meets the stated delivery conditions. To do this it organises the execution of components in phases, dynamically configuring a currently active set (CAS) V i, a subset of V, at the beginning of the ith phase. An adjudication is made after the execution of V i in order to check if conditions for the release of a result are satisfied. The result will be output immediately and any further phases and actions will be ended once these conditions are met. The initial CAS V 1 in phase one can be determined and flexibly changed with respect to different delivery conditions. Whenever recognised necessary, according to the selected delivery condition, the page 5 page 6

4 syndrome (a set of information used by an adjudicator to perform its judgement as to the correctness of a result) in SCOP is accumulated with the increase of phases. All the results produced and the additional information collected so far are employed to support the selection of a correct result. The architecture is very general allowing to combine several approaches for masking the effect of faults with different delivery conditions. For example, combining the design diversity approach with an acceptance test the Recovery block behaviour is obtained, while a pure replication and a majority voter (with the selection of one phase only) can be used for the design of an instance of NMR. This way the best alternative appropriate for the specific application can be specified and designed. The mechanisms for providing flexibility (concept of delivery condition) and efficiency (adaptive redundancy management) are basically independent of ways of redundancy for masking software faults, but must rely on the highly dynamic and complex control algorithm described. 3.3 Complexity The complexity of the dynamic control algorithm and of the adjudication mechanism for SCOP could itself be a source of errors and thus defeat the proposed scheme. These components are very general and can be re-used for all the instances of the scheme. Thus it appears feasible to verify them formally prove their correctness. Then, there is the need to develop a methodology for designing SCOP components hopefully supported by automatic tools. Given the application requirements to be fulfilled, the design of the specific SCOP instance implies the proper selection of the fault masking approach, of the redundancy degree and of the delivery condition(s). The appropriateness of the synthesised SCOP instance can then be verified and evaluated. The information specifying the desired behaviour, relative to this instance, of the control algorithm and the adjudication mechanism can be made read-only and recorded on stable storage if necessary. In this way, complexity of on-line control is greatly reduced: the adjudication mechanism and control algorithm of SCOP can take run-time actions by just monitoring the execution and reading the proper information, without performing complex computation. 3.4 Supporting Mechanism To support SCOP methodology several OS implemented mechanisms are necessary which are basically the same used to implement NVP and RB. Control of the components may be provided by a controller (similar to the driver program used in N-version programming). The controller is responsible for: 1) a synchronisation mechanism; 2) a mechanism for ensuring an identical set of input values or the proper data representation, as the case may be, to each component; 3) a mechanism for dynamically invoking an appropriate subset of variants; 4) a support to the specialisation of possible types of adjudication (both application semantics and syntax based); 5) a support to logical and/or physical reconfiguration, if needed. This set of OS implemented mechanisms may be seen as constituting a generic SCOP runtime support. Actually, from our preliminary analysis it results that the implementation of this support should not introduce serious technical difficulties, in comparison with the classical approaches such as NVP and RB. We are currently implementing a prototype of the SCOP scheme in an experimental C++ testbed (a local network environment that consists of a number of Sun-3 and Sun-4 workstations). Preliminary experimental data are promising and have been providing us with additional confidence for adopting SCOP in practical systems. 4 Conclusions In this paper we have dealt with some of the open problems that need to be addressed to improve the effectiveness of software fault tolerance structures. Besides pointing out at the high costs (both the run-time overhead and design cost) and at the ability to evaluate the impact of software fault tolerance structures, we focused on the issues of lack of generality and the usually very limited flexibility of software fault tolerance designs and their consequent inability to adapt to changing run-time conditions. These characteristics can be improved at the price of adding complexity to the design as exemplified by SCOP; thus proper trade-offs are required. SCOP is a fault tolerant system structuring method based on conditional usage of the available redundant components. It allows to combine several approaches for masking the effect of faults with different, even multiple, delivery conditions. It is immediate to design specific instances of SCOP which behave as a Recovery block or an NVP or as software implemented techniques for tolerating hardware faults. This generality allows to specify the best alternative for a specific application and to reduce the supporting mechanisms that the kernel/os must provide to a common set. The possibility in SCOP to define multiple delivery conditions to be used by the same instance represents a significant novelty in software fault tolerance in what it allows a great degree of flexibility and generality. Different applications with different criticality may thus use the same basic service equipped with multiple delivery conditions. In this way SCOP supports for different integrity levels. Another case easily managed is that of a single application with a variable degree of criticality in different phases of a mission. A SCOP instance, which usually provides services with a given degree of fault tolerance, may dynamically be used to provide support for safety. It is enough to switch to a delivery condition enforcing very strict checks against the delivery of erroneous outputs: the behaviour obtained consists in very strong detection of abnormal conditions helping to trigger the safety mechanisms. In both cases, this flexibility allows the scheme to easily manage executions that must occur in a degraded system, where less resources than usual are available due to some faults and before a repair or replacing action can take place. All these nice char- page 7 page 8

5 acteristics are obtained at the price of adding complexity which must be controlled and limited since it may be source of errors itself. To understand whether the increased complexity is justified and the trade off represented by SCOP is satisfactory the different sources of additional complexity of SCOP compared to the other schemes have been pointed out and directions towards reducing complexity identified. References [1] P. E. Ammann and J. C. Knight, "Data diversity: an approach to software fault tolerance," IEEE Trans. Comput., Vol. 37, pp , [2] J. Arlat, K. Kanoun and J. C. Laprie, "Dependability modelling and evaluation of software fault tolerant systems," IEEE Trans. Comput., Vol. 39, pp , [3] A. Avizienis and L. Chen, "On the implementation of N-version-programming for software ault-tolerance during execution," in Proc. Int. Conf. Comput. Soft. and Appli., New York, 1977, pp [4] A. Avizienis and J. C. Laprie, "Dependable Computing: from Concepts to Design Diversity," Proc. of the IEEE, Vol. 74, pp , [5] P. G. Bishop and F. D. Pullen, "PODS Revisited - A Study of Software Failure Behaviour," in Proc. 18th IEEE Int. Symp. on Fault-Tolerant Computing (FTCS-18), Tokyo, Japan, 1988, pp [6] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico and L. Strigini, "Dependability Analysis of Iterative Fault Tolerant Software Considering Correlation," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp [7] A. Bondavalli, F. Di Giandomenico and J. Xu, "A Cost-Effective and Flexible Scheme for Software fault Tolerance," Journal of Computer Systems Science and Engineering, Vol. 8, pp , [8] S. Chiaradonna, A. Bondavalli and L. Strigini, "On Performability Modeling and Evaluation of Software Fault Tolerance Structures," in Proc. 1st European Dependable Computing Conference (EDCC-1), Berlin, Germany, 1994, pp [9] D. E. Eckhardt, A. K. Caglayan, J. C. Knight, L. D. Lee, D. F. McAllister, M. A. Vouk and J. P. J. Kelly, "An Experimental Evaluation of Software Redundancy as a Strategy for Improving Reliability," IEEE Trans. Soft. Eng., Vol. 17, pp , [10] J. Gray and A. Reuter, "Transaction Processing: Concepts and Techniques," Morgan Kaufmann, [11] Y. Huang and C.M.R. Kintala, "Software implemented fault tolerance: Technologies and experience," in Proc. 23rd Int. Symp. Fault Tolerant Comput. (FTCS-23), Toulouse, 1993, pp [12] B. W. Johnson, "Design and Analysis of Fault Tolerant Digital Systems," Addison- Wesley Pub. Co., [13] K. H. Kim, "Distributed execution of recovery blocks: an approach to uniform treatment of hardware and software faults," in Proc. 4th Int. Conf. Distributed Comput. Sys., 1984, pp [14] J. C. Knight and N. G. Leveson, "An Experimental Evaluation of the Assumption of Independence in Multiversion Programming," IEEE Trans. Soft. Eng., Vol. SE-12, pp , [15] J. C. Laprie, J. Arlat, C. Beounes and K. Kanoun, "Definition and Analysis of Hardware and Software Fault-Tolerant Architecture," IEEE Computer, Vol. 23, pp , [16] J. C. Laprie, J. Arlat, C. Beounes, K. Kanoun and C. Hourtolle, "Hardware and Software Fault Tolerance: Definition and Analysis of Architectural Solutions," in Proc. 17th Int. Symp. Fault-Tolerant Comput., Pittsburgh, 1987, pp [17] M. R. Lyu and Y. He, "Improving the N-Version Programming Process Through the Evolution of a Design Paradigm," IEEE Transactions on Reliability, Special Issue on Fault-Tolerant Software, Vol. R-42, pp , [18] B. Randell, "System Structure for Software Fault Tolerance," IEEE Trans. Soft. Eng., Vol. SE-1, pp , [19] R. K. Scott, J. W. Gault and D. F. Mcallister, "Fault tolerant software reliability modeling," IEEE Trans. Soft. Eng., Vol. SE-13, pp , [20] A. Tai, A. Avizienis and J. Meyer, "Evaluation of fault-tolerant software: a perormability modeling approach," in "Dependable Computing for Critical Applications 3", C. E. Landweh, B. Randell and L. Simoncini Ed., Sprinter-Verlag, 1993, pp. [21] A. T. Tai, "Performability-Driven Adaptive Fault Tolerance," in Proc. 24th IEEE Int. Symp. on Fault-Tolerant Computing (FTCS-24), Austin, Texas, 1994, pp [22] A. T. Tai, A. Avizienis and J. F. Meyer, "Performability Enhancement of Fault-Tolerant Software," IEEE Transactions on Reliability, Special Issue on Fault-Tolerant Software, Vol. R-42, pp , [23] J. Xu, "The t/(n-1)-diagnosability and Its Applications to Fault Tolerance," in Proc. 21st Int. Symp. Fault-Tolerant Comput., Montreal, 1991, pp [24] J. Xu, A. Bondavalli and F. Di Giandomenico, "Dynamic Adjustment of Dependability and Efficiency in Fault-Tolerant Software," in "Predictably Dependable Computing Systems", B. Randell, J. C. Laprie, H. Kopetz and B. Littlewood Ed., Springer-Verlag, 1995, pp [25] J. Xu, B. Randell, C.M.F. Rubira-Calsavara and R.J. Stroud, "Toward an Object-Oriented Approach to Software Fault Tolerance," in "Fault-Tolerant Parallel and Distributed Systems", D. R. Avresky Ed., IEEE Computer Society Press, 1994, pp. page 9 page 10

The Reliable Hybrid Pattern A Generalized Software Fault Tolerant Design Pattern

The Reliable Hybrid Pattern A Generalized Software Fault Tolerant Design Pattern 1 The Reliable Pattern A Generalized Software Fault Tolerant Design Pattern Fonda Daniels Department of Electrical & Computer Engineering, Box 7911 North Carolina State University Raleigh, NC 27695 email: