Course: Advanced Software Engineering academic year: 2011-2012 Lecture 14: Software Dependability Lecturer: Vittorio Cortellessa Computer Science Department University of L'Aquila - Italy vittorio.cortellessa@di.univaq.it www.di.univaq.it/cortelle Copyright Notice» The material in these slides may be freely reproduced and distributed, partially or totally, as far as an explicit reference or acknowledge to the material author is preserved. 2 1
BASIC CONCEPTS 3 BASIC CONCEPTS : dependability summary 4 2
Reliability definition (RELIABILITY) Probability of a system working within specs throughout an interval of time without system-level repair 5 Reliability definition 6 Probability of a system working within specs for a certain throughout number an interval of invocations of time without system-level repair (RELIABILITY ON DEMAND) 3
Availability definition Fraction of time that the system is up within specs 7 Reliability terminology Fault feature that precludes the software from operating according to its specifications Error the value of the software state differs from the expected one Failure the actual software output (for some input) differs from the expected one 8 4
Faults, Errors and Failures program modthreeofsquare begin read(s); s := 2*s; s := s mod 3; write(s); end s=2 : no Error s=3 : Error! Fault! s=3,s=2 : no Failure s=4 : Failure! 9 Specification: a function that computes the remainder by 3 of the square of the input value y = (s 2 mod 3) Faults, Errors and Failures A failure is usually a result of a system error that is derived from a fault in the system However, faults do not necessarily result in system errors A faulty system might never execute the faulty statement to originate an error Errors do not necessarily lead to system failures The error can be corrected by built-in error detection and recovery or it can be naturally masked from other system components (error propagation) 10 5
About the error propagation Ф(C1) C1 Somehow interacting Ф(Cn) Cn Reliability of each component may not suffice Ф(C2) C2 component correct erroneous correct erroneous 11 About the error propagation system interface system component interface component i component j internal fault activation error error input error error propagation status of component i correct service component i failure incorrect service status of component j correct service (system) failure incorrect service 12 6
Dependability achievement» Fault avoidance - Development techniques are used that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults» Fault detection and removal - Verification and validation techniques that increase the probability of detecting and correcting faults before the system goes into service» Fault tolerance - Run-time techniques are used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures 13 FAULT AND FAILURE TYPES 14 7
Fault classification Heisenbugs Byzantine faults 15 Fault classification A repeatable bug; one that manifests reliably under a possibly unknown but well-defined set of conditions. 16 8
Fault classification A bug that disappears or alters its behavior when one attempts to probe or isolate it. E.g., the use of a debugger sometimes alters a program's operating environment significantly enough that buggy code, such as that which relies on the values of uninitialized memory, behaves quite differently. 17 Heisenbugs Fault classification A repeatable bug; A bug one whose that manifests underlying reliably under a possibly unknown causes but are well-defined so complex set and of conditions. obscure as to make its behavior appear chaotic or even non-deterministic. 18 9
Failure classification nature of the failure» Hardware failure - Hardware fails because of design and manufacturing errors or because components have reached the end of their natural life.» Software failure - Software fails due to errors in its specification, design or implementation. - Software failures are different from hardware failures in that software does not wear out. It can continue in operation even after an incorrect result has been produced.» Operational failure 19 - Human operators make mistakes. Now perhaps the largest single cause of system failures. Failure classification type of failure 20 10
Failure classification severity of failure Failure class Transient Permanent Recoverable Unrecoverable Non-corrupting Corrupting Description Occurs only with certain inputs Occurs with all inputs System can recover without operator intervention Operator intervention needed to recover from failure Failure does not corrupt system state or data Failure corrupts system state or data 21 Reliability improvement» Removing X% of the faults in a system will not necessarily improve the reliability by X%. A study at IBM showed that removing 60% of product defects resulted in a 3% improvement in reliability» Program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not affect the perceived reliability» A program with known faults may therefore still be seen as reliable by its users 22 11
Reliability specifications» The required level of system reliability should be expressed quantitatively.» Reliability is a dynamic system attribute (reliability specifications related to the source code are meaningless). - No more than N faults/1000 lines; - This is only useful for a post-delivery process analysis where you are trying to assess how good your development techniques are.» An appropriate reliability metric should be chosen to specify the overall system reliability. 23 METRICS 24 12
Reliability metrics» Reliability metrics are units of measurement of system reliability.» System reliability is measured by counting the number of operational failures and, where appropriate, relating these to the demands made on the system and the time that the system has been operational. 25 Dependability metrics Metric POFOD Probability of failure on demand ROCOF Rate of failure occurrence MTTF Mean time to failure AVAIL Availability Explanation The likelihood that the system will fail when a service request is made. A POFOD of 0.001 means that 1 out of a thousand service requests may result in failure. The frequency of occurrence with which unexpected behaviour is likely to occur. A ROCOF of 2/100 means that 2 failures are likely to occur in each 100 operational time units. This metric is sometimes called the failure intensity. The average time between observed system failures. An MTTF of 500 means that 1 failure can be expected every 500 time units. The probability that the system is available for use at a given time. Availability of 0.998 means that in every 1000 time units, the system is likely to be available for 998 of these. 26 13
Probability of failure on demand (POFOD)» This is the probability that the system will fail when a service request is made. Useful when demands for service are intermittent and relatively infrequent.» Appropriate for protection systems where services are demanded occasionally and where there are serious consequences if the service is not delivered. 27 Rate of failure occurrence (ROCOF)» Reflects the rate of occurrence of failure in the system.» Relevant for operating systems, transaction processing systems where the system has to process a large number of similar requests that are relatively frequent 28 14
Mean time to failure (MTTF)» Measure of the time between observed failures of the system. Is the reciprocal of ROCOF for stable systems.» Relevant for systems with long transactions i.e. where system processing takes a long time. MTTF should be longer than transaction length. 29 Availability» Measure of the fraction of the time that the system is available for use.» Takes repair and restart time into account» Relevant for non-stop, continuously running systems 30 15