B.H. Far

Size: px

Start display at page:

Download "B.H. Far"

Lynne Wilcox
5 years ago
Views:

Engineering, University of Calgary B.H. Far (far@ucalgary.ca) http://www.

1 SENG 637 Dependability, Reliability & Testing of Software Systems Defining i Necessary Reliability (Chapter 4) Department of Electrical & Computer Engineering, University of Calgary B.H. Far (far@ucalgary.ca) p// / p / / / / far@ucalgary.ca 1

2 Contents Steps in defining necessary reliability Failure severity class (FSC) Failure intensity objective (FIO) Strategies to meet FIO Software fault tolerance 2

3 SRE: Process /1 5 steps in SRE process: Define necessary reliability Develop operational profiles Prepare for test Execute test Apply failure data to guide decisions Define Necessary Reliability Develop Operational Profile Fault Tolerance Computing Prepare for Test Execute Test Apply Failure Data to Guide Decisions far@ucalgary.ca 3

4 Chapter 4 Section 1 How to define Necessary Reliability? far@ucalgary.ca 4

5 Reliability and Risk Necessary reliability depends on the Risk. Higher risk software requires higher reliability. Necessary reliability also depends on profitability, budget, man-power, etc. Q. "What are you going to test?" A. "The Most Important things. " Q. "And how do you know what the most important things are?" Reference: Software Testing Fundamentals: Methods and Metrics Marnie L. Hutcheson ISBN: X John Wiley & Sons 2003 (408 pages), Chapter

6 Necessary Reliability: How to 1) Define failure with failure severity classes (FSC) for the product. 2) Set a failure intensity objective (FIO) for each system to be tested. t 3) Choose a common scale for all associated systems. 4) Find the developed software failure intensity objective. 5) Engineer strategies to meet the software fil failure intensity it it objective. far@ucalgary.ca 6

7 1. Failure Severity Classes Failures usually differ by their impact on the system A failure Severity Class (FSC) is a set of failures that have the same per-failure impact on users using a failure classification criteria Common classification criteria: cost, system capability, human life, environment Note: there are other rankings such as MIT s ranks Failure severity is different from its complexity Severity can change with the time of failure and can be subjective far@ucalgary.ca 7

8 FSC: Common Classification Common o classification c criteria: Cost What does this failure cost in terms of operational cost, repair cost, loss of business, disruption, etc. Severity classes based on cost may be scaled by a factor of 10. Usually 4 ranges are enough. Severity class Definition ($) 1 > 100, , , ,000 10,000 4 < 1,000 far@ucalgary.ca 8

9 FSC: Common Classification Common classification criteria: System capability (Services) May include factors such as loss of data, downtime, recoverability, etc. Severity class Definition 1 Basic service interruption 2 Basic service degradation 3 Inconvenience, correction not deferrable 4 Minor tolerable effects, correction deferrable far@ucalgary.ca 9

10 FSC: Common Classification Common classification criteria: Environment May include factors such as harmful to environment, loss of wild life, etc. Applicable to nuclear, chemical industry, etc. Severity class Definition 1 Severe and unrecoverable damage to environment and/or wild life 2 Severe but partially recoverable damage to environment 3 Minor damage to environment or wild life 4 Minor but recoverable deficiencies far@ucalgary.ca 10

11 FSC: Common Classification Common classification criteria: Human life May include factors such as harmful lto human or environment, loss of human life, etc. Applicable to aeronautical, automotive, nuclear, health care industry, military systems, etc. Severity class Definition 1 Possible loss of human life 2 Severe damage to human immune system or environment 3 Minor damage to human immune system or environment 4 Minor but recoverable deficiencies far@ucalgary.ca 11

12 How to Define FSC? Experience based: ask users/ stakeholders/ developers/ compare to similar il products / use FTA and/or FMEA techniques. List all factors that t may be considered d as failure severity for the project Narrow the list down to the most critical and/or measurable ones Some factors may be hard to measure, such as impact on company reputation, etc. far@ucalgary.ca 12

13 FSC: Conflicting Concerns Conflicting viewpoints (concerns) between the software developer and customer regarding failure severity class (FSC) should be resolved before proceeding to set target failure intensity objective Comparison of the FSC for the software with a similar product is usually useful far@ucalgary.ca 13

14 Documenting FSC User profile Classification Failures (ordered list: start with the most severe ones ) (type or criteria Class 1 Class 2 Class 3 Class 4 concern) Cost System capability (Services) Human life Environment Other (specify) Define classes for each criterion separately far@ucalgary.ca 14

15 2. Failure Intensity Objective (FIO) Failure intensity objective (FIO) reflects an estimation of the bugs allowed to be remained in the product at the release time. FIO is an alternative way of expressing system reliability. 15

16 Failure Intensity Objective Failure intensity is usually given in terms of number of failure per time (or some other defined units), e.g., 3 alarms per 100 hours of operation. 5 failures per 1000 print jobs, etc. Failure intensity of a system is the sum of failure intensities for all of the components of the system (assuming no system redundancy and exponential model). far@ucalgary.ca 16

17 How to Set FIO /1 Mainly experience based and depends on the project. Depends on the trade-off among quality characteristics (development time and development cost) and functionality and technology. Rule of thumb: Estimate the project s total cost (C), e.g., using COCOMO s Early Design Model, etc., and set FIO to be 1 over C (i.e., C units of operation, assuming that the cost of highest impact is roughly equal to the total development costs) far@ucalgary.ca 17

18 How to Set FIO /2 Typical FIO for various projects Failure Impact Typical FIO ( ) Time between failures (MTTF) More than 1,000,000,000 $ cost 1 per 1,000,000,000 hours 114,000 years More than 1,000,000 $ cost 1 per 1,000,000 hours 114 years Around 1,000 $ cost 1 per 1,000 hours 6 weeks Around 100 $ cost 1 per 100 hours 100h Around 10 $ cost 1 per 10 hours 10 h Around 1 $ cost 1 per hour 1 h far@ucalgary.ca 18

19 How to Set FIO: Reliability Setting FIO in terms of reliability ln t R or 1 R t is failure intensity R is reliability t is natural unit (time, etc.) for For reliability around for 8 hours of operation, is set to R 0.95 far@ucalgary.ca 19

20 Reliability & Failure Intensity Reliability for 1 hour mission time Failure intensity failure / hour failure / 1000 hours failure / day failure / 1000 hours failure / week failure / month failure / 1000 hours failure / year far@ucalgary.ca 20

21 How to Set FIO: Availability Setting FIO in terms of system availability (A) for the exponential model : 1 1 A t At or t t t t A t 1 m m is failure intensity t m is downtime per failure eg e.g., if a product must be available 99% of time and downtime is 6 min, then FIO is about 1 per 10 hours. far@ucalgary.ca 21

22 Example Suppose we want 99 percent availability of a human- machine team. Assume that t a service interruption ti requires an average recovery time of 14 minutes for the person involved, since he/she must refresh his/her memory before restarting. Assume the average machine downtime at each failure is 1 minute. The total downtime is 15 minutes. λf= (1-0.99) / (0.99 x 0.25) = 0.01/ /0 or approximately 4 failures per 100 hr. Example From Musa s Book far@ucalgary.ca 22

23 How to Set FIO: MTTF Using MTTF A MTTR MTTF MTTF MTTF MTTF MTTR MTBF failure intensity meantime to repair meantime to failure Another definition of availability: MTTF 1 MTTF MTTR 1 MTTR MTTF M TTF MTTF MTTR far@ucalgary.ca 23

24 How to Set FIO: Hazard Rate Hazard Rate z(t): The probability that the component will fail in a given time interval given that it has not failed prior to the interval Hazard rate of 0.05 means that there is a 5% chance that the first failure will occur in the specified time interval and not before For exponential distribution, z(t) is far@ucalgary.ca 24

25 How to Set FIO: Profitability Based on analyze experience with previous or similar systems by comparing field measurements of major quality characteristics and degrees of user satisfaction with them with similar measurements for a previous release or a similar product. Compare trade-off trends between profitability and failure intensity. far@ucalgary.ca 25

26 Example Tip: select a range that leads to highest profit margin Example From Musa s Book far@ucalgary.ca 26

27 Reliability vs. Availability Why specify reliability when availability is better understood and has better intuitive appeal? Availability has a subjective appeal to the user and there are usually workarounds to make the system available without increasing the intrinsic reliability of it. Example: Using a replica server in case the domain server goes down increases the availability of the system but it does not necessarily increase the reliability of the server software. far@ucalgary.ca 27

28 Developed Software Product Developed software product is usually only a part of the whole system Example: stand alone system ft Interface e to other systems Acquired components Developed components OS, System software Hardware far@ucalgary.ca 28

29 3. Choose a Common Scale There may be various scales for expressing FIO for various project parts. Example: System failure intensity i objective = 30 failure/1,000,000 transactions MTTF for OS is 3,000 hours for 10 million transactions MTTF for hardware is 1 per 30 hours of operation One must define a unique scale for all FIOs far@ucalgary.ca 29

30 FIO for Developed Product How to compute failure intensity objective for the developed software? 1. Set FIO for the whole system 2. Set a common measurement unit for failure intensity for the whole system 3. Subtract expected failure intensity for acquired components from the FIO. 4. Subtract expected failure intensity for the environment (OS, interface systems) that the developed software will run on 5. The remaining will be failure intensity objective for the developed software components. far@ucalgary.ca 30

31 Computing Developed FIO Example 1: System failure intensity it objective = 100 failure/1,000,000 transactions Failure intensity for hardware = 0.1 failure/hour OS failure for a load of 100, transactions = 0.4 failure/hour Therefore, developed software FIO = 95 failure/1,000,000 transactions far@ucalgary.ca 31

32 Computing Developed FIO Example 2: Database system running on Win 2K System failure intensity objective = 30 failure/1,000,000 transactions MTTF for Win 2K is around 3,000 hours for 10 million transactions Average hardware failure is 1 per 30 hours Failure rate for other systems is 9 for one million transactions What is FIO for the developed software? far@ucalgary.ca 32

33 Computing Developed FIO 1 os 1/ 3000 MTTF 1 hardware 100 / / 3000 for 10, 000, 000 transactions os other total F hardware 90 for 10, 000, 000 transactions 191 for 10, 000, 000 transactions 300 for 10, 000, 000 transactions therefore developed _ software for 10,000,000 transactions far@ucalgary.ca 33

34 4. Strategies to Meet FIO Engineer strategies to meet the software failure intensity objective for the developed software. 4 main strategies: Fault prevention Fault removal Fault tolerance Fault/failure forecasting far@ucalgary.ca 34

35 Fault Prevention To avoid fault occurrences by construction. Activities: Requirement review Design review Clear code Establishing standards (ISO , 3etc) etc.) Using CASE tools with built-in check mechanisms Effectiveness factor: Proportion of the faults remaining after prevention activities. 35

36 Fault Removal To detect, by verification and validation, the existence it of ffaults and eliminate i them. Activities: Reviewing i code (inspection) i Testing Effectiveness factor: Reduction of failure intensity due to code review. Ratio of failure intensity after test and before test. far@ucalgary.ca 36

37 Testing vs. Inspection Inspections are strict and close examinations conducted on specifications, design, code, test, and other artifacts. Inspections allow for Testing allows for defect defect detection, detection prevention, and isolation Start early in life cycle Start later in life cycle Inspections are up to 20 times more efficient than testing Code reading detects twice as many defects/hour as testing 80% of development errors are usually found by inspections Inspections resulted in a 10x reduction in cost of finding errors SENG635 (Winter 2007) far@ucalgary.ca 37

38 Inspections or Testing? Q. Can inspection replace testing? No. Inspections cannot replace testing because all the information revealed through testing cannot be obtained through hinspection. Complex interactions in large systems (deadlocks, emergent behavior, etc.) Software reliability indicator Nonfunctional requirements: performance, usability, etc. SENG635 (Winter 2007) 38

39 Fault Tolerance To provide, by redundancy, service complying with the specification in spite of faults occurrences. Activities: Designing gand implementing redundancy Effectiveness factor: Reduction of failure intensity as a result of redundant design. far@ucalgary.ca 39

40 Fault / Failure Forecasting To estimate, by evaluation, the presence of faults and the occurrences of failures Activities: Establishing reliability model Collecting failure data Analysis and dinterpretation t ti of results Effectiveness factor: Reduction of failure intensity as a result of applying reliability engineering far@ucalgary.ca 40

41 41

42 Chapter 4 Section 2 Fault ltt Tolerant Software Systems far@ucalgary.ca 42

43 Fault Tolerance Terminology Backward Fault Tolerance Recovery Redundancy Forward Architectural Hardware redundancy Software redundancy Data redundancy Temporal redundancy Functional Serial Parallel Sequential 43

44 Definition & Goal /1 A fault-tolerant computing system must be capable of providing specified services in the presence of a bounded number of failures Use of techniques to enable continued delivery of service during system operation Based on the principle of Act during operation while Defined during specification and design far@ucalgary.ca 44

45 Definition & Goal /2 The failures could occur because faults are present in either the components of the system or in the system s design. Building large computing systems is a complex task; fault-tolerance requirements could make the task even more difficult unless appropriate system structuring ring concepts are utilized. Reliability growth (modeling, computation and interpretation) of a system featuring fault tolerance is different from a system without such feature. far@ucalgary.ca 45

46 Problems The traditional approaches to fault tolerance in hardware systems have been based on coping with the effects of well-understood failure modes of physical components. Conventional hardware fault tolerance methods (e.g., redundancy) are rarely powerful enough to cope with design deficiencies. E.g., designing a square wheel! Consequently, most hardware fault tolerance techniques cannot be applied directly in software fault tolerance, where almost all faults are design faults. 2+2=5 2+2=5 Redundancy of incorrectly designed component doesn t help! far@ucalgary.ca 46

47 History Defensive programming: Implementing relatively ad-hoc methods used to minimize the damage which could arise from the damage of presence of residual bugs. Dual software technique: Implementing two distinct versions of the same software and executing them. Any discrepancy in the outputs of the two versions may trigger an alarm. Etc. 47

48 Fault Tolerance Process 1. Detection Identify faults and their causes (errors) 2. Assessment Assess the extent to which the system state has been damaged or corrupted. 3. Recovery Remain operational or regain operational status 4. Fault treatment and continued service Locate and repair the fault to prevent another occurence 48

49 Definitions Recovery Actions to restore the system state to a correct state Recovery requires consistency checking Redundancy Designing the system with multiple components with the same functionality far@ucalgary.ca 49

50 Consistency Check A program-specific error detection mechanism to check on the results of program execution. Usually evaluates to either true or false. ensure<acceptance test>by P0 else-by P1 else fail far@ucalgary.ca 50

51 Example: Consistency Check Checksums for program parts or split packages Internal check points: ABS[(SQRT(x)*SQRT(x)) x] < E Exception signal when dividing by zero Integer overflow signal Interrupt signal for program loop Float point numerical failure check far@ucalgary.ca 51

52 Example: Consistency Check x y x y 6 i 1 x i y i ,1223,10,,10,3,, 10 30,2, 10 26, ,2111, The correct answer should be But ordinary implementation ti of this will return zero due to rounding and large differences in the order of magnitude of the summands. far@ucalgary.ca 52

53 Backward Recovery Roll back the system to a previously saved correct state Consistency check fails Laura L. Pullum: Software Fault Tolerance Techniques and Implementation, Artech House, 2001 far@ucalgary.ca 53

54 Domino Effect Why backward recovery is not always possible? Domino Effect: successive rollback of communicating processes when a failure is detected in any one of the processes. Laura L. Pullum: Software Fault Tolerance Techniques and Implementation, Artech House, 2001 far@ucalgary.ca 54

55 Forward Recovery Use redundancy to recover from a failure Laura L. Pullum: Software Fault Tolerance Techniques and Implementation, Artech House, 2001 far@ucalgary.ca 55

56 Forward Recovery: Pros & Cons Advantages: Forward recovery is fairly efficient in terms of the overhead (time and memory) it requires. This can be crucial in real-time applications where the time overhead of backward recovery can exceed stringent time constraints. If the fault is an anticipated one, such as the potential loss of data, then redundancy and forward recovery can be a useful and timely approach. Faults involving missed deadlines may be better recovered from using forward recovery than by introducing additional delay in roll back and recovering. Disadvantages: Application-specific, that is, it must be tailored to each situation or program. Can only remove predictable errors from the system state. Requires knowledge of the error. Cannot aid in recovery if the state is damaged beyond recoverability. Depends on the ability to accurately detect the occurrence of a fault (thus initiating the recovery actions. Laura L. Pullum: Software Fault Tolerance Techniques and Implementation, Artech House, 2001 SENG635 (Winter 2007) far@ucalgary.ca 56

57 Redundancy Redundancy: designing the system with multiple components with the same functionality Redundancy techniques: Implementing two (or more) )distinct i versions of the same software and executing them for the same set of inputs. Any discrepancy in the outputs of the two versions may trigger an alarm. Redundancy techniques efficiency depends on coincident and correlated faults. far@ucalgary.ca 57

58 Types of Redundancy Hardware redundancy Replicated and supplementary hardware added to the system to support fault tolerance. Software redundancy Also called program, modular, or functional redundancy, includes programs, modules, functions used to support fault tolerance. Data redundancy Using additional forms of data to assist in fault tolerance Temporal redundancy Using additional forms of data to assist in fault tolerance. Using additional time to perform tasks related to fault tolerance, i.e. repeating an execution using the same software and hardware resources involved in the initial, failed execution. 58

59 1. Coincident Faults Coincident Faults: when two or more functionally equivalent software components fail on the same input. When two or more software versions give the same incorrect response, an identical-andwrong (IAW) answer is obtained. 59

60 2. Correlated Faults Correlated Faults: Two faults are correlated when the measured probability of the coincidence failures is significantly higher than what would be expected from the individual failure. if p i _ fails j _ fails p i _ fails There will be no failure independence. d far@ucalgary.ca 60

61 Failure Scenario What if the software P1 components produce doublet or triplet identical-and-wrong (IAW) responses? Input space for each procedure P2 P3 Adjudication Algorithm Doublet & triplet IAW faults 61

62 Adjudication by Voting A voter compares results from two or more functionally equivalent software components and decides which of the answers provided by those components is correct. Various versions of voting algorithm: Majority voting Consensus voting 2-of-N voting far@ucalgary.ca 62

63 Majority Voting Several identical components are structured in parallel l and all are active. If the component outputs t are not identical, the minority components are ignored (i.e., e disabled or switched off). Majority voting: N: number of systems m [(N+1)/2], N>1 m: agreement number System reliability (Rsystem) for majority voting (assuming components with identical reliability Rc) R system 1 1 R c m where m N 1 2 far@ucalgary.ca 63

64 Consensus Voting If majority agreement is achieved, select this answer If unique maximum agreement is achieved but m<[(n+1)/2], select the unique maximum (m is the ceiling value) If tie in the maximum agreement number is achieved, select randomly System reliability (Rsystem) for consensus voting (assuming components with identical reliability Rc) R system 1 1 R c m m is the number of unique maximum components far@ucalgary.ca 64

65 2-of of-n Voting Agreement number m can be set to 2 if the output space is large and statistical independence of variant failures can be assumed. System reliability (Rsystem) for 2-of-N voting (assuming components with identical reliability Rc) R system 1 1 R 2 c far@ucalgary.ca 65

66 Design Techniques 1) Robust software systems 2) Recovery blocks 3) N-version programming 4) Consensus recovery block 5) Acceptance voting 6) N-self-checking programming 66

67 1) Robust Software Systems /1 Robust Software Systems (Anderson and Lee 1981, etc.): Construction of a robust module requires: Exception handlers for coping with exceptions propagated from lower levels; and Boolean expressions for detecting exceptions arising in the module itself, and their exception handlers. It is often possible (and desirable for the sake of simplicity) to map several exceptions onto a single handler. far@ucalgary.ca 67

68 2) Recovery Blocks (RB) Using multiple versions of software module and acceptance test. The output of the 1 st module is tested for acceptability and if fails, the 2 nd module is executed after backward state recovery. The system fails only if all modules fail on their acceptance tests. Figure from Reliability Engineering Handbook far@ucalgary.ca 68

69 3) N-Version Programming (NVP) Parallel execution of N independently developed functionally equivalent modules. Adjudication is via voting. The voter accepts all N outputs and selects the correct one among them, i.e., the one that meets the specification. Advantage of NVP: no service interruption Figure from Reliability Engineering Handbook far@ucalgary.ca 69

70 4) Consensus Recovery Block Combination of N- version programming (NVP) and recovery blocks (RB). IF NVP fails, the system reverts to RB using the same blocks. Advantage: highest possible system reliability. input failure NVP RB System failure success Correct output Correct output 70

71 5) Acceptance Voting Like N-version programming (NVP) all versions are executed in parallel. The output of each module goes to an acceptance test. If acceptance test is successful, the output goes to a voter. Figure from Reliability Engineering Handbook far@ucalgary.ca 71

72 6) N-Self Self-Check Programming In N-Self-Check Programming (NSCP), N modules are executed in pairs. The pairs outputs can be compared or accessed for correctness. Figure from Reliability Engineering Handbook far@ucalgary.ca 72

73 Discussion The capability of tolerating design faults rests largely on the coverage of run-time checks (i.e. acceptance tests) for detecting errors. Often, it is not possible to check completely within a procedure that the results produced have been according to the specification (e.g., for a sort algorithm that sorts its input, the check that the output has been sorted correctly would be as complex as the sort algorithm itself). Hence run-time checks are often limited to checking certain critical aspects of the specification. This means that the possibility of undetected failures cannot be ruled out entirely. far@ucalgary.ca 73

74 Fault Tolerance: Adjudication by voting 74

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy