Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Similar documents
Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

Basic Concepts of Reliability

Fault tolerance and Reliability

Dep. Systems Requirements

Appendix D: Storage Systems (Cont)

6.033 Lecture Fault Tolerant Computing 3/31/2014

Aerospace Software Engineering

Module 8 - Fault Tolerance

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Module 8 Fault Tolerance CS655! 8-1!

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

Chapter 8 Fault Tolerance

Distributed Systems COMP 212. Lecture 19 Othon Michail

Module 4 STORAGE NETWORK BACKUP & RECOVERY

CAD-CARE TROUBLESHOOTING GUIDE

Quantitative Analysis of Domain Testing Effectiveness.

Crash Proof - Data Loss Prevention

Issues in Programming Language Design for Embedded RT Systems

Fault Tolerance. Distributed Systems IT332

Outline. Failure Types

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

TSW Reliability and Fault Tolerance

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

B.H. Far

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

Software Metrics. Kristian Sandahl

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Software Quality. Chapter What is Quality?

WINDOWS 8.1 RESTORE PROCEDURE

Dependability and ECC

INFORMATION SECURITY- DISASTER RECOVERY

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Fault Tolerance. Distributed Systems. September 2002

Internet Services and Search Engines. Amin Vahdat CSE 123b May 2, 2006

CSE 124: Networked Services Lecture-15

Zettabyte Reliability with Flexible End-to-end Data Integrity

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

Implementation Issues. Remote-Write Protocols

Flash File Systems Overview

A SKY Computers White Paper

MS630 Memory Problem Determination/Resolution Guide

ZKLWHýSDSHU. 3UHð)DLOXUHý:DUUDQW\ý 0LQLPL]LQJý8QSODQQHGý'RZQWLPH. +3ý 1HW6HUYHUý 0DQDJHPHQW. Executive Summary. A Closer Look

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Automatic Generation of Availability Models in RAScad

Error recovery through programnling

Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309

Object Oriented Programming. Week 7 Part 1 Exceptions

High Availability AU PUBLIC INFORMATION. Allen-Bradley. Copyright 2014 Rockwell Automation, Inc. All Rights Reserved.

Last Class:Consistency Semantics. Today: More on Consistency

Hot Topics in IT Disaster Recovery

Dependability tree 1

Recover My Files Data Recovery Software English V3.98. The Options Button

Troubleshooting. Resetting the System. Problems Following Initial System Installation. First Steps Checklist CHAPTER

Eventual Consistency. Eventual Consistency

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Quickly Repair the Most Common Problems that Prevent Windows XP from Starting Up

How Does Failover Affect Your SLA? How Does Failover Affect Your SLA?

1. Distributed Systems

L E C T U R E N O T E S : C O N T R O L T Y P E S A N D R I S K C A L C U L A T I O N

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

VERITAS Volume Replicator. Successful Replication and Disaster Recovery

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Fault Tolerance

GOMS Lorin Hochstein October 2002

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fair City Plaza. Task: Diagnose problems, recommend improvements for small municipal district cooling system

MANUAL Functional Safety

Support vector machines

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

VERITAS Volume Replicator Successful Replication and Disaster Recovery

Intelligent Drive Recovery (IDR): helping prevent media errors and disk failures with smart media scan

Hardware safety integrity (HSI) in IEC 61508/ IEC 61511

Hard facts. Hard disk drives

HARDFS: Hardening HDFS with Selective and Lightweight Versioning

Distributed Systems 11. Consensus. Paul Krzyzanowski

Part 2: Basic concepts and terminology

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

User Interface Design

Simulation Model Of Functional Stability Of Business Processes

Fault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson

RELIABILITY OF ENTERPRISE HARD DISK DRIVES

Using Hidden Semi-Markov Models for Effective Online Failure Prediction

Announcements. R3 - There will be Presentations

2. On classification and related tasks

Service Recovery & Availability. Robert Dickerson June 2010

The Paradox of Microsoft Office365 Mailbox Backup

Vembu v4.0 Vembu ImageBackup

Data Intensive Scalable Computing

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Darshan Institute of Engineering & Technology Unit : 9

Today: Fault Tolerance

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

FAULT TOLERANT SYSTEMS

Transcription:

SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES To differentiate the failure and faults. To highlight the importance of execution and calendar time To understand Time interval between failures. To understand on the user perception of reliability. DEFINITIONS OF SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. The key elements of the definition include probability of failure-free operation, length of time of failure-free operation and the given execution environment. Failure intensity is a measure of the reliability of a software system operating in a given environment. Example: An air traffic control system fails once in two years. Factors Influencing Software Reliability A user s perception of the reliability of a software depends upon two categories of information. The number of faults present in the software. The way users operate the system. This is known as the operational profile. The fault count in a system is influenced by the following. Size and complexity of code. Characteristics of the development process used. Education, experience, and training of development personnel. Operational environment.

Applications of Software Reliability The applications of software reliability includes Comparison of software engineering technologies. What is the cost of adopting a technology? What is the return from the technology -- in terms of cost and quality? Measuring the progress of system testing -The failure intensity measure tells us about the present quality of the system: high intensity means more tests are to be performed. Controlling the system in operation -The amount of change to a software for maintenance affects its reliability. Better insight into software development processes - Quantification of quality gives us a better insight into the development processes. FUNCTIONAL AND NON-FUNCTIONAL REQUIREMENTS System functional requirements may specify error checking, recovery features, and system failure protection. System reliability and availability are specified as part of the non-functional requirements for the system. SYSTEM RELIABILITY SPECIFICATION Hardware reliability focuses on the probability a hardware component fails. Software reliability focuses on the probability a software component will produce an incorrect output. The software does not wear out and it can continue to operate after a bad result. Operator reliability focuses on the probability when a system user makes an error. FAILURE PROBABILITIES If there are two independent components in a system and the operation of the system depends on them both then, P(S) = P (A) + P (B) If the components are replicated then the probability of failure is P(S) = P (A) n which means that all components fail at once.

FUNCTIONAL RELIABILITY REQUIREMENTS The system will check all operator inputs to see that they fall within their required ranges. The system will check all disks for bad blocks each time it is booted. The system must be implemented in using a standard implementation of Ada. NON-FUNCTIONAL RELIABILITY SPECIFICATION The required level of reliability must be expressed quantitatively. Reliability is a dynamic system attribute. Source code reliability specifications are meaningless (e.g. N faults/1000 LOC). An appropriate metric should be chosen to specify the overall system reliability. HARDWARE RELIABILITY METRICS Hardware metrics are not suitable for software since its metrics are based on notion of component failure. Software failures are often design failures. Often the system is available after the failure has occurred. Hardware components can wear out. SOFTWARE RELIABILITY METRICS Reliability metrics are units of measure for system reliability. System reliability is measured by counting the number of operational failures and relating these to demands made on the system at the time of failure. A long-term measurement program is required to assess the reliability of critical systems. PROBABILITY OF FAILURE ON DEMAND The probability system will fail when a service request is made. It is useful when requests are made on an intermittent or infrequent basis. It is appropriate for protection systems where service requests may be rare and consequences can be serious if service is not delivered. It is relevant for many safety-critical systems with exception handlers. RATE OF FAULT OCCURRENCE Rate of fault occurrence reflects upon the rate of failure in the system. It is useful when system has to process a large number of similar requests that are relatively frequent. It is relevant for operating systems and transaction processing systems.

RELIABILITY METRICS Probability of Failure on Demand (PoFoD) PoFoD = 0.001. For one in every 1000 requests the service fails per time unit. Rate of Fault Occurrence (RoCoF) RoCoF = 0.02. Two failures for each 100 operational time units of operation. Mean Time to Failure (MTTF) The average time between observed failures (aka MTBF) It measures time between observable system failures. For stable systems MTTF = 1/RoCoF. It is relevant for systems when individual transactions take lots of processing time (e.g. CAD or WP systems). Availability = MTBF / (MTBF+MTTR) MTBF = Mean Time Between Failure MTTR = Mean Time to Repair Reliability = MTBF / (1+MTBF) TIME UNITS Time units include: Raw Execution Time which is employed in non-stop system Calendar Time is employed when the system has regular usage patterns Number of Transactions is employed for demand type transaction systems AVAILABILITY Availability measures the fraction of time system is really available for use. It takes repair and restart times into account. It is relevant for non-stop continuously running systems (e.g. traffic signal).

FAILURE CONSEQUENCES STUDY 1 Reliability does not take consequences into account. Transient faults have no real consequences but other faults might cause data loss or corruption. Hence it may be worthwhile to identify different classes of failure, and use different metrics for each. FAILURE CONSEQUENCES STUDY 2 When specifying reliability both the number of failures and the consequences of each matter. Failures with serious consequences are more damaging than those where repair and recovery is straightforward. In some cases, different reliability specifications may be defined for different failure types. FAILURE CLASSIFICATION Failure can be classified as the following Transient - only occurs with certain inputs. Permanent - occurs on all inputs. Recoverable - system can recover without operator help. Unrecoverable - operator has to help. Non-corrupting - failure does not corrupt system state or data. Corrupting - system state or data are altered. BUILDING RELIABILITY SPECIFICATION The building of reliability specification involves consequences analysis of possible system failures for each sub-system. From system failure analysis, partition the failure into appropriate classes. For each class send out the appropriate reliability metric. SPECIFICATION VALIDATION It is impossible to empirically validate high reliability specifications. No database corruption really means PoFoD class < 1 in 200 million. If each transaction takes 1 second to verify, simulation of one day s transactions takes 3.5 days.