Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

Similar documents
Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Basic Concepts of Reliability

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

TSW Reliability and Fault Tolerance

Appendix D: Storage Systems (Cont)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Dep. Systems Requirements

Dependability tree 1

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Fault Tolerance in Distributed Systems: An Introduction

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Distributed Systems COMP 212. Lecture 19 Othon Michail

Issues in Programming Language Design for Embedded RT Systems

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

Chapter 8 Fault Tolerance

Fault Tolerance in Distributed Systems: An Introduction

Distributed Systems COMP 212. Revision 2 Othon Michail

Fault Tolerance. Distributed Systems. September 2002

Today: Fault Tolerance. Replica Management

Recovering Device Drivers

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Chapter 8. Achmad Benny Mutiara

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Fault tolerance and Reliability

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Part 2: Basic concepts and terminology

MS630 Memory Problem Determination/Resolution Guide

Ingegneria del Software II, a.a. 2004/05. V.Cortellessa, University of L Aquila

Distributed Systems (ICE 601) Fault Tolerance

Announcements. R3 - There will be Presentations

Fault Tolerant Computing CS 530

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

6.033 Lecture Fault Tolerant Computing 3/31/2014

Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309

Regression testing. Whenever you find a bug. Why is this a good idea?

Module 8 Fault Tolerance CS655! 8-1!

Diagnosis in the Time-Triggered Architecture

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Today: Fault Tolerance. Fault Tolerance

Steps for project success. git status. Milestones. Deliverables. Homework 1 submitted Homework 2 will be posted October 26.

Fault-tolerant techniques

Reliable Computing I

Fault, Error, and Failure

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

CS 520 Theory and Practice of Software Engineering Fall 2018

INTRODUCTION TO SOFTWARE ENGINEERING

Software Quality Assurance (SQA) Software Quality Assurance

SOFTWARE ENGINEERING DECEMBER. Q2a. What are the key challenges being faced by software engineering?

Fault Tolerance. Distributed Systems IT332

Program Correctness and Efficiency. Chapter 2

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

Software Quality Assurance & Testing

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Approaches to Software Based Fault Tolerance A Review

Failure Tolerance. Distributed Systems Santa Clara University

Today: Fault Tolerance

Darshan Institute of Engineering & Technology Unit : 9

Module 8 - Fault Tolerance

Distributed Systems (5DV147)

Human Computer Interaction Lecture 14. HCI in Software Process. HCI in the software process

Software Quality. Chapter What is Quality?

LOGICAL OPERATOR USAGE IN STRUCTURAL MODELLING

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

Human Computer Interaction Lecture 06 [ HCI in Software Process ] HCI in the software process

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

Distributed Algorithms Benoît Garbinato

Last Class:Consistency Semantics. Today: More on Consistency

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

CSE 486/586 Distributed Systems

Eventual Consistency. Eventual Consistency

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Practical Byzantine Fault

Fault Tolerance. The Three universe model

Aerospace Software Engineering

MicroSurvey Users: How to Report a Bug

Lecture 15 Software Testing

Outline. Failure Types

Distributed Systems

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

Modeling Run-Time Distributions in Passively Replicated Fault Tolerant Systems

Re-host Factors and a Method to Maintain the Integrity of a Test

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Implementation Issues. Remote-Write Protocols

HCI in the software process

HCI in the software. chapter 6. HCI in the software process. The waterfall model. the software lifecycle

Verification and Validation

Object Oriented Programming. Week 7 Part 1 Exceptions

IBM POWER6 Processor-based Systems: Designing and Implementing Serviceability

Lecture 22: Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

Chapter 9. Software Testing

Fault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson

Transcription:

Course: Advanced Software Engineering academic year: 2011-2012 Lecture 14: Software Dependability Lecturer: Vittorio Cortellessa Computer Science Department University of L'Aquila - Italy vittorio.cortellessa@di.univaq.it www.di.univaq.it/cortelle Copyright Notice» The material in these slides may be freely reproduced and distributed, partially or totally, as far as an explicit reference or acknowledge to the material author is preserved. 2 1

BASIC CONCEPTS 3 BASIC CONCEPTS : dependability summary 4 2

Reliability definition (RELIABILITY) Probability of a system working within specs throughout an interval of time without system-level repair 5 Reliability definition 6 Probability of a system working within specs for a certain throughout number an interval of invocations of time without system-level repair (RELIABILITY ON DEMAND) 3

Availability definition Fraction of time that the system is up within specs 7 Reliability terminology Fault feature that precludes the software from operating according to its specifications Error the value of the software state differs from the expected one Failure the actual software output (for some input) differs from the expected one 8 4

Faults, Errors and Failures program modthreeofsquare begin read(s); s := 2*s; s := s mod 3; write(s); end s=2 : no Error s=3 : Error! Fault! s=3,s=2 : no Failure s=4 : Failure! 9 Specification: a function that computes the remainder by 3 of the square of the input value y = (s 2 mod 3) Faults, Errors and Failures A failure is usually a result of a system error that is derived from a fault in the system However, faults do not necessarily result in system errors A faulty system might never execute the faulty statement to originate an error Errors do not necessarily lead to system failures The error can be corrected by built-in error detection and recovery or it can be naturally masked from other system components (error propagation) 10 5

About the error propagation Ф(C1) C1 Somehow interacting Ф(Cn) Cn Reliability of each component may not suffice Ф(C2) C2 component correct erroneous correct erroneous 11 About the error propagation system interface system component interface component i component j internal fault activation error error input error error propagation status of component i correct service component i failure incorrect service status of component j correct service (system) failure incorrect service 12 6

Dependability achievement» Fault avoidance - Development techniques are used that either minimise the possibility of mistakes or trap mistakes before they result in the introduction of system faults» Fault detection and removal - Verification and validation techniques that increase the probability of detecting and correcting faults before the system goes into service» Fault tolerance - Run-time techniques are used to ensure that system faults do not result in system errors and/or that system errors do not lead to system failures 13 FAULT AND FAILURE TYPES 14 7

Fault classification Heisenbugs Byzantine faults 15 Fault classification A repeatable bug; one that manifests reliably under a possibly unknown but well-defined set of conditions. 16 8

Fault classification A bug that disappears or alters its behavior when one attempts to probe or isolate it. E.g., the use of a debugger sometimes alters a program's operating environment significantly enough that buggy code, such as that which relies on the values of uninitialized memory, behaves quite differently. 17 Heisenbugs Fault classification A repeatable bug; A bug one whose that manifests underlying reliably under a possibly unknown causes but are well-defined so complex set and of conditions. obscure as to make its behavior appear chaotic or even non-deterministic. 18 9

Failure classification nature of the failure» Hardware failure - Hardware fails because of design and manufacturing errors or because components have reached the end of their natural life.» Software failure - Software fails due to errors in its specification, design or implementation. - Software failures are different from hardware failures in that software does not wear out. It can continue in operation even after an incorrect result has been produced.» Operational failure 19 - Human operators make mistakes. Now perhaps the largest single cause of system failures. Failure classification type of failure 20 10

Failure classification severity of failure Failure class Transient Permanent Recoverable Unrecoverable Non-corrupting Corrupting Description Occurs only with certain inputs Occurs with all inputs System can recover without operator intervention Operator intervention needed to recover from failure Failure does not corrupt system state or data Failure corrupts system state or data 21 Reliability improvement» Removing X% of the faults in a system will not necessarily improve the reliability by X%. A study at IBM showed that removing 60% of product defects resulted in a 3% improvement in reliability» Program defects may be in rarely executed sections of the code so may never be encountered by users. Removing these does not affect the perceived reliability» A program with known faults may therefore still be seen as reliable by its users 22 11

Reliability specifications» The required level of system reliability should be expressed quantitatively.» Reliability is a dynamic system attribute (reliability specifications related to the source code are meaningless). - No more than N faults/1000 lines; - This is only useful for a post-delivery process analysis where you are trying to assess how good your development techniques are.» An appropriate reliability metric should be chosen to specify the overall system reliability. 23 METRICS 24 12

Reliability metrics» Reliability metrics are units of measurement of system reliability.» System reliability is measured by counting the number of operational failures and, where appropriate, relating these to the demands made on the system and the time that the system has been operational. 25 Dependability metrics Metric POFOD Probability of failure on demand ROCOF Rate of failure occurrence MTTF Mean time to failure AVAIL Availability Explanation The likelihood that the system will fail when a service request is made. A POFOD of 0.001 means that 1 out of a thousand service requests may result in failure. The frequency of occurrence with which unexpected behaviour is likely to occur. A ROCOF of 2/100 means that 2 failures are likely to occur in each 100 operational time units. This metric is sometimes called the failure intensity. The average time between observed system failures. An MTTF of 500 means that 1 failure can be expected every 500 time units. The probability that the system is available for use at a given time. Availability of 0.998 means that in every 1000 time units, the system is likely to be available for 998 of these. 26 13

Probability of failure on demand (POFOD)» This is the probability that the system will fail when a service request is made. Useful when demands for service are intermittent and relatively infrequent.» Appropriate for protection systems where services are demanded occasionally and where there are serious consequences if the service is not delivered. 27 Rate of failure occurrence (ROCOF)» Reflects the rate of occurrence of failure in the system.» Relevant for operating systems, transaction processing systems where the system has to process a large number of similar requests that are relatively frequent 28 14

Mean time to failure (MTTF)» Measure of the time between observed failures of the system. Is the reciprocal of ROCOF for stable systems.» Relevant for systems with long transactions i.e. where system processing takes a long time. MTTF should be longer than transaction length. 29 Availability» Measure of the fraction of the time that the system is available for use.» Takes repair and restart time into account» Relevant for non-stop, continuously running systems 30 15