Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Similar documents
Dependability tree 1

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

TSW Reliability and Fault Tolerance

Issues in Programming Language Design for Embedded RT Systems

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Fault Tolerance. Distributed Systems IT332

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Part 2: Basic concepts and terminology

What are Embedded Systems? Lecture 1 Introduction to Embedded Systems & Software

Distributed Systems COMP 212. Lecture 19 Othon Michail

Module 8 Fault Tolerance CS655! 8-1!

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Module 8 - Fault Tolerance

Fault Tolerance. Distributed Systems. September 2002

Today: Fault Tolerance. Failure Masking by Redundancy

Fault-tolerant techniques

Today: Fault Tolerance. Replica Management

B.H. Far

Dep. Systems Requirements

Today: Fault Tolerance. Fault Tolerance

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Review of Software Fault-Tolerance Methods for Reliability Enhancement of Real-Time Software Systems

Fault Tolerance. Chapter 7

Recovering from a Crash. Three-Phase Commit

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Fault Tolerance. Basic Concepts

Last Class:Consistency Semantics. Today: More on Consistency

Failure Tolerance. Distributed Systems Santa Clara University

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Software dependability. Critical systems development. Objectives. Dependability achievement. Diversity and redundancy.

Diagnosis in the Time-Triggered Architecture

Appendix D: Storage Systems (Cont)

Today: Fault Tolerance

ECE 60872/CS 590: Fault-Tolerant Computer System Design Software Fault Tolerance

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

Chapter 8 Fault Tolerance

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

MONIKA HEINER.

Distributed Systems COMP 212. Revision 2 Othon Michail

Dependability. IC Life Cycle

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems

Implementation Issues. Remote-Write Protocols

Sequential Fault Tolerance Techniques

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

Distributed Systems

FAULT TOLERANT SYSTEMS

Architectural Design

Fault Tolerance. The Three universe model

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Fault Simulation. Problem and Motivation

Toward Monitoring Fault-Tolerant Embedded Systems

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Today: Fault Tolerance. Reliable One-One Communication

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Chapter 8 Fault Tolerance

Fault Tolerant Computing CS 530

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Approaches to Software Based Fault Tolerance A Review

Verification and Validation

Distributed Systems Fault Tolerance

1. Introduction to Model Checking

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1

Concurrent Exception Handling and Resolution in Distributed Object Systems

CS4514 Real-Time Systems and Modeling

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Amrita Vishwa Vidyapeetham. ES623 Networked Embedded Systems Answer Key

Verification and Validation. Ian Sommerville 2004 Software Engineering, 7th edition. Chapter 22 Slide 1

Real-Time Operating Systems Design and Implementation. LS 12, TU Dortmund

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Eventual Consistency. Eventual Consistency

CSE 5306 Distributed Systems

A Framework for the Formal Verification of Time-Triggered Systems

Architectural Design

CSE 5306 Distributed Systems. Fault Tolerance

Aerospace Software Engineering

Testing for the Unexpected Using PXI

FAULT TOLERANT SYSTEMS

Architectural Styles and Non- Functional Requirements

Using Error Detection Codes to detect fault attacks on Symmetric Key Ciphers

Software Quality. Chapter What is Quality?

Chapter 8. Achmad Benny Mutiara

Fault-Tolerant Embedded System

Recovery in Distributed Systems from Transient and Permanent Faults

= trustworthiness of a computer system with regard to the services it provides

Fault-Tolerant Embedded System

A CAN-Based Architecture for Highly Reliable Communication Systems

Routing Journal Operations on Disks Using Striping With Parity 1

FAULT TOLERANT SYSTEMS

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Rediffmail Enterprise High Availability Architecture

Service Recovery & Availability. Robert Dickerson June 2010

Report. Certificate Z Rev. 00. SIMATIC Safety System

A Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM

Correct-by-Construction Attack- Tolerant Systems. Robert Constable Mark Bickford Robbert van Renesse Cornell University

Why the Threat of Downtime Should Be Keeping You Up at Night

Transcription:

Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1

Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy 2

Introduction : Safe and reliable software operation is a significant requirement for instance : aircraft, medical devices, nuclear safety appliance type applications : automobiles, washing machines, temperature control Examples about faults in softwares : software on space shuttle Endeavor rounded near zero values to zero Problems in Airbus A320 3

Introduction ( ) Faults : Hardware Faults : physical faults can be characterized predicted Software Faults : logical faults difficult to visualize, classify, detect, and correct Software faults may be traced to : incorrect requirements implementation(software design,coding) not satisfy requirements 4

A few definition: Cycle : Introduction to Software Fault Tolerance Techniques and Implementation failure fault error failure Fault software fault tolerance prevents failures by tolerating faults Fault tolerance techniques : Chapter 2 : various means of structuring redundancy Chapter 3 : Some programming methods Chapter 4 : design diverse techniques (Recovery) Chapter 5 : data diverse techniques 5

6

Means to achieve dependable software : 7

Means : Fault avoidance or prevention: to avoid or prevent fault occurrence Fault removal: to detect the existence of faults and eliminate them Fault/failure forecasting: to estimate the presence of faults and the occurrence and consequences of failures Fault tolerance: to provide service complying with the specification in spite of faults 8

Fault avoidance or Prevention : 1) System Requirements Specification : behavior specified in the requirements is not the expected or desired system behavior software requirements specification lies at the intersection between software engineering and system engineering 2) Structured Design and Programming Methods : introduce structure to the design to reduce the complexity and interdependency of components reduces overall complexity of the software, making it easier to implement and reduces the introduction of faults 9

Fault avoidance or Prevention ( ) 3) Formal Methods : requirements specifications are developed mathematically same size as the program difficult to construct harder to understand than the program itself prone to error suitable for small components that are highly critical 4) Software Reuse : savings in development cost has been well exercised is less likely to fail 10

Fault removal : 1) Software Testing : Difficulties : cost and complexity of exhaustive testing using all possible input sets can show the presence, but not the absence of faults 2) Formal Inspection : Has been widely implemented in industry Focuses on : examining source code to find faults correcting the faults verifying the corrections prior to the testing phase 11

3) Formal Design Proofs : closely related to formal methods attempts to achieve mathematical proof of correctness for programs costly and complex technique 12

Fault/failure forecasting : 1) Reliability Estimation : determines current software reliability applying statistical inference techniques to failure data 2) Reliability Prediction : determines future software reliability different techniques, depending on software development stage 13

Fault tolerance : When a fault occurs, provide mechanisms to prevent system failure provide protection against errors in translating the requirements and algorithms into a programming language do not provide explicit protection against errors in specifying the requirements 14

Error recovery : Error detection: an erroneous state is identified Error diagnosis: the damage caused by the error is assessed the cause of the error is determined Error containment/isolation: error is prevented from propagating Error recovery: the erroneous state is substituted with an error free state 15

Types of recovery : 1) Backward Recovery : restoring or rolling back the system to a previously saved state checkpointing : saving previous state system state is restored to the last saved state operations continue or restart from that state 16

Backward recovery : 17

Advantage of backward recovery : handle unpredictable errors caused by design faults has a uniform pattern of error detection and recovery only knowledge required the relevant prior state is error free particularly suited to recovery of transient faults 18

Disadvantage of backward recovery : requires significant resources (time, computation, stable storage) requires the system be halted temporarily domino effect may occur 19

Forward recovery : finding a new state from which the system can continue operation uses redundancy redundant software processes are executed in parallel uses voting schema to select correct result 20

Advantage of forward recovery : efficient in terms of the overhead (time and memory) suitable for real time applications suitable for predicted faults Disadvantage of forward recovery : it must be tailored to each situation or program can only remove predictable errors requires knowledge of the error Forward recovery is primarily used when there is no time for backward recovery 21

Types of redundancy: 1) hardware redundancy: includes replicated and supplementary hardware dded to the system to support fault tolerance Hardware faults are typically random, due to component aging and environmental effects 2) Software redundancy : called program, modular, or functional redundancy Software faults arise from specification and design errors or implementation (coding) mistakes 22

Software redundancy: Software errors cannot be detected by simple replication of identical software units introduce diversity into the software replicas Various views of redundant software : 23

24

Temporal redundancy: repeating an execution using the same software and hardware resources involved in the initial, failed execution can overcome transient faults Advantage of temporal redundancy : it does not require redundant hardware or software requires the availability of additional time to reexecute the failed process 25

Thanks for listening 26