Aerospace Software Engineering

Similar documents
Software Engineering. Reading. Test team. Testing. Chapters 8 & 9

A SKY Computers White Paper

Fault tolerance and Reliability

Approaches to Software Based Fault Tolerance A Review

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Module 8 - Fault Tolerance

Part 2: Basic concepts and terminology

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Software Fault Tolerance: A Tutorial

Fault Tolerance. The Three universe model

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Dep. Systems Requirements

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

B.H. Far

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

Module 8 Fault Tolerance CS655! 8-1!

Issues in Programming Language Design for Embedded RT Systems

Fault Tolerance. Distributed Systems IT332

TSW Reliability and Fault Tolerance

Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Today: Fault Tolerance

Dependability tree 1

416 Distributed Systems. Errors and Failures Feb 1, 2016

Today: Fault Tolerance. Fault Tolerance

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

High Availability AU PUBLIC INFORMATION. Allen-Bradley. Copyright 2014 Rockwell Automation, Inc. All Rights Reserved.

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

SRM UNIVERSITY FACULTY OF ENGINEERING AND TECHNOLOGY SCHOOL OF COMPUTING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING COURSE PLAN

High Availability Fieldbus Networks in Hazardous Areas

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems

Fault-Tolerant Embedded System

Fault-Tolerant Embedded System

Introduction to the Service Availability Forum

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerant Computing CS 530

Safety Architecture Patterns

Fault Tolerance. Basic Concepts

Recovering from a Crash. Three-Phase Commit

FAULT TOLERANT SYSTEMS

CIT 668: System Architecture

Aerospace Software Engineering

Appendix D: Storage Systems (Cont)

Database Management Systems

Dependability. IC Life Cycle

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

6.033 Lecture Fault Tolerant Computing 3/31/2014

Hardware Safety Integrity. Hardware Safety Design Life-Cycle

Introduction to the Distributed Real-Time System

Software Metrics. Kristian Sandahl

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Dependability and ECC

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

RELIABILITY and RELIABLE DESIGN. Giovanni De Micheli Centre Systèmes Intégrés

FP7-4: Introduction to Reliability and Fault Tolerance. FP7-4: Introduction to Reliability and Fault Tolerance. The NASA Mars Space Mission

Roll No. :. Invigilator's Signature :.. CS/MCA/SEM-4/MCA-401/ SOFTWARE ENGINEERING & TQM. Time Allotted : 3 Hours Full Marks : 70

Basic Concepts of Reliability

Introduction to Robust Systems

Middleware for Embedded Adaptive Dependability (MEAD)

Introduction to Computers & Programming

Computing and Communications Infrastructure for Network-Centric Warfare: Exploiting COTS, Assuring Performance

Chapter 5 B. Large and Fast: Exploiting Memory Hierarchy

USE CASE 8 MAINTENANCE ENGINEER MONITORS HEALTH OF PRIMARY EQUIPMENT

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Upgrading From a Successful Emergency Control System to a Complete WAMPAC System for Georgian State Energy System

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

AC/DC Solid State Power Controller Module

Sequential Fault Tolerance Techniques

Diagnosis in the Time-Triggered Architecture

Fault-Tolerance I: Atomicity, logging, and recovery. COS 518: Advanced Computer Systems Lecture 3 Kyle Jamieson

Fault-Tolerant Storage and Implications for the Cloud Charles Snyder

Is This What the Future Will Look Like?

Distributed Systems COMP 212. Lecture 19 Othon Michail

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Safety and Reliability Engineering Part 5: Redundancy / Software Reliability

Multilevel Fault-tolerance for Designing Dependable Wireless Networks

Failure Tolerance. Distributed Systems Santa Clara University

Storageflex HA3969 High-Density Storage: Key Design Features and Hybrid Connectivity Benefits. White Paper

Impact of JTAG/ Testability on Reliability

System modeling. Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei)

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Fault Tolerance. Distributed Systems. September 2002

Middleware and Distributed Systems. Fault Tolerance. Peter Tröger

Today: Fault Tolerance. Replica Management

FAULT TOLERANT SYSTEMS

Quantitative Analysis of Domain Testing Effectiveness.

Consensus in Distributed Systems. Jeff Chase Duke University

Why Things Break -- With Examples From Autonomous Vehicles ,QVWLWXWH IRU &RPSOH[ (QJLQHHUHG 6\VWHPV

Newave s approach for protecting mission critical application

Tolerating Latency in Replicated State Machines through Client Speculation

The Walking Dead Michael Nitschinger

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

Techniques, and Tools

Homework 11: Reliability and Safety Analysis Due: Friday, April14, at NOON

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Transcription:

16.35 Aerospace Software Engineering Reliability, Availability, and Maintainability Software Fault Tolerance Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT

Definitions Software reliability The probability that a system will operate without failure under given conditions for a given time interval Expressed on a scale 0 to 1 Software availability Probability that a system is functioning completely at a given instant in time, assuming that the required external resources are also available. A system completely up and running has availability 1; on that is unusable has availability 0.

Definitions Software maintainability Probability that, for a given condition of use, a maintenance activity can be carried out within a stated time interval and using stated procedures and resources. Ranges from 0 to 1.

MIL-STD-1629A Catastrophic: A failure that may cause death or system loss Critical: a failure that may cause severe injury or major system damage that results in mission loss Marginal: a failure that may cause minor injury or minor system damage that results in delay, loss of availability, or mission degradation Minor: a failure not serious enough to cause injury or system damage, but that results in unscheduled maintenance or repair.

Failure Data Interfailure Times (Read left to right, in rows) 3 138 325 30 50 55 113 77 242 81 24 68 115 108 422 9 88 180 2 670 10 91 120 1146 112 26 600 15 Type-1 uncertainty 114 Type-2 uncertainty 15 36 4 0 8 227 65 176 58 457 10 1071 6150 3321 1045 648 5485 1160 1864 4116

Measuring Reliability, Availability, and Maintainability MTTF Average of the interfailure times MTTR MTBF MTBF = MTTF + MTTR R = MTTF/(1 + MTTF) A = MTBF/(1 + MTBF) M = 1/(1 + MTTR)

Reliability Stability and Growth Reliability stability If the interfailure times stay the same Reliability growth If they increase Probability density function f(t) Probability f(t) t 2 t 1 f(t) dt Reliability function R(t) = 1-F(t) 1/86400 0 86400 t (time in seconds)

Predicting next Failure Times from Past History

Reliability Prediction The Jelinsky-Moranda Model (1972) Assumes: no type-2 uncertainty Assumes: fixing any fault contributes equally to improving the reliability

Software Fault Tolerance

Fault vs. Failure How do faults occur? What is a failure? Faults represent problems that developers see Failures are problems that users or customers see

Handling Design Faults Prevention Removal Fault Tolerance Input Sequence Work Arounds

Measures of Software Quality Reliability is the probability of failure free operation of a computer program in a specified environment for a specified period of time, where failure free operation in the context of software is interpreted as adherence to its requirements -[Pressman 97]. MTBF = MTTF + MTTR Safety

DO-178B The goal of [software] fault tolerance methods is to include safety features in the software design or source Code to ensure that the software will respond correctly to input data errors and prevent output and control errors. The need for error prevention or fault tolerance methods is determined by the system requirements and the system safety assessment process.

Software Fault Tolerance The function of software fault tolerance is to prevent system accidents (or undesirable events, in general), and mask out faults if possible. Fault Tolerance Techniques Single Version Multi-Version

Single Version Techniques Program structure and actions Error detection Exception handling Checkpoint and restart Process pairs Data diversity

Program Structure and Actions Modular Decomposition Partitioning Horizontal Partitioning Vertical Partitioning Visibility & Connectivity System Closure Temporal Structuring

Error Detection Structured Replication Timing Reversal Coding Reasonableness Structural checks Ad-Hoc Fault Trees

Exception Handling Interface Exceptions Local Exceptions Failure Exceptions

Checkpoint and Restart Checkpoint Restart Static Dynamic Input Checkpoint Memory Checkpoint Execution Output Retry Error Detection

Process Pairs Two identical versions of software Separate processors Primary Processor Input Checkpoint Selection Switch Output Secondary Processor Error Detection

Data Diversity I Input Data Re-Expression Input Re-Expression with Post-Execution Adjustment Re-Expression via Decomposition and Recombination Checkpoint Memory Input Re-expression -1 Re-expression -n Selection Switch Retry Checkpoint Execution Error Detection Output

Data Diversity II Input Input Re-Expression Program Execution Output Input Input Re-Expression Program Execution Output Re-Expression Output Execution 1 Input Input Decomposition x1,x2,...xn Execution 2 Output Recombination Output Execution n

Multi-Version Techniques Recovery Blocks N-Version Programming N Self-Checking Programming Consensus Recovery Blocks t/(n-1) Variant Programming

Recovery Blocks Checkpoint Memory Checkpoint Primary Version Input Alternate V-1 Selection Switch Output Alternate V- n Acceptance Test

N-Version Programming Version 1 Input Version 2 Selection Switch Output Version n

N Self-Checking Program Version 1 Input Acceptance Test Selection Switch Output Version N Acceptance Test Version 1 Input Version 2 Comparison Output Selection Switch Version N-1 Version N Comparison

Consensus Recovery Blocks Input Version 1 Version 2 Selection Algorithm Switch Failure Switch Output Version n Acceptance Test t/(n-1) Variant Programming: diagnosability measure to isolate the faulty units to a subset of size at most (n- 1) assuming there are at most t faulty units

Web-based subject evaluations available, please fill them out Wednesday: practice questions for the final exam