Dependability tree 1

Similar documents
Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Issues in Programming Language Design for Embedded RT Systems

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

TSW Reliability and Fault Tolerance

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Fault Tolerance. The Three universe model

FAULT TOLERANT SYSTEMS

Fault Tolerance. Distributed Systems IT332

Diagnosis in the Time-Triggered Architecture

FAULT TOLERANT SYSTEMS

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems

Fault-tolerant techniques

Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault-tolerant design techniques. slides made with the collaboration of: Laprie, Kanoon, Romano

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

Dep. Systems Requirements

Chapter 8 Fault Tolerance

Fault Tolerant Computing CS 530

Concurrent Exception Handling and Resolution in Distributed Object Systems

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

Part 2: Basic concepts and terminology

Area Efficient Scan Chain Based Multiple Error Recovery For TMR Systems

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Dependability. IC Life Cycle

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

Self-Checking Fault Detection using Discrepancy Mirrors

Causes of Software Failures

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems Fault Tolerance

Software Engineering: Integration Requirements

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

An Immune System Paradigm for the Assurance of Dependability of Collaborative Self-organizing Systems

Today: Fault Tolerance. Replica Management

Fault Tolerance. Distributed Systems. September 2002

In This Lecture. Transactions and Recovery. Transactions. Transactions. Isolation and Durability. Atomicity and Consistency. Transactions Recovery

Distributed Systems 24. Fault Tolerance

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Today: Fault Tolerance. Fault Tolerance

Last Class:Consistency Semantics. Today: More on Consistency

CHAPTER 3 RECOVERY & CONCURRENCY ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

PART II. CS 245: Database System Principles. Notes 08: Failure Recovery. Integrity or consistency constraints. Integrity or correctness of data

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

Recovery from failures

Approaches to Software Based Fault Tolerance A Review

Fault tolerance and Reliability

Crash Recovery. Hector Garcia-Molina Stijn Vansummeren. CS 245 Notes 08 1

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Test Oracles. Test Oracle

Implementation Issues. Remote-Write Protocols

Fault-Tolerant Embedded System

Sequential Fault Tolerance Techniques

FAULT TOLERANT SYSTEMS

Fault-Tolerant Embedded System

A Low-Cost Correction Algorithm for Transient Data Errors

6.033 Lecture Fault Tolerant Computing 3/31/2014

NOTES W2006 CPS610 DBMS II. Prof. Anastase Mastoras. Ryerson University

Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Outline. Failure Types

Eventual Consistency. Eventual Consistency

Introduction to the Service Availability Forum

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X

Responsive Roll-Forward Recovery in Embedded Real-Time Systems

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

A SKY Computers White Paper

Error Detection And Correction

Today: Fault Tolerance. Failure Masking by Redundancy

Lecture 22: Fault Tolerance

Network Survivability

Middleware and Distributed Systems. Fault Tolerance. Peter Tröger

Appendix D: Storage Systems (Cont)

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

Basic vs. Reliable Multicast

CSE 5306 Distributed Systems

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Today: Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

A Fault-Tolerant Alternative to Lockstep Triple Modular Redundancy

Paxos provides a highly available, redundant log of events

Applying F(I)MEA-technique for SCADA-based Industrial Control Systems Dependability Assessment and Ensuring

FairCom White Paper Caching and Data Integrity Recommendations

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

FAULT TOLERANT SYSTEMS

Fault Tolerance Dealing with an imperfect world

System modeling. Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei)

02 - Distributed Systems

Transcription:

Dependability tree 1

Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques to prevent the occurrence and introduction of faults rigorous developent, formal methods, testing, quality control methods,... (write free of bugs code) component screening, shielding, (prevent to insert external faults is not possible) 2. Fault Tolerance techniques deal with faults at run-time deliver correct service in presence of activated faults and errors 3. Fault Removal techniques remove faults in such a way that they are no more activated 4. Fault Forecasting techniques to estimate the present number, the future incidence, and the consequences of faults. Try to anticipate faults; do better design introducing fault tolerance techniques 2

Chain of threats: Faults-Errors-Failures From A. Avizienis, J.C. Laprie, B. Randell, C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, N. 1, 2004 3

Fault tolerance techniques 4

BASIC CONCEPT: Fault tolerance mechanisms detect error states (not faults) Fault tolerance techniques: carried out via error detection, error processing and fault treatment dormant fault Error Fault activation Error detection Error Recovery Error mitigation Normal operation Error processing Fault Treatment Protective redundancy: additional components or processes that mask/correct errors or faults inside a system so they do not become failures. Signal the problem to the user. 5

Another fundamental aspect is damage confinement Damage confinement: before we start to use fault tolerance redundancy, we isolate the compromised components Fault treatment fix the original problem, in such a way that it never occurs again Fault passivation - Deactivate a corrupted memory module in a computer - Broken computer no more used Phases of fault tolerance: Error Detection Damage Confinement Fault Treatment 6

Organisation of fault tolerance (techniques involved in fault tolerance) A. Avizienis, J.C. Laprie, B. Randell, C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, N. 1, 2004 7

Error detection 8

Error detection: Types of checks Reasonableness Checks Acceptable ranges of variables Acceptable transitions Divide by 0 Probable results.. Specification checks (use the definition of correct result ) Reversal Checks Examples Specification: find the solution of an equation Check: substitute results back into the original equation the specified function of the system is to compute a mathemathical function, output = F(input) if the function has an inverse function F, such that F (F(x))=x, we can: compute F output) and verify that F (output) = input 9

Error detection: Types of checks Replication Checks Based on copies and comparison of the results - two or more copies - a mechanism that compares them and declares an error if differ - the copies must be unlikely to be corrupted together in the same way Sys Sys comparator Do not protect from everything Assumption on faults is very important. Most of the time single fault assumption What is the fault model for sw? Same input, same bug in the software, they have a COMMON CAUSE FAILURE In case of hardware fault, we tolerate the error. Hw copies are not the same system. What is the fault model for hw? Faults are independent. 10

Error detection: Types of checks Codes add information to data in such a way that errors can be identified fault: bit flip mechanism: parity bit error detection: data do not satisfy the parity bit for each unit of data, e.g. 8 bits, add a parity bit so that the total number of 1 s in the resulting 9 bits is odd 10100000 1 byte parity bit Two bit flips are not detected communication channel 10100100 1 one bit flip Self-checking component a component that has the ability to automatically detect the existence of the fault and the detection occurs during the normal course of its operations Typically obtained using coding techniques: inputs and outputs are encoded (also different codes can be used) Applicable to small circuits: Clear error confinement Comparators, Voters, 11

Effectiveness of error detection (measured by) Coverage: probability that an error is detected conditional on its occurence Latency: time elapsing between the occurrence of an error and its detection (a random variable) how long errors remain undetected in the system Damage Confinement: error propagation path the wider the propagation, the more likely that errors will spread outside the system Preventing error propagation: - minimum priviledge - discriminating on type of use, users,.. - Protection mechanisms: message-passing versus sharing memory, hardware and time for authorization 12

System structuring principles (mutual suspicion) 1)Each component examines each request or data item from other components before acting on it For example, each software module checks legality and reasonableness of each request received - added overhead - need for providing signalling back to requestor and own strategy for dealing with erroneous requests 2) Make error confinement areas : Units of mitigation concept: error detection and error processing inside the module. At the boundery, the module can signal if it is faulty; Avoid errors spread over the system; create barriers at the interface of the faulty module 13

Error Recovery 14

Error Recovery There is an error state. We have applied error confinement. We want to recover. Forward recovery transform the erroneous state in a new state from which the system can operate correctly Backward recovery bring the system back to a state prior to the error occurrence - for example, recover from sw update by using the backup Backward and forward recovery can be combined if the error persists 15

Forward error recovery - Requires to assess the damage caused by the detected error or by errors propagated before detection - Usually ad hoc Example of application: real-time control systems, an occasional missed response to a sensor input is tolerable The system can recover by skipping its response to the missed sensor input. Backward error recovery Retry Redo with the same component 16

Backward error recovery Checkpointing A copy of the global state is called checkpoint. Checkpoints - may be taken automatically (periodically) or upon request by program - need to be correct (consistent) - need eventually to be discarded - survival of checkpoint data Consistency of checkpoint in distributed systems snapshot algorithms: determine past, consistent, global states Process A 1 5 d Process B 3 Process C b 4 x 2 a c 6 e Checkpoint domino effect x Error Message passed 17

Backward error recovery: Checkpointing save the state from time to time; restore the state if you have problems Loss: - computation time between the checkpointing and the rollback - data received during that interval Basic issues: - Checkpointing/rollback (resetting the system and process state to the state stored at the latest checkpoint) need mechanisms in run-time support - take a checkpoint before each message sent/received or other external events - take coordinated checkpointing - overhead of saving system state (minimize the amount of state information that must be saved) Class of faults for which checkpoint is useful: transient faults (disapper by themselves) used in massive parallel computing, to avoid to restart all things from the beginning (continue the computation from the checkpoint, saving the state from time to time) Class of faults for which checkpoint is not useful: harware fault; design faults (the system redo the same things) 18

Error mitigation Basic strategy for fault tolerance - Compensation (on demand / systematically ) fault masking A general method to achieve fault tolerance is to perform multiple computations through multiple channels, either sequencially or concurrently Tolerance of physical faults channels may be of identical design (we have the assumption that hardware components fail independently ) Tolerance of software faults channels must implement the same function via separate designs and implementations (design diversity) 19

Triple Modular Redundancy (TMR) fault masking Module 1 Module 2 Module 3 Voter output Triplicate the modules and perform a majority vote to determine the output of the system - 2/3 of the modules must deliver the correct results - effects of faults neutralised without notification of their occurrence - masking of a failure in any one of the three copies Sometimes some failures in two or more modules may occurr in such a way that a failure is avoided (compensating failures) Disadvantage: loss of protective redundancy Practical implementations of compensation: masking and recovery (includes error detection and fault handling) 20

Various starategies for implementing fault tolerance Choice of the strategy depends upon the underlying fault assumption that is being considered in the development process The classes of faults that can actually be tolerated depend on the fault assumption and on the independence of the redundancies with respect to the fault creation and activation 21

Interaction with the external world State of a computation - Program visible variables - Hidden variables (process descriptors, ) - External state : files, outside words (for example alarm already given to the aircraft pilot, Compensating actions may be also necessary for computing systems that interact with the outside world If an external communication is not correct, the computer may still limit or undo the damage by a compensating action Example: a cash dispensing machine gives less money compenating action: tell the bank and ask for the money back 22

Example: assume a real-time program communicated with its environment and backward error recovery is invoked the environment would not be able to recover along with the program and the system would be left in an inconsistent state. In this case, Forward recovery would help return the system to a consistent state by sending the environment a message informing it to disregard previous output from the program. 23

Observations Fault assumptions play a fundamental role Fault tolerance applies to all classes of faults Mechanisms that implements fault tolerance should be protected against the faults that might affect them 24

Fault handling 25

Fault location 1. can the error detection mechanism identify the faulty component/task with sufficient precision? - LOG and TRACES are important - diagnostic checks - codes - 2. What if diagnostic information / testing components are themselves damaged? 3. System level diagnosis: A system is a set of modules: - who tests whom is described by a testing graph - checks are never 100% certain Suppose A tests B. If B is faulty, A has a certain probability (we hope close to 100%) of finding out. But if A is faulty too, it might conclude B is OK; or says that C is faulty when it isn t 26

Fault treatment 1. Faulty components could not be left in the system - faults can add up over time 2. Reconfigure faulty components out of the system - physical reconfiguration turn off power, disable from bus access,.. - logical reconfiguration: don t talk, don t listen to it 3. Excluding faulty components will in the end exhaust available redundancy -insertion of spares -reinsertion of excluded component after thorough testing, possibly repair 4. Newly inserted components may require: - reallocation of software components - bringing the recreated components up to current state 27

Various strategies for implementing fault tolerance A. Avizienis, J.C. Laprie, B. Randell, C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, N. 1, 2004 Solid faults: permanent faults whose activation is reproducible Elusive faults: permanent faults whose activation is not systematically reproducible (e.g, conditions that occur in relation to the system load, pattern sensitive faults in semiconductor memories, ) Intermittent faults: transient physical faults + elusive development faults 28

Observations Fault tolerance uses replication for error detection and system recovery Error detection must be a trustworthy mechanism Fault tolerance relies on the independency of redundancies with respect to the process of fault creation and activations When tolerance to physical faults is foreseen, the channels may be identical, based on the assumption that hardware components fail independently When tolerance to design faults is foreseen, channels have to provide identical service through separate designs and implementation (through design diversity) Fault masking will conceal a possibly progressive and eventually fatal loss of protective redundancy. Practical implementations of masking generally involve error detection (and possibly fault handling), leading to masking and error detection and recovery. 29