Dependability Threats

Size: px
Start display at page:

Download "Dependability Threats"

Transcription

1 Dependable Systems Dependability Threats Dr. Peter Tröger Operating Systems Group

2 Dependability Dependability is defined as the trustworthiness of a computer system such that reliance can justifiable be placed on the service it delivers. The service delivered is the behavior as it is perceptible to users; a user is another system (human or physical) which interacts with the former. (J. C. Laprie) 2

3 Dependability Tree 3

4 Dependability Threats Threat - Unintended event or state Failure -,Ausfall Event when the system no longer complies to the specification Error -,Fehler(zustand) Part of system state that can lead to subsequent failure Fault -,Fehler(ursache) Adjudged or hypothesized cause of an error Error is a state, fault and failures are events in time Treating failures is repair, treating or avoiding errors is maintenance 4

5 Alternative Definitions 5

6 Chain of Dependability Threats Separation of system states and the events leading to them External Fault Normal Activation Active Fault / Latent Error Detection Detected Error Failure Classical version according to Laprie Internal Fault Dormant Fault Error Handling Restoration Failure Outage 6

7 System? Integrated combination of humans, products and processes Functional and non-functional specification In IT world, the products" are typically called components Interacting with each other and the environment IT systems are organized in layers, recursive definition Hardware executes operating system Operating system executes application Application executes plugin 7

8 IT System Current state of a layer is either correct or incorrect Focus on some chosen investigated layer Fault tolerance: Dealing with detectable incorrect states of investigated and execution layer I E Failure: Externally visible incorrect state of investigated layer The total state of a given system is the set of the following states: computation, communication, stored information, interconnection, and physical condition. (A. Avižienis) 8

9 Chain of Dependability Threats (with layers) C OF F,C ON, EXEC F s I 2 E I s E 2 E C OF F,C ON, EXEC F Detection EXEC F EXEC F (Activation) Deactivation s I 2 D I s E 2 E FAIL s I 2 X I s E 2 X E C ON (Enabling) Recovery Mitigation FAIL C OF F (Disabling) s I 2 X I s E 2 E Restoration s I 2 F I s E 2 E 9

10 Chain vs. Propagation The chain of dependability threats occurs in one investigated layer Software bug = fault May lead to wrong variable value = error May lead to exception being thrown = failure What happens if the problem leaves the layer? Example: Redundant RAID array with mirroring, single disc fails Resulting error state in the RAID system ( red light ) May or may not propagate to the operating system Error propagation: A failure in one layer is the fault in another layer 10

11 Error Propagation 11

12 Fault Classification High diversity in possible sources and types 12

13 Observations on Faults An external fault is a design fault - inability or refusal to foresee all situations Design faults are created during system development, system modification, or operational procedure creation and establishment Just replacing broken version of the same component leads to recurrent faults Physical faults are accidental faults Temporary external accidental physical faults are also called transient faults Temporary internal accidental faults are also called intermittent faults Examples: Pattern-sensitive memory hardware, system overload Arbitrary concept - Permanent faults with unknown activation condition Intentional and design faults are human-made faults, might be malicious faults Hardware production defects are typically physical faults 13

14 Observations on Faults A fault is active when it produces an error A non-active internal fault is a dormant / passive fault Origin in hardware: Often cycling between dormant and active Many specialized versions of the term,fault, e.g. bug Heisenbug - Resulting error disappears by itself Bohrbug - Resulting error is independent from execution state Mandelbug - Leads only to an error under specific conditions Fault-tolerant system design is a contradiction Design demands specification, faults are non-specified cases Solution: Specification for fault-free case + additional fault model 14

15 Fault Model Faults can be classified on different abstraction levels Physics Circuit level / switching circuit level Interesting for hardware design research (not this course) Investigate logical signals on connections stuck-at-zero, stuck-at-one, bridging faults, stuck-open Register transfer level Processor-memory-switch (PMS) level Hardware system level... (Software) Extremely important when talking about dependability means 15

16 Physical Faults Highly energized particles from space, atmospheric, or ground radiation Influence of particle that strikes a circuit: Atomic displacement, direct ionization, indirect ionization created by nuclear reactions Smaller structures are more sensitive to ionization effects Single Event Upset (SEU) Injected charge modifies hardware state temporally Can happen in memory and logic hardware Detected Unrecoverable Error (DUE) / Silent Data Corruption (SDC) Problem becomes permanent May be detected or undetected 16

17 Single Event Upset 17

18 Fault Model for Semiconductor Memories Stuck-at-1 or stuck-at-0 (hard) faults Transition / bit-flip faults (0->1, 1->0) Multiple writing - Data written into more than one cell on write attempt in one cell Pattern sensitivity - Device does not perform reliably with certain data pattern(s) Write recovery - Write followed by read/write at different location results in read/ write at same location Sense amplifier recovery - Data accessed remains the same for a number of cycles and then suddenly changed Bridging fault - Short between cells, AND type or OR type State coupling fault - Coupled (victim) cell is forced to 0 or 1 if coupling (aggressor) cell is in a given state 18

19 Software? All bugs are permanent design faults Ignoring user demands Ignoring special properties of the system environment Incomplete specification of dependability requirements Incomplete documentation Example for software fault model: Orthogonal Defect Classification (ODC) Any requirement to change the product is a defect Defect trigger: What make the defect surface Defect type: Nature of the fix you put on the defect 19

20 ODC Security Defect Types *"# +,-!+./01 )"# 23/45678!9:1;<=4>!?7737!"#$"%&'(")*+),"$-#.&/)0"+"$&1 ("# '"# &"# %"# $"# H3>=;!?7737 "# $ % & ' ( ) 23#*4")5.6"1&*%"1-4C/D!L6@=56D=34 M103/7;1!H16<!645!N.O1;D!M1/01 P=Q=4>!R!I17=6@=S6D=34!?7737! 20

21 Errors Escalates to failure depending on intentional / unintentional redundancy... system activity... specification of a failure case from user perspective (i.e. maximum outage time, acceptable delay, retransmission rate) System activity can reverse the error state before damage is happening Latent (not recognized) vs. detected error resulting from an active fault Hardware often contains unintentional redundancy, makes it difficult to test 21

22 Hardware Error Models Hardware faults effect state information, e.g. register values Stuck-at and other hardware faults therefore can also be denoted as error More interesting to investigate resulting effects on system-level Single data error - Program data is corrupted (in cache, memory, or register) Single code error - Effect on one instruction of the code Type 1/2 - Code modification without / with change of control flow Nature of error state may confirm to the nature of the originating fault Transient vs. permanent, static vs. dynamic, single vs. multiple Depends on utilized dependability means 22

23 Hardware Error Models Mapping of hardware-level single bit-flip error to other layers Memory data segment, processor data cache: System-level single data error Memory code segment, processor code cache: System-level single code error of type 1 (modification of target register) or type 2 (modification of branch target) Memory stack segment: System-level data error or type 2 code error Processor register: Depending on processor architecture and register type Single data error if register holds data interpreted by the application Single type 1 code error, if register holds address used by load/store operation Single type 2 code error, if register holds address of a branch target Processor control register: Everything could happen... 23

24 Hardware Error Models - Code Errors MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: SUB R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ FOOBAR MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BZ LOOP 24

25 Software Error Models Similar terminology, but completely different semantics Syntactical errors are handled by compiler, semantical errors occur at runtime Static vs. dynamic, permanent vs. temporary errors Example for C programming language Errors affecting assignments (missing / wrong local variable values) Errors affecting conditional instructions (wrong boolean or iteration condition) Errors affecting function call / return (wrong parameters, return statement) Errors affecting algorithms (missing statements or function calls, wrong operators) Under research in the software engineering field - field studies, automated code analysis, developer interviews 25

26 Error Message Occurrence Same fault can lead to different (detected or undetected) errors Errors become detected by error detection mechanism Some undetected errors are detected by several detectors Some detectors report several undetected errors as one Some undetected errors are never uncovered Detected errors might not be logged, if the system stops too fast 26

27 Failures Visible non-compliance of the system with the specification Failure effect: Why is this failure interesting to be investigated Failure mode: Type of failure in relation to the functionality of the system Failure mechanism: How can this happen Failure models are well-known in distributed software systems Classical categorization in the onion model [Barborak, Cristian] 27

28 Onion Model Assumption: System of components sending messages to each other Maps to hardware with electrical signals Maps to distributed software systems Fail-Stop Failure: No more messaging, other components are informed Crash Failure: No more messaging, no information Omission Failure: Messages are omitted for some time Timing Failure: Reaction on message or sending of message is too early / late Computation Failure: Wrong answer message on correct received message Byzantine Failure: Anything 28

29 Failure Severity Denotes consequences of failure Benign failures Failure costs and operational benefits are similar Sometimes also umbrella term for failures only detected by inspection A system with only such failures is fail-safe Catastrophic failures Costs of failure consequences are much larger than service benefit Grading depends on application Flying airplane - Fail-Stop is catastrophic Train - Fail-Stop is benign Criticality - Highest severity of possible failure modes in the system 29

30 Example: DO-178B Standard Software Considerations in Airborne Systems and Equipment Certification Mature document, developed for more than 20 years Definition of severity of failure conditions for airplane, crew, and passengers Catastrophic - Loss of ability to continue safe flight and landing Major - Reduced airplane or crew capability to cope with operating conditions Reduction in safety margins and functional capabilities Higher workload or physical distress for the crew Minor - Not significantly reduced airplane safety, slight increase in workload (Example: Change of flight plan) No effect - Failure results in no loss of operational capabilities and no increase in crew workload 30

31 Example: DO-178B Standard 31

32 Example: ISO26262 ok Severity (S) of injuries bad bad Failure risk acceptable Failure risk not acceptable Controllability (C) ok Functional safety of automotive systems Severity of failures expressed as Automotive Safety Integrity Level (ASIL) Controllability: Can the driver compensate Severity: How bad are the consequences Exposure: How often does that happen 32

33 Wording and Numbers 33

34 Wording and Numbers 34

35 Observations on Failures Failures and system load are correlated Load can lead to wear-out, so the failure probability increases Higher load can activate dormant faults Detected faults lead to recovery activities, which again increases the load Possibility for unintended feedback effects in complex systems Common-cause failures: Multiple parts are impacted for the same reason Cascade failures through common dependency (e.g. power) Secondary failures from inappropriate environment (e.g. temperature) Common-mode failures from bad design (e.g. identical redundant units) 35

36 Example: Amazon EBS Failure of Amazon cloud services in 2012 Major web sites were down (Reddit, Netflix, Airbnb, ) Report about root cause Large number of cloud storage servers could no longer handle requests Low priority service with memory leak was eating all resources Reason was repeated connection attempt to monitoring server Monitoring server was not reachable due to DNS misconfiguration DNS change was reasoned by exchange of unrelated hardware unit Example for cascade failure 36

37 Fail-Fast A common concept from system engineering, company management,... Report failure and stop immediately without further action Discussed by Jim Gray in 1985 as part of his famous article Why do computers stop and what can be done about it? Useful when benefit from recovery is not good enough for its costs, or if error propagation is highly probable Single units of a redundant set Deeply interwired IT system components Components under heavy request load 37

38 Literature Laprie, J. Dependability. Basic Concepts and Terminology. (Springer, 1998). Hansen, J. P. & Siewiorek, D. P. Models for time coalescence in event logs. in IEEE Proceedings of International Symposium on Fault-Tolerant Computing (FTCS-22) (1992). doi: /ftcs Hunny, U., Zulkernine, M. & Weldemariam, K. OSDC: Adapting ODC for Developing More Secure Software. in Proceedings of the 28th Annual ACM Symposium on Applied Computing (ACM, 2013). doi: / ISO. Road vehicles - Functional safety - Part 3: Concept phase (ISO ). (2011). Thomas K. Ferrell & Uma D. Ferrell. RTCA DO-178B/EUROCAE ED-12B. in The Avionics Handbook (CRC Press, 2001). Goloubeva, O., Rebaudengo, M., Reorda, M. & Violante, M. Software-Implemented Hardware Fault Tolerance. (Springer, 2010). 38

Part 2: Basic concepts and terminology

Part 2: Basic concepts and terminology Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness

More information

TSW Reliability and Fault Tolerance

TSW Reliability and Fault Tolerance TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software

More information

Deriving safety requirements according to ISO for complex systems: How to avoid getting lost?

Deriving safety requirements according to ISO for complex systems: How to avoid getting lost? Deriving safety requirements according to ISO 26262 for complex systems: How to avoid getting lost? Thomas Frese, Ford-Werke GmbH, Köln; Denis Hatebur, ITESYS GmbH, Dortmund; Hans-Jörg Aryus, SystemA GmbH,

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309

Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309 June 25th, 2007 Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309 Christopher Temple Automotive Systems Technology Manager Overview Functional Safety Basics Functional

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

A Low-Cost Correction Algorithm for Transient Data Errors

A Low-Cost Correction Algorithm for Transient Data Errors A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction

More information

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.) ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)

More information

Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard

Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO 26262 standard NMI Automotive Electronics Systems 2013 Event Victor Reyes Technical Marketing System

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Fault Injection Seminar

Fault Injection Seminar Fault Injection Seminar Summer Semester 2015 Daniel Richter, Lena Herscheid, Prof. Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute 23/04/2015 Fault Injection Seminar 1 Dependability

More information

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

Is This What the Future Will Look Like?

Is This What the Future Will Look Like? Is This What the Future Will Look Like? Implementing fault tolerant system architectures with AUTOSAR basic software Highly automated driving adds new requirements to existing safety concepts. It is no

More information

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability

Course: Advanced Software Engineering. academic year: Lecture 14: Software Dependability Course: Advanced Software Engineering academic year: 2011-2012 Lecture 14: Software Dependability Lecturer: Vittorio Cortellessa Computer Science Department University of L'Aquila - Italy vittorio.cortellessa@di.univaq.it

More information

SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE

SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante Politecnico di Torino - Dipartimento di Automatica

More information

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems Work in progress A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems December 5, 2017 CERTS 2017 Malte Appel, Arpan Gujarati and Björn B. Brandenburg Distributed

More information

Eliminating Single Points of Failure in Software Based Redundancy

Eliminating Single Points of Failure in Software Based Redundancy Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM

More information

Eventual Consistency. Eventual Consistency

Eventual Consistency. Eventual Consistency Eventual Consistency Many systems: one or few processes perform updates How frequently should these updates be made available to other read-only processes? Examples: DNS: single naming authority per domain

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Verification and Test with Model-Based Design

Verification and Test with Model-Based Design Verification and Test with Model-Based Design Flight Software Workshop 2015 Jay Abraham 2015 The MathWorks, Inc. 1 The software development process Develop, iterate and specify requirements Create high

More information

Fault Tolerant Computing CS 530

Fault Tolerant Computing CS 530 Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu

More information

Implementation Issues. Remote-Write Protocols

Implementation Issues. Remote-Write Protocols Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write

More information

Error Model Annex Revision

Error Model Annex Revision Error Model Annex Revision Peter H Feiler phf@sei.cmu.edu Jan 2011 Goal A core set of reliability concepts and error types Interaction of systems with nominal behavior and threats in the form of defects,

More information

Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist

Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist Internet of Things Group 2 Internet of Things Group 3 Autonomous systems: computing platform Intelligent eyes Vision. Intelligent

More information

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X Dependable Systems Hardware Dependability - Diagnosis Dr. Peter Tröger Sources: Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., 156881092X

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

Static Analysis of Embedded Systems

Static Analysis of Embedded Systems Static Analysis of Embedded Systems Xavier RIVAL rival@di.ens.fr Outline Case study Certification of embedded softwares Demo Static Analysisof Embedded Systems p.2/12 Ariane 5 Flight 501 Ariane 5: sattelite

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

A Valgrind-based Soft Error Injection Tool for SIHFT Evaluations

A Valgrind-based Soft Error Injection Tool for SIHFT Evaluations Master Thesis Clemens Terasa A Valgrind-based Soft Error Injection Tool for SIHFT Evaluations 27. März 2013 supervised by: Prof. Dr. Sibylle Schupp Hamburg University of Technology (TUHH) Technische Universität

More information

Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies

Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies The agenda Reliability in Storage systems Types of errors/faults in distributed storage

More information

Last Class:Consistency Semantics. Today: More on Consistency

Last Class:Consistency Semantics. Today: More on Consistency Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults

LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults Qining Lu, Mostafa Farahani, Jiesheng Wei, Anna Thomas, and Karthik Pattabiraman Department of Electrical and Computer Engineering,

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Reliable Architectures

Reliable Architectures 6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor

Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino Dipartimento di Automatica e Informatica Torino, Italy www.cad.polito.it

More information

Alexandre Esper, Geoffrey Nelissen, Vincent Nélis, Eduardo Tovar

Alexandre Esper, Geoffrey Nelissen, Vincent Nélis, Eduardo Tovar Alexandre Esper, Geoffrey Nelissen, Vincent Nélis, Eduardo Tovar Current status MC model gradually gaining in sophistication Current status MC model gradually gaining in sophistication Issue Safety-related

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Don t Judge Software by Its (Code) Coverage

Don t Judge Software by Its (Code) Coverage Author manuscript, published in "SAFECOMP 2013 - Workshop CARS (2nd Workshop on Critical Automotive applications : Robustness & Safety) of the 32nd International Conference on Computer Safety, Reliability

More information

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA Software Techniques for Dependable Computer-based Systems Matteo SONZA REORDA Summary Introduction State of the art Assertions Algorithm Based Fault Tolerance (ABFT) Control flow checking Data duplication

More information

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,

More information

Error Detection by Code Coverage Analysis without Instrumenting the Code

Error Detection by Code Coverage Analysis without Instrumenting the Code Error Detection by Code Coverage Analysis without Instrumenting the Code Erol Simsek, isystem AG Exhaustive testing to detect software errors constantly demands more time within development cycles. Software

More information

Concurrent Exception Handling and Resolution in Distributed Object Systems

Concurrent Exception Handling and Resolution in Distributed Object Systems Concurrent Exception Handling and Resolution in Distributed Object Systems Presented by Prof. Brian Randell J. Xu A. Romanovsky and B. Randell University of Durham University of Newcastle upon Tyne 1 Outline

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY WHITEPAPER DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY A Detailed Review ABSTRACT No single mechanism is sufficient to ensure data integrity in a storage system.

More information

Software Engineering: Integration Requirements

Software Engineering: Integration Requirements Software Engineering: Integration Requirements AYAZ ISAZADEH Department of Computer Science Tabriz University Tabriz, IRAN Abstract: - This paper presents a discussion of software integration requirements,

More information

Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters

Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters ECE 7502 Class Discussion Ningxi Liu 14 th Apr 2015 ECE 7502 S2015 Customer Validate Requirements Verify Specification

More information

Diagnosis in the Time-Triggered Architecture

Diagnosis in the Time-Triggered Architecture TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

Evaluation of Embedded Operating System by a Software Method *

Evaluation of Embedded Operating System by a Software Method * Jan. 2006, Volume 3, No.1 (Serial No.14) Journal of Communication and Computer, ISSN1548-7709, USA * Junjie Peng 1, Jun Ma 2, Bingrong Hong 3 (1,3 School of Computer Science & Engineering, Harbin Institute

More information

FUNCTIONAL SAFETY AND THE GPU. Richard Bramley, 5/11/2017

FUNCTIONAL SAFETY AND THE GPU. Richard Bramley, 5/11/2017 FUNCTIONAL SAFETY AND THE GPU Richard Bramley, 5/11/2017 How good is good enough What is functional safety AGENDA Functional safety and the GPU Safety support in Nvidia GPU Conclusions 2 HOW GOOD IS GOOD

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

DEPENDABLE PROCESSOR DESIGN

DEPENDABLE PROCESSOR DESIGN DEPENDABLE PROCESSOR DESIGN Matteo Carminati Politecnico di Milano - October 31st, 2012 Partially inspired by P. Harrod (ARM) presentation at the Test Spring School 2012 - Annecy (France) OUTLINE What?

More information

EDA Support for Functional Safety How Static and Dynamic Failure Analysis Can Improve Productivity in the Assessment of Functional Safety

EDA Support for Functional Safety How Static and Dynamic Failure Analysis Can Improve Productivity in the Assessment of Functional Safety EDA Support for Functional Safety How Static and Dynamic Failure Analysis Can Improve Productivity in the Assessment of Functional Safety by Dan Alexandrescu, Adrian Evans and Maximilien Glorieux IROC

More information

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,

More information

Communication Networks for the Next-Generation Vehicles

Communication Networks for the Next-Generation Vehicles Communication Networks for the, Ph.D. Electrical and Computer Engg. Dept. Wayne State University Detroit MI 48202 (313) 577-3855, smahmud@eng.wayne.edu January 13, 2005 4 th Annual Winter Workshop U.S.

More information

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability

Critical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability Objectives Critical Systems To explain what is meant by a critical system where system failure can have severe human or economic consequence. To explain four dimensions of dependability - availability,

More information

A Diversity of Duplications

A Diversity of Duplications A Diversity of Duplications David Powell Special event «Dependability of computing systems, Memories and future» in honor of Jean-Claude Laprie LAAS-CNRS, Toulouse, 16 April 2010 Duplication error Detection

More information

Safety-critical embedded systems, fault-tolerant control systems, fault detection, fault localization and isolation

Safety-critical embedded systems, fault-tolerant control systems, fault detection, fault localization and isolation Fault detection in safety-critical embedded systems nomen VERBER i, MA TJAl COLNARIC i, AND WOLFGANG A. HALANG 2 JUniversity of Maribor, Faculty of Electrical Engineering and Computer Science, 2000 Maribor,

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Chap 2. Introduction to Software Testing

Chap 2. Introduction to Software Testing Chap 2. Introduction to Software Testing 2.1 Software Testing Concepts and Processes 2.2 Test Management 1 2.1 Software Testing Concepts and Processes 1. Introduction 2. Testing Dimensions 3. Test Concepts

More information

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES

More information

System modeling. Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei)

System modeling. Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei) System modeling Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei) Budapest University of Technology and Economics Department of Measurement and Information Systems Contents Concept

More information

CIS 5373 Systems Security

CIS 5373 Systems Security CIS 5373 Systems Security Topic 1: Introduction to Systems Security Endadul Hoque 1 Why should you care? Security impacts our day-to-day life Become a security-aware user Make safe decisions Become a security-aware

More information

11. SEU Mitigation in Stratix IV Devices

11. SEU Mitigation in Stratix IV Devices 11. SEU Mitigation in Stratix IV Devices February 2011 SIV51011-3.2 SIV51011-3.2 This chapter describes how to use the error detection cyclical redundancy check (CRC) feature when a Stratix IV device is

More information

Fault Tolerance Dealing with an imperfect world

Fault Tolerance Dealing with an imperfect world Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction

More information

VLSI Test Technology and Reliability (ET4076)

VLSI Test Technology and Reliability (ET4076) VLSI Test Technology and Reliability (ET4076) Lecture 4(part 2) Testability Measurements (Chapter 6) Said Hamdioui Computer Engineering Lab Delft University of Technology 2009-2010 1 Previous lecture What

More information

By V-cubed Solutions, Inc. Page1. All rights reserved by V-cubed Solutions, Inc.

By V-cubed Solutions, Inc.   Page1. All rights reserved by V-cubed Solutions, Inc. By V-cubed Solutions, Inc. Page1 Purpose of Document This document will demonstrate the efficacy of CODESCROLL CODE INSPECTOR, CONTROLLER TESTER, and QUALITYSCROLL COVER, which has been developed by V-cubed

More information

Automated Freedom from Interference Analysis for Automotive Software

Automated Freedom from Interference Analysis for Automotive Software Automated Freedom from Interference Analysis for Automotive Software Florian Leitner-Fischer ZF TRW 78315 Radolfzell, Germany Email: florian.leitner-fischer@zf.com Stefan Leue Chair for Software and Systems

More information

9. SEU Mitigation in Cyclone IV Devices

9. SEU Mitigation in Cyclone IV Devices 9. SEU Mitigation in Cyclone IV Devices May 2013 CYIV-51009-1.3 CYIV-51009-1.3 This chapter describes the cyclical redundancy check (CRC) error detection feature in user mode and how to recover from soft

More information

Safety Instructions 1-1 Avoid unintended Start General Description 2-2

Safety Instructions 1-1 Avoid unintended Start General Description 2-2 Contents Contents 1 Safety and precautions 1-1 Safety Instructions 1-1 Avoid unintended Start. 1-1 2 Introduction 2-1 General Description 2-2 3 Supported Configuration 3-1 Introduction 3-1 Fixed-speed

More information

Understanding and Managing Cascading Disasters A Framework for Analysis

Understanding and Managing Cascading Disasters A Framework for Analysis Understanding and Managing Cascading Disasters A Framework for Analysis David Alexander and Gianluca Pescaroli Cascading Disasters Research Group Institute for Risk and Disaster Reduction University College

More information

Soft-error Detection Using Control Flow Assertions

Soft-error Detection Using Control Flow Assertions Soft-error Detection Using Control Flow Assertions O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino, Dipartimento di Automatica e Informatica Torino, Italy Abstract Over

More information

Lecture 5 Safety Analysis FHA, HAZOP

Lecture 5 Safety Analysis FHA, HAZOP Lecture 5 Safety Analysis FHA, HAZOP Introduction While designing a safety-critical system usually several safety analysis techniques are applied The idea is to achieve completeness of safety requirements,

More information

Introduction to Robust Systems

Introduction to Robust Systems Introduction to Robust Systems Subhasish Mitra Stanford University Email: subh@stanford.edu 1 Objective of this Talk Brainstorm What is a robust system? How can we build robust systems? Robust systems

More information

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8 TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system

More information

Model-Based Safety Approach for Early Validation of Integrated and Modular Avionics Architectures

Model-Based Safety Approach for Early Validation of Integrated and Modular Avionics Architectures Model-Based Safety Approach for Early Validation of Integrated and Modular Avionics Architectures Marion Morel THALES AVIONICS S.A.S., 31036 Toulouse, France marion.morel@fr.thalesgroup.com Abstract. Increasing

More information

Bridge Course On Software Testing

Bridge Course On Software Testing G. PULLAIAH COLLEGE OF ENGINEERING AND TECHNOLOGY Accredited by NAAC with A Grade of UGC, Approved by AICTE, New Delhi Permanently Affiliated to JNTUA, Ananthapuramu (Recognized by UGC under 2(f) and 12(B)

More information

Driver Assistance Pushes New Flash Functionalities

Driver Assistance Pushes New Flash Functionalities Driver Assistance Pushes New Flash Functionalities Anil Gupta Technical Executive Winbond Electronics Corporation Santa Clara, CA 1 Automotive and ADAS terminology ECC use to increase reliability of Flash

More information

Error Resilience in Digital Integrated Circuits

Error Resilience in Digital Integrated Circuits Error Resilience in Digital Integrated Circuits Heinrich T. Vierhaus BTU Cottbus-Senftenberg Outline 1. Introduction 2. Faults and errors in nano-electronic circuits 3. Classical fault tolerant computing

More information

Autonomous Driving From Fail-Safe to Fail-Operational Systems

Autonomous Driving From Fail-Safe to Fail-Operational Systems Autonomous Driving From Fail-Safe to Fail-Operational Systems Rudolf Grave December 3, 2015 Agenda About EB Automotive Autonomous Driving Requirements for a future car infrastructure Concepts for fail-operational

More information

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How CSE 486/586 Distributed Systems Failure Detectors Today s Question I have a feeling that something went wrong Steve Ko Computer Sciences and Engineering University at Buffalo zzz You ll learn new terminologies,

More information

Review of Software Fault-Tolerance Methods for Reliability Enhancement of Real-Time Software Systems

Review of Software Fault-Tolerance Methods for Reliability Enhancement of Real-Time Software Systems International Journal of Electrical and Computer Engineering (IJECE) Vol. 6, No. 3, June 2016, pp. 1031 ~ 1037 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i3.9041 1031 Review of Software Fault-Tolerance Methods

More information

BUILDING CYBERSECURITY CAPABILITY, MATURITY, RESILIENCE

BUILDING CYBERSECURITY CAPABILITY, MATURITY, RESILIENCE BUILDING CYBERSECURITY CAPABILITY, MATURITY, RESILIENCE 1 WHAT IS YOUR SITUATION? Excel spreadsheets Manually intensive Too many competing priorities Lack of effective reporting Too many consultants Not

More information

Improving FPGA Design Robustness with Partial TMR

Improving FPGA Design Robustness with Partial TMR Improving FPGA Design Robustness with Partial TMR Brian Pratt, Michael Caffrey, Paul Graham, Keith Morgan, Michael Wirthlin Abstract This paper describes an efficient approach of applying mitigation to

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

EXPERIENCES FROM MODEL BASED DEVELOPMENT OF DRIVE-BY-WIRE CONTROL SYSTEMS

EXPERIENCES FROM MODEL BASED DEVELOPMENT OF DRIVE-BY-WIRE CONTROL SYSTEMS EXPERIENCES FROM MODEL BASED DEVELOPMENT OF DRIVE-BY-WIRE CONTROL SYSTEMS Per Johannessen 1, Fredrik Törner 1 and Jan Torin 2 1 Volvo Car Corporation, Department 94221, ELIN, SE-405 31 Göteborg, SWEDEN;

More information

RAID systems within Industry

RAID systems within Industry White Paper 01/2014 RAID systems within Industry Functioning, variants and fields of application of RAID systems https://support.industry.siemens.com/cs/ww/en/view/109737064 Warranty and liability Warranty

More information

Toward Monitoring Fault-Tolerant Embedded Systems

Toward Monitoring Fault-Tolerant Embedded Systems Toward Monitoring Fault-Tolerant Embedded Systems Alwyn E. Goodloe National Institute of Aerospace Lee Pike Galois, Inc Characterizing the Systems The systems we focus on must be ultra-reliable, and so

More information