Dependability Threats
|
|
- Clement Norton
- 5 years ago
- Views:
Transcription
1 Dependable Systems Dependability Threats Dr. Peter Tröger Operating Systems Group
2 Dependability Dependability is defined as the trustworthiness of a computer system such that reliance can justifiable be placed on the service it delivers. The service delivered is the behavior as it is perceptible to users; a user is another system (human or physical) which interacts with the former. (J. C. Laprie) 2
3 Dependability Tree 3
4 Dependability Threats Threat - Unintended event or state Failure -,Ausfall Event when the system no longer complies to the specification Error -,Fehler(zustand) Part of system state that can lead to subsequent failure Fault -,Fehler(ursache) Adjudged or hypothesized cause of an error Error is a state, fault and failures are events in time Treating failures is repair, treating or avoiding errors is maintenance 4
5 Alternative Definitions 5
6 Chain of Dependability Threats Separation of system states and the events leading to them External Fault Normal Activation Active Fault / Latent Error Detection Detected Error Failure Classical version according to Laprie Internal Fault Dormant Fault Error Handling Restoration Failure Outage 6
7 System? Integrated combination of humans, products and processes Functional and non-functional specification In IT world, the products" are typically called components Interacting with each other and the environment IT systems are organized in layers, recursive definition Hardware executes operating system Operating system executes application Application executes plugin 7
8 IT System Current state of a layer is either correct or incorrect Focus on some chosen investigated layer Fault tolerance: Dealing with detectable incorrect states of investigated and execution layer I E Failure: Externally visible incorrect state of investigated layer The total state of a given system is the set of the following states: computation, communication, stored information, interconnection, and physical condition. (A. Avižienis) 8
9 Chain of Dependability Threats (with layers) C OF F,C ON, EXEC F s I 2 E I s E 2 E C OF F,C ON, EXEC F Detection EXEC F EXEC F (Activation) Deactivation s I 2 D I s E 2 E FAIL s I 2 X I s E 2 X E C ON (Enabling) Recovery Mitigation FAIL C OF F (Disabling) s I 2 X I s E 2 E Restoration s I 2 F I s E 2 E 9
10 Chain vs. Propagation The chain of dependability threats occurs in one investigated layer Software bug = fault May lead to wrong variable value = error May lead to exception being thrown = failure What happens if the problem leaves the layer? Example: Redundant RAID array with mirroring, single disc fails Resulting error state in the RAID system ( red light ) May or may not propagate to the operating system Error propagation: A failure in one layer is the fault in another layer 10
11 Error Propagation 11
12 Fault Classification High diversity in possible sources and types 12
13 Observations on Faults An external fault is a design fault - inability or refusal to foresee all situations Design faults are created during system development, system modification, or operational procedure creation and establishment Just replacing broken version of the same component leads to recurrent faults Physical faults are accidental faults Temporary external accidental physical faults are also called transient faults Temporary internal accidental faults are also called intermittent faults Examples: Pattern-sensitive memory hardware, system overload Arbitrary concept - Permanent faults with unknown activation condition Intentional and design faults are human-made faults, might be malicious faults Hardware production defects are typically physical faults 13
14 Observations on Faults A fault is active when it produces an error A non-active internal fault is a dormant / passive fault Origin in hardware: Often cycling between dormant and active Many specialized versions of the term,fault, e.g. bug Heisenbug - Resulting error disappears by itself Bohrbug - Resulting error is independent from execution state Mandelbug - Leads only to an error under specific conditions Fault-tolerant system design is a contradiction Design demands specification, faults are non-specified cases Solution: Specification for fault-free case + additional fault model 14
15 Fault Model Faults can be classified on different abstraction levels Physics Circuit level / switching circuit level Interesting for hardware design research (not this course) Investigate logical signals on connections stuck-at-zero, stuck-at-one, bridging faults, stuck-open Register transfer level Processor-memory-switch (PMS) level Hardware system level... (Software) Extremely important when talking about dependability means 15
16 Physical Faults Highly energized particles from space, atmospheric, or ground radiation Influence of particle that strikes a circuit: Atomic displacement, direct ionization, indirect ionization created by nuclear reactions Smaller structures are more sensitive to ionization effects Single Event Upset (SEU) Injected charge modifies hardware state temporally Can happen in memory and logic hardware Detected Unrecoverable Error (DUE) / Silent Data Corruption (SDC) Problem becomes permanent May be detected or undetected 16
17 Single Event Upset 17
18 Fault Model for Semiconductor Memories Stuck-at-1 or stuck-at-0 (hard) faults Transition / bit-flip faults (0->1, 1->0) Multiple writing - Data written into more than one cell on write attempt in one cell Pattern sensitivity - Device does not perform reliably with certain data pattern(s) Write recovery - Write followed by read/write at different location results in read/ write at same location Sense amplifier recovery - Data accessed remains the same for a number of cycles and then suddenly changed Bridging fault - Short between cells, AND type or OR type State coupling fault - Coupled (victim) cell is forced to 0 or 1 if coupling (aggressor) cell is in a given state 18
19 Software? All bugs are permanent design faults Ignoring user demands Ignoring special properties of the system environment Incomplete specification of dependability requirements Incomplete documentation Example for software fault model: Orthogonal Defect Classification (ODC) Any requirement to change the product is a defect Defect trigger: What make the defect surface Defect type: Nature of the fix you put on the defect 19
20 ODC Security Defect Types *"# +,-!+./01 )"# 23/45678!9:1;<=4>!?7737!"#$"%&'(")*+),"$-#.&/)0"+"$&1 ("# '"# &"# %"# $"# H3>=;!?7737 "# $ % & ' ( ) 23#*4")5.6"1&*%"1-4C/D!L6@=56D=34 M103/7;1!H16<!645!N.O1;D!M1/01 P=Q=4>!R!I17=6@=S6D=34!?7737! 20
21 Errors Escalates to failure depending on intentional / unintentional redundancy... system activity... specification of a failure case from user perspective (i.e. maximum outage time, acceptable delay, retransmission rate) System activity can reverse the error state before damage is happening Latent (not recognized) vs. detected error resulting from an active fault Hardware often contains unintentional redundancy, makes it difficult to test 21
22 Hardware Error Models Hardware faults effect state information, e.g. register values Stuck-at and other hardware faults therefore can also be denoted as error More interesting to investigate resulting effects on system-level Single data error - Program data is corrupted (in cache, memory, or register) Single code error - Effect on one instruction of the code Type 1/2 - Code modification without / with change of control flow Nature of error state may confirm to the nature of the originating fault Transient vs. permanent, static vs. dynamic, single vs. multiple Depends on utilized dependability means 22
23 Hardware Error Models Mapping of hardware-level single bit-flip error to other layers Memory data segment, processor data cache: System-level single data error Memory code segment, processor code cache: System-level single code error of type 1 (modification of target register) or type 2 (modification of branch target) Memory stack segment: System-level data error or type 2 code error Processor register: Depending on processor architecture and register type Single data error if register holds data interpreted by the application Single type 1 code error, if register holds address used by load/store operation Single type 2 code error, if register holds address of a branch target Processor control register: Everything could happen... 23
24 Hardware Error Models - Code Errors MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: SUB R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ FOOBAR MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BZ LOOP 24
25 Software Error Models Similar terminology, but completely different semantics Syntactical errors are handled by compiler, semantical errors occur at runtime Static vs. dynamic, permanent vs. temporary errors Example for C programming language Errors affecting assignments (missing / wrong local variable values) Errors affecting conditional instructions (wrong boolean or iteration condition) Errors affecting function call / return (wrong parameters, return statement) Errors affecting algorithms (missing statements or function calls, wrong operators) Under research in the software engineering field - field studies, automated code analysis, developer interviews 25
26 Error Message Occurrence Same fault can lead to different (detected or undetected) errors Errors become detected by error detection mechanism Some undetected errors are detected by several detectors Some detectors report several undetected errors as one Some undetected errors are never uncovered Detected errors might not be logged, if the system stops too fast 26
27 Failures Visible non-compliance of the system with the specification Failure effect: Why is this failure interesting to be investigated Failure mode: Type of failure in relation to the functionality of the system Failure mechanism: How can this happen Failure models are well-known in distributed software systems Classical categorization in the onion model [Barborak, Cristian] 27
28 Onion Model Assumption: System of components sending messages to each other Maps to hardware with electrical signals Maps to distributed software systems Fail-Stop Failure: No more messaging, other components are informed Crash Failure: No more messaging, no information Omission Failure: Messages are omitted for some time Timing Failure: Reaction on message or sending of message is too early / late Computation Failure: Wrong answer message on correct received message Byzantine Failure: Anything 28
29 Failure Severity Denotes consequences of failure Benign failures Failure costs and operational benefits are similar Sometimes also umbrella term for failures only detected by inspection A system with only such failures is fail-safe Catastrophic failures Costs of failure consequences are much larger than service benefit Grading depends on application Flying airplane - Fail-Stop is catastrophic Train - Fail-Stop is benign Criticality - Highest severity of possible failure modes in the system 29
30 Example: DO-178B Standard Software Considerations in Airborne Systems and Equipment Certification Mature document, developed for more than 20 years Definition of severity of failure conditions for airplane, crew, and passengers Catastrophic - Loss of ability to continue safe flight and landing Major - Reduced airplane or crew capability to cope with operating conditions Reduction in safety margins and functional capabilities Higher workload or physical distress for the crew Minor - Not significantly reduced airplane safety, slight increase in workload (Example: Change of flight plan) No effect - Failure results in no loss of operational capabilities and no increase in crew workload 30
31 Example: DO-178B Standard 31
32 Example: ISO26262 ok Severity (S) of injuries bad bad Failure risk acceptable Failure risk not acceptable Controllability (C) ok Functional safety of automotive systems Severity of failures expressed as Automotive Safety Integrity Level (ASIL) Controllability: Can the driver compensate Severity: How bad are the consequences Exposure: How often does that happen 32
33 Wording and Numbers 33
34 Wording and Numbers 34
35 Observations on Failures Failures and system load are correlated Load can lead to wear-out, so the failure probability increases Higher load can activate dormant faults Detected faults lead to recovery activities, which again increases the load Possibility for unintended feedback effects in complex systems Common-cause failures: Multiple parts are impacted for the same reason Cascade failures through common dependency (e.g. power) Secondary failures from inappropriate environment (e.g. temperature) Common-mode failures from bad design (e.g. identical redundant units) 35
36 Example: Amazon EBS Failure of Amazon cloud services in 2012 Major web sites were down (Reddit, Netflix, Airbnb, ) Report about root cause Large number of cloud storage servers could no longer handle requests Low priority service with memory leak was eating all resources Reason was repeated connection attempt to monitoring server Monitoring server was not reachable due to DNS misconfiguration DNS change was reasoned by exchange of unrelated hardware unit Example for cascade failure 36
37 Fail-Fast A common concept from system engineering, company management,... Report failure and stop immediately without further action Discussed by Jim Gray in 1985 as part of his famous article Why do computers stop and what can be done about it? Useful when benefit from recovery is not good enough for its costs, or if error propagation is highly probable Single units of a redundant set Deeply interwired IT system components Components under heavy request load 37
38 Literature Laprie, J. Dependability. Basic Concepts and Terminology. (Springer, 1998). Hansen, J. P. & Siewiorek, D. P. Models for time coalescence in event logs. in IEEE Proceedings of International Symposium on Fault-Tolerant Computing (FTCS-22) (1992). doi: /ftcs Hunny, U., Zulkernine, M. & Weldemariam, K. OSDC: Adapting ODC for Developing More Secure Software. in Proceedings of the 28th Annual ACM Symposium on Applied Computing (ACM, 2013). doi: / ISO. Road vehicles - Functional safety - Part 3: Concept phase (ISO ). (2011). Thomas K. Ferrell & Uma D. Ferrell. RTCA DO-178B/EUROCAE ED-12B. in The Avionics Handbook (CRC Press, 2001). Goloubeva, O., Rebaudengo, M., Reorda, M. & Violante, M. Software-Implemented Hardware Fault Tolerance. (Springer, 2010). 38
Part 2: Basic concepts and terminology
Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness
More informationTSW Reliability and Fault Tolerance
TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software
More informationDeriving safety requirements according to ISO for complex systems: How to avoid getting lost?
Deriving safety requirements according to ISO 26262 for complex systems: How to avoid getting lost? Thomas Frese, Ford-Werke GmbH, Köln; Denis Hatebur, ITESYS GmbH, Dortmund; Hans-Jörg Aryus, SystemA GmbH,
More informationDependability tree 1
Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques
More informationFunctional Safety and Safety Standards: Challenges and Comparison of Solutions AA309
June 25th, 2007 Functional Safety and Safety Standards: Challenges and Comparison of Solutions AA309 Christopher Temple Automotive Systems Technology Manager Overview Functional Safety Basics Functional
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationA Low-Cost Correction Algorithm for Transient Data Errors
A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction
More informationOverview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)
More informationFault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard
Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO 26262 standard NMI Automotive Electronics Systems 2013 Event Victor Reyes Technical Marketing System
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationFault Injection Seminar
Fault Injection Seminar Summer Semester 2015 Daniel Richter, Lena Herscheid, Prof. Andreas Polze Operating Systems and Middleware Group Hasso Plattner Institute 23/04/2015 Fault Injection Seminar 1 Dependability
More informationFault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed
More informationModule 8 Fault Tolerance CS655! 8-1!
Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!
More informationIs This What the Future Will Look Like?
Is This What the Future Will Look Like? Implementing fault tolerant system architectures with AUTOSAR basic software Highly automated driving adds new requirements to existing safety concepts. It is no
More informationSafety and Reliability of Software-Controlled Systems Part 14: Fault mitigation
Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester
More informationECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University
Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design
More informationIssues in Programming Language Design for Embedded RT Systems
CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationCourse: Advanced Software Engineering. academic year: Lecture 14: Software Dependability
Course: Advanced Software Engineering academic year: 2011-2012 Lecture 14: Software Dependability Lecturer: Vittorio Cortellessa Computer Science Department University of L'Aquila - Italy vittorio.cortellessa@di.univaq.it
More informationSOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE
SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante Politecnico di Torino - Dipartimento di Automatica
More informationA Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems
Work in progress A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems December 5, 2017 CERTS 2017 Malte Appel, Arpan Gujarati and Björn B. Brandenburg Distributed
More informationEliminating Single Points of Failure in Software Based Redundancy
Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM
More informationEventual Consistency. Eventual Consistency
Eventual Consistency Many systems: one or few processes perform updates How frequently should these updates be made available to other read-only processes? Examples: DNS: single naming authority per domain
More informationDistributed Systems (ICE 601) Fault Tolerance
Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability
More informationDistributed Systems COMP 212. Revision 2 Othon Michail
Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise
More informationVerification and Test with Model-Based Design
Verification and Test with Model-Based Design Flight Software Workshop 2015 Jay Abraham 2015 The MathWorks, Inc. 1 The software development process Develop, iterate and specify requirements Create high
More informationFault Tolerant Computing CS 530
Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu
More informationImplementation Issues. Remote-Write Protocols
Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write
More informationError Model Annex Revision
Error Model Annex Revision Peter H Feiler phf@sei.cmu.edu Jan 2011 Goal A core set of reliability concepts and error types Interaction of systems with nominal behavior and threats in the form of defects,
More informationRiccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist
Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist Internet of Things Group 2 Internet of Things Group 3 Autonomous systems: computing platform Intelligent eyes Vision. Intelligent
More informationSiewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X
Dependable Systems Hardware Dependability - Diagnosis Dr. Peter Tröger Sources: Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., 156881092X
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationStatic Analysis of Embedded Systems
Static Analysis of Embedded Systems Xavier RIVAL rival@di.ens.fr Outline Case study Certification of embedded softwares Demo Static Analysisof Embedded Systems p.2/12 Ariane 5 Flight 501 Ariane 5: sattelite
More informationFault-tolerant techniques
What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques
More informationA Valgrind-based Soft Error Injection Tool for SIHFT Evaluations
Master Thesis Clemens Terasa A Valgrind-based Soft Error Injection Tool for SIHFT Evaluations 27. März 2013 supervised by: Prof. Dr. Sibylle Schupp Hamburg University of Technology (TUHH) Technische Universität
More informationSoftware Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies
Software Based Fault Injection Framework For Storage Systems Vinod Eswaraprasad Smitha Jayaram Wipro Technologies The agenda Reliability in Storage systems Types of errors/faults in distributed storage
More informationLast Class:Consistency Semantics. Today: More on Consistency
Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency
More informationModule 8 - Fault Tolerance
Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced
More informationLLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults
LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults Qining Lu, Mostafa Farahani, Jiesheng Wei, Anna Thomas, and Karthik Pattabiraman Department of Electrical and Computer Engineering,
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationReliable Architectures
6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationAccurate Analysis of Single Event Upsets in a Pipelined Microprocessor
Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino Dipartimento di Automatica e Informatica Torino, Italy www.cad.polito.it
More informationAlexandre Esper, Geoffrey Nelissen, Vincent Nélis, Eduardo Tovar
Alexandre Esper, Geoffrey Nelissen, Vincent Nélis, Eduardo Tovar Current status MC model gradually gaining in sophistication Current status MC model gradually gaining in sophistication Issue Safety-related
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationDon t Judge Software by Its (Code) Coverage
Author manuscript, published in "SAFECOMP 2013 - Workshop CARS (2nd Workshop on Critical Automotive applications : Robustness & Safety) of the 32nd International Conference on Computer Safety, Reliability
More informationSoftware Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA
Software Techniques for Dependable Computer-based Systems Matteo SONZA REORDA Summary Introduction State of the art Assertions Algorithm Based Fault Tolerance (ABFT) Control flow checking Data duplication
More informationA CAN-Based Architecture for Highly Reliable Communication Systems
A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,
More informationError Detection by Code Coverage Analysis without Instrumenting the Code
Error Detection by Code Coverage Analysis without Instrumenting the Code Erol Simsek, isystem AG Exhaustive testing to detect software errors constantly demands more time within development cycles. Software
More informationConcurrent Exception Handling and Resolution in Distributed Object Systems
Concurrent Exception Handling and Resolution in Distributed Object Systems Presented by Prof. Brian Randell J. Xu A. Romanovsky and B. Randell University of Durham University of Newcastle upon Tyne 1 Outline
More informationFault Tolerance. Distributed Software Systems. Definitions
Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:
More informationDATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY
WHITEPAPER DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY A Detailed Review ABSTRACT No single mechanism is sufficient to ensure data integrity in a storage system.
More informationSoftware Engineering: Integration Requirements
Software Engineering: Integration Requirements AYAZ ISAZADEH Department of Computer Science Tabriz University Tabriz, IRAN Abstract: - This paper presents a discussion of software integration requirements,
More informationEnabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters
Enabling Testability of Fault-Tolerant Circuits by Means of IDDQ-Checkable Voters ECE 7502 Class Discussion Ningxi Liu 14 th Apr 2015 ECE 7502 S2015 Customer Validate Requirements Verify Specification
More informationDiagnosis in the Time-Triggered Architecture
TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the
More informationDistributed Systems
15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard
More informationEvaluation of Embedded Operating System by a Software Method *
Jan. 2006, Volume 3, No.1 (Serial No.14) Journal of Communication and Computer, ISSN1548-7709, USA * Junjie Peng 1, Jun Ma 2, Bingrong Hong 3 (1,3 School of Computer Science & Engineering, Harbin Institute
More informationFUNCTIONAL SAFETY AND THE GPU. Richard Bramley, 5/11/2017
FUNCTIONAL SAFETY AND THE GPU Richard Bramley, 5/11/2017 How good is good enough What is functional safety AGENDA Functional safety and the GPU Safety support in Nvidia GPU Conclusions 2 HOW GOOD IS GOOD
More informationFailure Tolerance. Distributed Systems Santa Clara University
Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot
More informationDEPENDABLE PROCESSOR DESIGN
DEPENDABLE PROCESSOR DESIGN Matteo Carminati Politecnico di Milano - October 31st, 2012 Partially inspired by P. Harrod (ARM) presentation at the Test Spring School 2012 - Annecy (France) OUTLINE What?
More informationEDA Support for Functional Safety How Static and Dynamic Failure Analysis Can Improve Productivity in the Assessment of Functional Safety
EDA Support for Functional Safety How Static and Dynamic Failure Analysis Can Improve Productivity in the Assessment of Functional Safety by Dan Alexandrescu, Adrian Evans and Maximilien Glorieux IROC
More informationImproving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy
Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,
More informationCommunication Networks for the Next-Generation Vehicles
Communication Networks for the, Ph.D. Electrical and Computer Engg. Dept. Wayne State University Detroit MI 48202 (313) 577-3855, smahmud@eng.wayne.edu January 13, 2005 4 th Annual Winter Workshop U.S.
More informationCritical Systems. Objectives. Topics covered. Critical Systems. System dependability. Importance of dependability
Objectives Critical Systems To explain what is meant by a critical system where system failure can have severe human or economic consequence. To explain four dimensions of dependability - availability,
More informationA Diversity of Duplications
A Diversity of Duplications David Powell Special event «Dependability of computing systems, Memories and future» in honor of Jean-Claude Laprie LAAS-CNRS, Toulouse, 16 April 2010 Duplication error Detection
More informationSafety-critical embedded systems, fault-tolerant control systems, fault detection, fault localization and isolation
Fault detection in safety-critical embedded systems nomen VERBER i, MA TJAl COLNARIC i, AND WOLFGANG A. HALANG 2 JUniversity of Maribor, Faculty of Electrical Engineering and Computer Science, 2000 Maribor,
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationChap 2. Introduction to Software Testing
Chap 2. Introduction to Software Testing 2.1 Software Testing Concepts and Processes 2.2 Test Management 1 2.1 Software Testing Concepts and Processes 1. Introduction 2. Testing Dimensions 3. Test Concepts
More informationSoftware reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.
SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES
More informationSystem modeling. Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei)
System modeling Fault modeling (based on slides from dr. István Majzik and Zoltán Micskei) Budapest University of Technology and Economics Department of Measurement and Information Systems Contents Concept
More informationCIS 5373 Systems Security
CIS 5373 Systems Security Topic 1: Introduction to Systems Security Endadul Hoque 1 Why should you care? Security impacts our day-to-day life Become a security-aware user Make safe decisions Become a security-aware
More information11. SEU Mitigation in Stratix IV Devices
11. SEU Mitigation in Stratix IV Devices February 2011 SIV51011-3.2 SIV51011-3.2 This chapter describes how to use the error detection cyclical redundancy check (CRC) feature when a Stratix IV device is
More informationFault Tolerance Dealing with an imperfect world
Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction
More informationVLSI Test Technology and Reliability (ET4076)
VLSI Test Technology and Reliability (ET4076) Lecture 4(part 2) Testability Measurements (Chapter 6) Said Hamdioui Computer Engineering Lab Delft University of Technology 2009-2010 1 Previous lecture What
More informationBy V-cubed Solutions, Inc. Page1. All rights reserved by V-cubed Solutions, Inc.
By V-cubed Solutions, Inc. Page1 Purpose of Document This document will demonstrate the efficacy of CODESCROLL CODE INSPECTOR, CONTROLLER TESTER, and QUALITYSCROLL COVER, which has been developed by V-cubed
More informationAutomated Freedom from Interference Analysis for Automotive Software
Automated Freedom from Interference Analysis for Automotive Software Florian Leitner-Fischer ZF TRW 78315 Radolfzell, Germany Email: florian.leitner-fischer@zf.com Stefan Leue Chair for Software and Systems
More information9. SEU Mitigation in Cyclone IV Devices
9. SEU Mitigation in Cyclone IV Devices May 2013 CYIV-51009-1.3 CYIV-51009-1.3 This chapter describes the cyclical redundancy check (CRC) error detection feature in user mode and how to recover from soft
More informationSafety Instructions 1-1 Avoid unintended Start General Description 2-2
Contents Contents 1 Safety and precautions 1-1 Safety Instructions 1-1 Avoid unintended Start. 1-1 2 Introduction 2-1 General Description 2-2 3 Supported Configuration 3-1 Introduction 3-1 Fixed-speed
More informationUnderstanding and Managing Cascading Disasters A Framework for Analysis
Understanding and Managing Cascading Disasters A Framework for Analysis David Alexander and Gianluca Pescaroli Cascading Disasters Research Group Institute for Risk and Disaster Reduction University College
More informationSoft-error Detection Using Control Flow Assertions
Soft-error Detection Using Control Flow Assertions O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino, Dipartimento di Automatica e Informatica Torino, Italy Abstract Over
More informationLecture 5 Safety Analysis FHA, HAZOP
Lecture 5 Safety Analysis FHA, HAZOP Introduction While designing a safety-critical system usually several safety analysis techniques are applied The idea is to achieve completeness of safety requirements,
More informationIntroduction to Robust Systems
Introduction to Robust Systems Subhasish Mitra Stanford University Email: subh@stanford.edu 1 Objective of this Talk Brainstorm What is a robust system? How can we build robust systems? Robust systems
More informationDependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8
TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system
More informationModel-Based Safety Approach for Early Validation of Integrated and Modular Avionics Architectures
Model-Based Safety Approach for Early Validation of Integrated and Modular Avionics Architectures Marion Morel THALES AVIONICS S.A.S., 31036 Toulouse, France marion.morel@fr.thalesgroup.com Abstract. Increasing
More informationBridge Course On Software Testing
G. PULLAIAH COLLEGE OF ENGINEERING AND TECHNOLOGY Accredited by NAAC with A Grade of UGC, Approved by AICTE, New Delhi Permanently Affiliated to JNTUA, Ananthapuramu (Recognized by UGC under 2(f) and 12(B)
More informationDriver Assistance Pushes New Flash Functionalities
Driver Assistance Pushes New Flash Functionalities Anil Gupta Technical Executive Winbond Electronics Corporation Santa Clara, CA 1 Automotive and ADAS terminology ECC use to increase reliability of Flash
More informationError Resilience in Digital Integrated Circuits
Error Resilience in Digital Integrated Circuits Heinrich T. Vierhaus BTU Cottbus-Senftenberg Outline 1. Introduction 2. Faults and errors in nano-electronic circuits 3. Classical fault tolerant computing
More informationAutonomous Driving From Fail-Safe to Fail-Operational Systems
Autonomous Driving From Fail-Safe to Fail-Operational Systems Rudolf Grave December 3, 2015 Agenda About EB Automotive Autonomous Driving Requirements for a future car infrastructure Concepts for fail-operational
More informationC 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How
CSE 486/586 Distributed Systems Failure Detectors Today s Question I have a feeling that something went wrong Steve Ko Computer Sciences and Engineering University at Buffalo zzz You ll learn new terminologies,
More informationReview of Software Fault-Tolerance Methods for Reliability Enhancement of Real-Time Software Systems
International Journal of Electrical and Computer Engineering (IJECE) Vol. 6, No. 3, June 2016, pp. 1031 ~ 1037 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i3.9041 1031 Review of Software Fault-Tolerance Methods
More informationBUILDING CYBERSECURITY CAPABILITY, MATURITY, RESILIENCE
BUILDING CYBERSECURITY CAPABILITY, MATURITY, RESILIENCE 1 WHAT IS YOUR SITUATION? Excel spreadsheets Manually intensive Too many competing priorities Lack of effective reporting Too many consultants Not
More informationImproving FPGA Design Robustness with Partial TMR
Improving FPGA Design Robustness with Partial TMR Brian Pratt, Michael Caffrey, Paul Graham, Keith Morgan, Michael Wirthlin Abstract This paper describes an efficient approach of applying mitigation to
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationEXPERIENCES FROM MODEL BASED DEVELOPMENT OF DRIVE-BY-WIRE CONTROL SYSTEMS
EXPERIENCES FROM MODEL BASED DEVELOPMENT OF DRIVE-BY-WIRE CONTROL SYSTEMS Per Johannessen 1, Fredrik Törner 1 and Jan Torin 2 1 Volvo Car Corporation, Department 94221, ELIN, SE-405 31 Göteborg, SWEDEN;
More informationRAID systems within Industry
White Paper 01/2014 RAID systems within Industry Functioning, variants and fields of application of RAID systems https://support.industry.siemens.com/cs/ww/en/view/109737064 Warranty and liability Warranty
More informationToward Monitoring Fault-Tolerant Embedded Systems
Toward Monitoring Fault-Tolerant Embedded Systems Alwyn E. Goodloe National Institute of Aerospace Lee Pike Galois, Inc Characterizing the Systems The systems we focus on must be ultra-reliable, and so
More information