Eliminating Single Points of Failure in Software Based Redundancy
|
|
- Sophia Henderson
- 5 years ago
- Views:
Transcription
1 Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM SOFTWARE GROUP
2 Transient Hardware Faults A Growing Problem [3] (Shivakumar, 2002) Transient hardware faults (Soft-Errors)! Induced by e.g., radiation, glitches, insufficient signal integrity Increasingly affecting microcontroller logic Future hardware designs: Even more performance and parallelism On the price of being less and less reliable! 2
3 Countermeasures - Hardware Safety-Critical System! Sensors SafetyCri+cal.Applica+on. Actuators Hardware-based countermeasures! Application-specific design or specialised hardware For example ECC, lock-step! Pragmatic approach (tackles problem right at source) Hardware costs (e.g., redundancy, checker, ) Selectivity (e.g., multi-application systems) Development costs (diverse safety concepts and HW, (re-)certification) 3
4 ! Countermeasures - Software Safety-Critical System! Sensors SafetyCri+cal.Applica+on. Actuators Different approaches to address transient hardware faults! Hardware vs. software measures Applicability and costs 4
5 Countermeasures - Software Sensors Safety-Critical System! SafetyCri+cal.Applica+on.(1). SafetyCri+cal.Applica+on.(2). Actuators SafetyCri+cal.Applica+on.(3). Different approaches to address transient hardware faults! Hardware vs. software measures Applicability and costs Software-based triple modular redundancy (TMR)! Accepted and proven (e.g., recommended for ASIL D error handling) Selective (e.g., multi-application systems) 4
6 ! Software-Based Redundancy in Detail Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Isola/ondomain Sphereofreplica/on(SOR) 5
7 Software-Based Redundancy in Detail Sensors Safety-Critical System! P = Interface. P = Replica.1. Replica.2. P = Majority. Voter. Actuators Replica.3. Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Isola/ondomain Sphereofreplica/on(SOR) 5
8 Software-Based Redundancy in Detail Sensors Safety-Critical System! P =? Interface. P =? Replica.1. Replica.2. P =? Majority. Voter. Actuators Replica.3. Isola/ondomain Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Risk analysis Inherently complex Random error distribution? (Nightingale, 2011) Sphereofreplica/on(SOR) 5
9 Software-Based Redundancy in Detail Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. Isola/ondomain Sphereofreplica/on(SOR) Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Risk analysis Inherently complex Random error distribution? (Nightingale, 2011) 5
10 Agenda Introduction! The Combined Redundancy Approach! Eliminating Vulnerabilities High-Reliability Voters Example: UAV Flight Control! CoRed Implementation Target System: I4Copter Evaluation! Experimental Setup Results Conclusion! 6
11 ! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! TMR +{ Encodedopera/on Sphereofreplica/on(SOR) 7
12 ! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ Encodedopera/on Sphereofreplica/on(SOR) 7
13 ! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ High-reliability voters Encodedopera/on Sphereofreplica/on(SOR) 7
14 CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ High-reliability voters Encodedopera/on Holistic Protection Approach! Input to output protection 1 Reading inputs 2 Processing 3 Distributing outputs Composability On application and system level Sphereofreplica/on(SOR) 7
15 ! Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter 8
16 Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) 8
17 Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X A (Prime) B X (Signature) D (Timestamp) Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) Data integrity: Prime Address integrity: Per variable signature Outdated data: Timestamp } X = X A + B X + D 8
18 Eliminating Input and Output Vulnerabilities SOR Y (Value) X (Value) Encode. Encode.. Z Decode. Z = X Y A (Prime) B X (Signature) D (Timestamp) Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) Data integrity: Prime Address integrity: Per variable signature Outdated data: Timestamp Set of arithmetic operands (+, -, *, =, ) Tailored for efficient encoded data voting } X = X A + B X + D 8
19 High-Reliability Voter Basics (1) Replica.1 X Encode. X Replica.2. Y Encode. Y Encoded.Voter. {E, W} Replica.3. Z Encode. Z ProviderEncodedVoter CoRed Encoded Voter! Input: variants ( X, Y, Z ) Output: Equality set (E) and winner (W) Based on operations No decoding necessary Branch decisions (equality) on encoded data! IFF difference of encoded values equals difference of static signatures X = Y X Y = B X B Y Each branch decision Unique signature! 9
20 High-Reliability Voter Basics (2) Replica.1 X Encode. X Replica.2. Y Encode. Y Encoded.Voter. {E, W} Check.(Decode). X Replica.3. Z Encode. Z e.g.,x isthewinner ProviderEncodedVoterConsumer Correct control-flow! Valid decision Unique control-flow path Each path Unique signature Control-flow signatures! Static signature (expected value): Compile-time Used as return value E Dynamic signature (actual value): Runtime, computed from variants Applied to winner W Validation: Subsequent check (decode) 10
21 CoRed Encoded Voter Example X = Y false Y = Z false X = Z false true true true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} apply(x, sig dyn {X,Z}) return sig static {X,Z} X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} Control-flow monitoring! Finding quorum Static signature Reapply path specific operations Sign winner with dynamic signature Check Subsequent decode 11
22 ! CoRed Encoded Voter Example X = Y false Y = Z true false 1 X = Z true false true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} apply(x, sig dyn {X,Z}) return sig static {X,Z} X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} 1. Improper branch decision: Y Z Voter elects Y as winner (which is incorrect) Returns E and W correctly Subsequent decode will fail! sig static sig dyn 12
23 CoRed Encoded Voter Example X = Y true false Y = Z true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} false 1 X = Z true apply(x, sig dyn {X,Z}) return sig static {X,Z} false X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} 2 return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} 1. Improper branch decision: Y Z Voter elects Y as winner (which is incorrect) Returns E and W correctly Subsequent decode will fail! sig static sig dyn 2. Faulty jump! Voter elects X and computes W correctly Returns incorrect E Again subsequent decode will fail! 12
24 Implementation CoRed implementation! Easy-to-use C++ templates and libraries Hardware independent: Code and Encoded Voter Thin OS integration layer PXROS-HR (Industry-strength commercial RTOS) CiAO (AUTOSAR-OS compatible) CoRed artefacts Real-time tasks and jobs Runtime-environment requirements! Temporal isolation static schedule (time triggered) Spatial isolation HW-based memory protection 13
25 CoRed Protected Flight Control Infineon TriCore TC1796 Redundant Sensor Setting Target System: I4Copter quadrotor platform Industry-grade hardware and software Triple redundant sensor setting Multi-application system Flight control application! Safety-critical Model-based: MATLAB Simulink Embedded Coder C++ code 14
26 Evaluation Experimental Setup FlightIControlApplica/on [Rebaudengo, 1999] Sensor System Sensor 1 Encode Sensor 2 Encode Sensor 3 Encode CoRed Encoded Tolerance Voter Decode Decode Decode Replica 1 Replica 2 Replica 3 Encode Encode Encode CoRed Encoded (Exact) Voter Network Interface Decode Actuator Remote Node FaultIInjec/on CampaignManager FaultDB ResultsDB SystemUnderTest HardwareDebugger HostComputer Fault injection Using hardware debugger! Injection of arbitrary fault patterns Minimal-intrusive Minimizing probe effects Fault list generation (Rebaudengo, 1999) Bits registers instructions Potentially huge fault space Vast majority of faults are non-effective Systematic elimination 15
27 Evaluation Experimental Setup FlightIControlApplica/on [Rebaudengo, 1999] Sensor System Sensor 1 Encode Sensor 2 Encode Sensor 3 Encode CoRed Encoded Tolerance Voter Decode Decode Decode Replica 1 Replica 2 Replica 3 Encode Encode Encode CoRed Encoded (Exact) Voter Network Interface Decode Actuator Remote Node FaultIInjec/on CampaignManager FaultDB ResultsDB SystemUnderTest HardwareDebugger HostComputer Fault injection Using hardware debugger! Injection of arbitrary fault patterns Minimal-intrusive Minimizing probe effects Fault list generation (Rebaudengo, 1999) Bits registers instructions Potentially huge fault space Vast majority of faults are non-effective Systematic elimination Outcome: 401,592experiments Effec+ve: 67,617errors Categories:.FailSilent,Masked,HardwareDetected,ICode, ControlIFlow,SilentDataCorrup/on 15
28 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
29 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
30 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! TMR: Suffers from 71 corruptions! SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
31 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % SDC Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! TMR: Suffers from 71 corruptions! CoRed: Remaining corruptions are covered 0 corruptions HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
32 Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica % 60 % 50 % 40 % Silent Data Corruptions 30 % Hardware Detected -Code Detected 20 % Control-flow Monitoring 10 % Masked 0 % SDC HW CFM Mask SDC HW CFM Mask Voter campaign! Plain Voter CoRed Encoded Voter 17
33 Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica % 60 % Silent Data Corruptions Hardware Detected -Code Detected Control-flow Monitoring Masked 50 % 40 % 30 % 20 % 10 % 0 % SDC Voter campaign! Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions HW Plain Voter CFM Mask SDC HW CFM CoRed Encoded Voter Mask 17
34 Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica % 60 % Silent Data Corruptions Hardware Detected -Code Detected Control-flow Monitoring Masked 50 % 40 % 30 % 20 % 10 % 0 % SDC Voter campaign! Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions CoRed Encoded Voter: Total ~26,000 1,228 masked 24,682 retry 0 corruptions HW Plain Voter CFM Mask SDC HW CFM CoRed Encoded Voter Mask 17
35 Evaluation Experimental Results (2) Replica.1. Replica.2. Replica % 60 % Data Address Voter. 80 % 100 µs Evaluation Overhead 70 % 107.1% 80 µs 50 % I4Copter Flight-Control: 7.1% overhead (compared to 40 plain % TMR) 60 µs Absolute numbers: 1,842µs (application) Silent Data Corruptions 30 % Hardware Detected WCET 20 % -Code 40 Detected Selectivity! µs Control-flow Monitoring I4Copter system 10 % CPU utilisation: 41% Masked Full replication impossible, CPU: 120% 0 % 20 µs Mission-critical replication of flight control possible with CoRed, Plain CPU: Voter 60% CoRed Encoded Voter Voter campaign! 0 µs Plain voter: CoRed Pair Plain TMR CoRed TMR CoRed Pair + Spare Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions CoRed Voter: Total ~26,000 1,228 masked 24,682 retry 0 corruptions Replication 100.0% Overhead Analysis! Voting Interface State Recovery SDC HW 103.2% 10.2µs (plain voter) vs. 77.6µs (CoRed voter) CFM Mask SDC HW 115.1% CFM 120 % 100 % Flight-Control Execution Time Mask 18
36 !!! Conclusion Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. 19
37 !! Conclusion Safety-Critical System! Decode. Replica.1. Encode. Sensors Encoded. Interface. Decode. Replica.2. Encode. Encoded Voter. Actuators Decode. Replica.3. Encode. The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection 19
38 !! Conclusion Safety-Critical System! Decode. Replica.1. Encode. Sensors Encoded. Interface. Decode. Replica.2. Encode. Encoded Voter. Actuators Decode. Replica.3. Encode. The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection 19
39 ! Conclusion The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection Applicability: Flight control! I4Copter MAV Selective and composable 19
40 Conclusion The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection Applicability: Flight control! I4Copter MAV Selective and composable Experimental Results! CoRed is effective Silent data corruptions can be eliminated Only 7.1% overhead (flight control example) 19
41 Thank you! SYSTEM SOFTWARE GROUP
42 References (1) International Roadmap for Semiconductors, 2001 (2) Implications of microcontroller software on safety-critical automotive systems (Infineon 2008) (3) P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, Modelling the effect of technology trends on the soft error rate of combinational logic, in DSN 02: Proceedings of the 2002 International Conference on Dependable Systems and Networks (4) Edmund B. Nightingale, John R Douceur, and Vince Orgovan, Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs, in Proceedings of EuroSys 2011, Awarded "Best Paper", ACM, April 2011 (5) M. Rebaudengo and M. S. Reorda, Evaluating the fault tolerance capabilities of embedded systems via bdm, VTS 1999 (6) Forin, Vital coded microprocessor principles and application for various transit systems,
Software-based Fault Tolerance Mission (Im)possible?
Software-based Fault Tolerance Mission Im)possible? Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013 System Software Group http://www4.cs.fau.de Embedded Systems Initiative
More informationSIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems)
SIListra making systems safer Coded Processing in Medical Devices Dr. Martin Süßkraut (TU-Dresden / SIListra Systems) martin.suesskraut@se.inf.tu-dresden.de Embedded goes Medical 5./6. Oct. 2011 1 SIListra
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationImproving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy
Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,
More informationDependability. IC Life Cycle
Dependability Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr IC Life Cycle User s Requirements Design Re-Cycling In-field Operation Production 2 1 IC Life Cycle User s
More informationFault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard
Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO 26262 standard NMI Automotive Electronics Systems 2013 Event Victor Reyes Technical Marketing System
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationAutomated Application of Fault Tolerance Mechanisms in a Component-Based System
Automated Application of Fault Tolerance Mechanisms in a Component-Based System Isabella Thomm Michael Stilkerich Rüdiger Kapitza Daniel Lohmann Wolfgang Schröder-Preikschat {ithomm,stilkerich,rrkapitz,lohmann,wosch}@cs.fau.de
More informationCprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques
: Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.
More informationARCHITECTURE DESIGN FOR SOFT ERRORS
ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan
More informationOPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden)
OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Björn Döbel (TU Dresden) Brussels, 02.02.2013 Hardware Faults Radiation-induced soft errors Mainly an issue in avionics+space 1 DRAM errors in large
More informationSoftware Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA
Software Techniques for Dependable Computer-based Systems Matteo SONZA REORDA Summary Introduction State of the art Assertions Algorithm Based Fault Tolerance (ABFT) Control flow checking Data duplication
More informationKESO Functional Safety and the Use of Java in Embedded Systems
KESO Functional Safety and the Use of Java in Embedded Systems Isabella S1lkerich, Bernhard Sechser Embedded Systems Engineering Kongress 05.12.2012 Lehrstuhl für Informa1k 4 Verteilte Systeme und Betriebssysteme
More informationA JVM for Soft-Error-Prone Embedded Systems
A JVM for Soft-Error-Prone Embedded Systems Isabella S)lkerich, Michael Strotz, Christoph Erhardt, Mar7n Hoffmann, Daniel Lohmann, Fabian Scheler, Wolfgang Schröder- Preikschat Department of Computer Science
More informationTU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007
TU Wien 1 Fault Isolation and Error Containment in the TT-SoC H. Kopetz TU Wien July 2007 This is joint work with C. El.Salloum, B.Huber and R.Obermaisser Outline 2 Introduction The Concept of a Distributed
More informationUsing Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Outline Introduction and Motivation Software-centric Fault Detection Process-Level Redundancy Experimental Results
More informationEvaluating the Fault Tolerance Capabilities of Embedded Systems via BDM
Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica Torino, Italy Fault tolerant system
More informationSafety Architecture Patterns
Tutorial: Safety Architecture Patterns Philip Koopman, Ph.D. These tutorials are a simplified introduction, and are not sufficient on their own to achieve system safety. You are responsible for the safety
More informationAN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware Ute Schiffel Christof Fetzer Martin Süßkraut Technische Universität Dresden Institute for System Architecture ute.schiffel@inf.tu-dresden.de
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance
More informationECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University
Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design
More informationCommunication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018
Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018 Agenda Motivation Introduction of Safety Components Introduction to ARMv8
More informationIs This What the Future Will Look Like?
Is This What the Future Will Look Like? Implementing fault tolerant system architectures with AUTOSAR basic software Highly automated driving adds new requirements to existing safety concepts. It is no
More informationAnalysis of Soft Error Mitigation Techniques for Register Files in IBM Cu-08 90nm Technology
Analysis of Soft Error Mitigation Techniques for s in IBM Cu-08 90nm Technology Riaz Naseer, Rashed Zafar Bhatti, Jeff Draper Information Sciences Institute University of Southern California Marina Del
More informationSafety and Reliability of Software-Controlled Systems Part 14: Fault mitigation
Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester
More informationDependability tree 1
Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques
More informationA Diversity of Duplications
A Diversity of Duplications David Powell Special event «Dependability of computing systems, Memories and future» in honor of Jean-Claude Laprie LAAS-CNRS, Toulouse, 16 April 2010 Duplication error Detection
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationTSW Reliability and Fault Tolerance
TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software
More informationError Mitigation of Point-to-Point Communication for Fault-Tolerant Computing
Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the
More informationA Low-Cost Correction Algorithm for Transient Data Errors
A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction
More informationElzar Triple Modular Redundancy using Intel AVX
Elzar Triple Modular Redundancy using Intel AVX Dmitrii Kuvaiskii Oleksii Oleksenko Pramod Bhatotia Christof Fetzer Pascal Felber Hardware Errors in the Wild Online services run in huge data centers 1
More informationIlan Beer. IBM Haifa Research Lab 27 Oct IBM Corporation
Ilan Beer IBM Haifa Research Lab 27 Oct. 2008 As the semiconductors industry progresses deeply into the sub-micron technology, vulnerability of chips to soft errors is growing In high reliability systems,
More informationDistributed Systems 24. Fault Tolerance
Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network
More informationDefect Tolerance in VLSI Circuits
Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy).
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationDuke University Department of Electrical and Computer Engineering
Duke University Department of Electrical and Computer Engineering Senior Honors Thesis Spring 2008 Proving the Completeness of Error Detection Mechanisms in Simple Core Chip Multiprocessors Michael Edward
More informationDistributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013
Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware
More informationDistributed Real-Time Fault Tolerance on a Virtualized Multi-Core System
Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard West and Ye Li Computer Science Department Boston University Boston, MA 02215 Email: {missimer,richwest,liye}@cs.bu.edu
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical
More informationMixed Critical Architecture Requirements (MCAR)
Superior Products Through Innovation Approved for Public Release; distribution is unlimited. (PIRA AER200905019) Mixed Critical Architecture Requirements (MCAR) Copyright 2009 Lockheed Martin Corporation
More informationCommercial-Off-the-shelf Hardware Transactional Memory for Tolerating Transient Hardware Errors
Commercial-Off-the-shelf Hardware Transactional Memory for Tolerating Transient Hardware Errors Rasha Faqeh TU- Dresden 19.01.2015 Dresden, 23.09.2011 Transient Error Recovery Motivation Folie Nr. 12 von
More informationFault-tolerant techniques
What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques
More informationOn the Consolidation of Mixed Criticalities Applications on Multicore Architectures
J Electron Test (2017) 33:65 76 DOI 10.1007/s10836-016-5636-7 On the Consolidation of Mixed Criticalities Applications on Multicore Architectures Stefano Esposito 1 Massimo Violante 1 Received: 21 July
More informationHW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths
HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths Mario Schölzel Department of Computer Science Brandenburg University of Technology Cottbus, Germany
More informationIssues in Programming Language Design for Embedded RT Systems
CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics
More informationReliable Architectures
6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact
More informationLast Class:Consistency Semantics. Today: More on Consistency
Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques
More informationTransient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.
Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection
More informationPart 2: Basic concepts and terminology
Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationImplementation Issues. Remote-Write Protocols
Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write
More informationAUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann
AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel Alexander Züpke, Marc Bommert, Daniel Lohmann alexander.zuepke@hs-rm.de, marc.bommert@hs-rm.de, lohmann@cs.fau.de Motivation Automotive and Avionic industry
More informationDistributed Operating Systems
2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System
More informationFault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components
Fault Tolerance To avoid disruption due to failure and to improve availability, systems are designed to be fault-tolerant Two broad categories of fault-tolerant systems are: systems that mask failure it
More informationA Fault Tolerant Superscalar Processor
A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor by V. Reddy and E. Rotenberg (2008)] P R E S E N T E D B Y NAN Z
More informationUnderstanding SW Test Libraries (STL) for safetyrelated integrated circuits and the value of white-box SIL2(3) ASILB(D) YOGITECH faultrobust STL
Understanding SW Test Libraries (STL) for safetyrelated integrated circuits and the value of white-box SIL2(3) ASILB(D) YOGITECH faultrobust STL Riccardo Mariani White Paper n. 001/2014 Riccardo Mariani
More informationANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software
ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software Ute Schiffel, André Schmitt, Martin Süßkraut, and Christof Fetzer Technische Universtät Dresden Department of Computer Science http://wwwse.inf.tu-dresden.de
More informationSoft Error Fault Tolerant Systems: CS456 Survey
Soft Error Fault Tolerant Systems: CS456 Survey Alok Garg Abstract Currently programming errors have been attributed to be the foremost cause of most system failures. But recent studies have suggested
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationSOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE
SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante Politecnico di Torino - Dipartimento di Automatica
More information6.033 Lecture Fault Tolerant Computing 3/31/2014
6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationDEPENDABLE PROCESSOR DESIGN
DEPENDABLE PROCESSOR DESIGN Matteo Carminati Politecnico di Milano - October 31st, 2012 Partially inspired by P. Harrod (ARM) presentation at the Test Spring School 2012 - Annecy (France) OUTLINE What?
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationModule 8 Fault Tolerance CS655! 8-1!
Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!
More informationDiagnosis in the Time-Triggered Architecture
TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the
More informationCompiler-Assisted Software Fault Tolerance for Microcontrollers
Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2018-03-01 Compiler-Assisted Software Fault Tolerance for Microcontrollers Matthew Kendall Bohman Brigham Young University Follow
More informationFault Tolerance Dealing with an imperfect world
Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction
More informationTowards Transactional Memory for Safety-Critical Embedded Systems
Towards Transactional Memory for Safety-Critical Embedded Systems Stefan Metzlaff, Sebastian Weis, and Theo Ungerer Department of Computer Science, University of Augsburg, Germany Euro-TM Workshop on Transactional
More informationA Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems On Programmable Chips
A Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems On Programmable Chips M. Sonza Reorda, M. Violante Politecnico di Torino Torino, Italy Abstract The availability of multimillion
More informationFunctional Safety on Multicore Microcontrollers for Industrial Applications
Functional Safety on Multicore Microcontrollers for Industrial Applications Thomas Barth Department of Electrical Engineering Hochschule Darmstadt University of Applied Sciences Darmstadt, Germany thomas.barth@h-da.de
More informationError Resilience in Digital Integrated Circuits
Error Resilience in Digital Integrated Circuits Heinrich T. Vierhaus BTU Cottbus-Senftenberg Outline 1. Introduction 2. Faults and errors in nano-electronic circuits 3. Classical fault tolerant computing
More informationMultiple Event Upsets Aware FPGAs Using Protected Schemes
Multiple Event Upsets Aware FPGAs Using Protected Schemes Costas Argyrides, Dhiraj K. Pradhan University of Bristol, Department of Computer Science Merchant Venturers Building, Woodland Road, Bristol,
More informationPractical Byzantine Fault Tolerance (The Byzantine Generals Problem)
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous
More informationError Detection and Recovery for Transient Faults in Elliptic Curve Cryptosystems
Error Detection and Recovery for Transient Faults in Elliptic Curve Cryptosystems Abdulaziz Alkhoraidly and M. Anwar Hasan Department of Electrical and Computer Engineering University of Waterloo January
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More informationDESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE
DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DivyaRani 1 1pg scholar, ECE Department, SNS college of technology, Tamil Nadu, India -----------------------------------------------------------------------------------------------------------------------------------------------
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationFunctional Safety Architectural Challenges for Autonomous Drive
Functional Safety Architectural Challenges for Autonomous Drive Ritesh Tyagi: August 2018 Topics Market Forces Functional Safety Overview Deeper Look Fail-Safe vs Fail-Operational Architectural Considerations
More informationID 025C: An Introduction to the OSEK Operating System
ID 025C: An Introduction to the OSEK Operating System Version 1.0 1 James Dickie Product Manager for Embedded Software Real-time operating systems AUTOSAR software components Software logic analyzer Experience:
More informationThe Use of Java in the Context of AUTOSAR 4.0
The Use of Java in the Context of AUTOSAR 4.0 Expectations and Possibilities Christian Wawersich cwh@methodpark.de MethodPark Software AG, Germany Isabella Thomm Michael Stilkerich {ithomm,stilkerich}@cs.fau.de
More informationSoft-error Detection Using Control Flow Assertions
Soft-error Detection Using Control Flow Assertions O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino, Dipartimento di Automatica e Informatica Torino, Italy Abstract Over
More informationPerformance Estimation of Virtual Duplex Systems on Simultaneous Multithreaded Processors
Performance Estimation of Virtual Duplex Systems on Simultaneous Multithreaded Processors Bernhard Fechner and Jörg Keller FernUniversität Hagen LG Technische Informatik II Postfach 940 58084 Hagen, Germany
More informationFault Tolerant Computing CS 530
Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu
More informationOutline of Presentation Field Programmable Gate Arrays (FPGAs(
FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable
More informationA Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor
A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor Jason Blome 1, Scott Mahlke 1, Daryl Bradley 2 and Krisztián Flautner 2 1 Advanced Computer Architecture
More informationIntel iapx 432-VLSI building blocks for a fault-tolerant computer
Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components
More informationRiccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist
Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist Internet of Things Group 2 Internet of Things Group 3 Autonomous systems: computing platform Intelligent eyes Vision. Intelligent
More informationMigration of SES to FPGA Based Architectural Concepts
Migration of SES to FPG Based rchitectural Concepts M. Steindl 1, J. Mottok 1, H. Meier 1,F. Schiller 2, M. Fruechtl 2 1 Regensburg University of pplied Sciences Department of Electronics and Information
More informationDependability Threats
Dependable Systems Dependability Threats Dr. Peter Tröger Operating Systems Group Dependability Dependability is defined as the trustworthiness of a computer system such that reliance can justifiable be
More informationFast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs
Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Hamid R. Zarandi,2, Seyed Ghassem Miremadi, Costas Argyrides 2, Dhiraj K. Pradhan 2 Department of Computer Engineering, Sharif
More informationReliable Computing I
Instructor: Mehdi Tahoori Reliable Computing I Lecture 9: Concurrent Error Detection INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the
More informationPractical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What
More informationTU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.
TU Wien 1 Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet H Kopetz TU Wien December 2008 Properties of a Successful Protocol 2 A successful real-time protocol must have the following
More informationAn Integrated ECC and BISR Scheme for Error Correction in Memory
An Integrated ECC and BISR Scheme for Error Correction in Memory Shabana P B 1, Anu C Kunjachan 2, Swetha Krishnan 3 1 PG Student [VLSI], Dept. of ECE, Viswajyothy College Of Engineering & Technology,
More informationBuilding High Reliability into Microsemi Designs with Synplify FPGA Tools
Power Matters. TM Building High Reliability into Microsemi Designs with Synplify FPGA Tools Microsemi Space Forum 2015, Synopsys 1 Agenda FPGA SEU mitigation industry trends and best practices Market trends
More information6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect
6. Fault Tolerance (a) Introduction. (b) Types of faults. (c) Fault models. (d) Fault coverage. (e) Redundancy. (f) Fault detection techniques. (g) Hardware fault tolerance. (h) Software fault tolerance.
More information