Eliminating Single Points of Failure in Software Based Redundancy

Size: px
Start display at page:

Download "Eliminating Single Points of Failure in Software Based Redundancy"

Transcription

1 Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM SOFTWARE GROUP

2 Transient Hardware Faults A Growing Problem [3] (Shivakumar, 2002) Transient hardware faults (Soft-Errors)! Induced by e.g., radiation, glitches, insufficient signal integrity Increasingly affecting microcontroller logic Future hardware designs: Even more performance and parallelism On the price of being less and less reliable! 2

3 Countermeasures - Hardware Safety-Critical System! Sensors SafetyCri+cal.Applica+on. Actuators Hardware-based countermeasures! Application-specific design or specialised hardware For example ECC, lock-step! Pragmatic approach (tackles problem right at source) Hardware costs (e.g., redundancy, checker, ) Selectivity (e.g., multi-application systems) Development costs (diverse safety concepts and HW, (re-)certification) 3

4 ! Countermeasures - Software Safety-Critical System! Sensors SafetyCri+cal.Applica+on. Actuators Different approaches to address transient hardware faults! Hardware vs. software measures Applicability and costs 4

5 Countermeasures - Software Sensors Safety-Critical System! SafetyCri+cal.Applica+on.(1). SafetyCri+cal.Applica+on.(2). Actuators SafetyCri+cal.Applica+on.(3). Different approaches to address transient hardware faults! Hardware vs. software measures Applicability and costs Software-based triple modular redundancy (TMR)! Accepted and proven (e.g., recommended for ASIL D error handling) Selective (e.g., multi-application systems) 4

6 ! Software-Based Redundancy in Detail Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Isola/ondomain Sphereofreplica/on(SOR) 5

7 Software-Based Redundancy in Detail Sensors Safety-Critical System! P = Interface. P = Replica.1. Replica.2. P = Majority. Voter. Actuators Replica.3. Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Isola/ondomain Sphereofreplica/on(SOR) 5

8 Software-Based Redundancy in Detail Sensors Safety-Critical System! P =? Interface. P =? Replica.1. Replica.2. P =? Majority. Voter. Actuators Replica.3. Isola/ondomain Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Risk analysis Inherently complex Random error distribution? (Nightingale, 2011) Sphereofreplica/on(SOR) 5

9 Software-Based Redundancy in Detail Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. Isola/ondomain Sphereofreplica/on(SOR) Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Risk analysis Inherently complex Random error distribution? (Nightingale, 2011) 5

10 Agenda Introduction! The Combined Redundancy Approach! Eliminating Vulnerabilities High-Reliability Voters Example: UAV Flight Control! CoRed Implementation Target System: I4Copter Evaluation! Experimental Setup Results Conclusion! 6

11 ! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! TMR +{ Encodedopera/on Sphereofreplica/on(SOR) 7

12 ! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ Encodedopera/on Sphereofreplica/on(SOR) 7

13 ! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ High-reliability voters Encodedopera/on Sphereofreplica/on(SOR) 7

14 CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ High-reliability voters Encodedopera/on Holistic Protection Approach! Input to output protection 1 Reading inputs 2 Processing 3 Distributing outputs Composability On application and system level Sphereofreplica/on(SOR) 7

15 ! Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter 8

16 Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) 8

17 Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X A (Prime) B X (Signature) D (Timestamp) Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) Data integrity: Prime Address integrity: Per variable signature Outdated data: Timestamp } X = X A + B X + D 8

18 Eliminating Input and Output Vulnerabilities SOR Y (Value) X (Value) Encode. Encode.. Z Decode. Z = X Y A (Prime) B X (Signature) D (Timestamp) Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) Data integrity: Prime Address integrity: Per variable signature Outdated data: Timestamp Set of arithmetic operands (+, -, *, =, ) Tailored for efficient encoded data voting } X = X A + B X + D 8

19 High-Reliability Voter Basics (1) Replica.1 X Encode. X Replica.2. Y Encode. Y Encoded.Voter. {E, W} Replica.3. Z Encode. Z ProviderEncodedVoter CoRed Encoded Voter! Input: variants ( X, Y, Z ) Output: Equality set (E) and winner (W) Based on operations No decoding necessary Branch decisions (equality) on encoded data! IFF difference of encoded values equals difference of static signatures X = Y X Y = B X B Y Each branch decision Unique signature! 9

20 High-Reliability Voter Basics (2) Replica.1 X Encode. X Replica.2. Y Encode. Y Encoded.Voter. {E, W} Check.(Decode). X Replica.3. Z Encode. Z e.g.,x isthewinner ProviderEncodedVoterConsumer Correct control-flow! Valid decision Unique control-flow path Each path Unique signature Control-flow signatures! Static signature (expected value): Compile-time Used as return value E Dynamic signature (actual value): Runtime, computed from variants Applied to winner W Validation: Subsequent check (decode) 10

21 CoRed Encoded Voter Example X = Y false Y = Z false X = Z false true true true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} apply(x, sig dyn {X,Z}) return sig static {X,Z} X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} Control-flow monitoring! Finding quorum Static signature Reapply path specific operations Sign winner with dynamic signature Check Subsequent decode 11

22 ! CoRed Encoded Voter Example X = Y false Y = Z true false 1 X = Z true false true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} apply(x, sig dyn {X,Z}) return sig static {X,Z} X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} 1. Improper branch decision: Y Z Voter elects Y as winner (which is incorrect) Returns E and W correctly Subsequent decode will fail! sig static sig dyn 12

23 CoRed Encoded Voter Example X = Y true false Y = Z true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} false 1 X = Z true apply(x, sig dyn {X,Z}) return sig static {X,Z} false X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} 2 return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} 1. Improper branch decision: Y Z Voter elects Y as winner (which is incorrect) Returns E and W correctly Subsequent decode will fail! sig static sig dyn 2. Faulty jump! Voter elects X and computes W correctly Returns incorrect E Again subsequent decode will fail! 12

24 Implementation CoRed implementation! Easy-to-use C++ templates and libraries Hardware independent: Code and Encoded Voter Thin OS integration layer PXROS-HR (Industry-strength commercial RTOS) CiAO (AUTOSAR-OS compatible) CoRed artefacts Real-time tasks and jobs Runtime-environment requirements! Temporal isolation static schedule (time triggered) Spatial isolation HW-based memory protection 13

25 CoRed Protected Flight Control Infineon TriCore TC1796 Redundant Sensor Setting Target System: I4Copter quadrotor platform Industry-grade hardware and software Triple redundant sensor setting Multi-application system Flight control application! Safety-critical Model-based: MATLAB Simulink Embedded Coder C++ code 14

26 Evaluation Experimental Setup FlightIControlApplica/on [Rebaudengo, 1999] Sensor System Sensor 1 Encode Sensor 2 Encode Sensor 3 Encode CoRed Encoded Tolerance Voter Decode Decode Decode Replica 1 Replica 2 Replica 3 Encode Encode Encode CoRed Encoded (Exact) Voter Network Interface Decode Actuator Remote Node FaultIInjec/on CampaignManager FaultDB ResultsDB SystemUnderTest HardwareDebugger HostComputer Fault injection Using hardware debugger! Injection of arbitrary fault patterns Minimal-intrusive Minimizing probe effects Fault list generation (Rebaudengo, 1999) Bits registers instructions Potentially huge fault space Vast majority of faults are non-effective Systematic elimination 15

27 Evaluation Experimental Setup FlightIControlApplica/on [Rebaudengo, 1999] Sensor System Sensor 1 Encode Sensor 2 Encode Sensor 3 Encode CoRed Encoded Tolerance Voter Decode Decode Decode Replica 1 Replica 2 Replica 3 Encode Encode Encode CoRed Encoded (Exact) Voter Network Interface Decode Actuator Remote Node FaultIInjec/on CampaignManager FaultDB ResultsDB SystemUnderTest HardwareDebugger HostComputer Fault injection Using hardware debugger! Injection of arbitrary fault patterns Minimal-intrusive Minimizing probe effects Fault list generation (Rebaudengo, 1999) Bits registers instructions Potentially huge fault space Vast majority of faults are non-effective Systematic elimination Outcome: 401,592experiments Effec+ve: 67,617errors Categories:.FailSilent,Masked,HardwareDetected,ICode, ControlIFlow,SilentDataCorrup/on 15

28 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16

29 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16

30 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! TMR: Suffers from 71 corruptions! SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16

31 Evaluation Experimental Results (1) Interface. Replica.1. Replica % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % SDC Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! TMR: Suffers from 71 corruptions! CoRed: Remaining corruptions are covered 0 corruptions HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16

32 Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica % 60 % 50 % 40 % Silent Data Corruptions 30 % Hardware Detected -Code Detected 20 % Control-flow Monitoring 10 % Masked 0 % SDC HW CFM Mask SDC HW CFM Mask Voter campaign! Plain Voter CoRed Encoded Voter 17

33 Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica % 60 % Silent Data Corruptions Hardware Detected -Code Detected Control-flow Monitoring Masked 50 % 40 % 30 % 20 % 10 % 0 % SDC Voter campaign! Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions HW Plain Voter CFM Mask SDC HW CFM CoRed Encoded Voter Mask 17

34 Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica % 60 % Silent Data Corruptions Hardware Detected -Code Detected Control-flow Monitoring Masked 50 % 40 % 30 % 20 % 10 % 0 % SDC Voter campaign! Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions CoRed Encoded Voter: Total ~26,000 1,228 masked 24,682 retry 0 corruptions HW Plain Voter CFM Mask SDC HW CFM CoRed Encoded Voter Mask 17

35 Evaluation Experimental Results (2) Replica.1. Replica.2. Replica % 60 % Data Address Voter. 80 % 100 µs Evaluation Overhead 70 % 107.1% 80 µs 50 % I4Copter Flight-Control: 7.1% overhead (compared to 40 plain % TMR) 60 µs Absolute numbers: 1,842µs (application) Silent Data Corruptions 30 % Hardware Detected WCET 20 % -Code 40 Detected Selectivity! µs Control-flow Monitoring I4Copter system 10 % CPU utilisation: 41% Masked Full replication impossible, CPU: 120% 0 % 20 µs Mission-critical replication of flight control possible with CoRed, Plain CPU: Voter 60% CoRed Encoded Voter Voter campaign! 0 µs Plain voter: CoRed Pair Plain TMR CoRed TMR CoRed Pair + Spare Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions CoRed Voter: Total ~26,000 1,228 masked 24,682 retry 0 corruptions Replication 100.0% Overhead Analysis! Voting Interface State Recovery SDC HW 103.2% 10.2µs (plain voter) vs. 77.6µs (CoRed voter) CFM Mask SDC HW 115.1% CFM 120 % 100 % Flight-Control Execution Time Mask 18

36 !!! Conclusion Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. 19

37 !! Conclusion Safety-Critical System! Decode. Replica.1. Encode. Sensors Encoded. Interface. Decode. Replica.2. Encode. Encoded Voter. Actuators Decode. Replica.3. Encode. The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection 19

38 !! Conclusion Safety-Critical System! Decode. Replica.1. Encode. Sensors Encoded. Interface. Decode. Replica.2. Encode. Encoded Voter. Actuators Decode. Replica.3. Encode. The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection 19

39 ! Conclusion The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection Applicability: Flight control! I4Copter MAV Selective and composable 19

40 Conclusion The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection Applicability: Flight control! I4Copter MAV Selective and composable Experimental Results! CoRed is effective Silent data corruptions can be eliminated Only 7.1% overhead (flight control example) 19

41 Thank you! SYSTEM SOFTWARE GROUP

42 References (1) International Roadmap for Semiconductors, 2001 (2) Implications of microcontroller software on safety-critical automotive systems (Infineon 2008) (3) P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, Modelling the effect of technology trends on the soft error rate of combinational logic, in DSN 02: Proceedings of the 2002 International Conference on Dependable Systems and Networks (4) Edmund B. Nightingale, John R Douceur, and Vince Orgovan, Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs, in Proceedings of EuroSys 2011, Awarded "Best Paper", ACM, April 2011 (5) M. Rebaudengo and M. S. Reorda, Evaluating the fault tolerance capabilities of embedded systems via bdm, VTS 1999 (6) Forin, Vital coded microprocessor principles and application for various transit systems,

Software-based Fault Tolerance Mission (Im)possible?

Software-based Fault Tolerance Mission (Im)possible? Software-based Fault Tolerance Mission Im)possible? Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013 System Software Group http://www4.cs.fau.de Embedded Systems Initiative

More information

SIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems)

SIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems) SIListra making systems safer Coded Processing in Medical Devices Dr. Martin Süßkraut (TU-Dresden / SIListra Systems) martin.suesskraut@se.inf.tu-dresden.de Embedded goes Medical 5./6. Oct. 2011 1 SIListra

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,

More information

Dependability. IC Life Cycle

Dependability. IC Life Cycle Dependability Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr IC Life Cycle User s Requirements Design Re-Cycling In-field Operation Production 2 1 IC Life Cycle User s

More information

Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard

Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO standard Fault-Injection testing and code coverage measurement using Virtual Prototypes on the context of the ISO 26262 standard NMI Automotive Electronics Systems 2013 Event Victor Reyes Technical Marketing System

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Automated Application of Fault Tolerance Mechanisms in a Component-Based System

Automated Application of Fault Tolerance Mechanisms in a Component-Based System Automated Application of Fault Tolerance Mechanisms in a Component-Based System Isabella Thomm Michael Stilkerich Rüdiger Kapitza Daniel Lohmann Wolfgang Schröder-Preikschat {ithomm,stilkerich,rrkapitz,lohmann,wosch}@cs.fau.de

More information

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

ARCHITECTURE DESIGN FOR SOFT ERRORS

ARCHITECTURE DESIGN FOR SOFT ERRORS ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan

More information

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden)

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden) OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Björn Döbel (TU Dresden) Brussels, 02.02.2013 Hardware Faults Radiation-induced soft errors Mainly an issue in avionics+space 1 DRAM errors in large

More information

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA Software Techniques for Dependable Computer-based Systems Matteo SONZA REORDA Summary Introduction State of the art Assertions Algorithm Based Fault Tolerance (ABFT) Control flow checking Data duplication

More information

KESO Functional Safety and the Use of Java in Embedded Systems

KESO Functional Safety and the Use of Java in Embedded Systems KESO Functional Safety and the Use of Java in Embedded Systems Isabella S1lkerich, Bernhard Sechser Embedded Systems Engineering Kongress 05.12.2012 Lehrstuhl für Informa1k 4 Verteilte Systeme und Betriebssysteme

More information

A JVM for Soft-Error-Prone Embedded Systems

A JVM for Soft-Error-Prone Embedded Systems A JVM for Soft-Error-Prone Embedded Systems Isabella S)lkerich, Michael Strotz, Christoph Erhardt, Mar7n Hoffmann, Daniel Lohmann, Fabian Scheler, Wolfgang Schröder- Preikschat Department of Computer Science

More information

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007 TU Wien 1 Fault Isolation and Error Containment in the TT-SoC H. Kopetz TU Wien July 2007 This is joint work with C. El.Salloum, B.Huber and R.Obermaisser Outline 2 Introduction The Concept of a Distributed

More information

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Outline Introduction and Motivation Software-centric Fault Detection Process-Level Redundancy Experimental Results

More information

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM

Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM Evaluating the Fault Tolerance Capabilities of Embedded Systems via BDM M. Rebaudengo, M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica Torino, Italy Fault tolerant system

More information

Safety Architecture Patterns

Safety Architecture Patterns Tutorial: Safety Architecture Patterns Philip Koopman, Ph.D. These tutorials are a simplified introduction, and are not sufficient on their own to achieve system safety. You are responsible for the safety

More information

AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware

AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware Ute Schiffel Christof Fetzer Martin Süßkraut Technische Universität Dresden Institute for System Architecture ute.schiffel@inf.tu-dresden.de

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018

Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018 Communication Patterns in Safety Critical Systems for ADAS & Autonomous Vehicles Thorsten Wilmer Tech AD Berlin, 5. March 2018 Agenda Motivation Introduction of Safety Components Introduction to ARMv8

More information

Is This What the Future Will Look Like?

Is This What the Future Will Look Like? Is This What the Future Will Look Like? Implementing fault tolerant system architectures with AUTOSAR basic software Highly automated driving adds new requirements to existing safety concepts. It is no

More information

Analysis of Soft Error Mitigation Techniques for Register Files in IBM Cu-08 90nm Technology

Analysis of Soft Error Mitigation Techniques for Register Files in IBM Cu-08 90nm Technology Analysis of Soft Error Mitigation Techniques for s in IBM Cu-08 90nm Technology Riaz Naseer, Rashed Zafar Bhatti, Jeff Draper Information Sciences Institute University of Southern California Marina Del

More information

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

A Diversity of Duplications

A Diversity of Duplications A Diversity of Duplications David Powell Special event «Dependability of computing systems, Memories and future» in honor of Jean-Claude Laprie LAAS-CNRS, Toulouse, 16 April 2010 Duplication error Detection

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

TSW Reliability and Fault Tolerance

TSW Reliability and Fault Tolerance TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

A Low-Cost Correction Algorithm for Transient Data Errors

A Low-Cost Correction Algorithm for Transient Data Errors A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction

More information

Elzar Triple Modular Redundancy using Intel AVX

Elzar Triple Modular Redundancy using Intel AVX Elzar Triple Modular Redundancy using Intel AVX Dmitrii Kuvaiskii Oleksii Oleksenko Pramod Bhatotia Christof Fetzer Pascal Felber Hardware Errors in the Wild Online services run in huge data centers 1

More information

Ilan Beer. IBM Haifa Research Lab 27 Oct IBM Corporation

Ilan Beer. IBM Haifa Research Lab 27 Oct IBM Corporation Ilan Beer IBM Haifa Research Lab 27 Oct. 2008 As the semiconductors industry progresses deeply into the sub-micron technology, vulnerability of chips to soft errors is growing In high reliability systems,

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Defect Tolerance in VLSI Circuits

Defect Tolerance in VLSI Circuits Defect Tolerance in VLSI Circuits Prof. Naga Kandasamy We will consider the following redundancy techniques to tolerate defects in VLSI circuits. Duplication with complementary logic (physical redundancy).

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Duke University Department of Electrical and Computer Engineering

Duke University Department of Electrical and Computer Engineering Duke University Department of Electrical and Computer Engineering Senior Honors Thesis Spring 2008 Proving the Completeness of Error Detection Mechanisms in Simple Core Chip Multiprocessors Michael Edward

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System

Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Distributed Real-Time Fault Tolerance on a Virtualized Multi-Core System Eric Missimer*, Richard West and Ye Li Computer Science Department Boston University Boston, MA 02215 Email: {missimer,richwest,liye}@cs.bu.edu

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical

More information

Mixed Critical Architecture Requirements (MCAR)

Mixed Critical Architecture Requirements (MCAR) Superior Products Through Innovation Approved for Public Release; distribution is unlimited. (PIRA AER200905019) Mixed Critical Architecture Requirements (MCAR) Copyright 2009 Lockheed Martin Corporation

More information

Commercial-Off-the-shelf Hardware Transactional Memory for Tolerating Transient Hardware Errors

Commercial-Off-the-shelf Hardware Transactional Memory for Tolerating Transient Hardware Errors Commercial-Off-the-shelf Hardware Transactional Memory for Tolerating Transient Hardware Errors Rasha Faqeh TU- Dresden 19.01.2015 Dresden, 23.09.2011 Transient Error Recovery Motivation Folie Nr. 12 von

More information

Fault-tolerant techniques

Fault-tolerant techniques What are the effects if the hardware or software is not fault-free in a real-time system? What causes component faults? Specification or design faults: Incomplete or erroneous models Lack of techniques

More information

On the Consolidation of Mixed Criticalities Applications on Multicore Architectures

On the Consolidation of Mixed Criticalities Applications on Multicore Architectures J Electron Test (2017) 33:65 76 DOI 10.1007/s10836-016-5636-7 On the Consolidation of Mixed Criticalities Applications on Multicore Architectures Stefano Esposito 1 Massimo Violante 1 Received: 21 July

More information

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths

HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths HW/SW Co-Detection of Transient and Permanent Faults with Fast Recovery in Statically Scheduled Data Paths Mario Schölzel Department of Computer Science Brandenburg University of Technology Cottbus, Germany

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

Reliable Architectures

Reliable Architectures 6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact

More information

Last Class:Consistency Semantics. Today: More on Consistency

Last Class:Consistency Semantics. Today: More on Consistency Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Part 2: Basic concepts and terminology

Part 2: Basic concepts and terminology Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Implementation Issues. Remote-Write Protocols

Implementation Issues. Remote-Write Protocols Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write

More information

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel Alexander Züpke, Marc Bommert, Daniel Lohmann alexander.zuepke@hs-rm.de, marc.bommert@hs-rm.de, lohmann@cs.fau.de Motivation Automotive and Avionic industry

More information

Distributed Operating Systems

Distributed Operating Systems 2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System

More information

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components Fault Tolerance To avoid disruption due to failure and to improve availability, systems are designed to be fault-tolerant Two broad categories of fault-tolerant systems are: systems that mask failure it

More information

A Fault Tolerant Superscalar Processor

A Fault Tolerant Superscalar Processor A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor by V. Reddy and E. Rotenberg (2008)] P R E S E N T E D B Y NAN Z

More information

Understanding SW Test Libraries (STL) for safetyrelated integrated circuits and the value of white-box SIL2(3) ASILB(D) YOGITECH faultrobust STL

Understanding SW Test Libraries (STL) for safetyrelated integrated circuits and the value of white-box SIL2(3) ASILB(D) YOGITECH faultrobust STL Understanding SW Test Libraries (STL) for safetyrelated integrated circuits and the value of white-box SIL2(3) ASILB(D) YOGITECH faultrobust STL Riccardo Mariani White Paper n. 001/2014 Riccardo Mariani

More information

ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software

ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software Ute Schiffel, André Schmitt, Martin Süßkraut, and Christof Fetzer Technische Universtät Dresden Department of Computer Science http://wwwse.inf.tu-dresden.de

More information

Soft Error Fault Tolerant Systems: CS456 Survey

Soft Error Fault Tolerant Systems: CS456 Survey Soft Error Fault Tolerant Systems: CS456 Survey Alok Garg Abstract Currently programming errors have been attributed to be the foremost cause of most system failures. But recent studies have suggested

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE

SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE SOFTWARE-IMPLEMENTED HARDWARE FAULT TOLERANCE O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, and M. Violante Politecnico di Torino - Dipartimento di Automatica

More information

6.033 Lecture Fault Tolerant Computing 3/31/2014

6.033 Lecture Fault Tolerant Computing 3/31/2014 6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

DEPENDABLE PROCESSOR DESIGN

DEPENDABLE PROCESSOR DESIGN DEPENDABLE PROCESSOR DESIGN Matteo Carminati Politecnico di Milano - October 31st, 2012 Partially inspired by P. Harrod (ARM) presentation at the Test Spring School 2012 - Annecy (France) OUTLINE What?

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

Diagnosis in the Time-Triggered Architecture

Diagnosis in the Time-Triggered Architecture TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the

More information

Compiler-Assisted Software Fault Tolerance for Microcontrollers

Compiler-Assisted Software Fault Tolerance for Microcontrollers Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2018-03-01 Compiler-Assisted Software Fault Tolerance for Microcontrollers Matthew Kendall Bohman Brigham Young University Follow

More information

Fault Tolerance Dealing with an imperfect world

Fault Tolerance Dealing with an imperfect world Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction

More information

Towards Transactional Memory for Safety-Critical Embedded Systems

Towards Transactional Memory for Safety-Critical Embedded Systems Towards Transactional Memory for Safety-Critical Embedded Systems Stefan Metzlaff, Sebastian Weis, and Theo Ungerer Department of Computer Science, University of Augsburg, Germany Euro-TM Workshop on Transactional

More information

A Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems On Programmable Chips

A Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems On Programmable Chips A Low-Cost SEE Mitigation Solution for Soft-Processors Embedded in Systems On Programmable Chips M. Sonza Reorda, M. Violante Politecnico di Torino Torino, Italy Abstract The availability of multimillion

More information

Functional Safety on Multicore Microcontrollers for Industrial Applications

Functional Safety on Multicore Microcontrollers for Industrial Applications Functional Safety on Multicore Microcontrollers for Industrial Applications Thomas Barth Department of Electrical Engineering Hochschule Darmstadt University of Applied Sciences Darmstadt, Germany thomas.barth@h-da.de

More information

Error Resilience in Digital Integrated Circuits

Error Resilience in Digital Integrated Circuits Error Resilience in Digital Integrated Circuits Heinrich T. Vierhaus BTU Cottbus-Senftenberg Outline 1. Introduction 2. Faults and errors in nano-electronic circuits 3. Classical fault tolerant computing

More information

Multiple Event Upsets Aware FPGAs Using Protected Schemes

Multiple Event Upsets Aware FPGAs Using Protected Schemes Multiple Event Upsets Aware FPGAs Using Protected Schemes Costas Argyrides, Dhiraj K. Pradhan University of Bristol, Department of Computer Science Merchant Venturers Building, Woodland Road, Bristol,

More information

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous

More information

Error Detection and Recovery for Transient Faults in Elliptic Curve Cryptosystems

Error Detection and Recovery for Transient Faults in Elliptic Curve Cryptosystems Error Detection and Recovery for Transient Faults in Elliptic Curve Cryptosystems Abdulaziz Alkhoraidly and M. Anwar Hasan Department of Electrical and Computer Engineering University of Waterloo January

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE

DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DivyaRani 1 1pg scholar, ECE Department, SNS college of technology, Tamil Nadu, India -----------------------------------------------------------------------------------------------------------------------------------------------

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Functional Safety Architectural Challenges for Autonomous Drive

Functional Safety Architectural Challenges for Autonomous Drive Functional Safety Architectural Challenges for Autonomous Drive Ritesh Tyagi: August 2018 Topics Market Forces Functional Safety Overview Deeper Look Fail-Safe vs Fail-Operational Architectural Considerations

More information

ID 025C: An Introduction to the OSEK Operating System

ID 025C: An Introduction to the OSEK Operating System ID 025C: An Introduction to the OSEK Operating System Version 1.0 1 James Dickie Product Manager for Embedded Software Real-time operating systems AUTOSAR software components Software logic analyzer Experience:

More information

The Use of Java in the Context of AUTOSAR 4.0

The Use of Java in the Context of AUTOSAR 4.0 The Use of Java in the Context of AUTOSAR 4.0 Expectations and Possibilities Christian Wawersich cwh@methodpark.de MethodPark Software AG, Germany Isabella Thomm Michael Stilkerich {ithomm,stilkerich}@cs.fau.de

More information

Soft-error Detection Using Control Flow Assertions

Soft-error Detection Using Control Flow Assertions Soft-error Detection Using Control Flow Assertions O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino, Dipartimento di Automatica e Informatica Torino, Italy Abstract Over

More information

Performance Estimation of Virtual Duplex Systems on Simultaneous Multithreaded Processors

Performance Estimation of Virtual Duplex Systems on Simultaneous Multithreaded Processors Performance Estimation of Virtual Duplex Systems on Simultaneous Multithreaded Processors Bernhard Fechner and Jörg Keller FernUniversität Hagen LG Technische Informatik II Postfach 940 58084 Hagen, Germany

More information

Fault Tolerant Computing CS 530

Fault Tolerant Computing CS 530 Fault Tolerant Computing CS 530 Lecture Notes 1 Introduction to the class Yashwant K. Malaiya Colorado State University 1 Instructor, TA Instructor: Yashwant K. Malaiya, Professor malaiya @ cs.colostate.edu

More information

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable

More information

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor

A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded Microprocessor Jason Blome 1, Scott Mahlke 1, Daryl Bradley 2 and Krisztián Flautner 2 1 Advanced Computer Architecture

More information

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

Intel iapx 432-VLSI building blocks for a fault-tolerant computer Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components

More information

Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist

Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist Riccardo Mariani, Intel Fellow, IOTG SEG, Chief Functional Safety Technologist Internet of Things Group 2 Internet of Things Group 3 Autonomous systems: computing platform Intelligent eyes Vision. Intelligent

More information

Migration of SES to FPGA Based Architectural Concepts

Migration of SES to FPGA Based Architectural Concepts Migration of SES to FPG Based rchitectural Concepts M. Steindl 1, J. Mottok 1, H. Meier 1,F. Schiller 2, M. Fruechtl 2 1 Regensburg University of pplied Sciences Department of Electronics and Information

More information

Dependability Threats

Dependability Threats Dependable Systems Dependability Threats Dr. Peter Tröger Operating Systems Group Dependability Dependability is defined as the trustworthiness of a computer system such that reliance can justifiable be

More information

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Hamid R. Zarandi,2, Seyed Ghassem Miremadi, Costas Argyrides 2, Dhiraj K. Pradhan 2 Department of Computer Engineering, Sharif

More information

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 9: Concurrent Error Detection INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What

More information

TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.

TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12. TU Wien 1 Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet H Kopetz TU Wien December 2008 Properties of a Successful Protocol 2 A successful real-time protocol must have the following

More information

An Integrated ECC and BISR Scheme for Error Correction in Memory

An Integrated ECC and BISR Scheme for Error Correction in Memory An Integrated ECC and BISR Scheme for Error Correction in Memory Shabana P B 1, Anu C Kunjachan 2, Swetha Krishnan 3 1 PG Student [VLSI], Dept. of ECE, Viswajyothy College Of Engineering & Technology,

More information

Building High Reliability into Microsemi Designs with Synplify FPGA Tools

Building High Reliability into Microsemi Designs with Synplify FPGA Tools Power Matters. TM Building High Reliability into Microsemi Designs with Synplify FPGA Tools Microsemi Space Forum 2015, Synopsys 1 Agenda FPGA SEU mitigation industry trends and best practices Market trends

More information

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect

6. Fault Tolerance. CS 313 High Integrity Systems; CS M13 Critical Systems; Michaelmas Term 2009, Sect 6. Fault Tolerance (a) Introduction. (b) Types of faults. (c) Fault models. (d) Fault coverage. (e) Redundancy. (f) Fault detection techniques. (g) Hardware fault tolerance. (h) Software fault tolerance.

More information