Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM SOFTWARE GROUP http://www4.cs.fau.de
Transient Hardware Faults A Growing Problem [3] (Shivakumar, 2002) Transient hardware faults (Soft-Errors)! Induced by e.g., radiation, glitches, insufficient signal integrity Increasingly affecting microcontroller logic Future hardware designs: Even more performance and parallelism On the price of being less and less reliable! 2
Countermeasures - Hardware Safety-Critical System! Sensors SafetyCri+cal.Applica+on. Actuators Hardware-based countermeasures! Application-specific design or specialised hardware For example ECC, lock-step! Pragmatic approach (tackles problem right at source) Hardware costs (e.g., redundancy, checker, ) Selectivity (e.g., multi-application systems) Development costs (diverse safety concepts and HW, (re-)certification) 3
! Countermeasures - Software Safety-Critical System! Sensors SafetyCri+cal.Applica+on. Actuators Different approaches to address transient hardware faults! Hardware vs. software measures Applicability and costs 4
Countermeasures - Software Sensors Safety-Critical System! SafetyCri+cal.Applica+on.(1). SafetyCri+cal.Applica+on.(2). Actuators SafetyCri+cal.Applica+on.(3). Different approaches to address transient hardware faults! Hardware vs. software measures Applicability and costs Software-based triple modular redundancy (TMR)! Accepted and proven (e.g., recommended for ASIL D error handling) Selective (e.g., multi-application systems) 4
! Software-Based Redundancy in Detail Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Isola/ondomain Sphereofreplica/on(SOR) 5
Software-Based Redundancy in Detail Sensors Safety-Critical System! P = 1 1000 Interface. P = 998 1000 Replica.1. Replica.2. P = 1 1000 Majority. Voter. Actuators Replica.3. Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Isola/ondomain Sphereofreplica/on(SOR) 5
Software-Based Redundancy in Detail Sensors Safety-Critical System! P =? Interface. P =? Replica.1. Replica.2. P =? Majority. Voter. Actuators Replica.3. Isola/ondomain Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Risk analysis Inherently complex Random error distribution? (Nightingale, 2011) Sphereofreplica/on(SOR) 5
Software-Based Redundancy in Detail Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. Isola/ondomain Sphereofreplica/on(SOR) Software-based TMR requires:! Temporal and spatial isolation (isolation domains) Interface and Majority Voter Single points of failure! No error detection Very small Certain probability Risk analysis Inherently complex Random error distribution? (Nightingale, 2011) 5
Agenda Introduction! The Combined Redundancy Approach! Eliminating Vulnerabilities High-Reliability Voters Example: UAV Flight Control! CoRed Implementation Target System: I4Copter Evaluation! Experimental Setup Results Conclusion! 6
! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! TMR +{ Encodedopera/on Sphereofreplica/on(SOR) 7
! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ Encodedopera/on Sphereofreplica/on(SOR) 7
! CoRed Overview Holistic Protection Approach Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ High-reliability voters Encodedopera/on Sphereofreplica/on(SOR) 7
CoRed Overview Holistic Protection Approach 1 2 3 Isola/ondomain The Combined Redundancy Approach (CoRed)! Data-flow encoding TMR +{ High-reliability voters Encodedopera/on Holistic Protection Approach! Input to output protection 1 Reading inputs 2 Processing 3 Distributing outputs Composability On application and system level Sphereofreplica/on(SOR) 7
! Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter 8
Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) 8
Eliminating Input and Output Vulnerabilities SOR Y (Value) Encode. Y (EncodedValue) Decode. Y X (Value) Encode. X (EncodedValue) Decode. X A (Prime) B X (Signature) D (Timestamp) Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) Data integrity: Prime Address integrity: Per variable signature Outdated data: Timestamp } X = X A + B X + D 8
Eliminating Input and Output Vulnerabilities SOR Y (Value) X (Value) Encode. Encode.. Z Decode. Z = X Y A (Prime) B X (Signature) D (Timestamp) Inter-domain data-flow protection! Checksum vs. Arithmetic code (AN code) AN Code Encoded data operations Enabler for high-reliability voter CoRed: Extended AN code ( code)! Based on VCP (Forin, 1989) Data integrity: Prime Address integrity: Per variable signature Outdated data: Timestamp Set of arithmetic operands (+, -, *, =, ) Tailored for efficient encoded data voting } X = X A + B X + D 8
High-Reliability Voter Basics (1) Replica.1 X Encode. X Replica.2. Y Encode. Y Encoded.Voter. {E, W} Replica.3. Z Encode. Z ProviderEncodedVoter CoRed Encoded Voter! Input: variants ( X, Y, Z ) Output: Equality set (E) and winner (W) Based on operations No decoding necessary Branch decisions (equality) on encoded data! IFF difference of encoded values equals difference of static signatures X = Y X Y = B X B Y Each branch decision Unique signature! 9
High-Reliability Voter Basics (2) Replica.1 X Encode. X Replica.2. Y Encode. Y Encoded.Voter. {E, W} Check.(Decode). X Replica.3. Z Encode. Z e.g.,x isthewinner ProviderEncodedVoterConsumer Correct control-flow! Valid decision Unique control-flow path Each path Unique signature Control-flow signatures! Static signature (expected value): Compile-time Used as return value E Dynamic signature (actual value): Runtime, computed from variants Applied to winner W Validation: Subsequent check (decode) 10
CoRed Encoded Voter Example X = Y false Y = Z false X = Z false true true true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} apply(x, sig dyn {X,Z}) return sig static {X,Z} X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} Control-flow monitoring! Finding quorum Static signature Reapply path specific operations Sign winner with dynamic signature Check Subsequent decode 11
! CoRed Encoded Voter Example X = Y false Y = Z true false 1 X = Z true false true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} apply(x, sig dyn {X,Z}) return sig static {X,Z} X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} 1. Improper branch decision: Y Z Voter elects Y as winner (which is incorrect) Returns E and W correctly Subsequent decode will fail! sig static sig dyn 12
CoRed Encoded Voter Example X = Y true false Y = Z true apply(y, sig dyn {Y,Z}) return sig static {Y,Z} false 1 X = Z true apply(x, sig dyn {X,Z}) return sig static {X,Z} false X = Z false apply(x, sig dyn {X,Y}) return sig static {X,Y} 2 return sig static {} true apply(x, sig dyn {X,Y,Z}) return sig static {X,Y,Z} 1. Improper branch decision: Y Z Voter elects Y as winner (which is incorrect) Returns E and W correctly Subsequent decode will fail! sig static sig dyn 2. Faulty jump! Voter elects X and computes W correctly Returns incorrect E Again subsequent decode will fail! 12
Implementation CoRed implementation! Easy-to-use C++ templates and libraries Hardware independent: Code and Encoded Voter Thin OS integration layer PXROS-HR (Industry-strength commercial RTOS) CiAO (AUTOSAR-OS compatible) CoRed artefacts Real-time tasks and jobs Runtime-environment requirements! Temporal isolation static schedule (time triggered) Spatial isolation HW-based memory protection 13
CoRed Protected Flight Control Infineon TriCore TC1796 Redundant Sensor Setting Target System: I4Copter quadrotor platform Industry-grade hardware and software Triple redundant sensor setting Multi-application system Flight control application! Safety-critical Model-based: MATLAB Simulink Embedded Coder C++ code 14
Evaluation Experimental Setup FlightIControlApplica/on [Rebaudengo, 1999] Sensor System Sensor 1 Encode Sensor 2 Encode Sensor 3 Encode CoRed Encoded Tolerance Voter Decode Decode Decode Replica 1 Replica 2 Replica 3 Encode Encode Encode CoRed Encoded (Exact) Voter Network Interface Decode Actuator Remote Node FaultIInjec/on CampaignManager FaultDB ResultsDB SystemUnderTest HardwareDebugger HostComputer Fault injection Using hardware debugger! Injection of arbitrary fault patterns Minimal-intrusive Minimizing probe effects Fault list generation (Rebaudengo, 1999) Bits registers instructions Potentially huge fault space Vast majority of faults are non-effective Systematic elimination 15
Evaluation Experimental Setup FlightIControlApplica/on [Rebaudengo, 1999] Sensor System Sensor 1 Encode Sensor 2 Encode Sensor 3 Encode CoRed Encoded Tolerance Voter Decode Decode Decode Replica 1 Replica 2 Replica 3 Encode Encode Encode CoRed Encoded (Exact) Voter Network Interface Decode Actuator Remote Node FaultIInjec/on CampaignManager FaultDB ResultsDB SystemUnderTest HardwareDebugger HostComputer Fault injection Using hardware debugger! Injection of arbitrary fault patterns Minimal-intrusive Minimizing probe effects Fault list generation (Rebaudengo, 1999) Bits registers instructions Potentially huge fault space Vast majority of faults are non-effective Systematic elimination Outcome: 401,592experiments Effec+ve: 67,617errors Categories:.FailSilent,Masked,HardwareDetected,ICode, ControlIFlow,SilentDataCorrup/on 15
Evaluation Experimental Results (1) Interface. Replica.1. Replica.2. 90 % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
Evaluation Experimental Results (1) Interface. Replica.1. Replica.2. 90 % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
Evaluation Experimental Results (1) Interface. Replica.1. Replica.2. 90 % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! TMR: Suffers from 71 corruptions! SDC HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
Evaluation Experimental Results (1) Interface. Replica.1. Replica.2. 90 % 80 % Data Address Replica.3. Silent Data Corruptions Hardware Detected -Code Detected Masked Distribution of Effective Faults 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 % SDC Redundant execution campaign (Interface)! Total: ~45,000 Errors Unprotected: Suffers from 3,622 corruptions! TMR: Suffers from 71 corruptions! CoRed: Remaining corruptions are covered 0 corruptions HW Mask SDC HW Unprotected Plain TMR CoRed TMR Mask SDC HW Mask 16
Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica.3. 70 % 60 % 50 % 40 % Silent Data Corruptions 30 % Hardware Detected -Code Detected 20 % Control-flow Monitoring 10 % Masked 0 % SDC HW CFM Mask SDC HW CFM Mask Voter campaign! Plain Voter CoRed Encoded Voter 17
Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica.3. 70 % 60 % Silent Data Corruptions Hardware Detected -Code Detected Control-flow Monitoring Masked 50 % 40 % 30 % 20 % 10 % 0 % SDC Voter campaign! Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions HW Plain Voter CFM Mask SDC HW CFM CoRed Encoded Voter Mask 17
Evaluation Experimental Results (2) Replica.1. Replica.2. Voter. 90 % 80 % Data Address Replica.3. 70 % 60 % Silent Data Corruptions Hardware Detected -Code Detected Control-flow Monitoring Masked 50 % 40 % 30 % 20 % 10 % 0 % SDC Voter campaign! Plain voter: Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions CoRed Encoded Voter: Total ~26,000 1,228 masked 24,682 retry 0 corruptions HW Plain Voter CFM Mask SDC HW CFM CoRed Encoded Voter Mask 17
Evaluation Experimental Results (2) Replica.1. Replica.2. Replica.3. 90 % 60 % Data Address Voter. 80 % 100 µs Evaluation Overhead 70 % 107.1% 80 µs 50 % I4Copter Flight-Control: 7.1% overhead (compared to 40 plain % TMR) 60 µs Absolute numbers: 1,842µs (application) Silent Data Corruptions 30 % Hardware Detected WCET 20 % -Code 40 Detected Selectivity! µs Control-flow Monitoring I4Copter system 10 % CPU utilisation: 41% Masked Full replication impossible, CPU: 120% 0 % 20 µs Mission-critical replication of flight control possible with CoRed, Plain CPU: Voter 60% CoRed Encoded Voter Voter campaign! 0 µs Plain voter: CoRed Pair Plain TMR CoRed TMR CoRed Pair + Spare Total ~11,000 2,465 masked 7,245 retry 1,223 corruptions CoRed Voter: Total ~26,000 1,228 masked 24,682 retry 0 corruptions Replication 100.0% Overhead Analysis! Voting Interface State Recovery SDC HW 103.2% 10.2µs (plain voter) vs. 77.6µs (CoRed voter) CFM Mask SDC HW 115.1% CFM 120 % 100 % Flight-Control Execution Time Mask 18
!!! Conclusion Safety-Critical System! Replica.1. Sensors Interface. Replica.2. Majority. Voter. Actuators Replica.3. 19
!! Conclusion Safety-Critical System! Decode. Replica.1. Encode. Sensors Encoded. Interface. Decode. Replica.2. Encode. Encoded Voter. Actuators Decode. Replica.3. Encode. The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection 19
!! Conclusion Safety-Critical System! Decode. Replica.1. Encode. Sensors Encoded. Interface. Decode. Replica.2. Encode. Encoded Voter. Actuators Decode. Replica.3. Encode. The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection 19
! Conclusion The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection Applicability: Flight control! I4Copter MAV Selective and composable 19
Conclusion The Combined Software Redundancy Approach (CoRed)! Eliminate Single Points of Failure in software-based TMR No specific application knowledge necessary Holistic approach: input-to-output protection Applicability: Flight control! I4Copter MAV Selective and composable Experimental Results! CoRed is effective Silent data corruptions can be eliminated Only 7.1% overhead (flight control example) 19
Thank you! SYSTEM SOFTWARE GROUP http://www4.cs.fau.de
References (1) International Roadmap for Semiconductors, 2001 (2) Implications of microcontroller software on safety-critical automotive systems (Infineon 2008) (3) P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, Modelling the effect of technology trends on the soft error rate of combinational logic, in DSN 02: Proceedings of the 2002 International Conference on Dependable Systems and Networks (4) Edmund B. Nightingale, John R Douceur, and Vince Orgovan, Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs, in Proceedings of EuroSys 2011, Awarded "Best Paper", ACM, April 2011 (5) M. Rebaudengo and M. S. Reorda, Evaluating the fault tolerance capabilities of embedded systems via bdm, VTS 1999 (6) Forin, Vital coded microprocessor principles and application for various transit systems, 1989 21