Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester 2011
MITIGATION OF HARDWARE & PROGRAMMING FAULTS Part 14: Fault mitigation, Slide 2
Hardware Reliability How do you avoid system failures due to random hardware faults? Fault prevention? Increase reliability of hardware components Common target failure rate: 10-9 h -1 Often not sufficient Fault removal? Testing, verification, simulation Detects production and design faults only Fault tolerance? Make use of redundancy Enables to achieve safety and/or reliability goal Part 14: Fault mitigation, Slide 3
Hardware Fault Tolerance Tolerance of hardware faults by means of hardware replication? Triple modular redundancy is often too expensive! Restricted use of hardware redundancy Software-implemented hardware fault tolerance: Make use of software to monitor the hardware Sophisticated monitoring concepts: Combine hardware and software techniques See e.g. E-Gas monitoring concept Part 14: Fault mitigation, Slide 4
Hardware Components Clock Power supply Sensor Connector Digital Input Digital Output Connector Actuator Sensor Connector Analogue Input Processing unit Analogue Output Connector Actuator Serial bus interface RAM ROM : information source / sensor : information sink / actuator : hardware component : function (internal data flow not specified) : data flow Part 14: Fault mitigation, Slide 5
Two aspects of fault tolerance: Fault Tolerance 1. Error Detection A deviation from expected service is detected. 2. System Recovery The system is transformed to a error-free state or a state in which the error does not occur again. Design for safety: Initiate transition to a safe state. Part 14: Fault mitigation, Slide 6
IEC 61508: Safe Failure Fraction Determine for each safety-related component: Failure rate of a safe failure: λ S Failure rate of an undetected dangerous failure: λ DU Failure rate of a detected dangerous failure: λ DD Safe Failure Fraction: SFF S DD S DD DU Part 14: Fault mitigation, Slide 7
IEC 61508: Safe Failure Fraction Part 14: Fault mitigation, Slide 8
Functional Tests of RAM cells Correct functioning of a ram cell means: Reading a 1 and a 0 correctly, changing a 1 into a 0 correctly and vise versa and writing a 1 and a 0 correctly each independently of the states of other cells Functional Test: Sequence of write and read accesses Complexity of a complete test of n cells: 2 n Use a fault model, e.g. stuck-at-faults, coupling faults Popular tests: March tests, test Abraham Part 14: Fault mitigation, Slide 9
March Test March tests consists of a sequence of march elements A March elements consists of a sequence of operations applied to a cell: Operations: w0, w1, r0, r1 Possible address orders: increasing order Example: March C- : decreasing order arbitrary order { (w0); (r0, w1); (r1, w0); (r0, w1); (r1,w0); (r0); } Part 14: Fault mitigation, Slide 10
Classification of Hardware Faults Classification with respect to persistency: Permanent faults: presence is assumed to be continuous in time Transient faults: presence is bounded in time What happens when a transient fault occurs?» logical 0 logical 1» logical 1 logical 0 Called bitflip, can lead to a soft error Causes?» Radiation» Crosstalk» Noise Part 14: Fault mitigation, Slide 11
Detecting memory faults Which class of faults is detected by functional tests? Permanent and transient faults But useful for permanent faults only Concurrent fault detection? Use redundancy Parity bit Block replication Error correction code (ECC) Fault detection in invariable memory? Cyclic redundancy checks (CRC) Part 14: Fault mitigation, Slide 12
Dependent failures Condition for independent events: P(A and B) = P(A) P(B) Condition for dependent failures: P(Failure A and Failure B ) < > P(Failure A ) P(Failure B ) Event Failure A Failure A Failure B Failure B Common Cause Failures Cascading Failures Part 14: Fault mitigation, Slide 13
Dependent Failures Typical events or root causes: Common and shared resources Hardware Power supply Input data Specification Environmental factors Temperature Humidity Electromagnetic compliance Part 14: Fault mitigation, Slide 14
Detecting faults in the Processing Unit Self-test by software Test of the registers and internal ram the coding and execution including flag register the address calculation the program counter and stack pointer Can a processing unit determine its own state of health? Common Cause Failures possible Increase fault coverage: Trigger and evaluate test by external hardware unit Part 14: Fault mitigation, Slide 15
Detecting faults in the Processing Unit Time redundancy Using the same software Detects transient faults only Using diverse software versions Transient and some permanent faults Control flow checking Define valid program paths at design time Compute golden signature Check compliance at run time Compute signature and check against golden signature Implemented either exclusively in software Or using a watchdog processor Part 14: Fault mitigation, Slide 16
E-Gas E-Gas: Throttle-by-wire Drive-by-wire application: no mechanical link between the control element and the actuator Required Computations: Metering fuel Adjusting ignition time point Controlling the air supply Possibility of increasing the power of the engine Safety-critical system! Ensure the correct function Part 14: Fault mitigation, Slide 17
Controlling the Drive Unit of a Vehicle Part 14: Fault mitigation, Slide 18 [Source: US Patent 5880568]
Controlling the Drive Unit of a Vehicle Part 14: Fault mitigation, Slide 19 [Source: US Patent 5880568]
Dual Core Microcontroller Two driving forces: 1. Performance same performance at 200MHz as a single-core MCU operating at 500 MHz Lower power consumption Lower heat generation 2. Safety redundancy: two processors different monitoring concepts possible Part 14: Fault mitigation, Slide 20
Dual Core Architectures Homogenous redundancy Core 1 Core 2 Symmetric execution Heterogeneous redundancy Core 1 Core 2 Asymmetric execution Program Core 1 Core 2 Program1 Program2 Core 1 Core 2 Part 14: Fault mitigation, Slide 21
Dual-Core Lockstep Dual-core lockstep: Lockstep principle: the same way. Fault detection unit: Homogenous, synchronous dual-core architecture Both processors respond to the same data in Comparator comparing the output data of the processors. Master Bus Peripherals Checker Comparator Signal error Part 14: Fault mitigation, Slide 22
Dual-Core Lockstep Disadvantages: No additional performance using a second core Detection of processor faults only: Susceptible to systematic and cascading failures High costs: special dual-core architecture required Common Cause Failures? Part 14: Fault mitigation, Slide 23
Software Faults How do you avoid system failures due to software faults? Fault avoidance Apply different techniques, e.g. (semi-)formal methods, graphical modeling, Coding guidelines Fault removal Reviewing, testing, simulation, verification Fault tolerance Assertions Plausibility checks N-version-programming Part 14: Fault mitigation, Slide 24
Choice of Programming Language For SIL 3 and 4 applies: The use of a language subset is highly recommended. Part 14: Fault mitigation, Slide 25 [IEC 61508-7, Annex C (informative)]
Why can C cause problems? Example: If (a = b) { /* some instruction */ } What does it refer to? If (a == b) { /* some instruction */ } a = b If (a!=0) { /* some instruction */ } Rule: Do not use assignments in conditions! Part 14: Fault mitigation, Slide 26
Design Recommendations Part 14: Fault mitigation, Slide 27 [IEC 61508-3, Annex B (normative)]
Coding Guidelines Goals of Coding Guidelines Avoid misunderstandings Avoid undefined behaviour Increase code readability Avoids the introduction of defects Makes debugging easier Simplifies adding new features Coding guidelines can be a controversial issue, e.g. using Naming conventions Style conventions. Part 14: Fault mitigation, Slide 28
MISRA-C MISRA: (Motor Industry Software Reliability Association) MISRA-C: Development guideline for vehicle based software implemented in C Popular guidelines not only in the automotive industry There are tools, e.g. PC-Lint offering MISRA compliance checking. Though, not all rules can be checked automatically. Part 14: Fault mitigation, Slide 29
Satisfying the Tool Original code: If (a=b) { /* some instruction /* } Tool reports violation: Condition should be of Boolean type. What the programmer did: If (!!(a=b)) { /* some instruction /* } Part 14: Fault mitigation, Slide 30
IEC 61508: Techniques & measures according to SIL Part 14: Fault mitigation, Slide 31
IEC 61508: Techniques & measures according to SIL Part 14: Fault mitigation, Slide 32