Dependability Threats

Size: px

Start display at page:

Download "Dependability Threats"

Clement Norton
5 years ago
Views:

1 Dependable Systems Dependability Threats Dr. Peter Tröger Operating Systems Group

2 Dependability Dependability is defined as the trustworthiness of a computer system such that reliance can justifiable be placed on the service it delivers. The service delivered is the behavior as it is perceptible to users; a user is another system (human or physical) which interacts with the former. (J. C. Laprie) 2

3 Dependability Tree 3

4 Dependability Threats Threat - Unintended event or state Failure -,Ausfall Event when the system no longer complies to the specification Error -,Fehler(zustand) Part of system state that can lead to subsequent failure Fault -,Fehler(ursache) Adjudged or hypothesized cause of an error Error is a state, fault and failures are events in time Treating failures is repair, treating or avoiding errors is maintenance 4

5 Alternative Definitions 5

Chain of Dependability Threats Separation of system states and the events leading to them External Fault Normal Activation Active Fault / Latent

6 Chain of Dependability Threats Separation of system states and the events leading to them External Fault Normal Activation Active Fault / Latent Error Detection Detected Error Failure Classical version according to Laprie Internal Fault Dormant Fault Error Handling Restoration Failure Outage 6

7 System? Integrated combination of humans, products and processes Functional and non-functional specification In IT world, the products" are typically called components Interacting with each other and the environment IT systems are organized in layers, recursive definition Hardware executes operating system Operating system executes application Application executes plugin 7

8 IT System Current state of a layer is either correct or incorrect Focus on some chosen investigated layer Fault tolerance: Dealing with detectable incorrect states of investigated and execution layer I E Failure: Externally visible incorrect state of investigated layer The total state of a given system is the set of the following states: computation, communication, stored information, interconnection, and physical condition. (A. Avižienis) 8

9 Chain of Dependability Threats (with layers) C OF F,C ON, EXEC F s I 2 E I s E 2 E C OF F,C ON, EXEC F Detection EXEC F EXEC F (Activation) Deactivation s I 2 D I s E 2 E FAIL s I 2 X I s E 2 X E C ON (Enabling) Recovery Mitigation FAIL C OF F (Disabling) s I 2 X I s E 2 E Restoration s I 2 F I s E 2 E 9

10 Chain vs. Propagation The chain of dependability threats occurs in one investigated layer Software bug = fault May lead to wrong variable value = error May lead to exception being thrown = failure What happens if the problem leaves the layer? Example: Redundant RAID array with mirroring, single disc fails Resulting error state in the RAID system ( red light ) May or may not propagate to the operating system Error propagation: A failure in one layer is the fault in another layer 10

11 Error Propagation 11

12 Fault Classification High diversity in possible sources and types 12

13 Observations on Faults An external fault is a design fault - inability or refusal to foresee all situations Design faults are created during system development, system modification, or operational procedure creation and establishment Just replacing broken version of the same component leads to recurrent faults Physical faults are accidental faults Temporary external accidental physical faults are also called transient faults Temporary internal accidental faults are also called intermittent faults Examples: Pattern-sensitive memory hardware, system overload Arbitrary concept - Permanent faults with unknown activation condition Intentional and design faults are human-made faults, might be malicious faults Hardware production defects are typically physical faults 13

14 Observations on Faults A fault is active when it produces an error A non-active internal fault is a dormant / passive fault Origin in hardware: Often cycling between dormant and active Many specialized versions of the term,fault, e.g. bug Heisenbug - Resulting error disappears by itself Bohrbug - Resulting error is independent from execution state Mandelbug - Leads only to an error under specific conditions Fault-tolerant system design is a contradiction Design demands specification, faults are non-specified cases Solution: Specification for fault-free case + additional fault model 14

15 Fault Model Faults can be classified on different abstraction levels Physics Circuit level / switching circuit level Interesting for hardware design research (not this course) Investigate logical signals on connections stuck-at-zero, stuck-at-one, bridging faults, stuck-open Register transfer level Processor-memory-switch (PMS) level Hardware system level... (Software) Extremely important when talking about dependability means 15

16 Physical Faults Highly energized particles from space, atmospheric, or ground radiation Influence of particle that strikes a circuit: Atomic displacement, direct ionization, indirect ionization created by nuclear reactions Smaller structures are more sensitive to ionization effects Single Event Upset (SEU) Injected charge modifies hardware state temporally Can happen in memory and logic hardware Detected Unrecoverable Error (DUE) / Silent Data Corruption (SDC) Problem becomes permanent May be detected or undetected 16

17 Single Event Upset 17

18 Fault Model for Semiconductor Memories Stuck-at-1 or stuck-at-0 (hard) faults Transition / bit-flip faults (0->1, 1->0) Multiple writing - Data written into more than one cell on write attempt in one cell Pattern sensitivity - Device does not perform reliably with certain data pattern(s) Write recovery - Write followed by read/write at different location results in read/ write at same location Sense amplifier recovery - Data accessed remains the same for a number of cycles and then suddenly changed Bridging fault - Short between cells, AND type or OR type State coupling fault - Coupled (victim) cell is forced to 0 or 1 if coupling (aggressor) cell is in a given state 18

19 Software? All bugs are permanent design faults Ignoring user demands Ignoring special properties of the system environment Incomplete specification of dependability requirements Incomplete documentation Example for software fault model: Orthogonal Defect Classification (ODC) Any requirement to change the product is a defect Defect trigger: What make the defect surface Defect type: Nature of the fix you put on the defect 19

20 ODC Security Defect Types *"# +,-!+./01 )"# 23/45678!9:1;<=4>!?7737!"#$"%&'(")*+),"$-#.&/)0"+"$&1 ("# '"# &"# %"# $"# H3>=;!?7737 "# $ % & ' ( ) 23#*4")5.6"1&*%"1-4C/D!L6@=56D=34 M103/7;1!H16<!645!N.O1;D!M1/01 P=Q=4>!R!I17=6@=S6D=34!?7737! 20

21 Errors Escalates to failure depending on intentional / unintentional redundancy... system activity... specification of a failure case from user perspective (i.e. maximum outage time, acceptable delay, retransmission rate) System activity can reverse the error state before damage is happening Latent (not recognized) vs. detected error resulting from an active fault Hardware often contains unintentional redundancy, makes it difficult to test 21

22 Hardware Error Models Hardware faults effect state information, e.g. register values Stuck-at and other hardware faults therefore can also be denoted as error More interesting to investigate resulting effects on system-level Single data error - Program data is corrupted (in cache, memory, or register) Single code error - Effect on one instruction of the code Type 1/2 - Code modification without / with change of control flow Nature of error state may confirm to the nature of the originating fault Transient vs. permanent, static vs. dynamic, single vs. multiple Depends on utilized dependability means 22

23 Hardware Error Models Mapping of hardware-level single bit-flip error to other layers Memory data segment, processor data cache: System-level single data error Memory code segment, processor code cache: System-level single code error of type 1 (modification of target register) or type 2 (modification of branch target) Memory stack segment: System-level data error or type 2 code error Processor register: Depending on processor architecture and register type Single data error if register holds data interpreted by the application Single type 1 code error, if register holds address used by load/store operation Single type 2 code error, if register holds address of a branch target Processor control register: Everything could happen... 23

24 Hardware Error Models - Code Errors MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: SUB R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ FOOBAR MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BNZ LOOP MOV R0, 10 MOV R1, 1 LOOP: ADD R1, R1 SUB R0, 1 BZ LOOP 24

25 Software Error Models Similar terminology, but completely different semantics Syntactical errors are handled by compiler, semantical errors occur at runtime Static vs. dynamic, permanent vs. temporary errors Example for C programming language Errors affecting assignments (missing / wrong local variable values) Errors affecting conditional instructions (wrong boolean or iteration condition) Errors affecting function call / return (wrong parameters, return statement) Errors affecting algorithms (missing statements or function calls, wrong operators) Under research in the software engineering field - field studies, automated code analysis, developer interviews 25

Error Message Occurrence Same fault can lead to different (detected or undetected) errors Errors become detected by error detection mechanism Some undetected errors are detected by

26 Error Message Occurrence Same fault can lead to different (detected or undetected) errors Errors become detected by error detection mechanism Some undetected errors are detected by several detectors Some detectors report several undetected errors as one Some undetected errors are never uncovered Detected errors might not be logged, if the system stops too fast 26

27 Failures Visible non-compliance of the system with the specification Failure effect: Why is this failure interesting to be investigated Failure mode: Type of failure in relation to the functionality of the system Failure mechanism: How can this happen Failure models are well-known in distributed software systems Classical categorization in the onion model [Barborak, Cristian] 27

28 Onion Model Assumption: System of components sending messages to each other Maps to hardware with electrical signals Maps to distributed software systems Fail-Stop Failure: No more messaging, other components are informed Crash Failure: No more messaging, no information Omission Failure: Messages are omitted for some time Timing Failure: Reaction on message or sending of message is too early / late Computation Failure: Wrong answer message on correct received message Byzantine Failure: Anything 28

29 Failure Severity Denotes consequences of failure Benign failures Failure costs and operational benefits are similar Sometimes also umbrella term for failures only detected by inspection A system with only such failures is fail-safe Catastrophic failures Costs of failure consequences are much larger than service benefit Grading depends on application Flying airplane - Fail-Stop is catastrophic Train - Fail-Stop is benign Criticality - Highest severity of possible failure modes in the system 29

30 Example: DO-178B Standard Software Considerations in Airborne Systems and Equipment Certification Mature document, developed for more than 20 years Definition of severity of failure conditions for airplane, crew, and passengers Catastrophic - Loss of ability to continue safe flight and landing Major - Reduced airplane or crew capability to cope with operating conditions Reduction in safety margins and functional capabilities Higher workload or physical distress for the crew Minor - Not significantly reduced airplane safety, slight increase in workload (Example: Change of flight plan) No effect - Failure results in no loss of operational capabilities and no increase in crew workload 30

31 Example: DO-178B Standard 31

failures expressed as Automotive Safety Integrity Level (ASIL) Controllability: Can the

32 Example: ISO26262 ok Severity (S) of injuries bad bad Failure risk acceptable Failure risk not acceptable Controllability (C) ok Functional safety of automotive systems Severity of failures expressed as Automotive Safety Integrity Level (ASIL) Controllability: Can the driver compensate Severity: How bad are the consequences Exposure: How often does that happen 32

33 Wording and Numbers 33

34 Wording and Numbers 34

35 Observations on Failures Failures and system load are correlated Load can lead to wear-out, so the failure probability increases Higher load can activate dormant faults Detected faults lead to recovery activities, which again increases the load Possibility for unintended feedback effects in complex systems Common-cause failures: Multiple parts are impacted for the same reason Cascade failures through common dependency (e.g. power) Secondary failures from inappropriate environment (e.g. temperature) Common-mode failures from bad design (e.g. identical redundant units) 35

36 Example: Amazon EBS Failure of Amazon cloud services in 2012 Major web sites were down (Reddit, Netflix, Airbnb, ) Report about root cause Large number of cloud storage servers could no longer handle requests Low priority service with memory leak was eating all resources Reason was repeated connection attempt to monitoring server Monitoring server was not reachable due to DNS misconfiguration DNS change was reasoned by exchange of unrelated hardware unit Example for cascade failure 36

37 Fail-Fast A common concept from system engineering, company management,... Report failure and stop immediately without further action Discussed by Jim Gray in 1985 as part of his famous article Why do computers stop and what can be done about it? Useful when benefit from recovery is not good enough for its costs, or if error propagation is highly probable Single units of a redundant set Deeply interwired IT system components Components under heavy request load 37

38 Literature Laprie, J. Dependability. Basic Concepts and Terminology. (Springer, 1998). Hansen, J. P. & Siewiorek, D. P. Models for time coalescence in event logs. in IEEE Proceedings of International Symposium on Fault-Tolerant Computing (FTCS-22) (1992). doi: /ftcs Hunny, U., Zulkernine, M. & Weldemariam, K. OSDC: Adapting ODC for Developing More Secure Software. in Proceedings of the 28th Annual ACM Symposium on Applied Computing (ACM, 2013). doi: / ISO. Road vehicles - Functional safety - Part 3: Concept phase (ISO ). (2011). Thomas K. Ferrell & Uma D. Ferrell. RTCA DO-178B/EUROCAE ED-12B. in The Avionics Handbook (CRC Press, 2001). Goloubeva, O., Rebaudengo, M., Reorda, M. & Violante, M. Software-Implemented Hardware Fault Tolerance. (Springer, 2010). 38

Part 2: Basic concepts and terminology

Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness