Dependability and real-time. TDDD07 Real-time Systems Lecture 7: Dependability & Fault tolerance. Early computer systems. Dependable systems

Size: px

Start display at page:

Download "Dependability and real-time. TDDD07 Real-time Systems Lecture 7: Dependability & Fault tolerance. Early computer systems. Dependable systems"

Jesse Watkins
5 years ago
Views:

TDDD7 Real-time Systems Lecture 7: Dependability & Fault tolerance Simin i Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping university Dependability

56 pages 2of 56 Dependable systems Early computer systems How do things go wrong and why? What can we do about it?

1 TDDD7 Real-time Systems Lecture 7: Dependability & Fault tolerance Simin i Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping university Dependability and real-time If a system is to produce results within time constraints, it needs to produce results at all! How to justify and measure how well computer systems do their jobs? 56 pages 2of 56 Dependable systems Early computer systems How do things go wrong and why? What can we do about it? This lecture: Basic notions of dependability and replication in fault-tolerant systems Next lecture: Designing Dependable Real-time systems 944: Real-time computer system in the Whirlwind project at MIT, used in a military air traffic control system 95 Short life of vaccum tubes gave mean time to failure of 2 minutes 3of 56 4of 56 Engineers: Fool me once, shame on you fool me twice, shame on me Software developers: Fool me N times, who cares, this is complex and anyway no one expects software to work... 5of 56 6of 56

2 FT - June 6, 24 October 25 "If you have a problem with your Volkswagen the likelihood that it was a software problem is very high. Software technology is not something that we as car manufacturers feel comfortable with. Bernd Pischetsrieder, chief executive of Volkswagen Automaker Toyota announced a recall of 6, of its Prius hybrid vehicles following reports of vehicle warning lights illuminating for no reason, and cars' gasoline engines stalling unexpectedly. Wired 5--8 The problem was found to be an embedded software bug 7of 56 8of 56 February 2, 24 Driver support: Volvo cars Angel Eck, driving a 997 Pontiac Sunfire found her car racing at high speed and accelerating on Interstate 7 for 45 minutes, heading toward Denver... with no effect from trying the brakes, shifting to neutral, and shutting off the ignition. 984 ABS Anti-lock Braking System 998 Dynamic Stability and Traction Control (DSTC) 22 Roll Stability Control (RSC) 23 Intelligent Driver Information System (IDIS) 24 Blind Spot Information system (BLIS) 26 Active Bi-Xenon lights 26 Adaptive Cruise Control (ACC) 26 Collision warning system with brake support 9of 56 of 56 Early space and avionics Airbus 38 During 955, 8 air carrier accidents in the USA (when only 2% of the public was willing to fly!) Today s complexity many times higher Integrated modular avionics (IMA), with safety-critical digital components, e.g. Power-by-wire: complementing the hydraulic powered flight control surfaces Cabin pressure control (implemented with a TTP operated bus) of 56 2 of 56 2

3 How do things go wrong? Why? What can we do about it? Dependability Basic notions in dependable systems and replication for fault tolerance (sv. Pålitliga datorsystem) What is dependability? Property of a computing system which allows reliance to be justifiably placed on the service it delivers. [Avizienis et al.] The ability to avoid service failures that are more frequent or more severe than is acceptable. (sv. Pålitliga datorsystem) 3 of 56 4 of 56 Attributes of dependability IFIP W.4 definitions: Safety: absence of harm to people and environment Availability: the readiness for correct service Integrity: absence of improper system alterations Reliability: continuity of correct service Maintainability: ability to undergo modifications and repairs Reliability [Sv. Tillförlitlighet] Means that the system (functionally) behaves as specified, and does it continually over measured intervals of time. Typical measure in aerospace: -9 Another way of putting it: MTTF - One failure in 9 flight hours. 5 of 56 6 of 56 Faults, Errors & Failures Examples Fault: a defect within the system or a situation that can lead to failure Error: manifestation (symptom) of the fault - an unexpected behaviour Failure: system not performing its intended function Year 2 bug Bit flips in hardware due to cosmic radiation in space Loose wire Air craft retracting its landing gear while on ground Effects in time: Permanent/ transient/ intermittent 7 of 56 8 of 56 3

4 Fault Error Failure More on dependability oal of system verification and validation is to eliminate faults Some will remain oal of hazard/risk analysis is to focus on important faults oal of fault tolerance is to reduce effects of errors if they appear - eliminate or delay failures Four approaches [IFIP.4]:. Fault avoidance 2. Fault removal Next lecture 3. Fault tolerance 4. Fault forecasting 9 of 56 2 of 56 oogle s min outage More on dependability September 2, 29: A small fraction of mail s servers were taken offline to perform routine upgrades. We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers. Fault forecasting? Four approaches [IFIP.4]:. Fault avoidance 2. Fault removal 3. Fault tolerance 4. Fault forecasting 2of of 56 Fault tolerance External factors Means that a system provides a degraded (but acceptable) function Even in presence of faults During a period defined by certain model assumptions The film... Foreseen or unforeseen? Fault model describes the foreseen faults 23 of 56 24of 56 4

5 Fault models On-line fault-management Leading to Node failures Crash Omission Timing Byzantine Leading to Channel failures Crash (and potential partitions) Message loss Message delay Erroneous/arbitrary messages Fault detection By program or its environment Fault tolerance using redundancy software hardware Data Fault containment by architectural choices 25of 56 26of 56 Redundancy Static Redundancy From D. Lardner: Edinburgh Review, year 824: The most certain and effectual check upon errors which arise in the process of computation is to cause the same computations to be made by separate and independent computers*; and this check is rendered still more decisive if their computations are carried out by different methods. Used all the time (whether an error has appeared or not), just in case SW: N-version programming HW: Voting systems Data: parity bits, checksums * people who compute 27of 56 28of 56 Dynamic Redundancy Server replication models Used when error appears and specifically aids the treatment SW: Recovery methods, Exceptions HW: Switching to back-up module Data: Self-correcting codes Time: Re-computing a result Passive replication X Active replication : Denotes a replica group 29of 56 3 of 56 5

6 Increasing availability Relating states Active replication roup membership: who is up, who is down? Passive replication Primary backup: bring the secondary server up to date What do we need to implement to support transition to operational mode? Message ordering Agreement among replicas Needs a notion of time (precedence) in distributed systems Recall from lecture 4: If clocks are synchronised The rate of processing at each node can be related Timeouts can be used to detect faults as opposed to message delays Asynchronous systems: Abstract time via order of events 3of of 56 Distributed snapshot Chicken and egg problem Vector clocks help to synchronise at event level Consistent snapshots But reasoning about response times and fault tolerance needs quantitative bounds Replication is useful in presence of failures if there is a consistent common state among replicas To get consistency, processes need to communicate their state t via broadcast But broadcast algorithms are distributed algorithms that run on every node and can be affected by failures of 56 34of 56 Desirable broadcast properties How to implement? Reliable broadcast all non-crashed processes agree on messages delivered (agreement) no spurious messages (integrity) it all messages broadcast by non-crashed processes are delivered (validity) The first step is to separate the underlying network (transport) and the broadcast mechanism Distinguish between receipt and delivery of a message All or none! 35of 56 36of 56 6

7 Desirable broadcast properties Broadcast Send Send Application layer Broadcast protocol Transport Deliver Receive Reliable broadcast all non-crashed processes agree on messages delivered (agreement) no spurious messages (integrity) it all messages broadcast by non-crashed processes delivered (validity) All or none! 37of 56 38of 56 Example implementation What if n fails? At every node n Execute broadcast(m) by: adding sender(m) and a unique ID as a header to the message m (building m) send(m) to all neighbours including itself When receive(m): if previously not executed deliver(m) then if sender(m) /= n then send(m) to all neighbours deliver(m) Directly after a receipt? While relaying? After sending to some but not all neighbours? This is where failure models come in... 39of 56 4of 56 Algorithms for broadcast Correctness: prove validity, integrity, agreement, order In presence of a given failure model! Typical assumptions: no link failures leading to partition send does not duplicate or change messages receive does not invent messages The consensus problem Assume that we have a reliable broadcast Processes p,,p,p n take part in a decision Each p i proposes a value v i e.g. application state info All correct processes decide on a common value v that is equal to one of the proposed values 4of 56 42of 56 7

Desired properties Basic impossibility result Every non-faulty process eventually decides (Termination) No two non-faulty processes decide differently (Agreement) If a process decides v

problem in an asynchronous distributed system with a single crash failure Algorithms for consensus have to be proven to have these properties 43of 56 44of 56 A way around it Can I prove

Assume Synchrony: Distributed computations proceed in rounds initiated by pulses Pulses implemented using local physical clocks, synchronised assuming bounded message delays NO!

8 Desired properties Basic impossibility result Every non-faulty process eventually decides (Termination) No two non-faulty processes decide differently (Agreement) If a process decides v then the value v was proposed by some process (Validity) In presence of given failure model [Fischer, Lynch and Paterson 985] There is no deterministic algorithm solving the consensus problem in an asynchronous distributed system with a single crash failure Algorithms for consensus have to be proven to have these properties 43of 56 44of 56 A way around it Can I prove any protocol correct? Assume Synchrony: Distributed computations proceed in rounds initiated by pulses Pulses implemented using local physical clocks, synchronised assuming bounded message delays NO! It depends on the fault model: Assumptions on how can the nodes/channels fail 45 of 56 46of 56 Byzantine agreement Byzantine agreement protocol Problem: Nodes need to agree on a decision but some nodes are faulty, act in an arbitrary way (can be malicious) Proposed in 98 by Pease, Shostak and Lamport How many faulty nodes can the algorithm tolerate? 47of of 56 8

9 Result from 98 Scenario Theorem: There is an upper bound t for the number of byzantine node failures compared to the size of the network N, N 3t+ ives a t+ round algorithm for solving consensus in a synchronous network and L are correct, is faulty Here: We just demonstrate that t nodes would not be enough! L L said said 49of 56 5 of 56 Scenario 2 Scenario 3 and are correct, L is faulty The general is faulty! said said L L said L L said 5 of of 56 2-round algorithm Summary does not work with t=, N=3! Seen from L, scenario and 3 are identical, so if L decides in scenario it will decide in scenario 3 Similarly for, if it decides in scenario 2 it decides in scenario 3 L and do not agree in scenario 3! Dependable systems need to justify why services can be relied upon To tolerate faults we normally deploy replication Replication needs (some kind of) agreement on state Agreement problems are hard to prove correct in distributed systems Require assumptions on faults, and (some kind of) synchrony 53 of of 56 9

10 Real-time communication Timing and fault tolerance Remember in a TTP bus: Nodes have synchronised clocks The communication infrastructure (collection of CCs and the TTP bus communication) guarantees a bounded time for a new data appearing on comm. interface of a node (its CNI) Byzantine faults in nodes can be detected by other nodes! Exercise: Which faults are detectable in a system connected with a CAN bus? Implementing fault tolerance often relies on support for timing guarantees Exercise: How do faults affect support for timing? How do faults affect response times in nodes? How do fault tolerance mechanisms affect timing? 55 of of 56

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8 TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system