Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend on some other component. Specifically: A component C depends on C if the correctness of C s behavior depends on the correctness of C s behavior. September 2002 DoCS 2002 2
Dependable Systems Dependable systems are characterized by o Availability % of time system may be used immediately o Reliability Mean time between failures o Safety How serious is the impact of a failure o Maintainability How long does it take to repair the system o Security September 2002 DoCS 2002 3 Definitions A system fails when it does not perform according to its specification. An error is part of a system state that may lead to a failure. A fault is the cause of an error. Types of faults o Transient Weather on a microwave link o Intermittent Bad solder joint o Permanent Coding error Note: For distributed systems, components can be either processes or channels A fault tolerant system does not fail in the presence of faults. September 2002 DoCS 2002 4
Server Failure Models Type of failure Crash failure Omission failure Receive omission Send omission Timing failure Response failure Value failure State transition failure Arbitrary failure Description A server halts, but is working correctly until it halts A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server's response lies outside the specified time interval The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control A server may produce arbitrary responses at arbitrary times Observation: Crash failures are the least severe; arbitrary failures are the worst September 2002 DoCS 2002 5 Crash Failure (1) Problem: Clients cannot distinguish between a crashed component and one that is just a bit slow Examples: Consider a server from which a client is expecting output: o Is the server perhaps exhibiting timing or omission failures o Is the channel between client and server faulty (crashed, or exhibiting timing or omission failures) September 2002 DoCS 2002 6
Crash Failure (2) Fail-silent: The component exhibits omission or crash failures; clients cannot tell what went wrong Fail-stop: The component exhibits crash failures, but its failure can be detected (either through announcement or timeouts) Fail-safe: The component exhibits arbitrary, but benign failures (they can t do any harm) September 2002 DoCS 2002 7 Masking Errors with Redundancy Information redundancy - Hamming code Time redundancy - timeout and retry Physical redundancy - triple modular redundancy September 2002 DoCS 2002 8
Processes Process Groups Replicated processes Reaching agreement Basic issue: Protect yourself against faulty processes by replicating and distributing computations in a group. September 2002 DoCS 2002 9 Flat and Hierarchical Groups All processes are equal Complex decision making Requires multicast Communicate with coordinator Coordinator makes decisions Single point of failure September 2002 DoCS 2002 10
Replicated Processes Fault tolerance can be achieved by o Primary backup o Replicated write K fault tolerant o System can tolerate k faulty components o To tolerate fail-stop requires k+1 o To tolerate byzantine requires 2k+1 September 2002 DoCS 2002 11 Byzantine Generals 3 The generals announce their troop strengths (in units of 1 kilosoldiers). The vectors that each general assembles The vectors that each general receives in step 3. September 2002 DoCS 2002 12
Agreement in Faulty Systems 3 The same as in previous slide, except now with 2 loyal generals and one traitor. September 2002 DoCS 2002 13 Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels? Error detection: o Framing of packets to allow for bit error detection o Use of frame numbering to detect packet loss Error correction: o Add so much redundancy that corrupted packets can be automatically corrected o Request retransmission of lost, or last N packets Observation: Most of this work assumes point-to-point communication September 2002 DoCS 2002 14
Reliable RPC/RMI? Client unable to locate the server o Relatively simple just report back to client Invocation from client to server is lost o Just resend message Server crashes after receiving request Reply message from server is lost Client crashes after sending an invocation September 2002 DoCS 2002 15 Server Crashes Server crashes are harder as you don t what it had already done: Normal operation Crash after execution Crash before execution At least once - retry until ack received At most once - report failure Exactly once - ideal target September 2002 DoCS 2002 16
Reliable Communication Client unable to locate the server Invocation from client to server is lost Server crashes after receiving request Reply message from server is lost Client crashes after sending an invocation September 2002 DoCS 2002 17 Reliable RPC/RMI Detecting lost replies can be hard, because it can also be that the server had crashed. You don t know whether the server has carried out the operation Solution: None, except that you can try to make your operations idempotent: o repeatable without any harm done if it happened to be carried out before. September 2002 DoCS 2002 18
Reliable RPC/RMI Problem: The server is doing work and holding resources for nothing (called doing an orphan computation). Orphan is killed (or rolled back) by client when it reboots Broadcast new epoch number when recovering servers kill orphans Require computations to complete in a time T units. Old ones are simply removed. September 2002 DoCS 2002 19 Reliable Multicast Multicast can be useful to groups of replicas What is reliable multicast? o Message order o Interaction with group membership o Failing and recovering members September 2002 DoCS 2002 20
Basic Reliable Multicast Known fixed group No failures Underlying unreliable multicast September 2002 DoCS 2002 21 Scalable Reliable Multicasting: Feedback Suppression Basic idea: Let a process P suppress its own feedback when it notices another process Q is already asking for a retransmission Assumptions: o All receivers listen to a common feedback channel to which feedback messages are submitted o Process P schedules its own feedback message randomly, and suppresses it when observing another feedback message September 2002 DoCS 2002 22
Multicast Feedback Only negative acknowledgements Multicast NACKs suppress same Retransmit also multicast September 2002 DoCS 2002 23 Hierarchical Feedback Control Local coordinator o Forwards the message to its children. o Handles retransmission requests. September 2002 DoCS 2002 24
Atomic Multicast Message is delivered to all members of a group or to none. Messages are delivered in same order to all members. What does it mean for a member to join or leave the group? Simplifies higher level applications. September 2002 DoCS 2002 25 Virtual Synchrony Distinguish between message receipt and message delivery September 2002 DoCS 2002 26
Virtual Synchrony P1 P2 P1 joins P3 crashes P3 rejoins P3 P4 G={P1,P2,P3,P4} G={P1,P2,P4} G={P1,P2,P3,P4} Principle of virtual synchronous multicast September 2002 DoCS 2002 27 Ordered Multicast FIFO ordering: If a correct process issues multicast(g, m 1 ) followed by multicast(g,m 2 ) then every correct process that delivers m 2 will deliver m 1 before m 2. Causal ordering: If multicast(g, m 1 ) happened before multicast(g, m 2 ) then any correct process that delivers m 2 will deliver m 1 before m 2. Total ordering: If a correct process delivers m 1 before it delivers m 2, then any other correct process that delivers m 2 will deliver m 1 before m 2. September 2002 DoCS 2002 28
FIFO Ordering t 1 t 2 t 3 t 4 time P 1 P 2 P 3 P 4 P 5 September 2002 DoCS 2002 29 Causal Ordering c 1 c 2 time c3 P 1 P 2 P 3 P 4 P 5 September 2002 DoCS 2002 30
Total Causal Ordering c 1 c 2 time c3 P 1 P 2 P 3 P 4 P 5 September 2002 DoCS 2002 31 Six Reliable Multicasts Multicast Reliable multicast FIFO multicast Causal multicast Atomic multicast FIFO atomic multicast Causal atomic multicast Basic Message Ordering None FIFO-ordered delivery Causal-ordered delivery None FIFO-ordered delivery Causal-ordered delivery Total-ordered Delivery? No No No Yes Yes Yes September 2002 DoCS 2002 32
Totally Ordered Requests (ISIS) P 2 1 2 3 1 Message 2 Proposed Seq P 4 1 3 Agreed Seq 2 3 P 1 P 3 September 2002 DoCS 2002 33 Implementing Virtual Synchrony Process 4 notices that process 7 has crashed, sends a view change Process 6 sends out all its unstable messages, followed by a flush message Process 6 installs the new view when it has received a flush message from everyone else September 2002 DoCS 2002 34
Two-Phase Commit (1) a) The finite state machine for the coordinator in 2PC. b) The finite state machine for a participant. September 2002 DoCS 2002 35 Two-Phase Commit (2) State of Q COMMIT ABORT INIT READY Action by P Make transition to COMMIT Make transition to ABORT Make transition to ABORT Contact another participant Actions taken by a participant P when residing in state READY and having contacted another participant Q. September 2002 DoCS 2002 36
Two-Phase Commit (3) actions by coordinator: while START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; } Outline of the steps taken by the coordinator in a two phase commit protocol September 2002 DoCS 2002 37 Two-Phase Commit (4) actions by participant: write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; } if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; } else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator; } September 2002 DoCS 2002 38 Steps taken by participant process in 2PC.
Two-Phase Commit (5) actions for handling decision requests: /* executed by separate thread */ while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */ Steps taken for handling incoming decision requests. September 2002 DoCS 2002 39 Three-Phase Commit a) Finite state machine for the coordinator in 3PC b) Finite state machine for a participant September 2002 DoCS 2002 40
Recovery Stable Storage a) Stable Storage b) Crash after drive 1 is updated c) Bad spot September 2002 DoCS 2002 41 Checkpointing A recovery line. September 2002 DoCS 2002 42
Independent Checkpointing The domino effect. September 2002 DoCS 2002 43 Message Logging Incorrect replay of messages after recovery, leading to an orphan process. September 2002 DoCS 2002 44