Fault Tolerance. Distributed Systems IT332

Fault Tolerance Distributed Systems IT332

2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery

3 Failures, Due to What? A system is said to fail when it cannot meet its promises Failures can happen due to a variety of reasons: Hardware faults Software bugs Operator errors Network errors/outages

4 Failures in Distributed Systems A characteristic feature of distributed systems that distinguishes them from single-machine systems is the notion of partial failure A partial failure may happen when a component in a distributed system fails This failure may affect the proper operation of other components, while at the same time leaving yet other components unaffected

5 Goal and Fault-Tolerance An overall goal in distributed systems is to construct the system in such a way that it can automatically recover from partial failures Tire punctured. Car stops. Tire punctured, recovered and continued. Fault-tolerance is the property that enables a system to continue operating properly in the event of failures For example, TCP is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communication links which are imperfect or overloaded

6 Dependable Systems Being fault tolerant is strongly related to what is called a dependable system A system is said to be highly available if it will be most likely working at a given instant in time A highly-reliable system is one that will most likely continue to work without interruption during a relatively long period of time Availability Reliability A Dependable System A system temporarily fails to operate correctly, nothing catastrophic happens Safety Maintainability How easy a failed system can be repaired

7 Failure Models Type of Failure Description Crash Failure A server halts, but was working correctly until it stopped Omission Failure A server fails to respond to incoming requests Receive Omission A server fails to receive incoming messages Send Omission A server fails to send messages Timing Failure A server s response lies outside the specified time interval Response Failure Value Failure State Transition Failure A server s response is incorrect The value of the response is wrong The server deviates from the correct flow of control Byzantine Arbitrary Failure A server may produce arbitrary responses at arbitrary times *or Byzantine Failure

8 Faults Masking by Redundancy The key technique for masking faults is to use redundancy Usually, extra bits are added to allow recovery from garbled bits Information Usually, extra processes are added to allow tolerating failed processes Software Redundancy Hardware Usually, extra equipment are added to allow tolerating failed hardware components Time Usually, an action is performed, and then, if required, it is performed again

9 Example: Triple Modular Redundancy If one is faulty, the final result will be incorrect A circuit with signals passing through devices A, B, and C, in sequence If 2 or 3 of the inputs are the same, the output is equal to that input Each device is replicated 3 times and after each stage is a triplicated voter

10 Reliable Client Server Communication How to handle communication failures? Use a reliable transport protocol (e.g., TCP) or handle at the application layer Techniques for reliable communication Use redundant bits to detect bit errors in packets Use sequence numbers to detect packet loss Mask corrupted/lost packets using acknowledgements and retransmissions

11 RPC Semantics in the Presence of Failures Client cannot locate server: The RPC system informs the caller of the failure Client request is lost: Client resends the request upon timeout Server crashes after receiving a request Server response is lost Client crashes after sending a request

12 Server Crashes Server crashes after receiving a request: did crash occur before or after the request is carried out? Client cannot distinguish between the 2 possibilities, leading to 3 possible semantics At least once: keep trying until a request is received, guarantees that the RPC has been carried out at least one time, but possibly more. Exactly once: desirable but difficult to achieve At most once: give up immediately and report back failure, guarantees that the RPC has been carried out at most one time, but possibly none at all. A server in client server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution.

13 Server Response Lost Upon timeout, client cannot tell whether the server has crashed, or the reply was lost, or the request was lost Client can resend the request for idempotent operations (i.e., operations that can be safely repeated) For non-idempotent operations, add sequence numbers to requests so that the server can distinguish a retransmitted request from an original request

14 Distributed Commit A distributed transaction involves multiple servers To ensure the atomicity of transactions, all servers involved must agree whether to commit or abort The process that initiates the distributed transaction acts as the coordinator Processes participating in the distributed transaction are the participants The coordinator rely on a distributed commit protocol to ensure the atomicity of a distributed transaction

15 Two-Phase Commit Protocol (2PC) Ensures that a transaction commits only when all participants are ready to commit Phase I: Voting Phase Step 1 Step 2 The coordinator sends a VOTE_REQUEST message to all participants. When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling indicating the that coordinator it is prepared that to locally it is prepared commit to its locally part of commit the transaction, its part of or the transaction, otherwise a VOTE_ABORT or otherwise a message. VOTE_ABORT message

16 Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from the participants. Step 1 If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message. Each participant that voted for a commit waits for the final reaction by the coordinator. Step 2 If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Otherwise, when receiving a GLOBAL_ABORT message, the transaction is locally aborted as well.

17 Recovering from a Crash Processes may crash; timeout is used when a process is waiting for a message from another process Upon timeout The coordinator in WAIT state will send Global Abort to all participants A participant in INIT state will abort the transaction A participant in READY state will contact another process Q and examine Q s state If all participants are in READY state, they will block until the coordinator recovers Actions taken by a participant P when residing in state READY and having contacted another participant Q.

18 Recovery When a failure occurs, we need to bring the system into an error free state Forward recovery: remove all errors in the system s state, thus enabling the system to proceed Forward recovery is impossible in most cases, why? The main problem with forward error recovery mechanisms is that it has to be known in advance which errors may occur. Backward recovery: bring the system back to a previous error free state Widely used in distributed systems Techniques for backward recovery Checkpointing Message logging

19 Checkpointing Each process periodically records its state, i.e., makes a checkpoint High checkpoint frequency increases the overhead Low checkpoint frequency increases the recovery cost in terms of lost computation Consistent global state/ ditributed snapshot: if a process P has recorded the receipt of a message, then there should also be a process Q that has recorded the sending of that message. Upon a crash, roll back to a recovery line, i.e., the most recent consistent collection of checkpoints.

20 Checkpointing We are able to identify both, senders and receivers. Initial state A snapshot A recovery line Not a recovery line P A failure Q Message sent from Q to P They jointly form a distributed snapshot

21 Independent Checkpoints Each process periodically checkpoints independently of other processes Upon a failure, each process is rolled back to its most recent checkpoint If most recent checkpoints do not form a consistent global state, need keep rolling back until a consistent global state is found cascaded rollback Not a Recovery Line Not a Recovery Line Not a Recovery Line Rollback P A failure Q

22 Coordinated Checkpoints Processes use the distributed snapshot algorithm to coordinate checkpointing all processes synchronize to jointly write their state to local stable storage. This saved state is automatically globally consistent. Upon a failure, roll back to the latest snapshot All processes restart from the latest snapshot

23 Message Logging Many distributed systems combine checkpointing (expensive) with message logging (cheap) Each process periodically records its local state and logs the messages it received after having recorded that state When a process crashes, restore the most recently checkpointed state, and then replay the messages that have been received Message logging can be of two types: Sender-based logging: A process can log its messages before sending them off Receiver-based logging: A receiving process can first log an incoming message before delivering it to the application Combining infrequent checkpointing with message logging is more efficient than frequent checkpointing

24 Replay of Messages and Orphan Processes P Incorrect replay of messages after recovery can lead to orphan processes. This should be avoided An orphan process is a process that survives the crash of another process, but whose state is inconsistent with the crashed process after its recovery M1 Q crashes Q recovers M1 is replayed M1 M3 becomes an orphan Q M2 M3 M2 M3 R Logged Message M2 can never be replayed So neither will m3 Unlogged Message

25 Next Chapter Distributed File Systems Questions?