Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1

Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered to each member of the group. Assumptions for simplicity: An agreement exists on who is a member of the group Processes do not fail Processes do not join or leave the group while communication is going on. What is reliable multicasting then when these assumptions do not hold? A message that is sent to a process group should be delivered to each current non-faulty member of the group. 2

Basic Reliable-Multicasting Schemes sender receiver receiver receiver receiver history buffer M25 Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 sender receiver receiver receiver receiver Last = 25 Last = 25 Last = 23 Last = 25 ACK 25 M25 ACK 25 M25 M25 M25 Missed 24 ACK 25 A simple solution to reliable multicasting when all receivers are known and are assumed not to fail 3

Scalability in Reliable Multicasting Problem 1: The sender is flooded with ACK messages when there are too many receivers (feedback implosion) Solution Receivers return only negative ACK when they notice that they missed a broadcast message Problem 2: With returning only negative ACK the sender has to keep a message in its history buffer forever (or at least a long time) Solution: Use expiration time on messages in history buffer 4

Nonhierarchical Feedback Control Feedback suppression: goal is to reduce the number of feedback messages returned to the sender SRM protocol A process that notices a missing messages multicasts it to the group after waiting for a random amount of time Receivers suppress their feedback sender receiver receiver receiver receiver T = 3 T = 5 T = 1 T = 4 NACK NACK NACK NACK Several receivers have scheduled a request for retransmission, but the first retransmission request leads 5 to the suppression of others.

Hierarchical Feedback Control (1) Essence: Organize processes into subgroups and appoint a local coordinator to each subgroup For simplicity, assume only one sender Setup a tree where the subgroup of sender process is the root node in the tree. Local coordinator is responsible for handling retransmission requests of receivers within its subgroup Local coordinator keeps a history buffer If the local coordinator itself misses a message it asks the coordinator of its parent subgroup to retransmit the message 6

Hierarchical Feedback Control (2) coordinator Sender S LAN C C R receiver The essence of hierarchical reliable multicasting. 7

Atomic Multicast Goal: To achieve reliable multicasting in the presence of process failures Guarantees that a message is delivered to either all processes or to none at all. All messages must be delivered in the same order to all processes Some processes in the group may crash In order to achieve reliable atomic multicasting, all the nonfaulty members must have agreed on the group membership; e.g. the crashed process is no longer a group member When the process recovers, it is forced to join the group again. Joining the group requires that the state of the process have to be brought up to date. 8

Receiving vs. Delivering Messages The logical organization of a distributed system to distinguish between message receipt and message delivery Message is delivered to application Application Message is received by communication layer Message is buffered in this layer until it can be delivered to the application Comm. Layer Local OS Message comes in from the network Network 9

Message Ordering (1) Four different orderings in multicast are distinguished: 1. Unordered (reliable) multicast 2.FIFO-ordered multicast 3.Causally-ordered multicast 4.Totally-ordered multicast Process P1 sends m1 sends m2 Process P2 receives m1 receives m2 Process P3 receives m2 receives m1 Three communicating processes in the same group. The ordering of events per process is shown along the vertical 10 axis.

Message Ordering (2) Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 receives m3 receives m1 sends m4 receives m2 receives m2 receives m4 receives m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting 11

Message Ordering (3) Six different versions of reliable multicasting. Multicast Reliable multicast FIFO multicast Causal multicast Atomic multicast FIFO atomic multicast Causal atomic multicast Basic Message Ordering None FIFO-ordered delivery Causal-ordered delivery None FIFO-ordered delivery Causal-ordered delivery Total-ordered Delivery? No No No Yes Yes Yes 12

Virtual Synchrony (1) Group view: The list of processes that a multicast message is delivered (delivery list); denoted as G Each process on that list should have the same group view, A view change vc may occur (e.g. a process joins or leaves the group) during transmission of message m The message m must be delivered to each nonfaulty process in G before the view change comes into effect. Otherwise, the message m must not be delivered at all. 13

Virtual Synchrony (2) For example, a process multicasts a message m to a group of processes Right after that, a process leaves or joins the group another process notices a view change and multicasts view change message (vc) to the group Any message sent in view G must be delivered to each correct process before view change message is delivered A reliable multicast with this property is said to be virtually synchronous In other words, a view change acts as a barrier across which no multicast can pass 14

Virtual Synchrony (3) A message sent to view G can be delivered only to processes in G, and is discarded by successive views P1 joins the group P1 Reliable multicast P3 crashes P3 rejoins P2 P3 P4 G = {P1, P2, P3, P4} G = {P1, P2, P4} The principle of virtually synchronous multicast. 15

Virtual Synchrony: Examples P P Q R vc Q R vc G G G G P P Q R Q R G G G G 16

Implementing Virtual Synchrony (1) Isis system (fault-tolerant distributed system) A reliable point-to-point communication facilities exist and the ordering is assumed to be FIFO Can TCP provide a reliable FIFO ordered point-to-point communication? If a message m has been received by all members in G, m is said to be stable Only stable messages are allowed to be delivered. Otherwise, it is kept in a buffer in the communication layer. Assume The current view is G i and the next view G i+1 is to be installed G i and G i+1 differs by one process (WLG) 17

Implementing Virtual Synchrony (2) For example, The process that notices a view change (e.g. a process crashes or a process joins the group probably after recovery) sends a view change message to other nonfaulty processes Any other process P notices the view change when it receives a view change message. P first forwards all unstable messages in the buffer to every process in G i+1 using a reliable point-to-point communication Afterwards, it multicasts a flush message After P has received a flush message from every other process, it can safely install the new view It is also possible to elect a coordinator to forward all unstable messages 18

Implementing Virtual Synchrony (3) Unstable message Flush message 2 1 5 2 1 5 2 1 5 4 vc 6 4 6 4 6 0 7 3 0 7 3 0 7 3 a) Process 4 notices that process 7 has crashed, sends a view change b) Process 6 sends out all its unstable messages, followed by a flush message c) Process 6 installs the new view when it has received a flush message from everyone else 19

Distributed Commit Essential issue: having an operation being performed by each member of a process group, or none at all. e.g. committing a transaction Distributed commit problem A coordinator is present to initiate the commit One-phase commit Two-phase commit Three-phase commit 20

Two-Phase Commit - 2PC (1) Consider a distributed transaction involving the participation of a number of processes each running on a different machine. Phase 1 a: Coordinator sends VOTE_REQUEST to participants Phase 1 b: When a participant receives VOTE_REQUEST it returns either VOTE_COMMIT or VOTE_ABORT to the coordinator. Phase 2 a: coordinator collects all votes; if all are VOTE_COMMIT it sends GLOBAL_COMMIT to all participants; otherwise it sends GLOBAL_ABORT. Phase 2 b: Each participant waits for GLOBAL_COMMIT or GLOBAL_ABORT and acts accordingly. 21

2PC (2) Commit Vote-request Vote-abort Global-abort INIT WAIT Vote-commit Global-commit Vote-request Vote-abort Global-abort ACK INIT READY Vote-request Vote-Commit Global-commit ACK ABORT a COMMIT ABORT b COMMIT a) The finite state machine for the coordinator in 2PC. b) The finite state machine for a participant. 22

2PC Failing Participant (1) How does this affect other participants? INIT: No problem READY: A participant P is waiting for either GLOBAL_COMMIT or GLOBAL_ABORT. If the coordinator crashes before its message reached P, P cannot know what to do. 1. It may block until the coordinator recovers 2. It can ask another participant Q. The decision depends which state Q is in i. INIT: they can both abort ii. COMMIT: They can commit iii. ABORT: They both abort iv. READY: Contact another participant. If all the participants it contacted are in this state, they have to wait until the coordinator recovers (apparently the coordinator is failing) 23

2PC Failing Participant (2) State of Q COMMIT ABORT INIT READY Action by P Make transition to COMMIT Make transition to ABORT Make transition to ABORT Contact another participant Actions taken by a participant P when residing in state READY and having contacted another participant Q. 24

2PC - Steps Taken by Coordinator write START_2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; } 25

2PC - Steps Taken by a Participant write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; } if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; } else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator; } 26

2PC - When a Participant is Asked for a Decision actions for handling decision requests: /* executed by separate thread */ while true { } wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */ Steps taken for handling incoming decision requests. 27

2PC Wait for the Coordinator to Recover All participants need to block until the coordinator recovers when All participants have received and processed the VOTE_REQUEST (i.e. they all are in state READY) from the coordinator while in the meantime the coordinator is crashed. In that case, participants cannot cooperatively decide on the final action to take (COMMIT or ABORT) Assuming that not all participant can be contacted (perhaps they are crashed as well), and uncontacted participant may either be in (or recover to) state INIT, ABORT or COMMIT. This is why another protocol is needed to avoid blocking 28

Three-Phase Commit 3PC Avoids blocking processes in the presence of failstop crashes Phase 1 a: Coordinator sends VOTE_REQUEST to participants Phase 1 b: When participant receives VOTE_REQUEST it returns either VOTE_COMMIT or VOTE_ABORT to coordinator. Phase 2 a: Coordinator collects all votes; if all are VOTE_COMMIT it sends PREPARE to all participants; otherwise it sends ABORT Phase 2 b: Each participant waits for PREPARE or ABORT 29

3PC (2) Phase 3 a (prepare to commit): Coordinator waits until all participants have ACKed (READY-COMMIT) receipt of PREPARE message, and then sends COMMIT to all. Phase 3 b (prepare to commit): Participants waits for COMMIT States of the coordinator and each participant satisfies the following two conditions: 1. There is no single state from which it is possible to make a transition directly to either COMMIT or ABORT state 2. There is no state in which it is not possible to make final decision, 30

3PC (3) Commit Vote-request INIT Vote-request Vote-abort INIT Vote-request Vote-commit Vote-abort Global-abort WAIT Vote-commit Prepare-commit Global-abort ACK WAIT Prepare-commit Ready-commit ABORT Ready-commit Global-commit PRECOMMIT ABORT Global-commit ACK PRECOMMIT a COMMIT b COMMIT a) Finite state machine for the coordinator in 3PC b) Finite state machine for a participant 31

3PC Failing Participant (1) Coordinator blocks WAIT: The coordinator sends GLOBAL_ABORT after timeout. PRECOMMIT: On a timeout, it will conclude one of the participant crashed (and it is known to have voted COMMIT); it will send GLOBAL_COMMIT to remaining participants Participant P blocks INIT: abort on a timeout READY: On a timeout, P contacts Q If Q is still in INIT, they can safely abort (since no other participant can be in state PRECOMMIT) 32

3PC Failing Participant (2) Participant P blocks (cont.) READY: On a timeout, P contacts Q 1. If each of the participants P contacted is in state READY, the transaction should be aborted (an uncontacted process may be in INIT). If one of the participants not contacted by P is in state PRECOMMIT, it can still abort. 2. If all contacted processes are in state PRECOMMIT, the transaction can safely commit 3. If a contacted process is in state ABORT (or COMMIT), then P moves to the corresponding state. PRECOMMIT: A decision can be taken 33

Recovery Once a failure occurs, it is essential that the failing process be able to recover to a correct state. What does it actually mean recovering to a correct state? How can the state of a distributed system be recorded and recovered to? Methods Check-pointing Message logging 34

Recovery: Background Essence: When a failure occurs, we need to bring the system into an error-free state. Backward recovery: bring the system from its present erroneous state back into a previously correct state. From time to time, the system state (at least part of it) must be recorded (check-pointing) in a persistent storage. Forward recovery: Instead of a previous check-pointed state, find a correct new state from which the system can continue to execute. In practice: By and large backward error recovery is widely applied 35

Forms of Recovery: Example Backward recovery: Retransmitting a lost message Forward recovery: Constructing the missing packets from successfully delivered packets (n, k) block erasure codes Forward recovery require that error types be known in advance so that appropriate recovery mechanisms are deployed. Backward recovery can be used as a general mechanism 36

Backward Recovery: Problems Restoring a previous state is costly operation Saving system state is not for free Loop of recovery No guarantee that the same (or similar) failure does not happen again. Rolling back is not always possible Think of a ATM machine handing mistakenly $1000. Imagine a UNIX command like /bin/rm fr * 37

Recovery: Stable Storage drive 1 a b c h g f d e a h b c g f d e a b c h g f d e updates recover drive 2 a b c h g f d e a b c h g f d e a b c h g f d e a) Stable Storage b) Crash after drive 1 is updated c) Bad spot (as a result of general wear and tear) 38

Checkpointing In a fault-tolerant distributed system, backward error recovery requires that the system regularly save its (global) state onto stable storage. Consistent global state can be captured using distributed snapshot algorithm. A recovery line corresponds to the most recent distributed snapshot Initial state consistent cut Checkpoint P1 P2 Failure inconsistent cut Time 39

Independent Checkpointing Processes save their local state independently Each process rolls back to the most recently saved state on a crash If these local states jointly do not form a consistent cut, then processes will have to further roll back to another previous checkpoint. The domino effect. 40

Coordinated Checkpointing Essence: All processes synchronize to jointly write their state to local stable storage. Saved state is automatically globally consistent. Two-phase blocking protocol: A coordinator first multicasts a CHECKPOINT_REQUEST message to all processes A receiving process takes a local checkpoint, stops sending messages (queues them and blocks), and tells the coordinator it has taken the checkpoint (ACK). When the coordinator has received an ACK from all processes, it multicasts a CHECKPOINT_DONE message to allow blocked processes to continue Question: What could happen if a process did not stop sending regular messages after saving its local state? 41

Message Logging (1) Checkpointing is an expensive operation Message logging allows to reduce the number of checkpoints, but still enables recovery Message logging and checkpointing are used together Idea: If the transmission of messages can be replayed, we can still reach a globally consistent state. A checkpointed state is taken as a starting point. Piecewise deterministic model: The execution of each process is considered to take place as a sequence of intervals, where events occur Each interval starts with a nondeterministic event (e.g. receipt of a message) Execution in an interval is completely deterministic 42

Message Logging (2) Conclusion (Piecewise deterministic model) If we record non-deterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay Problem: When should we actually log messages? Issue: Avoid orphan processes Orphan process is a process that survives the crash of another process, but whose state is inconsistent with the crashed process after its recovery Goal: Devise message logging schemes in which orphans do not occur. 43

Orphan Process: Example Process Q has just received m 1 and m 2 and subsequently delivered m 3 before it crashes. Assume that m 2 is not logged. When Q crashes and subsequently recovers, only m 1 is going to be replayed, but m 2 is certainly not, and probably m 3 is not. P m1 Q crashes and recovers m1 Q R m2 m3 m2 m3 unlogged message Time logged message Incorrect replay of messages after recovery, leading to an orphan process (Question: which one is the orphan process here?). 44

Message Logging Schemes (1) HDR(m): The header of message m containing its source, destination, sequence number, a delivery number The header contains all information for resending a message and delivering it in the correct order A message is stable if it can no longer be lost (e.g. is written to a stable storage, along with its header). DEP(m): the set of processes to which the message m has been delivered. It includes the processes to which another message m, which is causally dependent on m, has been delivered. COPY(m): the set of processes that have a copy of the message (and its header), but not (yet) in their local stable storage. 45

Message Logging Schemes (2) The processes in COPY(m) can hand over m. If all processes in this set crashes, the retransmission of m is not possible. Using this notation, Process Q is orphan if there is a message m, such that Q is contained in DEP(m), while at the same time every process in COPY(m) has crashed. There is no way to replay transmission of m. To avoid orphan processes, We can enforce that DEP(m) COPY(m). In other words, whenever a process becomes dependent on the delivery of m, it will always keep a copy of m (i.e. the message along with its header) 46

Message Logging Schemes (3) Pessimistic logging protocol For each unstable message m, there is at most one process dependent on m, that is DEP(m) 1. In other words, this protocol ensures that each unstable message m is delivered to at most one process. A process P, after receiving m also becomes a member of COPY(m) P, is forced to write it to a stable storage before sending a message to another process If P crashes before it logs m there will be no problem since no other process will be dependent on the delivery of m. Optimistic logging protocol If each process in COPY(m) has crashed, any orphan process in DEP(m) is rolled back to a state in which it is no longer belongs to DEP(m). 47