Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Size: px

Start display at page:

Download "Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit"

Lionel Hamilton
5 years ago
Views:

1 Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1

2 Distributed Commit o A more general problem of atomic multicast o Definition: Having an operation performed by each member of a process group, or none at all In reliable multicast, operation is delivery of message o One-phase commit protocol Commit ABORT participants State-diagram of participants drawbacks? -2

3 2PC o o o o o Model: The client who initiated the computation acts as coordinator; processes required to commit are the participants Phase 1a: sends vote-request to participants (also called a pre-write) Phase 1b: When participant receives vote-request it returns either vote-commit or vote-abort to coordinator. If it sends vote-abort, it aborts its local computation Phase 2a: collects all votes; if all are votecommit, it sends global-commit to all participants, otherwise it sends global-abort Phase 2b: Each participant waits for global-commit or global-abort and handles accordingly. -3

4 2PC: timeline Vote-commit Global-commit ABORT Vote-abort Global-abort ABORT -4

5 2PC: Recovery from Crash o What happens in case of a crash? How do we detect a crash? If timeout in -, then abort. If timeout in -, then abort. If timeout in -, then need to find out if globally committed or aborted. Just wait for to recover. Check with others (next-next slide). -5

6 2PC: timeline Vote-commit Global-commit ABORT Vote-abort Global-abort ABORT -6

7 2PC: Recovery from Crash Wait State Wait States o If blocked in -, then participant checks with others (Here, participant Q): If Q is in, then commit. If Q is in ABORT, then ABORT. If Q in, then can safely ABORT. If all in, nothing can be done. -7

8 2PC: timeline ABORT Q Vote-commit Global-commit -8

9 2PC: Recovery from Crash o crashed during : After recovery, locally abort, then inform coordinator or ABORT: After recovery, recover the original state (/ABORT), retransmit decision to coordinator : Analogous to previous slide Wait States -9

10 2PC: timeline Vote-abort Global-abort Vote-commit Global-commit ABORT -10

11 2PC: timeline ABORT Q? Vote-commit -11

12 2PC: Blocking Commit Protocol o Consider the previous participant blocking example All participants are in state This means all participants received and processed the VOTE_REQUEST from the coordinator, but coordinator crashed s cannot cooperatively decide on final action s may be blocked until coordinator recovers o Solution Three-phase commit protocol -12

13 3PC: timeline PRE Vote-commit Prepare-commit Ready-commit Global-commit PRE -13

14 3PC: Three-Phase Commit o Block can happen at and PRE (Difference between 2PC coordinator block), and PRE (Difference between 2PC participant block) -14

15 3PC: timeline Vote-commit Prepare-commit PRE Ready-commit Global-commit PRE -15

16 3PC: timeline PRE Vote-commit Prepare-commit Ready-commit Global-commit PRE -16

17 3PC: timeline PRE ABORT Q PRE? Vote-commit Prepare-commit Ready-commit Global-commit PRE -17

18 3PC: Three-Phase Commit o Not applied often in practice as 2PC blocks rarely occur o The states of the coordinator and each participant satisfy the following two conditions: 1. There is no single state from which it is possible to make a transition directly to either a or an ABORT state. 2. There is no state in which it is not possible to make a final decision, and from which a transition to a state can be made. -18

Today: Fault Tolerance. Reliable One-One Communication

Today: Fault Tolerance. Reliable One-One Communication Today: Fault Tolerance Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing Message logging Lecture 17, page 1 Reliable One-One Communication Issues