Distributed Systems Principles and Paradigms

Size: px

Start display at page:

Download "Distributed Systems Principles and Paradigms"

Osborn Williams
6 years ago
Views:

1 Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel: (020) URL: steen/ 01 Introduction 02 Architectures 03 Processes 04 Communication 05 Naming 06 Synchronization 07 Consistency and Replication 08 Fault Tolerance 09 Security 10 Distributed Object-Based Systems 11 Distributed File Systems 12 Distributed Web-Based Systems 13 Distributed Coordination-Based Systems 00 1 / Introduction Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed commit Recovery 08 1 Fault Tolerance/

2 Dependability Basics: A component provides services to clients. To provide services, the component may require the services from other components a component may depend on some other component. Specifically: A componentcdepends onc if the correctness of C s behavior depends on the correctness ofc s behavior. Some properties of dependability: Availability Reliability Safety Maintainability Readiness for usage Continuity of service delivery Very low probability of catastrophes How easy can a failed system be repaired Note: For distributed systems, components can be either processes or channels 08 2 Fault Tolerance/8.1 Introduction Terminology Failure: When a component is not living up to its specifications, a failure occurs Error: That part of a component s state that can lead to a failure Fault: The cause of an error Fault prevention: prevent the occurrence of a fault Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults (i.e., mask the presence of faults) Fault removal: reduce the presence, number, seriousness of faults Fault forecasting: estimate the present number, future incidence, and the consequences of faults 08 3 Fault Tolerance/8.1 Introduction

3 Failure Models Crash failures: A component simply halts, but behaves correctly before halting Omission failures: A component fails to respond Timing failures: The output of a component is correct, but lies outside a specified real-time interval (performance failures: too slow) Response failures: The output of a component is incorrect (but can at least not be accounted to another component) Value failure: The wrong value is produced State transition failure: Execution of the component s service brings it into a wrong state Arbitrary failures: A component may produce arbitrary output and be subject to arbitrary timing failures Observation: Crash failures are the least severe; arbitrary failures are the worst 08 4 Fault Tolerance/8.1 Introduction Crash Failures Problem: Clients cannot distinguish between a crashed component and one that is just a bit slow Examples: Consider a server from which a client is expecting output: Is the server perhaps exhibiting timing or omission failures Is the channel between client and server faulty (crashed, or exhibiting timing or omission failures) Fail-silent: The component exhibits omission or crash failures; clients cannot tell what went wrong Fail-stop: The component exhibits crash failures, but its failure can be detected (either through announcement or timeouts) Fail-safe: The component exhibits arbitrary, but benign failures (they can t do any harm) 08 5 Fault Tolerance/8.1 Introduction

4 Process Resilience Basic issue: Protect yourself against faulty processes by replicating and distributing computations in a group. Flat groups: Good for fault tolerance as information exchange immediately occurs with all group members; however, may impose more overhead as control is completely distributed (hard to implement). Hierarchical groups: All communication through a single coordinator not really fault tolerant and scalable, but relatively easy to implement. Flat group Hierarchical group Coordinator Worker (a) (b) 08 6 Fault Tolerance/8.2 Process Resilience Groups and Failure Masking (1/4) Terminology: when a group can mask anykconcurrent member failures, it is said to be k-fault tolerant (k is called degree of fault tolerance). Problem: how large does a k-fault tolerant group need to be? Assume crash/performance failure semantics a total ofk +1 members are needed to survivek member failures. Assume arbitrary failure semantics, and group output defined by voting a total of2k +1 members are needed to survive k member failures. Assumption: all members are identical, and process all input in the same order only then are we sure that they do exactly the same thing Fault Tolerance/8.2 Process Resilience

5 Groups and Failure Masking (2/4) Assumption: Group members are not identical, i.e., we have a distributed computation Problem: Nonfaulty group members should reach agreement on the same value Process 2 tells different things a 2 2 b a Process 3 passes a different value a 1 b 3 1 (a) b (b) 3 Observation: Assuming arbitrary failure semantics, we need 3k + 1 group members to survive the attacks of k faulty members Note: This is also known as Byzantine failures. Essence: We are trying to reach a majority vote among the group of loyalists, in the presence ofktraitors need2k +1 loyalists Fault Tolerance/8.2 Process Resilience Groups and Failure Masking (3/4) x 1 y z Faulty process (a) Got( 1, 2, x, 4 ) Got( 1, 2, y, 4 ) Got( 1, 2, 3, 4 ) Got( 1, 2, z, 4 ) 1 Got 2 Got 4 Got ( 1, 2, y, 4 ) ( 1, 2, x, 4 ) ( 1, 2, x, 4 ) ( a, b, c, d ) ( e, f, g, h ) ( 1, 2, y, 4 ) ( 1, 2, z, 4 ) ( 1, 2, z, 4 ) ( i, j, k, l ) (b) (c) (a) what they send to each other (b) what each one got from the other (c) what each one got in second step 08 9 Fault Tolerance/8.2 Process Resilience

6 Groups and Failure Masking (4/4) Issue: What are the necessary conditions for reaching agreement? Unordered Message ordering Ordered Process behavior Synchronous Asynchronous X Unicast X Multicast X X Unicast X X X X Multicast Bounded Unbounded Bounded Unbounded Communication delay Message transmission Process: Delays: Ordering: Transmission: Synchronous operate in lockstep Are delays on communication bounded? Are messages delivered in the order they were sent? Are messages sent one-by-one, or multicast? Fault Tolerance/8.2 Process Resilience Failure Detection Essence: We detect failures through timeout mechanisms Setting timeouts properly is very difficult and application dependent You cannot distinguish process failures from network failures We need to consider failure notification throughout the system: Gossiping (i.e., proactively disseminate a failure detection) On failure detection, pretend you failed as well Fault Tolerance/8.2 Process Resilience

7 Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels? Error detection: Framing of packets to allow for bit error detection Use of frame numbering to detect packet loss Error correction: Add so much redundancy that corrupted packets can be automatically corrected Request retransmission of lost, or last N packets Observation: Most of this work assumes point-topoint communication Fault Tolerance/8.3 Reliable Communication Reliable RPC (1/3) What can go wrong?: 1: Client cannot locate server 2: Client request is lost 3: Server crashes 4: Server response is lost 5: Client crashes [1:] Relatively simple just report back to client [2:] Just resend message Fault Tolerance/8.3 Reliable Communication

8 Reliable RPC (2/3) [3:] Server crashes are harder as you don t what it had already done: REQ Server REQ Server REQ Server Receive Receive Receive REP Execute Execute Crash No REP No REP Reply Crash (a) (b) (c) Problem: We need to decide on what we expect from the server At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. At-most-once-semantics: The server guarantees it will carry out an operation at most once Fault Tolerance/8.3 Reliable Communication Reliable RPC (3/3) [4:] Detecting lost replies can be hard, because it can also be that the server had crashed. You don t know whether the server has carried out the operation Solution: None, except that you can try to make your operations idempotent: repeatable without any harm done if it happened to be carried out before. [5:] Problem: The server is doing work and holding resources for nothing (called doing an orphan computation). Orphan is killed (or rolled back) by client when it reboots Broadcast new epoch number when recovering servers kill orphans Require computations to complete in attime units. Old ones are simply removed. Question: What s the rolling back for? Fault Tolerance/8.3 Reliable Communication

9 Reliable Multicasting (1/2) Basic model: We have a multicast channelcwith two (possibly overlapping) groups: The sender groupsnd(c) of processes that submit messages to channel c The receiver group RCV(c) of processes that can receive messages from channel c Simple reliability: If process P RCV(c) at the time messagemwas submitted toc, andpdoes not leavercv(c),mshould be delivered top Atomic multicast: How can we ensure that a messagemsubmitted to channelcis delivered to processp RCV(c) only ifmis delivered to all members of RCV(c) Fault Tolerance/8.4 Reliable Group Communication Reliable Multicasting (2/2) Observation: If we can stick to a local-area network, reliable multicasting is easy Principle: Let the sender log messages submitted to channel c: IfPsends messagem,mis stored in a history buffer Each receiver acknowledges the receipt of m, or requests retransmission at P when noticing message lost Sender P removes m from history buffer when everyone has acknowledged receipt Question: Why doesn t this scale? Fault Tolerance/8.4 Reliable Group Communication

10 Scalable Reliable Multicasting: Feedback Suppression Basic idea: Let a processpsuppress its own feedback when it notices another process Q is already asking for a retransmission Assumptions: All receivers listen to a common feedback channel to which feedback messages are submitted Process P schedules its own feedback message randomly, and suppresses it when observing another feedback message Question: Why is the random schedule so important? Sender receives only one NACK Receivers suppress their feedback Sender Receiver Receiver Receiver Receiver T=3 T=4 T=1 T=2 NACK NACK NACK NACK NACK Network Fault Tolerance/8.4 Reliable Group Communication Scalable Reliable Multicasting: Hierarchical Solutions Basic solution: Construct a hierarchical feedback channel in which all submitted messages are sent only to the root. Intermediate nodes aggregate feedback messages before passing them on. (Long-haul) connection Coordinator S Sender Local-area network C C Receiver R Root Question: What s the main problem with this solution? Observation: Intermediate nodes can easily be used for retransmission purposes Fault Tolerance/8.4 Reliable Group Communication

11 Atomic Multicast Idea: Formulate reliable multicasting in the presence of process failures in terms of process groups and changes to group membership: Reliable multicast by multiple P1 joins the group point-to-point messages P3 crashes P3 rejoins P1 P2 P3 P4 G = {P1,P2,P3,P4} G = {P1,P2,P4} G = {P1,P2,P3,P4} Partial multicast from P3 is discarded Time Guarantee: A message is delivered only to the nonfaulty members of the current group. All members should agree on the current group membership. Keyword: Virually synchronous multicast Fault Tolerance/8.4 Reliable Group Communication Virtual Synchrony (1/2) Essence: We consider views V RCV(c) SND(c) Processes are added or deleted from a viewv through view changes tov ; a view change is to be executed locally by eachp V V (1) For each consistent state, there is a unique view on which all its members agree. Note: implies that all nonfaulty processes see all view changes in the same order (2) If message m is sent to V before a view change vc tov, then either allp V that excutevc receive m, or no processesp V that executevc receive m. Note: all nonfaulty members in the same view get to see the same set of multicast messages. (3) A message sent to viewv can be delivered only to processes in V, and is discarded by successive views A reliable multicast algorithm satisfying (1) (3) is virtually synchronous Fault Tolerance/8.4 Reliable Group Communication

12 Virtual Synchrony (2/2) A sender to a viewv need not be member ofv If a senders V crashes, its multicast message m is flushed beforesis removed fromv:m will never be delivered after the point thats V Note: Messages fromsmay still be delivered to all, or none (nonfaulty) processes in V before they all agree on a new view to whichsdoes not belong If a receiverpfails, a messagemmay be lost but can be recovered as we know exactly what has been received in V. Alternatively, we may decide to delivermto members inv {P} Observation: Virtually synchronous behavior can be seen independent from the ordering of message delivery. The only issue is that messages are delivered to an agreed upon group of receivers Fault Tolerance/8.4 Reliable Group Communication Virtual Synchrony Implementation (1/3) The current view is known at eachpby means of a delivery list dest[p] IfP dest[q] thenq dest[p] Messages received bypare queued inqueue[p] IfPfails, the group view must change, but not before all messages from P have been flushed Each P attaches a (stepwise increasing) timestamp with each message it sends Assume FIFO-ordered delivery; the highest numbered message from Q that has been received by P is recorded in rcvd[p][q] The vector rcvd[p][] is sent (as a control message) to all members in dest[p] Each P records rcvd[q][] in remote[p][q] Fault Tolerance/8.4 Reliable Group Communication

13 Virtual Synchrony Implementation (2/3) Observation: remote[p][q] shows what P knows about message arrival at Q min A message is stable if it has been received by allq dest[p] (shown as the min vector) Stable messages can be delivered to the next layer (which may deal with ordering). Note: Causal message delivery comes for free As soon as all messages from the faulty process have been flushed, that process can be removed from the (local) views Fault Tolerance/8.4 Reliable Group Communication Virtual Synchrony Implementation (3/3) Remains: What if a senderpfailed and not all its messages made it to the nonfaulty members of the current view? Solution: Select a coordinator which has all (unstable) messages fromp, and forward those to the other group members. Note: Member failure is assumed to be detected and subsequently multicast to the current view as a view change. That view change will not be carried out before all messages in the current view have been delivered Fault Tolerance/8.4 Reliable Group Communication

14 Distributed Commit Two-phase commit Three-phase commit Essential issue: Given a computation distributed across a process group, how can we ensure that either all processes commit to the final result, or none of them do (atomicity)? Fault Tolerance/8.5 Distributed Commit Two-Phase Commit (1/2) Model: The client who inititated the computation acts as coordinator; processes required to commit are the participants Phase 1a: Coordinator sends vote-request to participants (also called a pre-write) Phase 1b: When participant receives vote-request it returns either vote-commit or vote-abort to coordinator. If it sends vote-abort, it aborts its local computation Phase 2a: Coordinator collects all votes; if all are vote-commit, it sends global-commit to all participants, otherwise it sends global-abort Phase 2b: Each participant waits forglobal-commit or global-abort and handles accordingly Fault Tolerance/8.5 Distributed Commit

15 Two-Phase Commit (2/2) Commit Vote-request Vote-abort Global-abort ABORT INIT WAIT (a) Vote-commit Global-commit COMMIT Vote-request Vote-abort INIT Vote-request Vote-commit READY Global-abort ACK ABORT (b) Global-commit ACK COMMIT Fault Tolerance/8.5 Distributed Commit 2PC Failing Participant (1/2) Observation: Consider participant crash in one of its states, and the subsequent recovery to that state: Initial state: No problem, as participant was unaware of the protocol Ready state: Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make log the coordinator s decision Abort state: Merely make entry into abort state idempotent, e.g., removing the workspace of results Commit state: Also make entry into commit state idempotent, e.g., copying workspace to storage. Observation: When distributed commit is required, having participants use temporary workspaces to keep their results allows for simple recovery in the presence of failures Fault Tolerance/8.5 Distributed Commit

16 2PC Failing Participant (2/2) Alternative: When a recovery is needed to the READY state, check what the other participants are doing. This approach avoids having to log the coordinator s decision. Assume recovering participant P contacts another participant Q: State of Q COMMIT ABORT INIT READY Action by P Make transition to COMMIT Make transition to ABORT Make transition to ABORT Contact another participant Result: If all participants are in the READY state, the protocol blocks. Apparently, the coordinator is failing. Note: The protocol prescribes that we need the decision from the coordinator Fault Tolerance/8.5 Distributed Commit 2PC Failing Coordinator Observation: The real problem lies in the fact that the coordinator s final decision may not be available for some time (or actually lost). Alternative: Let a participant P in the READY state timeout when it hasn t received the coordinator s decision; P tries to find out what other participants know (as discussed). Observation: Essence of the problem is that a recovering participant cannot make a local decision: it is dependent on other (possibly failed) processes Fault Tolerance/8.5 Distributed Commit

17 Three-Phase Commit (1/2) Phase 1a: Coordinator sends vote-request to participants Phase 1b: When participant receives vote-request it returns either vote-commit or vote-abort to coordinator. If it sends vote-abort, it aborts its local computation Phase 2a: Coordinator collects all votes; if all are vote-commit, it sends prepare-commit to all participants, otherwise it sends global-abort, and halts Phase 2b: Each participant waits for prepare-commit, or waits for global-abort after which it halts Phase 3a: (Prepare to commit) Coordinator waits until all participants have sent ready-commit, and then sends global-commit to all Phase 3b: (Prepare to commit) Participant waits for global-commit Fault Tolerance/8.5 Distributed Commit Three-Phase Commit (2/2) Commit Vote-request Vote-abort Global-abort ABORT INIT WAIT PRECOMMIT COMMIT Vote-commit Prepare-commit Ready-commit Global-commit (a) Vote-request Vote-abort INIT Vote-request Vote-commit READY Global-abort ACK ABORT (b) Prepare-commit Ready-commit PRECOMMIT COMMIT Global-commit ACK Fault Tolerance/8.5 Distributed Commit

18 3PC Failing Participant Basic issue: CanPfind out what it should it do after crashing in the ready or pre-commit state, even if other participants or the coordinator failed? Essence: Coordinator and participants on their way to commit, never differ by more than one state transition Consequence: If a participant timeouts in ready state, it can find out at the coordinator or other participants whether it should abort, or enter pre-commit state Observation: If a participant already made it to the pre-commit state, it can always safely commit (but is not allowed to do so for the sake of failing other processes) Observation: We may need to elect another coordinator to send off the final COMMIT Fault Tolerance/8.5 Distributed Commit Recovery Introduction Checkpointing Message Logging Fault Tolerance/8.6 Recovery

19 Recovery: Background Essence: When a failure occurs, we need to bring the system into an error-free state: Forward error recovery: Find a new state from which the system can continue operation Backward error recovery: Bring the system back into a previous error-free state Practice: Use backward error recovery, requiring that we establish recovery points Observation: Recovery in distributed systems is complicated by the fact that processes need to cooperate in identifying a consistent state from where to recover Fault Tolerance/8.6 Recovery Consistent Recovery State Requirement: Every message that has been received is also shown to have been sent in the state of the sender Recovery line: Assuming processes regularly checkpoint their state, the most recent consistent global checkpoint. P1 Initial state Recovery line Checkpoint Failure P2 Message sent from P2 to P1 Time Inconsistent collection of checkpoints Observation: If and only if the system provides reliable communication, should sent messages also be received in a consistent state Fault Tolerance/8.6 Recovery

20 Cascaded Rollback Observation: If checkpointing is done at the wrong instants, the recovery line may lie at system startup time cascaded rollback P1 Initial state Checkpoint m m Failure P2 Time Fault Tolerance/8.6 Recovery Checkpointing: Stable Storage Principle: Replicate all data on at least two disks, and keep one copy correct at all times. Sector has different value b a h g c d e f b a h g c d e f b a h g c d e f b a h g c d e f b a h g c d e f b a h g c d e f Bad checksum (a) (b) (c) After a crash: If both disks are identical: you re in good shape. If one is bad, but the other is okay (checksums): choose the good one. If both seem okay, but are different: choose the main disk. If both aren t good: you re not in a good shape Fault Tolerance/8.6 Recovery

21 Independent Checkpointing Essence: Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP[i](m) denotem th checkpoint of processp i and INT[i](m) the interval between CP[i](m 1) and CP[i](m) When processp i sends a message in interval INT[i](m), it piggybacks (i, m) When processp j receives a message in interval INT[j](n), it records the dependency INT[i](m) INT[j](n) The dependency INT[i](m) INT[j](n) is saved to stable storage when taking checkpoint CP[j](n) Observation: If processp i rolls back to CP[i](m 1), P j must roll back to CP[j](n 1). Question: How can P j find out where to roll back to? Fault Tolerance/8.6 Recovery Coordinated Checkpointing Essence: Each process takes a checkpoint after a globally coordinated action Question: What advantages are there to coordinated checkpointing? Simple solution: Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue Observation: It is possible to consider only those processes that depend on the recovery of the coordinator, and ignore the rest Fault Tolerance/8.6 Recovery

22 Message Logging Alternative: Instead of taking an (expensive) checkpoint, try to replay your (communication) behavior from the most recent checkpoint store messages in a log Assumption: We assume a piecewise deterministic execution model: The execution of each process can be considered as a sequence of state intervals Each state interval starts with a nondeterministic event (e.g., message receipt) Execution in a state interval is deterministic Conclusion: If we record nondeterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay Question: Why is logging only messages not enough? Question: Is logging only nondeterministic events enough? Fault Tolerance/8.6 Recovery Message Logging and Consistency Problem: When should we actually log messages? Issue: Avoid orphans: Process Q has just received and subsequently delivered messagesm 1 andm 2 Assume thatm 2 is never logged. After deliveringm 1 andm 2,Qsends messagem 3 to processr ProcessRreceives and subsequently deliversm 3 P Q R m1 Q crashes and recovers m3 m2 m2 m3 Unlogged message Logged message m1 m2 is never replayed, so neither will m3 Time Goal: Devise message logging schemes in which orphans do not occur Fault Tolerance/8.6 Recovery

23 Message-Logging Schemes (1/2) À Ê Ñ : The header of messagemcontaining its source, destination, sequence number, and delivery number The header contains all information for resending a message and delivering it in the correct order (assume data is reproduced by the application) A messagemis stable if HDR[m] cannot be lost (e.g., because it has been written to stable storage) È Ñ : The set of processes to which messagem has been delivered, as well as any message that causally depends on delivery of m ÇÈ Ñ : The set of processes that have a copy of HDR[m] in their volatile memory IfCis a collection of crashed processes, thenq C is an orphan if there is a messagemsuch that Q DEP[m] and COPY[m] C Fault Tolerance/8.6 Recovery Message-Logging Schemes (2/2) Goal: No orphans means that for each messagem, DEP[m] COPY[m] Pessimistic protocol: for each nonstable message m, there is at most one process dependent onm, that is DEP[m] 1 Consequence: An unstable message in a pessimistic protocol must be made stable before sending a next message Optimistic protocol: for each unstable message m, we ensure that if COPY[m] C, then eventually also DEP[m] C, wherecdenotes a set of processes that have been marked as faulty Consequence: To guarantee that DEP[m] C, we generally rollback each orphan process Q until Q DEP[m] Fault Tolerance/8.6 Recovery

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel: