Failure Tolerance. Distributed Systems Santa Clara University

Size: px

Start display at page:

Download "Failure Tolerance. Distributed Systems Santa Clara University"

Jasmine Caldwell
5 years ago
Views:

1 Failure Tolerance Distributed Systems Santa Clara University

2 Distributed Checkpointing

3 Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot Reflects a consistent, global state If process P has received a message from Q Then global state should show that process Q sent a message to process P

4 Distributed Checkpointing Global state presented by a cut Consistent cuts: Messages shown received are shown sent Messages shown sent are either received or in transit

5 Distributed Checkpointing

6 Distributed Checkpointing Represent distributed system as a system of processes connected by unidirectional point-to-point communication

7 Distributed Checkpointing Distributed snapshot Anybody can start snapshot Initiating process P records its own state Process P sends a marker along all of its outgoing channels Process Q upon receiving first marker Records its state Sends a marker to all of its neighbors Starts recording all incoming channels Process Q upon receiving subsequent markers Stops recording on channel on which the marker arrived

8 Distributed Checkpointing Process Q upon receiving last marker Send own state messages on channels monitored to the initiating state

9 Distributed Checkpointing

10 Distributed Checkpointing Termination Detection: Use snapshot protocol If Q receives a marker for the first time Sending process becomes its predecessors If Q is done with the snapshot, sends a DONE message to predecessor This still allows for messages in transit

11 Distributed Checkpointing Termination detection: Need snapshot where all channels are empty Q returns DONE only if All of Q s successors have returned a DONE message Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel In all other cases, Q sends a CONTINUE message

12 Distributed Checkpointing Termination detection When initiating process receives only DONE messages No regular messages are in transit Thus, computation is terminated

13 Failure Types Dependability consists of Availability System is ready to be used Reliability System can run continually without failure Safety In a failure condition, nothing catastrophic happens Maintainability How easy can a failed system be repaired

14 Failure Types Dependability: System that breaks down for a millisecond every hour Availability > % Reliability is low System breaks down only for two weeks every July Availability ~ 96% Reliability is high

15 Failure Types Failure: system cannot meet its promises Error: part of the system state that may lead to a failure Fault: cause of an error

16 Failure Types Transient faults occur once and the disappear If the operation is repeated, fault goes away Example: Bird flies through the beam of a microwave transmitter and possibly gets roasted

17 Failure Types Intermittent fault Fault occurs Goes away Fault returns

18 Failure Types Permanent fault Fault appears Continues to exist until the faulty component is repaired

19 Failure Types Crash failure Server halts, but it is working correctly until it has Omission failure A server fails to respond to incoming messages Receive omission Server fails to receive incoming messages Send omission Server fails to send messages

20 Failure Types Timing failure A server s response lies outside the specified time interval Response failure A server s response is incorrect Value failure The value of the response is wrong State transition failure The server deviates from the correct flow of control Arbitrary / Byzantine failure A server may produce arbitrary responses at arbitrary times

21 Failure Types Fail-stop failure Fail stop server stops producing output Others can detect this state Fail-silent failure Fail silent server stops producing output Others cannot distinguish this from a server that is slow Fail-safe failure: Server acts arbitrarily But other servers can recognize its output as false

22 Failure Masking Failure masking by redundancy Erasure correcting codes Replication

23 Failure Masking Triple Modular Redundancy

24 Process Resilience Organize processes into groups Groups can be dynamic run membership protocols hierarchical

25 Process Resilience

26 Process Resilience Leader election Bully algorithm Process with highest ID wins

27 Process Resilience Leader Election using a ring

28 Process Resilience Agreement in Faulty Systems

29 Process Resilience Byzantine general problem In the presence of byzantine failure Can only decide on a single value is >2/3 of the participants are not faulty

30 Process Resilience Byzantine General Problem; Lamport algorithm Each process has to share a value with all others But processes can lie and can misrepresent their value Goal: All processes accept values from the non-faulty processes

31 Process Resilience Lamport algorithm (1982) Each process sends its value to all other processes Values are gathered into vectors Each process sends these vectors to everybody else Every process accepts values with a majority

32 Process Resilience

33 Process Resilience

34 Reliable Group Communication Problem: How to get messages to the members of a process group Reliable multicasting Without process failures: Problem assumes that there is a join and leave protocol for processes Often: members receive messages in exactly the same order

35 Reliable Group Communication Simple solution if all receivers are known and assumed to not fail

36 Reliable Group Communication Tradeoffs: Explicit retransmission requests or retransmissions when acks are missing Use multicast or point-to-point transmission for retransmissions Use piggy-backing in order save network bandwidth

37 Reliable Group Communication Scalability in Reliable Multicasting Simple scheme cannot support large numbers Optimization: Get rid of acks Only send retransmission requests Difficult to get messages out of history buffer. Use cumulative acks

38 Reliable Group Communication Scalability in reliable multicasting Feedback suppression Implemented in Scalable Reliable Multicasting (SRM) by Floyd (97) Never ack receipt of messages Whenever a process sends a retransmission request (NACK), it multicasts to everyone Servers that receive this multicast suppress their own NACK message

39 Reliable Group Communication

40 Reliable Group Communication Feedback suppression scales reasonably well Problems: Receivers need to schedule feedback messages accurately Otherwise, too many will send out their NACK anyway Feedback still interrupts processes that received the message Could form a separate multicast process for those that have not received But that is difficult to do over a wide area network

41 Reliable Group Communication Hierarchical Feedback Control

42 Reliable Group Communication Atomic multicast (in the presence of failures) Make a distinction between receiving and delivering a message

43 Reliable Group Communication Each message is associated with a group view The processes on the delivery list Changes in group membership Announced by a group view change message Problem: Message based on old group view needs to be delivered before the group view change message is delivered

44 Reliable Group Communication Virtual Synchronicity Reliable multicast where multicast message to a group view G is delivered to all non-faulty processes in G

45 Reliable Group Communication

46 Reliable Group Communication Gives several possibilities for ordering Unordered multicasts Fifo ordered multicasts Causally-ordered multicasts Totally-ordered multicasts

47 Reliable Group Communication Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called Atomic multicasting

48 Reliable Group Communication ISIS: Implementing atomic multicast Build on TCP as a reliable point-to-point communication Assumes that messages sent out by a sender arrive in that order (TCP property) Multicasting message with group view Same as sending individual messages to all members in the group

49 Reliable Group Communication Processes keep messages until they know that every other process has received m In that case m is stable ONLY STABLE MESSAGES ARE DELIVERED This is also true for view-change messages Forwarding of messages guarantees that a message delivered to one non-faulty process is received by everyone in the group Can require any process to send message to all members of the group

50 Reliable Group Communication

51 Reliable Group Communication Processing a group change Process receives group change message Forwards any unstable message for the old group to all processes in the new group and marks them as stable ISIS / TCP assumes that these messages are never lost All messages to the old group received by one process are therefore guaranteed to be received by all non-faulty process in the old group

52 Reliable Group Communication When process P no longer has unstable messages: Multicasts a flush message to the new group When P receives flush messages from all members of the new group, it installs the new view

53 Reliable Group Communication

54 Reliable Group Communication When process Q receives message sent to the old group If Q still believes itself to be in the old group: Delivers message (unless it has already received it and considers it a duplicate) If Q has received view change message Forwards any unstable message Then sends flush message to the new group

55 Reliable Group Communication Need more protocol in order to deal with failure during a view change Details in Birman s book or the papers on ISIS

56 Checkpointing Revocery Forward recovery Bring system to a new, failure free state Backward recovery Bring system back to an old, failure free state and start over

57 Checkpointing Distributed snapshot to establish recovery line

58 Domino effect Checkpointing

59 Checkpointing Need to do coordinated checkpointing instead of individual checkpointing Simpler solution: Two-phase blocking protocol Coordinator broadcasts a CHECKPOINT_REQ Processes receiving CHECKPOINT_REQ create local checkpoint queue messages from the application block until they receive CHECKPOINT_DONE Coordinator sends CHECKPOINT_DONE after receiving acks from everyone

60 Checkpointing Techniques used to reduce checkpoints Message logging Can lead to orphans

61 Checkpointing Pessimistic logging protocols Ensure that for each non-stable message there is at most one process depending on it Optimistic logging protocols Any orphan process depending on some message is rolled back until it now longer depend on the message

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend