Failure Tolerance Distributed Systems Santa Clara University
Distributed Checkpointing
Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot Reflects a consistent, global state If process P has received a message from Q Then global state should show that process Q sent a message to process P
Distributed Checkpointing Global state presented by a cut Consistent cuts: Messages shown received are shown sent Messages shown sent are either received or in transit
Distributed Checkpointing
Distributed Checkpointing Represent distributed system as a system of processes connected by unidirectional point-to-point communication
Distributed Checkpointing Distributed snapshot Anybody can start snapshot Initiating process P records its own state Process P sends a marker along all of its outgoing channels Process Q upon receiving first marker Records its state Sends a marker to all of its neighbors Starts recording all incoming channels Process Q upon receiving subsequent markers Stops recording on channel on which the marker arrived
Distributed Checkpointing Process Q upon receiving last marker Send own state messages on channels monitored to the initiating state
Distributed Checkpointing
Distributed Checkpointing Termination Detection: Use snapshot protocol If Q receives a marker for the first time Sending process becomes its predecessors If Q is done with the snapshot, sends a DONE message to predecessor This still allows for messages in transit
Distributed Checkpointing Termination detection: Need snapshot where all channels are empty Q returns DONE only if All of Q s successors have returned a DONE message Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel In all other cases, Q sends a CONTINUE message
Distributed Checkpointing Termination detection When initiating process receives only DONE messages No regular messages are in transit Thus, computation is terminated
Failure Types Dependability consists of Availability System is ready to be used Reliability System can run continually without failure Safety In a failure condition, nothing catastrophic happens Maintainability How easy can a failed system be repaired
Failure Types Dependability: System that breaks down for a millisecond every hour Availability > 99.9999 % Reliability is low System breaks down only for two weeks every July Availability ~ 96% Reliability is high
Failure Types Failure: system cannot meet its promises Error: part of the system state that may lead to a failure Fault: cause of an error
Failure Types Transient faults occur once and the disappear If the operation is repeated, fault goes away Example: Bird flies through the beam of a microwave transmitter and possibly gets roasted
Failure Types Intermittent fault Fault occurs Goes away Fault returns
Failure Types Permanent fault Fault appears Continues to exist until the faulty component is repaired
Failure Types Crash failure Server halts, but it is working correctly until it has Omission failure A server fails to respond to incoming messages Receive omission Server fails to receive incoming messages Send omission Server fails to send messages
Failure Types Timing failure A server s response lies outside the specified time interval Response failure A server s response is incorrect Value failure The value of the response is wrong State transition failure The server deviates from the correct flow of control Arbitrary / Byzantine failure A server may produce arbitrary responses at arbitrary times
Failure Types Fail-stop failure Fail stop server stops producing output Others can detect this state Fail-silent failure Fail silent server stops producing output Others cannot distinguish this from a server that is slow Fail-safe failure: Server acts arbitrarily But other servers can recognize its output as false
Failure Masking Failure masking by redundancy Erasure correcting codes Replication
Failure Masking Triple Modular Redundancy
Process Resilience Organize processes into groups Groups can be dynamic run membership protocols hierarchical
Process Resilience
Process Resilience Leader election Bully algorithm Process with highest ID wins
Process Resilience Leader Election using a ring
Process Resilience Agreement in Faulty Systems
Process Resilience Byzantine general problem In the presence of byzantine failure Can only decide on a single value is >2/3 of the participants are not faulty
Process Resilience Byzantine General Problem; Lamport algorithm Each process has to share a value with all others But processes can lie and can misrepresent their value Goal: All processes accept values from the non-faulty processes
Process Resilience Lamport algorithm (1982) Each process sends its value to all other processes Values are gathered into vectors Each process sends these vectors to everybody else Every process accepts values with a majority
Process Resilience
Process Resilience
Reliable Group Communication Problem: How to get messages to the members of a process group Reliable multicasting Without process failures: Problem assumes that there is a join and leave protocol for processes Often: members receive messages in exactly the same order
Reliable Group Communication Simple solution if all receivers are known and assumed to not fail
Reliable Group Communication Tradeoffs: Explicit retransmission requests or retransmissions when acks are missing Use multicast or point-to-point transmission for retransmissions Use piggy-backing in order save network bandwidth
Reliable Group Communication Scalability in Reliable Multicasting Simple scheme cannot support large numbers Optimization: Get rid of acks Only send retransmission requests Difficult to get messages out of history buffer. Use cumulative acks
Reliable Group Communication Scalability in reliable multicasting Feedback suppression Implemented in Scalable Reliable Multicasting (SRM) by Floyd (97) Never ack receipt of messages Whenever a process sends a retransmission request (NACK), it multicasts to everyone Servers that receive this multicast suppress their own NACK message
Reliable Group Communication
Reliable Group Communication Feedback suppression scales reasonably well Problems: Receivers need to schedule feedback messages accurately Otherwise, too many will send out their NACK anyway Feedback still interrupts processes that received the message Could form a separate multicast process for those that have not received But that is difficult to do over a wide area network
Reliable Group Communication Hierarchical Feedback Control
Reliable Group Communication Atomic multicast (in the presence of failures) Make a distinction between receiving and delivering a message
Reliable Group Communication Each message is associated with a group view The processes on the delivery list Changes in group membership Announced by a group view change message Problem: Message based on old group view needs to be delivered before the group view change message is delivered
Reliable Group Communication Virtual Synchronicity Reliable multicast where multicast message to a group view G is delivered to all non-faulty processes in G
Reliable Group Communication
Reliable Group Communication Gives several possibilities for ordering Unordered multicasts Fifo ordered multicasts Causally-ordered multicasts Totally-ordered multicasts
Reliable Group Communication Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called Atomic multicasting
Reliable Group Communication ISIS: Implementing atomic multicast Build on TCP as a reliable point-to-point communication Assumes that messages sent out by a sender arrive in that order (TCP property) Multicasting message with group view Same as sending individual messages to all members in the group
Reliable Group Communication Processes keep messages until they know that every other process has received m In that case m is stable ONLY STABLE MESSAGES ARE DELIVERED This is also true for view-change messages Forwarding of messages guarantees that a message delivered to one non-faulty process is received by everyone in the group Can require any process to send message to all members of the group
Reliable Group Communication
Reliable Group Communication Processing a group change Process receives group change message Forwards any unstable message for the old group to all processes in the new group and marks them as stable ISIS / TCP assumes that these messages are never lost All messages to the old group received by one process are therefore guaranteed to be received by all non-faulty process in the old group
Reliable Group Communication When process P no longer has unstable messages: Multicasts a flush message to the new group When P receives flush messages from all members of the new group, it installs the new view
Reliable Group Communication
Reliable Group Communication When process Q receives message sent to the old group If Q still believes itself to be in the old group: Delivers message (unless it has already received it and considers it a duplicate) If Q has received view change message Forwards any unstable message Then sends flush message to the new group
Reliable Group Communication Need more protocol in order to deal with failure during a view change Details in Birman s book or the papers on ISIS
Checkpointing Revocery Forward recovery Bring system to a new, failure free state Backward recovery Bring system back to an old, failure free state and start over
Checkpointing Distributed snapshot to establish recovery line
Domino effect Checkpointing
Checkpointing Need to do coordinated checkpointing instead of individual checkpointing Simpler solution: Two-phase blocking protocol Coordinator broadcasts a CHECKPOINT_REQ Processes receiving CHECKPOINT_REQ create local checkpoint queue messages from the application block until they receive CHECKPOINT_DONE Coordinator sends CHECKPOINT_DONE after receiving acks from everyone
Checkpointing Techniques used to reduce checkpoints Message logging Can lead to orphans
Checkpointing Pessimistic logging protocols Ensure that for each non-stable message there is at most one process depending on it Optimistic logging protocols Any orphan process depending on some message is rolled back until it now longer depend on the message