Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance Distributed Systems Santa Clara University

Distributed Checkpointing

Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot Reflects a consistent, global state If process P has received a message from Q Then global state should show that process Q sent a message to process P

Distributed Checkpointing Global state presented by a cut Consistent cuts: Messages shown received are shown sent Messages shown sent are either received or in transit

Distributed Checkpointing

Distributed Checkpointing Represent distributed system as a system of processes connected by unidirectional point-to-point communication

Distributed Checkpointing Distributed snapshot Anybody can start snapshot Initiating process P records its own state Process P sends a marker along all of its outgoing channels Process Q upon receiving first marker Records its state Sends a marker to all of its neighbors Starts recording all incoming channels Process Q upon receiving subsequent markers Stops recording on channel on which the marker arrived

Distributed Checkpointing Process Q upon receiving last marker Send own state messages on channels monitored to the initiating state

Distributed Checkpointing

Distributed Checkpointing Termination Detection: Use snapshot protocol If Q receives a marker for the first time Sending process becomes its predecessors If Q is done with the snapshot, sends a DONE message to predecessor This still allows for messages in transit

Distributed Checkpointing Termination detection: Need snapshot where all channels are empty Q returns DONE only if All of Q s successors have returned a DONE message Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel In all other cases, Q sends a CONTINUE message

Distributed Checkpointing Termination detection When initiating process receives only DONE messages No regular messages are in transit Thus, computation is terminated

Failure Types Dependability consists of Availability System is ready to be used Reliability System can run continually without failure Safety In a failure condition, nothing catastrophic happens Maintainability How easy can a failed system be repaired

Failure Types Dependability: System that breaks down for a millisecond every hour Availability > 99.9999 % Reliability is low System breaks down only for two weeks every July Availability ~ 96% Reliability is high

Failure Types Failure: system cannot meet its promises Error: part of the system state that may lead to a failure Fault: cause of an error

Failure Types Transient faults occur once and the disappear If the operation is repeated, fault goes away Example: Bird flies through the beam of a microwave transmitter and possibly gets roasted

Failure Types Intermittent fault Fault occurs Goes away Fault returns

Failure Types Permanent fault Fault appears Continues to exist until the faulty component is repaired

Failure Types Crash failure Server halts, but it is working correctly until it has Omission failure A server fails to respond to incoming messages Receive omission Server fails to receive incoming messages Send omission Server fails to send messages

Failure Types Timing failure A server s response lies outside the specified time interval Response failure A server s response is incorrect Value failure The value of the response is wrong State transition failure The server deviates from the correct flow of control Arbitrary / Byzantine failure A server may produce arbitrary responses at arbitrary times

Failure Types Fail-stop failure Fail stop server stops producing output Others can detect this state Fail-silent failure Fail silent server stops producing output Others cannot distinguish this from a server that is slow Fail-safe failure: Server acts arbitrarily But other servers can recognize its output as false

Failure Masking Failure masking by redundancy Erasure correcting codes Replication

Failure Masking Triple Modular Redundancy

Process Resilience Organize processes into groups Groups can be dynamic run membership protocols hierarchical

Process Resilience

Process Resilience Leader election Bully algorithm Process with highest ID wins

Process Resilience Leader Election using a ring

Process Resilience Agreement in Faulty Systems

Process Resilience Byzantine general problem In the presence of byzantine failure Can only decide on a single value is >2/3 of the participants are not faulty

Process Resilience Byzantine General Problem; Lamport algorithm Each process has to share a value with all others But processes can lie and can misrepresent their value Goal: All processes accept values from the non-faulty processes

Process Resilience Lamport algorithm (1982) Each process sends its value to all other processes Values are gathered into vectors Each process sends these vectors to everybody else Every process accepts values with a majority

Process Resilience

Reliable Group Communication Problem: How to get messages to the members of a process group Reliable multicasting Without process failures: Problem assumes that there is a join and leave protocol for processes Often: members receive messages in exactly the same order

Reliable Group Communication Simple solution if all receivers are known and assumed to not fail

Reliable Group Communication Tradeoffs: Explicit retransmission requests or retransmissions when acks are missing Use multicast or point-to-point transmission for retransmissions Use piggy-backing in order save network bandwidth

Reliable Group Communication Scalability in Reliable Multicasting Simple scheme cannot support large numbers Optimization: Get rid of acks Only send retransmission requests Difficult to get messages out of history buffer. Use cumulative acks

Reliable Group Communication Scalability in reliable multicasting Feedback suppression Implemented in Scalable Reliable Multicasting (SRM) by Floyd (97) Never ack receipt of messages Whenever a process sends a retransmission request (NACK), it multicasts to everyone Servers that receive this multicast suppress their own NACK message

Reliable Group Communication

Reliable Group Communication Feedback suppression scales reasonably well Problems: Receivers need to schedule feedback messages accurately Otherwise, too many will send out their NACK anyway Feedback still interrupts processes that received the message Could form a separate multicast process for those that have not received But that is difficult to do over a wide area network

Reliable Group Communication Hierarchical Feedback Control

Reliable Group Communication Atomic multicast (in the presence of failures) Make a distinction between receiving and delivering a message

Reliable Group Communication Each message is associated with a group view The processes on the delivery list Changes in group membership Announced by a group view change message Problem: Message based on old group view needs to be delivered before the group view change message is delivered

Reliable Group Communication Virtual Synchronicity Reliable multicast where multicast message to a group view G is delivered to all non-faulty processes in G

Reliable Group Communication

Reliable Group Communication Gives several possibilities for ordering Unordered multicasts Fifo ordered multicasts Causally-ordered multicasts Totally-ordered multicasts

Reliable Group Communication Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called Atomic multicasting

Reliable Group Communication ISIS: Implementing atomic multicast Build on TCP as a reliable point-to-point communication Assumes that messages sent out by a sender arrive in that order (TCP property) Multicasting message with group view Same as sending individual messages to all members in the group

Reliable Group Communication Processes keep messages until they know that every other process has received m In that case m is stable ONLY STABLE MESSAGES ARE DELIVERED This is also true for view-change messages Forwarding of messages guarantees that a message delivered to one non-faulty process is received by everyone in the group Can require any process to send message to all members of the group

Reliable Group Communication

Reliable Group Communication Processing a group change Process receives group change message Forwards any unstable message for the old group to all processes in the new group and marks them as stable ISIS / TCP assumes that these messages are never lost All messages to the old group received by one process are therefore guaranteed to be received by all non-faulty process in the old group

Reliable Group Communication When process P no longer has unstable messages: Multicasts a flush message to the new group When P receives flush messages from all members of the new group, it installs the new view

Reliable Group Communication

Reliable Group Communication When process Q receives message sent to the old group If Q still believes itself to be in the old group: Delivers message (unless it has already received it and considers it a duplicate) If Q has received view change message Forwards any unstable message Then sends flush message to the new group

Reliable Group Communication Need more protocol in order to deal with failure during a view change Details in Birman s book or the papers on ISIS

Checkpointing Revocery Forward recovery Bring system to a new, failure free state Backward recovery Bring system back to an old, failure free state and start over

Checkpointing Distributed snapshot to establish recovery line

Domino effect Checkpointing

Checkpointing Need to do coordinated checkpointing instead of individual checkpointing Simpler solution: Two-phase blocking protocol Coordinator broadcasts a CHECKPOINT_REQ Processes receiving CHECKPOINT_REQ create local checkpoint queue messages from the application block until they receive CHECKPOINT_DONE Coordinator sends CHECKPOINT_DONE after receiving acks from everyone

Checkpointing Techniques used to reduce checkpoints Message logging Can lead to orphans

Checkpointing Pessimistic logging protocols Ensure that for each non-stable message there is at most one process depending on it Optimistic logging protocols Any orphan process depending on some message is rolled back until it now longer depend on the message