Failure Tolerance. Distributed Systems Santa Clara University

Similar documents
Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Software Systems. Definitions

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance. Basic Concepts

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Distributed Systems COMP 212. Lecture 19 Othon Michail

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance. Distributed Systems IT332

Today: Fault Tolerance. Fault Tolerance

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Systems Fault Tolerance

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Today: Fault Tolerance. Replica Management

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Today: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Fault Tolerance. Chapter 7

Dep. Systems Requirements

Today: Fault Tolerance. Failure Masking by Redundancy

Fault Tolerance 1/64

Distributed Systems

Module 8 Fault Tolerance CS655! 8-1!

Distributed Systems Principles and Paradigms

Fault Tolerance. Fall 2008 Jussi Kangasharju

Module 8 - Fault Tolerance

Today: Fault Tolerance. Reliable One-One Communication

Distributed Systems Reliable Group Communication

Last Class: Clock Synchronization. Today: More Canonical Problems

Last Class: Clock Synchronization. Today: More Canonical Problems

Synchronization. Chapter 5

MYE017 Distributed Systems. Kostas Magoutis

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

MYE017 Distributed Systems. Kostas Magoutis

Last Class: Naming. Today: Classical Problems in Distributed Systems. Naming. Time ordering and clock synchronization (today)

Distributed Systems 11. Consensus. Paul Krzyzanowski

CMPSCI 677 Operating Systems Spring Lecture 14: March 9

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Recovering from a Crash. Three-Phase Commit

Distributed Systems COMP 212. Revision 2 Othon Michail


Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Consensus and related problems

Synchronization. Clock Synchronization

To do. Consensus and related problems. q Failure. q Raft

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Distributed Synchronization. EECS 591 Farnam Jahanian University of Michigan

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Process groups and message ordering

Intuitive distributed algorithms. with F#

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems 23. Fault Tolerance

Basic vs. Reliable Multicast

Verteilte Systeme (Distributed Systems)

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Fault Tolerance. The Three universe model

殷亚凤. Synchronization. Distributed Systems [6]

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

Distributed Systems 24. Fault Tolerance

PROCESS SYNCHRONIZATION

MODELS OF DISTRIBUTED SYSTEMS

CSE 486/586 Distributed Systems

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Exam 2 Review. Fall 2011

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Exam 2 Review. October 29, Paul Krzyzanowski 1

Fault Tolerance Dealing with an imperfect world

CHAPTER 4: INTERPROCESS COMMUNICATION AND COORDINATION

Last Class:Consistency Semantics. Today: More on Consistency

TSW Reliability and Fault Tolerance

Process Synchroniztion Mutual Exclusion & Election Algorithms

Reliable Distributed System Approaches

MODELS OF DISTRIBUTED SYSTEMS

Coordination and Agreement

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Implementation Issues. Remote-Write Protocols

System Models for Distributed Systems

CSE 5306 Distributed Systems. Synchronization

Coordination and Agreement

Consensus in Distributed Systems. Jeff Chase Duke University

Synchronization Part 2. REK s adaptation of Claypool s adaptation oftanenbaum s Distributed Systems Chapter 5 and Silberschatz Chapter 17

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Specifying and Proving Broadcast Properties with TLA

Issues in Programming Language Design for Embedded RT Systems

CSE 486/586 Distributed Systems Reliable Multicast --- 1

Transcription:

Failure Tolerance Distributed Systems Santa Clara University

Distributed Checkpointing

Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot Reflects a consistent, global state If process P has received a message from Q Then global state should show that process Q sent a message to process P

Distributed Checkpointing Global state presented by a cut Consistent cuts: Messages shown received are shown sent Messages shown sent are either received or in transit

Distributed Checkpointing

Distributed Checkpointing Represent distributed system as a system of processes connected by unidirectional point-to-point communication

Distributed Checkpointing Distributed snapshot Anybody can start snapshot Initiating process P records its own state Process P sends a marker along all of its outgoing channels Process Q upon receiving first marker Records its state Sends a marker to all of its neighbors Starts recording all incoming channels Process Q upon receiving subsequent markers Stops recording on channel on which the marker arrived

Distributed Checkpointing Process Q upon receiving last marker Send own state messages on channels monitored to the initiating state

Distributed Checkpointing

Distributed Checkpointing Termination Detection: Use snapshot protocol If Q receives a marker for the first time Sending process becomes its predecessors If Q is done with the snapshot, sends a DONE message to predecessor This still allows for messages in transit

Distributed Checkpointing Termination detection: Need snapshot where all channels are empty Q returns DONE only if All of Q s successors have returned a DONE message Q has not received any message between the point it recorded its state and the point it had received the marker along each of its incoming channel In all other cases, Q sends a CONTINUE message

Distributed Checkpointing Termination detection When initiating process receives only DONE messages No regular messages are in transit Thus, computation is terminated

Failure Types Dependability consists of Availability System is ready to be used Reliability System can run continually without failure Safety In a failure condition, nothing catastrophic happens Maintainability How easy can a failed system be repaired

Failure Types Dependability: System that breaks down for a millisecond every hour Availability > 99.9999 % Reliability is low System breaks down only for two weeks every July Availability ~ 96% Reliability is high

Failure Types Failure: system cannot meet its promises Error: part of the system state that may lead to a failure Fault: cause of an error

Failure Types Transient faults occur once and the disappear If the operation is repeated, fault goes away Example: Bird flies through the beam of a microwave transmitter and possibly gets roasted

Failure Types Intermittent fault Fault occurs Goes away Fault returns

Failure Types Permanent fault Fault appears Continues to exist until the faulty component is repaired

Failure Types Crash failure Server halts, but it is working correctly until it has Omission failure A server fails to respond to incoming messages Receive omission Server fails to receive incoming messages Send omission Server fails to send messages

Failure Types Timing failure A server s response lies outside the specified time interval Response failure A server s response is incorrect Value failure The value of the response is wrong State transition failure The server deviates from the correct flow of control Arbitrary / Byzantine failure A server may produce arbitrary responses at arbitrary times

Failure Types Fail-stop failure Fail stop server stops producing output Others can detect this state Fail-silent failure Fail silent server stops producing output Others cannot distinguish this from a server that is slow Fail-safe failure: Server acts arbitrarily But other servers can recognize its output as false

Failure Masking Failure masking by redundancy Erasure correcting codes Replication

Failure Masking Triple Modular Redundancy

Process Resilience Organize processes into groups Groups can be dynamic run membership protocols hierarchical

Process Resilience

Process Resilience Leader election Bully algorithm Process with highest ID wins

Process Resilience Leader Election using a ring

Process Resilience Agreement in Faulty Systems

Process Resilience Byzantine general problem In the presence of byzantine failure Can only decide on a single value is >2/3 of the participants are not faulty

Process Resilience Byzantine General Problem; Lamport algorithm Each process has to share a value with all others But processes can lie and can misrepresent their value Goal: All processes accept values from the non-faulty processes

Process Resilience Lamport algorithm (1982) Each process sends its value to all other processes Values are gathered into vectors Each process sends these vectors to everybody else Every process accepts values with a majority

Process Resilience

Process Resilience

Reliable Group Communication Problem: How to get messages to the members of a process group Reliable multicasting Without process failures: Problem assumes that there is a join and leave protocol for processes Often: members receive messages in exactly the same order

Reliable Group Communication Simple solution if all receivers are known and assumed to not fail

Reliable Group Communication Tradeoffs: Explicit retransmission requests or retransmissions when acks are missing Use multicast or point-to-point transmission for retransmissions Use piggy-backing in order save network bandwidth

Reliable Group Communication Scalability in Reliable Multicasting Simple scheme cannot support large numbers Optimization: Get rid of acks Only send retransmission requests Difficult to get messages out of history buffer. Use cumulative acks

Reliable Group Communication Scalability in reliable multicasting Feedback suppression Implemented in Scalable Reliable Multicasting (SRM) by Floyd (97) Never ack receipt of messages Whenever a process sends a retransmission request (NACK), it multicasts to everyone Servers that receive this multicast suppress their own NACK message

Reliable Group Communication

Reliable Group Communication Feedback suppression scales reasonably well Problems: Receivers need to schedule feedback messages accurately Otherwise, too many will send out their NACK anyway Feedback still interrupts processes that received the message Could form a separate multicast process for those that have not received But that is difficult to do over a wide area network

Reliable Group Communication Hierarchical Feedback Control

Reliable Group Communication Atomic multicast (in the presence of failures) Make a distinction between receiving and delivering a message

Reliable Group Communication Each message is associated with a group view The processes on the delivery list Changes in group membership Announced by a group view change message Problem: Message based on old group view needs to be delivered before the group view change message is delivered

Reliable Group Communication Virtual Synchronicity Reliable multicast where multicast message to a group view G is delivered to all non-faulty processes in G

Reliable Group Communication

Reliable Group Communication Gives several possibilities for ordering Unordered multicasts Fifo ordered multicasts Causally-ordered multicasts Totally-ordered multicasts

Reliable Group Communication Virtually synchronous reliable multicasting with totally-ordered delivery of messages is called Atomic multicasting

Reliable Group Communication ISIS: Implementing atomic multicast Build on TCP as a reliable point-to-point communication Assumes that messages sent out by a sender arrive in that order (TCP property) Multicasting message with group view Same as sending individual messages to all members in the group

Reliable Group Communication Processes keep messages until they know that every other process has received m In that case m is stable ONLY STABLE MESSAGES ARE DELIVERED This is also true for view-change messages Forwarding of messages guarantees that a message delivered to one non-faulty process is received by everyone in the group Can require any process to send message to all members of the group

Reliable Group Communication

Reliable Group Communication Processing a group change Process receives group change message Forwards any unstable message for the old group to all processes in the new group and marks them as stable ISIS / TCP assumes that these messages are never lost All messages to the old group received by one process are therefore guaranteed to be received by all non-faulty process in the old group

Reliable Group Communication When process P no longer has unstable messages: Multicasts a flush message to the new group When P receives flush messages from all members of the new group, it installs the new view

Reliable Group Communication

Reliable Group Communication When process Q receives message sent to the old group If Q still believes itself to be in the old group: Delivers message (unless it has already received it and considers it a duplicate) If Q has received view change message Forwards any unstable message Then sends flush message to the new group

Reliable Group Communication Need more protocol in order to deal with failure during a view change Details in Birman s book or the papers on ISIS

Checkpointing Revocery Forward recovery Bring system to a new, failure free state Backward recovery Bring system back to an old, failure free state and start over

Checkpointing Distributed snapshot to establish recovery line

Domino effect Checkpointing

Checkpointing Need to do coordinated checkpointing instead of individual checkpointing Simpler solution: Two-phase blocking protocol Coordinator broadcasts a CHECKPOINT_REQ Processes receiving CHECKPOINT_REQ create local checkpoint queue messages from the application block until they receive CHECKPOINT_DONE Coordinator sends CHECKPOINT_DONE after receiving acks from everyone

Checkpointing Techniques used to reduce checkpoints Message logging Can lead to orphans

Checkpointing Pessimistic logging protocols Ensure that for each non-stable message there is at most one process depending on it Optimistic logging protocols Any orphan process depending on some message is rolled back until it now longer depend on the message