Distributed Systems COMP 212. Lecture 19 Othon Michail

Similar documents
Fault Tolerance. Distributed Systems IT332

Dep. Systems Requirements

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance. Distributed Systems. September 2002

Failure Tolerance. Distributed Systems Santa Clara University

Today: Fault Tolerance. Replica Management

Fault Tolerance. Chapter 7

Distributed Systems

Today: Fault Tolerance. Fault Tolerance

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance. Basic Concepts

Distributed Systems COMP 212. Revision 2 Othon Michail

Chapter 8 Fault Tolerance

Today: Fault Tolerance

Today: Fault Tolerance. Reliable One-One Communication

CSE 5306 Distributed Systems. Fault Tolerance

Today: Fault Tolerance. Failure Masking by Redundancy

CSE 5306 Distributed Systems

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Module 8 Fault Tolerance CS655! 8-1!

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Distributed Systems Fault Tolerance

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance 1/64

Module 8 - Fault Tolerance

Fault Tolerance. Fall 2008 Jussi Kangasharju

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

MYE017 Distributed Systems. Kostas Magoutis

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

MYE017 Distributed Systems. Kostas Magoutis

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Chapter 8 Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance

Recovering from a Crash. Three-Phase Commit

Fault Tolerance Dealing with an imperfect world

Distributed Systems 23. Fault Tolerance

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Implementation Issues. Remote-Write Protocols

Distributed Systems 24. Fault Tolerance

Last Class:Consistency Semantics. Today: More on Consistency

Distributed Systems Principles and Paradigms

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

TSW Reliability and Fault Tolerance

Consensus and related problems

Distributed Commit in Asynchronous Systems

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Fault Tolerance. The Three universe model

(Pessimistic) Timestamp Ordering

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:


Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Issues in Programming Language Design for Embedded RT Systems

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed.

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

To do. Consensus and related problems. q Failure. q Raft

Network Protocols. Sarah Diesburg Operating Systems CS 3430

02 - Distributed Systems

02 - Distributed Systems

Eventual Consistency. Eventual Consistency

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Paxos provides a highly available, redundant log of events

Distributed Systems 11. Consensus. Paul Krzyzanowski

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

MODELS OF DISTRIBUTED SYSTEMS

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Distributed System. Gang Wu. Spring,2018

Distributed Systems (5DV147)

CSE 486/586 Distributed Systems

Dependability tree 1

Synchronization. Chapter 5

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Basic vs. Reliable Multicast

Topics in Reliable Distributed Systems

System models for distributed systems

Lecture XII: Replication

CSE 5306 Distributed Systems. Synchronization

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Distributed Systems. Before We Begin. Advantages. What is a Distributed System? CSE 120: Principles of Operating Systems. Lecture 13.

Consensus in Distributed Systems. Jeff Chase Duke University

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Replication in Distributed Systems

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

System Models for Distributed Systems

CS 347 Parallel and Distributed Data Processing

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

0: BEGIN TRANSACTION 1: W = 1 2: X = W + 1 3: Y = X * 2 4: COMMIT TRANSACTION

Transcription:

Distributed Systems COMP 212 Lecture 19 Othon Michail

Fault Tolerance 2/31

What is a Distributed System? 3/31

Distributed vs Single-machine Systems A key difference: partial failures One component fails Some components are affected Others not In single-machine systems failures are often total The whole system is affected Important and challenging goal: Design Distributed Systems that can mask partial failures and automatically recover from them without seriously affecting the overall performance This is the concept of fault tolerance 4/31

Basic Concepts Fault Tolerance is closely related to the notion of Dependability For a DS to be dependable, it should satisfy: Availability: the system is ready to be used immediately Reliability: the system can run continuously without failure Safety: if a system fails, nothing catastrophic will happen Maintainability: when a system fails, it can be repaired easily and quickly (and, sometimes, without its users noticing the failure) Security: protection of a system from various threats 5/31

But, what is a Failure? A definition: A system is said to fail when it cannot meet its promises Failures are caused by errors in the system a damaged packet a lost packet a process not responding a server sending unexpected responses Errors are caused by faults We often detect an error but cannot easily say what caused it a bad transmission medium causing damage to packets an overheated CPU making a server behave in unexpected ways A fault tolerant system can provide its services even in the presence of faults 6/31

Main Types of Faults Transient fault: occurs once and then disappears A bird flying through a beam of a microwave transmitter Some bits might get lost but a retransmission will probably work Intermittent fault: may reappear again and again A loose contact on a connector Permanent fault: continues to exist until the faulty component is replaced burn-out chips, software bugs, disk head crashes 7/31

Main Types of Failures Type of failure Crash failure Omission failure Receive omission Send omission Timing failure Response failure Value failure State transition failure Arbitrary (aka Byzantine) failure Description A server halts, but is working correctly until it halts A server fails to respond to incoming requests - A server fails to receive incoming messages - A server fails to send outgoing messages A server's response lies outside the specified time interval The server's response is incorrect - The value of the response is wrong - The server deviates from the correct flow of control A server may produce arbitrary responses at arbitrary times (even malicious) 8/31

Failure Masking by Redundancy Strategy: if we cannot avoid failures then better hide them from other processes and/or users using redundancy Three main types: 1. Information Redundancy Add extra bits to allow for error detection/recovery e.g., parity bits, Hamming codes 2. Time Redundancy Perform operation and, if required, perform it again Well suited for transient and intermittent faults 3. Physical Redundancy Add extra (duplicate) hardware and/or software components to the system Replication 9/31

Example: Physical Redundancy Ubiquitous technique: nature (two eyes, ears, lungs), aircraft engines, sports referees An example from circuits Triple modular redundancy: Each device is replicated three times. If two or three inputs of a voter are the same, the output is equal to that input (majority wins). Note that a voter itself might be faulty, hence, a separate voter at each stage. 10/31

DS Fault Tolerance Topics Process Resilience Process failure prevention by replicating processes into groups Design issues How to achieve agreement within a group when not all members can be trusted? Reliable Client/Server Communications masking crash and omission failures Reliable Group Communication e.g., what happens if a process joins the group during communication? Distributed COMMIT operation to be performed by all group members, or none at all Recovery Strategies Recovery from an error is fundamental to fault tolerance 11/31

Process Resilience Key approach to tolerating a faulty process: Organise several identical processes into a group. Key property of groups: A message sent to the group is received by all copies of the process (the group members). If one process fails, hopefully some other process will still be able to function (and service any pending request or operation) Like sending an email to the email address of a department and then that email being automatically forwarded to a number of people so that the one available will handle it 12/31

Flat vs. Hierarchical Groups a) Communication in a flat group: all the processes are equal, decisions are made collectively Note: no single point-of-failure However: decision making is complicated as consensus is required b) Communication in a simple hierarchical group: one of the processes is elected to be the coordinator, which selects another process (a worker) to perform the operation Note: single point-of failure However: decisions are easily and quickly made by the coordinator without first having to ensure consensus 13/31

Reliable Client/Server Comms. In addition to process failures, a communication channel may exhibit crash, omission, timing, and/or arbitrary failures In practice, the focus is on masking crash and omission failures For example: the point-to-point TCP masks omission failures by guarding against lost messages using ACKs and retransmissions. However, it performs poorly when a crash occurs (although a DS may try to mask a TCP crash by automatically re-establishing the lost connection). 14/31

Reliable Group Communication Reliable multicast services guarantee that all messages are delivered to all members of a process group Sounds simple, but is surprisingly tricky as multicasting services tend to be inherently unreliable For a small group, multiple, reliable point-to-point channels will do the job, however, such a solution scales poorly as the group membership grows Also: What happens if a process joins the group during communication? Worse: what happens if the sender of the multiple, reliable pointto-point channels crashes half way through sending the messages? 15/31

What is Reliable Group Communication? Processes may fail All non-faulty members receive the message Agree what the group looks like before a message can be delivered + other constraints All processes operate correctly Message should be delivered to every current group member Simplest case: No process join or leave the group 16/31

Atomic Multicasting There often exists a requirement where the system needs to ensure that all processes get the message, or that none of them gets it An additional requirement is that all messages arrive at all processes in sequential order This is known as the atomic multicast problem 17/31

Example Consider a replicated database constructed on top of a distributed system DS offers reliable multicasting DB is a group of processes one process per replica Suppose a replica crashes and then recovers Some updates may be missed Bringing the replica back requires knowing missed updates and their order (costly) With atomic multicasting: Updates are only sent to alive processes Before the crashed replica joins the group, it has to explicitly synchronise 18/31

Distributed COMMIT General Goal: We want an operation to be performed by all group members, or none at all Atomic multicasting: The operation is the delivery of the message There are three types of a commit protocol : single-phase commit two-phase commit three-phase commit 19/31

The Two-Phase Commit Protocol Summarised: GET READY, OK, GO AHEAD 1. The coordinator sends a VOTE_REQUEST message to all group members 2. The group member returns VOTE_COMMIT if it can commit locally, otherwise VOTE_ABORT 3. All votes are collected by the coordinator. A GLOBAL_COMMIT is sent if all the group members voted to commit. If one group member voted to abort, a GLOBAL_ABORT is sent. 4. The group members then COMMIT or ABORT based on the last message received from the coordinator 20/31

Problems with Failures Participants wait for messages. Crash? Timeouts are used e.g., if some participants do not reply to VOTE_REQUEST within a time limit, the coordinator will send GLOBAL_ABORT to all participants What if a participant, P, sent VOTE_COMMIT, but did not get a message from the coordinator within the limit? Coordinator crashed after asking P for a vote Either immediately, or while asking others, or in process of informing other participants about the global decision Cannot just abort! 21/31

Blocking in 2 Phase Commit Either wait till the coordinator recovers (and inform again about the global decision) Or try and get a decision from a neighbour If another process, Q, was not asked to vote, ABORT If Q committed, COMMIT But what if Q also waits for the final decision? If the coordinator crashed just after asking everyone for votes, cannot reach a decision before the coordinator recovers 22/31

Recovery Strategies Once a failure has occurred, it is essential that the process where the failure happened recovers to a correct state Recovery from an error is fundamental to fault tolerance Two main forms of recovery: 1. Backward Recovery: return the system to some previous correct state (using checkpoints), then continue executing 2. Forward Recovery: bring the system into a correct state, from which it can then continue to execute 23/31

Forward and Backward Recovery (1) Disadvantage of Forward Recovery: In order to work, all potential errors need to be accounted for up-front When an error occurs, the recovery mechanism then knows what to do to bring the system forward to a correct state 24/31

Forward and Backward Recovery (2) Disadvantages of Backward Recovery: Checkpointing (can be very expensive, especially when errors are very rare) No guarantee that we won t meet the same error again Some operations cannot be rolled back [Despite the cost, backward recovery is implemented more often] 25/31

Recovery Example Consider as an example: Reliable Communications Retransmission of a lost/damaged packet is an example of a backward recovery technique Recovery of damaged packets from others (e.g., consider Hamming codes) is an example of a forward recovery technique 26/31

Backward Recovery and Logging Many fault-tolerant DS combine checkpointing with logging Sender-based logs Log a message before sending Receiver-based logs Log a message before executing 27/31

Checkpointing Independent checkpointing May lead to inconsistencies and cascaded rollback (domino effect) P2 has received m but P1 does not have a record of sending Rollback to an earlier state 28/31

Coordinated Checkpointing All processes synchronise to jointly write their state to a local stable storage no cascaded rollbacks Algorithms Two-phase blocking protocol Coordinator multicasts CHECKPOINT_REQUEST When a process receives this message It takes a local checkpoint Stops accepting incoming messages Queues outgoing messages handed by application Acknowledges the coordinator When everybody acknowledged, coordinator sends CHECKPOINT_DONE (unblocking processes) 29/31

Summary (1 of 2) Fault Tolerance: The characteristic by which a system can mask the occurrence and recovery from failures. A system is fault tolerant if it can continue to operate even in the presence of failures. Types of failures: Crash (system halts) Omission (incoming request ignored) Timing (responding too soon or too late) Response (getting the order wrong) Arbitrary/Byzantine (indeterminate, unpredictable) 30/31

Summary (2 of 2) Fault Tolerance is generally achieved through use of redundancy and reliable multicasting protocols Processes, client/server and group communications can all be enhanced to tolerate faults in a distributed system. Commit protocols allow for fault tolerant multicasting (with two-phase the most popular type). Recovery from errors within a Distributed System tends to rely heavily on Backward Recovery techniques that employ some type of checkpointing or logging mechanism, although Forward Recovery is also possible 31/31