Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Similar documents
Chapter 8 Fault Tolerance

Distributed Systems Fault Tolerance

Fault Tolerance. Basic Concepts

CSE 5306 Distributed Systems

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance. Distributed Systems. September 2002

CSE 5306 Distributed Systems. Fault Tolerance

Fault Tolerance. Chapter 7

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Software Systems. Definitions

Failure Tolerance. Distributed Systems Santa Clara University

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Today: Fault Tolerance

Distributed Systems COMP 212. Lecture 19 Othon Michail

Today: Fault Tolerance. Replica Management

Dep. Systems Requirements

Today: Fault Tolerance. Fault Tolerance

Networked Systems and Services, Fall 2018 Chapter 4. Jussi Kangasharju Markku Kojo Lea Kutvonen

Module 8 Fault Tolerance CS655! 8-1!

Fault Tolerance 1/64

Distributed Systems Principles and Paradigms

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Module 8 - Fault Tolerance

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Fault Tolerance. Fall 2008 Jussi Kangasharju

Chapter 8 Fault Tolerance

Today: Fault Tolerance. Failure Masking by Redundancy

Distributed Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Fault Tolerance Dealing with an imperfect world

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

MODELS OF DISTRIBUTED SYSTEMS

Consensus and related problems

MODELS OF DISTRIBUTED SYSTEMS

To do. Consensus and related problems. q Failure. q Raft

Implementation Issues. Remote-Write Protocols

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Last Class:Consistency Semantics. Today: More on Consistency

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems (ICE 601) Fault Tolerance

Basic Protocols and Error Control Mechanisms

Distributed Systems 24. Fault Tolerance

COMMUNICATION IN DISTRIBUTED SYSTEMS

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Issues in Programming Language Design for Embedded RT Systems

Distributed Systems 23. Fault Tolerance

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Distributed Systems (5DV147)

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Eventual Consistency. Eventual Consistency

Middleware for Transparent TCP Connection Migration

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

TSW Reliability and Fault Tolerance

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Distributed Systems COMP 212. Revision 2 Othon Michail

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

RMI: Design & Implementation

Today: Fault Tolerance. Reliable One-One Communication

Consensus in Distributed Systems. Jeff Chase Duke University

Failures, Elections, and Raft

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Today CSCI Communication. Communication in Distributed Systems. Communication in Distributed Systems. Remote Procedure Calls (RPC)

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

CSE 486/586 Distributed Systems

Causal Consistency and Two-Phase Commit

Distributed Operating Systems

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

02 - Distributed Systems

02 - Distributed Systems

PRIMARY-BACKUP REPLICATION

CS 347 Parallel and Distributed Data Processing

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Fault Tolerance in Distributed Systems: An Introduction

Dependability tree 1

Introduction to Distributed Systems Seif Haridi

CS 425 / ECE 428 Distributed Systems Fall 2017

DISTRIBUTED COMPUTER SYSTEMS

Paxos and Raft (Lecture 21, cs262a) Ion Stoica, UC Berkeley November 7, 2016

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Byzantine Fault Tolerance

Paxos provides a highly available, redundant log of events

Distributed Systems Consensus

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Transcription:

Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1

Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed commit Recovery from failure 2

Basic Concepts Concept of partial failure A distributed system consists of components (processes, communication channels, etc) Partial failure takes place when a component fails Goal is to continue in the presence of partial failures. and to automatically recover from partial failures without seriously affecting the overall performance Fault tolerance is closely related to dependability: A dependable system must tolerate partial failures 3

Dependability Dependability includes the following requirements: 1. Availability The system is ready to be used immediately 2. Reliability A property that a system can run continuously without failure 3. Safety When a system temporarily fails to operate correctly, nothing catastrophic happens 4. Maintainability How easy a failed system can be repaired. 5. Security 4

Terminology A system fails when it is not living up to its specifications An error is a a part of system s state that may lead to a failure The cause of an error is called a fault. Dependable system is where you can manage faults Fault prevention: prevent the occurrence of a fault Fault tolerance: build a system that can provide its services in the presence of faults. Fault removal: reduce the presence, number, and seriousness of faults. Fault forecasting: estimate the present number, future incidence, and the consequences of faults 5

Transient faults Fault Types Occur once and then disappear. If the operation is repeated, the fault goes away Intermittent faults Occur, then vanishes of its own accord, then reappears, and so on Difficult to diagnose Permanent faults continues to exist until the faulty component is repaired 6

Failure Models Different types of failures. Type of failure Crash failure Omission failure Receive omission Send omission Timing failure Response failure Value failure State transition failure Arbitrary failure (Byzantine failure) Description A server halts, but is working correctly until it halts A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server's response lies outside the specified time interval The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control A server may produce arbitrary responses at arbitrary times 7

Crash Failures Problem: Clients cannot distinguish between a crashed and slow server Servers don t announce they are going to crash Fail-silent: The server exhibits omission or crash failures; the client cannot tell what went wrong Fail-stop: It is easy to detect that a server crashed; (e.g. it announces) Fail-safe: The server exhibits arbitrary, but benign failures (they can t do any harm; clients can easily understand they are junk). 8

Failure Masking by Redundancy Information redundancy: Extra bits are added to allow recovery from corrupted bits during transmission Time redundancy: An action is repeated if needed. Retransmitting Helpful when the faults are transient or intermittent Physical redundancy Extra equipment or processes are added to tolerate the loss or malfunctioning of some components 9

Triple Modular Redundancy. signal A B C voter A1 B1 C1 V1 V4 V7 A2 B2 C2 V2 V5 V8 A3 B3 C3 V3 V6 V9 10

Process Resilience Basic approach Organize several identical processes into a group; if a process in the group fails the others take over. Group of processes act as if they were a single entity Flat groups: All processes are equal; all decisions are made collectively No single point of failure; hard to implement May impose overhead as control is completely distributed Hierarchical groups: All communications are done through a single coordinator; easy to implement Loss of the coordinator brings the entire group to a halt 11

Flat Groups versus Hierarchical Groups a) Communication in a flat group. b) Communication in a simple hierarchical group 12

Groups & Failure Masking Assumption All requests arrive at all servers in the same order (atomic multicast problem) When a group can mask any k concurrent member failures, it is said to be k-fault tolerant (k is called degree of fault tolerance) Question How large should a k-fault tolerant group be? Answer partly depends on the fault type If k processes stop silently, then we need k+1 processes to provide k-fault tolerant system If processes exhibit Byzantine failures, we need 2k+1 processes to achieve k degree of fault tolerance if the client base its decisions on majority 13

Agreement in Faulty Systems (1) Problem: To reach an agreement in face of process failures For example, electing a coordinator, deciding whether or not to commit a transaction, dividing up tasks among workers, and synchronization The problem is more intricate since reaching an agreement needs majority of votes among correctly functioning processes. Another issue is to reach this consensus within a finite number of steps. Lamport proved that in a system with k faulty processes, agreement can be reached only if 2k+1 correctly functioning processes are present, for a total of 3k+1 14

Agreement in Faulty Systems (2) Byzantine generals problem: There are n blue army generals who want to attack a red army They want to exchange troop strengths After finite number of steps each general will have a vector of length n corresponding to all blue armies But, k generals are known to be traitors Traitors lies to everyone Traitors do not cooperate since it is unknown who the traitors are There is reliable point-to-point communication channels between two generals. 15

Agreement With Four Processes Traitor honest honest 2 1 1 y x 2 2 4 1 2 3 4 z 4 1 4 honest 1 Got(1,2,x,4) 2 Got(1,2,y,4) 3 Got(1,2,3,4) 4 Got(1,2,z,4) 1 Got (1,2,y,4) (1,b,c,d) (1,2,z,4) Result:(1,2,U,4) 2 Got (1,2,x,4) (e,2,g,h) (1,2,z,4) Result:(1, 2, U, 4) 4 Got (1,2,x,4) (1,2,y,4) (i,j,z,4) Result:(1, 2, U, 4) 16

Agreement With Three Processes Traitor x honest 1 1 1 2 3 2 y 2 honest Originally, they have 1000, 2000, and 3000 soldiers, respectively P1: got(1,2,x) P2: got(1,2,y) P3: got(1,2,3) P1 Got (1,2,y) (1,b,c) Result: (1, 2, U) P2 Got (1,2,x) (d,2,f) Result: (1, 2, U) 17

So far: Reliable C/S Communication Concentrated on process resilience. What about reliable communication channels? A communication channel can crash (e.g. a TCP connection is abruptly broken) Omission failures (a message can get lost) Timing failures Arbitrary failures (e.g. duplicate messages) 18

Point-to-Point Communication Point-to-point communication is established by making use of a reliable transport protocol, such as TCP. TCP masks omission failure (i.e. lost messages) by acknowledgements and retransmissions Crash failures are not masked (TCP connection is abruptly broken) Distributed system lets the client know that the channel has crashed and takes care of problem by setting up a new connection 19

Reliable RPC (RMI) Transparency may be lost when an error occurs. What can go wrong? 1. The client is unable to locate the server 2. The request message from the client to the server is lost 3. The server crashes after receiving a request 4. The reply message from the server to the client is lost 5. The client crashes after sending the request What to do? [1:] Relatively simple just report back to client (raise an exception or use signals) [2]: Just resend the message (if an acknowledgement does not arrive after a certain time period elapsed) 20

Server Crashes (1) server server server REQ Receive REQ Receive REQ Receive REP Execute Reply No REP Execute Crash No REP Crash A server in client-server communication a) Normal case b) Crash after execution (report failure to the client) c) Crash before execution (retransmit the request) 21

Server Crashes (2) Three schools of thought exist on what to do when the server crashes 1. At least once semantics: guarantees that the RPC has been carried out at least once, possibly more 2. At most once semantics: guarantees that the RPC has been carried out at most one time, possibly none at all. 3. No guarantee at all. What we want is exactly once semantics. 22

Server Crashes: Print Example (1) Example: The client sends a (print) request to a server The server crashes and subsequently recovers The server announces it is up and running again Client is not sure if its request has been carried out Server s Strategies: 1. Send ACK after delivering the request to the printer 2. Send ACK after the text is printed out 23

Server Crashes: Print Example (2) Client s strategies 1. Never reissue the request 2. Always reissue the request 3. Reissue the request only if it did not receive an acknowledgement client assumes that server crashed before the request is delivered to the printer 4. Reissue the request only if it has received an acknowledgement client assumes that server sent acknowledgement, and crashed before sending the document for printing 24

Server Crashes: Print Example (3) Situation: Two strategies for server and four for clients, there are eight combinations. None of them offers a satisfactory solution for every case There are three events that can happen at the server in different orders (M) : send the completion message (P) : print the message (C): crash 25

Server Crashes: Print Example (4) 1.M P C: a crash occurs after sending completion message and printing the text 2.M C( P): a crash happens after sending the completion message, but before the text could be printed 3.P M C: a crash occurs after sending the completion message and printing the text 4.P C (M): the text printed, after which a crash occurs before the completion message could be sent 5.C (P M): a crash happens before the server could do anything 6.C (M P): a crash happens before the server could do anything 26

Server Crashes: Print Example (5) Client Server Strategy M P Strategy P M Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM) Always DUP OK OK DUP DUP OK Never OK ZERO ZERO OK OK ZERO Only when ACKed DUP OK ZERO DUP OK ZERO Only when not ACKed OK ZERO OK OK DUP OK Different combinations of client and server strategies in the presence of server crashes. 27

[4:] Lost Reply Messages Detecting lost replies can be hard; the client cannot know why there is no answer: the reply message is lost or the server crashed or it is just slow. i.e. the client is not sure whether the request is carried out or not Solution The client resend the request and does nothing to prevent duplicate execution if the request is an idempotent Otherwise; it can assign each request a unique sequence number. Or it indicates if the request is original or a reissue. The last two solutions put extra burden on the server 28

[5:] Client Crashes (1) The problem The server is processing the request and tying up the valuable resources for nothing (orphan computation) Solutions 1.Extermination: the client keeps a log on persistent storage that survives the crashes; when the client recovers, it checks the log file and kills explicitly the orphan computation 2.Reincarnation: divide time up into sequentially numbered epochs when a client recovers, it announces the starting of a new epoch; when announcement is received, the orphan computations are killed on behalf of the client. If the network is partitioned some orphans can survive 29