Fault Tolerance. Distributed Systems IT332

Similar documents
Today: Fault Tolerance. Reliable One-One Communication

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Today: Fault Tolerance. Failure Masking by Redundancy

Today: Fault Tolerance. Replica Management

Fault Tolerance. Chapter 7

Fault Tolerance. Distributed Systems. September 2002

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Distributed Systems COMP 212. Lecture 19 Othon Michail

Today: Fault Tolerance. Fault Tolerance

Fault Tolerance. Basic Concepts

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Today: Fault Tolerance

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems. Fault Tolerance

Dep. Systems Requirements

Distributed Systems Fault Tolerance

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance

MYE017 Distributed Systems. Kostas Magoutis

Failure Tolerance. Distributed Systems Santa Clara University

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Distributed Systems

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Fault Tolerance. Distributed Software Systems. Definitions

Module 8 Fault Tolerance CS655! 8-1!

Fault Tolerance 1/64

Distributed Systems Principles and Paradigms

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

MYE017 Distributed Systems. Kostas Magoutis

Recovering from a Crash. Three-Phase Commit

Module 8 - Fault Tolerance

Fault Tolerance. Fall 2008 Jussi Kangasharju

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

To do. Consensus and related problems. q Failure. q Raft

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed.

MODELS OF DISTRIBUTED SYSTEMS

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Consensus and related problems

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Distributed Systems (ICE 601) Fault Tolerance

Implementation Issues. Remote-Write Protocols

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Last Class:Consistency Semantics. Today: More on Consistency

MODELS OF DISTRIBUTED SYSTEMS

Distributed Systems COMP 212. Revision 2 Othon Michail

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps

(Pessimistic) Timestamp Ordering

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Fault Tolerance. The Three universe model


Distributed Commit in Asynchronous Systems

Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Consensus in Distributed Systems. Jeff Chase Duke University

Eventual Consistency. Eventual Consistency

Distributed System. Gang Wu. Spring,2018

Distributed Systems 24. Fault Tolerance

Fault Tolerance Dealing with an imperfect world

CS 347 Parallel and Distributed Data Processing

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

Issues in Programming Language Design for Embedded RT Systems

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Distributed Transactions

Silberschatz and Galvin Chapter 18

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

COMMUNICATION IN DISTRIBUTED SYSTEMS

Distributed Systems 23. Fault Tolerance

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

PRIMARY-BACKUP REPLICATION

Dependability tree 1

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

DISTRIBUTED COMPUTER SYSTEMS

Practical Byzantine Fault

Chapter 16: Distributed Synchronization

0: BEGIN TRANSACTION 1: W = 1 2: X = W + 1 3: Y = X * 2 4: COMMIT TRANSACTION

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

Chapter 18: Distributed

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

11/7/2018. Event Ordering. Module 18: Distributed Coordination. Distributed Mutual Exclusion (DME) Implementation of. DME: Centralized Approach

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Rollback-Recovery p Σ Σ

Chapter 17: Recovery System

Chapter 19: Distributed Databases

Fault Tolerance in Distributed Systems: An Introduction

Recovery from failures

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

Distributed Operating Systems

TSW Reliability and Fault Tolerance

Advanced Database Management System (CoSc3052) Database Recovery Techniques. Purpose of Database Recovery. Types of Failure.

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Transcription:

Fault Tolerance Distributed Systems IT332

2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery

3 Failures, Due to What? A system is said to fail when it cannot meet its promises Failures can happen due to a variety of reasons: Hardware faults Software bugs Operator errors Network errors/outages

4 Failures in Distributed Systems A characteristic feature of distributed systems that distinguishes them from single-machine systems is the notion of partial failure A partial failure may happen when a component in a distributed system fails This failure may affect the proper operation of other components, while at the same time leaving yet other components unaffected

5 Goal and Fault-Tolerance An overall goal in distributed systems is to construct the system in such a way that it can automatically recover from partial failures Tire punctured. Car stops. Tire punctured, recovered and continued. Fault-tolerance is the property that enables a system to continue operating properly in the event of failures For example, TCP is designed to allow reliable two-way communication in a packet-switched network, even in the presence of communication links which are imperfect or overloaded

6 Dependable Systems Being fault tolerant is strongly related to what is called a dependable system A system is said to be highly available if it will be most likely working at a given instant in time A highly-reliable system is one that will most likely continue to work without interruption during a relatively long period of time Availability Reliability A Dependable System A system temporarily fails to operate correctly, nothing catastrophic happens Safety Maintainability How easy a failed system can be repaired

7 Failure Models Type of Failure Description Crash Failure A server halts, but was working correctly until it stopped Omission Failure A server fails to respond to incoming requests Receive Omission A server fails to receive incoming messages Send Omission A server fails to send messages Timing Failure A server s response lies outside the specified time interval Response Failure Value Failure State Transition Failure A server s response is incorrect The value of the response is wrong The server deviates from the correct flow of control Byzantine Arbitrary Failure A server may produce arbitrary responses at arbitrary times *or Byzantine Failure

8 Faults Masking by Redundancy The key technique for masking faults is to use redundancy Usually, extra bits are added to allow recovery from garbled bits Information Usually, extra processes are added to allow tolerating failed processes Software Redundancy Hardware Usually, extra equipment are added to allow tolerating failed hardware components Time Usually, an action is performed, and then, if required, it is performed again

9 Example: Triple Modular Redundancy If one is faulty, the final result will be incorrect A circuit with signals passing through devices A, B, and C, in sequence If 2 or 3 of the inputs are the same, the output is equal to that input Each device is replicated 3 times and after each stage is a triplicated voter

10 Reliable Client Server Communication How to handle communication failures? Use a reliable transport protocol (e.g., TCP) or handle at the application layer Techniques for reliable communication Use redundant bits to detect bit errors in packets Use sequence numbers to detect packet loss Mask corrupted/lost packets using acknowledgements and retransmissions

11 RPC Semantics in the Presence of Failures Client cannot locate server: The RPC system informs the caller of the failure Client request is lost: Client resends the request upon timeout Server crashes after receiving a request Server response is lost Client crashes after sending a request

12 Server Crashes Server crashes after receiving a request: did crash occur before or after the request is carried out? Client cannot distinguish between the 2 possibilities, leading to 3 possible semantics At least once: keep trying until a request is received, guarantees that the RPC has been carried out at least one time, but possibly more. Exactly once: desirable but difficult to achieve At most once: give up immediately and report back failure, guarantees that the RPC has been carried out at most one time, but possibly none at all. A server in client server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution.

13 Server Response Lost Upon timeout, client cannot tell whether the server has crashed, or the reply was lost, or the request was lost Client can resend the request for idempotent operations (i.e., operations that can be safely repeated) For non-idempotent operations, add sequence numbers to requests so that the server can distinguish a retransmitted request from an original request

14 Distributed Commit A distributed transaction involves multiple servers To ensure the atomicity of transactions, all servers involved must agree whether to commit or abort The process that initiates the distributed transaction acts as the coordinator Processes participating in the distributed transaction are the participants The coordinator rely on a distributed commit protocol to ensure the atomicity of a distributed transaction

15 Two-Phase Commit Protocol (2PC) Ensures that a transaction commits only when all participants are ready to commit Phase I: Voting Phase Step 1 Step 2 The coordinator sends a VOTE_REQUEST message to all participants. When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling indicating the that coordinator it is prepared that to locally it is prepared commit to its locally part of commit the transaction, its part of or the transaction, otherwise a VOTE_ABORT or otherwise a message. VOTE_ABORT message

16 Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from the participants. Step 1 If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message. Each participant that voted for a commit waits for the final reaction by the coordinator. Step 2 If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Otherwise, when receiving a GLOBAL_ABORT message, the transaction is locally aborted as well.

17 Recovering from a Crash Processes may crash; timeout is used when a process is waiting for a message from another process Upon timeout The coordinator in WAIT state will send Global Abort to all participants A participant in INIT state will abort the transaction A participant in READY state will contact another process Q and examine Q s state If all participants are in READY state, they will block until the coordinator recovers Actions taken by a participant P when residing in state READY and having contacted another participant Q.

18 Recovery When a failure occurs, we need to bring the system into an error free state Forward recovery: remove all errors in the system s state, thus enabling the system to proceed Forward recovery is impossible in most cases, why? The main problem with forward error recovery mechanisms is that it has to be known in advance which errors may occur. Backward recovery: bring the system back to a previous error free state Widely used in distributed systems Techniques for backward recovery Checkpointing Message logging

19 Checkpointing Each process periodically records its state, i.e., makes a checkpoint High checkpoint frequency increases the overhead Low checkpoint frequency increases the recovery cost in terms of lost computation Consistent global state/ ditributed snapshot: if a process P has recorded the receipt of a message, then there should also be a process Q that has recorded the sending of that message. Upon a crash, roll back to a recovery line, i.e., the most recent consistent collection of checkpoints.

20 Checkpointing We are able to identify both, senders and receivers. Initial state A snapshot A recovery line Not a recovery line P A failure Q Message sent from Q to P They jointly form a distributed snapshot

21 Independent Checkpoints Each process periodically checkpoints independently of other processes Upon a failure, each process is rolled back to its most recent checkpoint If most recent checkpoints do not form a consistent global state, need keep rolling back until a consistent global state is found cascaded rollback Not a Recovery Line Not a Recovery Line Not a Recovery Line Rollback P A failure Q

22 Coordinated Checkpoints Processes use the distributed snapshot algorithm to coordinate checkpointing all processes synchronize to jointly write their state to local stable storage. This saved state is automatically globally consistent. Upon a failure, roll back to the latest snapshot All processes restart from the latest snapshot

23 Message Logging Many distributed systems combine checkpointing (expensive) with message logging (cheap) Each process periodically records its local state and logs the messages it received after having recorded that state When a process crashes, restore the most recently checkpointed state, and then replay the messages that have been received Message logging can be of two types: Sender-based logging: A process can log its messages before sending them off Receiver-based logging: A receiving process can first log an incoming message before delivering it to the application Combining infrequent checkpointing with message logging is more efficient than frequent checkpointing

24 Replay of Messages and Orphan Processes P Incorrect replay of messages after recovery can lead to orphan processes. This should be avoided An orphan process is a process that survives the crash of another process, but whose state is inconsistent with the crashed process after its recovery M1 Q crashes Q recovers M1 is replayed M1 M3 becomes an orphan Q M2 M3 M2 M3 R Logged Message M2 can never be replayed So neither will m3 Unlogged Message

25 Next Chapter Distributed File Systems Questions?