Fault Tolerance. Distributed Software Systems. Definitions

Similar documents
Chapter 8 Fault Tolerance

Fault Tolerance. Basic Concepts

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Failure Tolerance. Distributed Systems Santa Clara University

Fault Tolerance. Distributed Systems. September 2002

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Distributed Systems Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems

Dep. Systems Requirements

Today: Fault Tolerance. Replica Management

Fault Tolerance. Chapter 7

Chapter 8 Fault Tolerance

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

MYE017 Distributed Systems. Kostas Magoutis

Today: Fault Tolerance

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance. Distributed Systems IT332

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Failure Masking by Redundancy

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Reliable Group Communication

Distributed Systems COMP 212. Lecture 19 Othon Michail

MYE017 Distributed Systems. Kostas Magoutis

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance. Fall 2008 Jussi Kangasharju

Fault Tolerance 1/64

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

Distributed Systems (ICE 601) Fault Tolerance

Implementation Issues. Remote-Write Protocols

Module 8 Fault Tolerance CS655! 8-1!

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Module 8 - Fault Tolerance

Process groups and message ordering

Last Class:Consistency Semantics. Today: More on Consistency

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Fault Tolerance Dealing with an imperfect world

To do. Consensus and related problems. q Failure. q Raft

Eventual Consistency. Eventual Consistency

Consensus and related problems

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

System Models for Distributed Systems

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Failures, Elections, and Raft

Distributed Systems

Specifying and Proving Broadcast Properties with TLA

Introduction to Distributed Systems Seif Haridi

2017 Paul Krzyzanowski 1

Consensus in Distributed Systems. Jeff Chase Duke University

Reliable Distributed System Approaches

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Middleware and Distributed Systems. System Models. Dr. Martin v. Löwis

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Basic vs. Reliable Multicast

2. System Models Page 1. University of Freiburg, Germany Department of Computer Science. Distributed Systems. Chapter 2 System Models

Today: Fault Tolerance. Reliable One-One Communication

The Design and Implementation of a Fault-Tolerant Media Ingest System

Distributed Operating Systems

Distributed Systems 11. Consensus. Paul Krzyzanowski

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

System models for distributed systems

Distributed Systems (5DV147)

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)


Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Arvind Krishnamurthy Fall Collection of individual computing devices/processes that can communicate with each other

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Distributed Systems. Multicast and Agreement

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer

The UNIVERSITY of EDINBURGH. SCHOOL of INFORMATICS. CS4/MSc. Distributed Systems. Björn Franke. Room 2414

BYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)

Middleware for Transparent TCP Connection Migration

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201


Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

MODELS OF DISTRIBUTED SYSTEMS

Today. A methodology for making fault-tolerant distributed systems.

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Deadlock

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Algorithms Benoît Garbinato

Distributed Systems (5DV020) Group communication. Fall Group communication. Fall Indirect communication

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Efficient Buffering in Reliable Multicast Protocols

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

Consensus, impossibility results and Paxos. Ken Birman

MODELS OF DISTRIBUTED SYSTEMS

Distributed Systems 24. Fault Tolerance

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Transcription:

Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety: failure to operate correctly does not lead to catastrophic failures Maintainability: ability to easily repair a failed system

Failure Models Different types of failures. Type of failure Crash failure Omission failure Receive omission Send omission Timing failure Response failure Value failure State transition failure Arbitrary failure Description A server halts, but is working correctly until it halts A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server's response lies outside the specified time interval The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control A server may produce arbitrary responses at arbitrary times Failure Masking by Redundancy Triple modular redundancy. 2

Agreement in Faulty Systems Many things can go wrong Communication Message transmission can be unreliable Time taken to deliver a message is unbounded Adversary can intercept messages Processes Can fail or team up to produce wrong results Agreement very hard, sometime impossible, to achieve! Two-Army Problem Two blue armies need to simultaneously attack the white army to win; otherwise they will be defeated. The blue army can communicate only across the area controlled by the white army which can intercept the messengers. What is the solution? 3

Byzantine Agreement [Lamport et al. (982)] Goal: Each process learn the true values sent by correct processes Assumptions: Every message that is sent is delivered correctly The receiver knows who sent the message Message delivery time is bounded Byzantine Agreement Result In a system with m faulty processes agreement can be achieved only if there are 2m+ functioning correctly Note: This result only guarantees that each process receives the true values sent by correct processors, but it does not identify the correct processes! 4

Byzantine General Problem: Example Phase : Generals announce their troop strengths to each other Byzantine General Problem: Example Phase : Generals announce their troop strengths to each other 2 2 2 5

Byzantine General Problem: Example Phase : Generals announce their troop strengths to each other 4 4 4 Byzantine General Problem: Example Phase 2: Each general construct a vector with all troops 2 x 4 2 y 4 x y z 2 z 4 6

Byzantine General Problem: Example Phase 3: Generals send their vectors to each other and compute majority voting 2 y 4 a b c d 2 z 4 (a, b, c, d) (, 2,?, 4) (e, f, g, h) 2 x 4 e f g h 2 z 4 (, 2,?, 4) (h, i, j, k) h 2 2 i x y j 4 4 k (, 2,?, 4) Flat Groups versus Hierarchical Groups a) Communication in a flat group. b) Communication in a simple hierarchical group 7

Reliable Group Communication Reliable multicast: all nonfaulty processes which do not join/leave during communication receive the message Atomic multicast: all messages are delivered in the same order to all processes Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a) Message transmission b) Reporting feedback 8

Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. Hierarchical Feedback Control The essence of hierarchical reliable multicasting. a) Each local coordinator forwards the message to its children. b) A local coordinator handles retransmission requests. 9

Group communication Role of group membership service Provide an interface for group membership changes Implement a failure detector Notify members of group membership changes Perform group address expansion Services provided for process groups Group address expansion Group send Multicast communication Leave Fail Group membership management Join Process group 0

View delivery A view reflects current membership of group A view is delivered when a membership change occurs and the application is notified of the change Receiving a view is different from delivering a view All members have to agree to the delivery of a view View-synchronous group communication the delivery of a new view draws a conceptual line across the system and every message is either delivered on side or the other of that line View-synchronous group communication a (allowed). p p crashes b (allowed). p p crashes q q r r view (p, q, r) view (q, r) view (p, q, r) view (q, r) c (disallowed). p q r p crashes d (disallowed). p crashes p q r view (p, q, r) view (q, r) view (p, q, r) view (q, r)

Atomic Multicast All messages are delivered in the same order to all processes Group view: the set of processes known by the sender when it multicast the message Virtual synchronous multicast: a message multicast to a group view G is delivered to all nonfaulty processes in G If sender fails after sending the message, the message may be delivered to no one Virtual Synchronous Multicast 2

Virtual Synchrony Implementation [Birman et al 99] The logical organization of a distributed system to distinguish between message receipt and message delivery Virtual Synchrony Implementation: [Birman et al., 99] Only stable messages are delivered Stable message: a message received by all processes in the message s group view Assumptions (can be ensured by using TCP): Point-to-point communication is reliable Point-to-point communication ensures FIFO-ordering 3

Virtual Synchrony Implementation: Example G i = {,,,, P5} P5 fails detects that P5 has failed send a view change message to every process in G i+ = {,,, } change view P5 Virtual Synchrony Implementation: Example Every process Send each unstable message m from G i to members in G i+ Marks m as being stable Send a flush message to mark that all unstable messages have been sent unstable message P5 flush message 4

Virtual Synchrony Implementation: Example Every process After receiving a flush message from any process in G i+ installs G i+ P5 Message Ordering Discussed last week how we can implement an ordered multicast FIFO-order: messages from the same process are delivered in the same order they were sent Causal-order: potential causality between different messages is preserved Total-order: all processes receive messages in the same order Total ordering does not imply causality or FIFO! Atomicity is orthogonal to ordering 5

Message Ordering and Atomicity Multicast Reliable multicast FIFO multicast Causal multicast Atomic multicast FIFO atomic multicast Causal atomic multicast Basic Message Ordering None FIFO-ordered delivery Causal-ordered delivery None FIFO-ordered delivery Causal-ordered delivery Total-ordered Delivery? No No No Yes Yes Yes 6