Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Similar documents
Fault Tolerance. Fall 2008 Jussi Kangasharju

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Basic Concepts

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

Fault Tolerance. Distributed Software Systems. Definitions

Chapter 8 Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance

Failure Tolerance. Distributed Systems Santa Clara University

CSE 5306 Distributed Systems

Chapter 8 Fault Tolerance

Fault Tolerance. Chapter 7

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Failure Masking by Redundancy

Today: Fault Tolerance. Fault Tolerance

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems Fault Tolerance

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Dep. Systems Requirements

Today: Fault Tolerance

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Fault Tolerance 1/64

MYE017 Distributed Systems. Kostas Magoutis

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance. Distributed Systems IT332

Distributed Systems Principles and Paradigms

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

MYE017 Distributed Systems. Kostas Magoutis

Today: Fault Tolerance. Reliable One-One Communication

Module 8 Fault Tolerance CS655! 8-1!

Distributed Systems

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Module 8 - Fault Tolerance

Distributed Systems Reliable Group Communication

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms

Distributed Systems (ICE 601) Fault Tolerance

Fault Tolerance Dealing with an imperfect world

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

Basic vs. Reliable Multicast

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Recovering from a Crash. Three-Phase Commit

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems COMP 212. Revision 2 Othon Michail


TSW Reliability and Fault Tolerance

The UNIVERSITY of EDINBURGH. SCHOOL of INFORMATICS. CS4/MSc. Distributed Systems. Björn Franke. Room 2414

Issues in Programming Language Design for Embedded RT Systems

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

CSE 486/586 Distributed Systems

Coordination and Agreement

MODELS OF DISTRIBUTED SYSTEMS

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CSE 486/586 Distributed Systems Reliable Multicast --- 1

Fault Tolerance. The Three universe model

Consensus and related problems

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Coordination and Agreement

MODELS OF DISTRIBUTED SYSTEMS

Distributed Systems 24. Fault Tolerance

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Multicast Communication. Brian Nielsen

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

To do. Consensus and related problems. q Failure. q Raft

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

System Models for Distributed Systems

Distributed Systems 23. Fault Tolerance

CS 425 / ECE 428 Distributed Systems Fall 2017

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

COMMUNICATION IN DISTRIBUTED SYSTEMS

Last Class:Consistency Semantics. Today: More on Consistency

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

Implementation Issues. Remote-Write Protocols

Distributed Systems. Multicast and Agreement

Networked Systems and Services, Fall 2018 Chapter 4. Jussi Kangasharju Markku Kojo Lea Kutvonen

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

CS505: Distributed Systems

FAULT TOLERANT SYSTEMS

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

C 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How

Process groups and message ordering

PRIMARY-BACKUP REPLICATION

Distributed Commit in Asynchronous Systems

Eventual Consistency. Eventual Consistency

02 - Distributed Systems

Rollback-Recovery p Σ Σ

02 - Distributed Systems

Distributed Operating Systems

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

2017 Paul Krzyzanowski 1

C 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.

Transcription:

Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju

Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2

Basic Concepts Dependability includes n Availability n Reliability n Safety n Maintainability 3

Fault, error, failure -- -- -- client failure error fault server 4

Failure Model n Challenge: independent failures n Detection n which component? n what went wrong? n Recovery n failure dependent n ignorance increases complexity => taxonomy of failures 5

Fault Tolerance n Detection n Recovery n mask the error OR n fail predictably n Designer n possible failure types? n recovery action (for the possible failure types) n A fault classification: n transient (disappear) n intermittent (disappear and reappear) n permanent 6

Failure Models Type of failure Crash failure Description A server halts, but is working correctly until it halts Omission failure Receive omission Send omission Timing failure A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server's response lies outside the specified time interval Response failure Value failure State transition failure Arbitrary failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control A server may produce arbitrary responses at arbitrary times Crash: fail-stop, fail-safe (detectable), fail-silent (seems to have crashed) 7

Failure Masking (1) Detection n redundant information - error detecting codes (parity, checksums) - replicas n redundant processing - groupwork and comparison n control functions - timers - acknowledgements 8

Failure Masking (2) Recovery n redundant information - error correcting codes - replicas n redundant processing - time redundancy - retrial - recomputation (checkpoint, log) - physical redundancy - groupwork and voting - tightly synchronized groups 9

Example: Physical Redundancy Triple modular redundancy. 10

Failure Masking (3) n Failure models vs. implementation issues: the (sub-)system belongs to a class => certain failures do not occur => easier detection & recovery n A point of view: forward vs. backward recovery n Issues: n process resilience n reliable communication 11

Process Resilience (1) n Redundant processing: groups n Tightly synchronized - flat group: voting - hierarchical group: a primary and a hot standby (execution-level synchrony) n Loosely synchronized - hierarchical group: n Technical basis a primary and a cold standby (checkpoint, log) n group a single abstraction n reliable message passing 12

Flat and Hierarchical Groups (1) Communication in a flat group. Communication in a simple hierarchical group Group management: a group server OR distributed management 13

Flat and Hierarchical Groups (2) n Flat groups n symmetrical n no single point of failure n complicated decision making n Hierarchical groups n the opposite properties n Group management issues n join, leave; n crash (no notification) 14

Process Groups n Communication vs management n application communication: message passing n group management: message passing n synchronization requirement: each group communication operation in a stable group n Failure masking n k fault tolerant: tolerates k faulty members - fail silent: k + 1 components needed - Byzantine: 2k + 1 components needed n a precondition: atomic multicast n in practice: the probability of a failure must be small enough 15

Alice Agreement in Faulty Systems (1) e-mail Bob Requirement: - an agreement - within a bounded time on a rainy day La Tryste Faulty data communication: no agreement possible Alice -> Bob Let s meet at noon in front of La Tryste Alice <- Bob OK!! Alice: If Bob doesn t know that I received his message, he will not come Alice -> Bob I received your message, so it s OK. Bob: If Alice doesn t know that I received her message, she will not come 16

Agreement in Faulty Systems (2) Reliable data communication, unreliable nodes The Byzantine generals problem for 3 loyal generals and 1 traitor. a) The generals announce their troop strengths (in units of 1 kilosoldiers). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3. 17

Agreement in Faulty Systems (3) The same as in previous slide, except now with 2 loyal generals and one traitor. 18

Reliable Group Communication n Lower-level data communication support n unreliable multicast (LAN) n reliable point-to-point channels n unreliable point-to-point channels n Group communication n individual point-to-point message passing n implemented in middleware or in application n Reliability n acks: lost messages, lost members n communication consistency? 19

Reliability of Group Communication? n A sent message is received by all members (acks from all => ok) n Problem: during a multicast operation n an old member disappears from the group n a new member joins the group n Solution n membership changes synchronize multicasting => during an MC operation no membership changes An additional problem: the sender disappears (remember: multicast ~ for (all P i in G) {send m to P i } ) 20

Basic Reliable-Multicasting Scheme Message transmission Reporting feedback A simple solution to reliable multicasting when all receivers are known and are assumed not to fail Scalability? Feedback implosion! 21

Scalability: Feedback Suppression 1. Never acknowledge successful delivery. 2. Multicast negative acknowledgements suppress redundant NACKs Problem: detection of lost messages and lost group members 22

Hierarchical Feedback Control The essence of hierarchical reliable multicasting. a) Each local coordinator forwards the message to its children. b) A local coordinator handles retransmission requests. 23

Basic Multicast Guarantee: the message will eventually be delivered to all member of the group (during the multicast: a fixed membership) Group view: G = {p i } delivery list Implementation of Basic_multicast(G, m) : 1. for each p i in G: send(p i,m) (a reliable one-to-one send) 2. on receive(m) at p i : deliver(m) at p i 24

Message Delivery Delivery of messages - new message => HBQ - decision making - delivery order - deliver or not to deliver? - the message is allowed to be delivered: HBQ => DQ - when at the head of DQ: message => application (application: receive ) Application delivery hold-back queue delivery queue Message passing system 25

Reliable Multicast and Group Changes Assume n reliable point-to-point communication n group G={p i }: each p i : groupview Reliable_multicast (G, m): if a message is delivered to one in G, then it is delivered to all in G Group change (join, leave) => change of groupview Change of group view: update as a multicast vc Concurrent group_change and multicast => concurrent messages m and vc Virtual synchrony: all nonfaulty processes see m and vc in the same order 26

Virtually Synchronous Reliable MC (1) X Group change: G i =G i+1 Virtual synchrony: all processes see m and vc in the same order n m, vc => m is delivered to all nonfaulty processes in G i (alternative: this order is not allowed!) n vc, m => m is delivered to all processes in G i+1 (what is the difference?) Problem: the sender fails (during the multicast why is it a problem?) Alternative solutions: n m is delivered to all other members of G i (=> ordering m, vc) n m is ignored by all other members of G i (=> ordering vc, m) 27

Virtually Synchronous Reliable MC (2) The principle of virtual synchronous multicast: - a reliable multicast, and if the sender crashes - the message may be delivered to all or ignored by each 28

Implementing Virtual Synchrony n Communication: reliable, order-preserving, point-to-point n Requirement: all messages are delivered to all nonfaulty processes in G n Solution n each p j in G keeps a message in the hold-back queue until it knows that all p j in G have received it n a message received by all is called stable n only stable messages are allowed to be delivered n view change G i => G i+1 : - multicast all unstable messages to all p j in G i+1 - multicast a flush message to all p j in G i+1 - after having received a flush message from all: install the new view G i+1 29

Implementing Virtual Synchrony a) Process 4 notices that process 7 has crashed, sends a view change b) Process 6 sends out all its unstable messages, followed by a flush message c) Process 6 installs the new view when it has received a flush message from everyone else 30

Ordered Multicast Need: all messages are delivered in the intended order If p: multicast(g,m) and if (for any m ) for FIFO multicast(g, m) < multicast(g, m ) for causal multicast(g, m) -> multicast(g, m ) for total if at any q: deliver(m) < deliver(m ) then for all q in G : deliver(m) < deliver(m ) 31

Reliable FIFO-Ordered Multicast Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 receives m3 receives m1 sends m4 receives m2 receives m4 receives m2 receives m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting 32

Virtually Synchronous Multicasting Virtually synchronous multicast Basic Message Ordering Total-ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery No Causal multicast Causal-ordered delivery No Atomic multicast None Yes FIFO atomic multicast FIFO-ordered delivery Yes Causal atomic multicast Causal-ordered delivery Yes Six different versions of virtually synchronous reliable multicasting - virtually synchronous: everybody or nobody (members of the group) (sender fails: either everybody else or nobody) - atomic multicasting: virtually synchronous reliable multicasting with totallyordered delivery. 33

Recovery n Fault tolerance: recovery from an error (erroneous state => error-free state) n Two approaches n backward recovery: back into a previous correct state n forward recovery: - detect that the new state is erroneous - bring the system in a correct new state challenge: the possible errors must be known in advance n forward: continuous need for redundancy n backward: - expensive when needed - recovery after a failure is not always possible 34

Recovery Stable Storage Stable Storage Crash after drive 1 Bad spot is updated 35

Implementing Stable Storage n Careful block operations (fault tolerance: transient faults) n careful_read: {get_block, check_parity, error=> N retries} n careful_write: {write_block, get_block, compare, error=> N retries} n irrecoverable failure => report to the client n Stable Storage operations (fault tolerance: data storage errors) n stable_get: {careful_read(replica_1), if failure then careful_read(replica_2)} n stable_put: {careful_write(replica_1), careful_write(replica_2)} n error/failure recovery: read both replicas and compare - both good and the same => ok - both good and different => replace replica_2 with replica_1 - one good, one bad => replace the bad block with the good block 36

Checkpointing Needed: a consistent global state to be used as a recovery line A recovery line: the most recent distributed snapshot 37

Independent Checkpointing Each process records its local state from time to time difficult to find a recovery line If the most recently saved states do not form a recovery line rollback to a previous saved state (threat: the domino effect). A solution: coordinated checkpointing 38

Coordinated Checkpointing (1) n Nonblocking checkpointing n see: distributed snapshot n Blocking checkpointing n coordinator: multicast CHECKPOINT_REQ n partner: - take a local checkpoint - acknowledge the coordinator - wait (and queue any subsequent messages) n coordinator: - wait for all acknowledgements - multicast CHECKPOINT_DONE n coordinator, partner: continue 39

Coordinated Checkpointing (2) P1 P2 P3 checkpoint request ack checkpoint done local checkpoint message 40

Message Logging Improving efficiency: checkpointing and message logging Recovery: most recent checkpoint + replay of messages Problem: Incorrect replay of messages after recovery may lead to orphan processes. 41

Chapter Summary n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 42