Dep. Systems Requirements

Size: px
Start display at page:

Download "Dep. Systems Requirements"

Transcription

1 Dependable Systems

2 Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small relative to MTTF, we can achieve good availability, even if MTTF is not large e.g. Windows Reliability the system can run continuously without failure R(t) = probability system will not fail by time t Lim t->infty R(t) = 0 Safety if a system fails, nothing catastrophic will happen. Maintainability when a system fails, it can be repaired easily and quickly (and, sometimes, without its users noticing the failure). Security...

3 Some Definitions Failure When a system cannot meet its promises When DS cannot provide one of its services Error Part of system state which may lead to failure e.g., packet corruption Fault Cause of an error e.g., bad transmission medium Transient, intermittent, permanent

4 Fault Tolerance System provides services even when faults exist Which means we expect faults in the system Except we cannot detect faults We can detect errors, since they are a product of the system state An error represents the failure of a component, but not of the system On detecting the error we will perform corrective action to prevent failure Thus we are tolerant of the existence of the fault We need to know failure types, since components fail when errors occur

5 Types of Failure Type of failure Description Crash failure Omission failure Receive omission Send omission A server halts, but is working correctly until it halts. Fail silent / Fail stop: most benign form of failure A server fails to respond to incoming requests. - A server fails to receive incoming messages. - A server fails to send outgoing messages. To a client, indistinguishable from fail silent Timing failure Response failure Value failure State transition failure A server's response lies outside the specified time interval. Performance failure critical problem in real-time systems The server's response is incorrect. - The value of the response is wrong. - The server deviates from the correct flow of control. Arbitrary failure A server may produce arbitrary responses at arbitrary times. Also known as Byzantine failure

6 Failure Masking by Redundancy Strategy: hide the occurrence of failure from other processes using redundancy. Three main types: Information Redundancy add extra bits to allow for error detection/recovery e.g., Hamming codes Time Redundancy perform operation and, if needs be, perform it again. e.g., Transactions: in the event of ABORT, retry the transaction Physical Redundancy add extra (duplicate) hardware and/or software to the system e.g., replicated DNS servers.

7 Physical Redundancy Redundancy Hardware e.g., Triple Modular Redundancy Software e.g., replicate processes A1 V1 B1 V4 C1 V7 A2 V2 B2 V5 C2 V8 A3 V3 B3 V6 C3 V9

8 Process Groups Processes can be made fault tolerant by arranging to have a group of processes, with each member of the group being identical. A message sent to the group is delivered to all of the copies of the process (the group members), and then only one of them performs the required service. Deal with a collection of processes as a single abstraction; need not know how many processes there are or where they are. If one of the processes fails, it is assumed that one of the others will still be able to function (and service any pending request or operation). Flat versus hierarchical groups As long as N/2+1 communicating, then we have quorum That is, we know no other processes can be part of such a group. Thus we can claim majority rule

9 Passive Replication Two servers: one active, one passive Client accesses server through perfect switch, which selects the server based on health status Health-status monitoring must also be perfect Or proactive State is typically shared between the servers by means of raid array This amounts to a second interface Client to server Server to server Which in turn makes servers invoking other servers more complex No ability to load share in this approach Requires different code for takeover from normal operation Client has to react to server fail-over Very widely deployed RAID Health Status Server 1 Server 2 Switch Client

10 Active Replication All processes execute identical code, processing all messages in whatever sequence is necessary to ensure that they remain in approximately the same state. Neither clients nor servers require modification, provided the message middleware ensures any causal ordering required to guarantee matching end states at the servers. This extends to servers invoking other servers. (Presuming that we never lose contact with N/2+1 nodes) Picture courtesy of Chris Hobbs

11 Virtual Synchrony VS is one approach to active replication. It produces the same outcome as if the system were synchronous i.e. all messages and replies were atomic Except that approach is very expensive, hence the desire to be virtually synchronous, but not actually synchronous Actual synchronous are less error prone, but operate at the speed of the slowest processor We avoid the need for slow down by use of quorums of processes Picture courtesy of Chris Hobbs A reliable multicast is virtually synchronous if a message multicast to a group is delivered to each nonfaulty process. If sender crashes during the multicast, the message may be delivered to all remaining processes, or ignored by each one of them

12 The Goal of Agreement Algorithms To have all non-faulty processes reach concensus on some issue (quickly). Even with non-faulty processes, agreement between even two processes is not possible in the face of unreliable (or un-time-bounded) communication. Except this does not mean what you think it means. When it says not possible what it means is not always possible Specifically, the communication system can always maliciously defeat any attempt at agreement. If we have a message system with a p% failure rate, although we can send sufficient messages to overcome this loss rate with arbitrary degree of assurance, we can never prove that the messages did actually get through. See the two-army problem In practice, failure does not occur in the ways required of the proofs Process failure need to be guarded against The network can (subject to partitioning) be made arbitrary reliable Sufficient that our engineering judgement will recognize that other problems will occur before we have to worry about message loss

13 The Byzantine Generals Problem n generals, all of which head armies m of which are traitors; trying to prevent loyal generals from reaching agreement by feeding them incorrect and contradictory information Can loyal generals reach agreement? (need to exchange troops strength) A similar occurance can surface in a distributed system, and is known as byzantine (or arbitrary) failure. Though it may not be malicious. It may simply be defective code Question: how do we deal with such malicious group members within a distributed system? We will assume communication is correct

14 Agreement in Faulty Systems a) The generals announce their troop strengths (in units of 1 kilosoldiers) to the other members of the group by sending a message. b) The vectors that each general assembles based on (a), each general knows their own strength. They then send their vectors to all the other generals. c) The vectors that each general receives in step 3. It is clear to all that General 3 is the traitor. d) Each general examines the ith element of the received vectors. Looks for majority

15 Agreement in Faulty Systems Why 3N+1? The same algorithm as in previous slide, except now with 2 loyal generals and 1 traitor. Note: It is no longer possible to determine the majority value in each column, and the algorithm has failed to produce agreement. It has been shown that for the algorithm to work properly, more than two-thirds of the processes have to be working correctly. That is: if there are M faulty processes, we need 2M + 1 functioning processes to reach agreement. * [Lamport et al. 82] In a system with m faulty processes, agreement can be achieved only if 2m+1 correctly functioning processes are present * [Fisher et al. 85] if messages cannot be delivered within a known, finite time, no agreement possible evn if one process is faulty

16 Distributed COMMIT Atomic multicasting is a special case of distributed atomic action, where the action is the delivery of the message. Distributed commit also satisfies the requirement for distributed atomic action, and has other uses. There are three types of commit protocol single-phase two-phase three-phase commit.

17 One-Phase Commit An elected co-ordinator tells all the other processes to perform the operation in question. But, what if a process cannot perform the operation? There s no way to tell the coordinator! Whoops

18 Two-Phase Commit First developed in 1978 Summarized: GET READY, OK, GO AHEAD. The coordinator sends a VOTE_REQUEST message to all group members. The group member returns VOTE_COMMIT if it can commit locally, otherwise VOTE_ABORT. All votes are collected by the coordinator. A GLOBAL_COMMIT is sent if all the group members voted to commit. If one group member voted to abort, a GLOBAL_ABORT is sent. The group members then COMMIT or ABORT based on the last message received from the coordinator.

19 Two-Phase Commit Finite State Machines a) The finite state machine for the coordinator. b) The finite state machine for a participant (group member). c) Note: all state transitions are logged to stable storage

20 Problem with Two-Phase Commit It can lead to both the coordinator and the group members blocking, which may lead to deadlock. If the coordinator crashes, the group members will not be able to reach a final decision, and they may, therefore, block until the coordinator recovers Note that co-ordinator failure is likely rare We already know that process, machine, and network failure should be treated as abnormal conditions, and so we should not cause our system undue cost under normal operating conditions to ensure a problem (that can be solved by occassional manual intervention) does not occur. Note also that there is a failure small window during which this failure has to occur: after INIT and before final COMMIT Note also that all co-ordinator operations will be logged, so we can have a passive failover with the operations written to a twinned RAID. Thus, the passive server can take over the operation. What if another node crashes? That said, Two-Phase Commit is known as a blocking-commit protocol for reason of co-ordinator crash leading to blocking. The (alternate) solution? The Three-Phase Commit Protocol.

21 Three-Phase Commit a) Finite state machine for the coordinator. b) Finite state machine for a group member. Main point: although 3PC is generally regarded as better than 2PC, it is not applied often in practice, as the conditions under which 2PC blocks rarely occur. 2PC is often not implemented because it is too costly?? 3PC can itself block under various circumstances

22 Error Recovery Once a error has been detected, it is essential that the system recovers to a correct state. Recovery from an error is fundamental to fault tolerance Without it, the system will fail Two main forms of recovery: Backward Recovery: Return the system to some previous correct state (using checkpoints), then continue executing. Forward Recovery: Bring the system into a correct state, from which it can then continue to execute. Typically a measure, dependent on damage assessment

23 Forward and Backward Recovery Disadvantage of Backward Recovery: Checkpointing can be very expensive especially when errors are very rare Despite the cost, backward recovery is implemented more often than forward recovery. This is because it is a general mechanism, and thus easily usable Forward recovery is often not generalizable Disadvantage of Forward Recovery: In order to work, all potential errors need to be accounted for upfront. i.e., during the design When an error occurs, the recovery mechanism then knows what to do to bring the system forward to a correct state. There is no general abstraction for forward recovery as exists for backward recovery

24 Recovery Example Consider as an example: Reliable Communications. Retransmission of a lost/damaged packet is an example of a backward-recovery mechanism. When a lost/damaged packet can be reconstructed as a result of the receipt of other successfully delivered packets, then this is known as Erasure Correction. This is an example of a forward-recovery mechanism.

25 Rebuild from redundant data e.g., database indexes Backward Recovery State restoration e.g., Watchdog timer causing reboot Recovery Line, composed of several Recovery points Synchronization of creation of recovery points is critical And/or capturing of messages (a la global snapshot algorithm)

26 Recovery Measures Recoverable vs. unrecoverable Some things are inherently unrecoverable e.g., printer output There may be compensation for unrecoverable actions Techniques for maintaining recovery data Checkpoint Audit trail (See 'message logging' in Textbook) Careful replacement Differential file Recovery cache

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Today: Fault Tolerance. Failure Masking by Redundancy

Today: Fault Tolerance. Failure Masking by Redundancy Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Fault Tolerance. Chapter 7

Fault Tolerance. Chapter 7 Fault Tolerance Chapter 7 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability Failure Models Type of failure Crash failure Omission failure Receive omission Send omission

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic

More information

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

Fault Tolerance 1/64

Fault Tolerance 1/64 Fault Tolerance 1/64 Fault Tolerance Fault tolerance is the ability of a distributed system to provide its services even in the presence of faults. A distributed system should be able to recover automatically

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Fault Tolerance. Fall 2008 Jussi Kangasharju

Fault Tolerance. Fall 2008 Jussi Kangasharju Fault Tolerance Fall 2008 Jussi Kangasharju Chapter Outline Fault tolerance Process resilience Reliable group communication Distributed commit Recovery 2 Basic Concepts Dependability includes Availability

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

MYE017 Distributed Systems. Kostas Magoutis

MYE017 Distributed Systems. Kostas Magoutis MYE017 Distributed Systems Kostas Magoutis magoutis@cse.uoi.gr http://www.cse.uoi.gr/~magoutis Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Fault Tolerance Dealing with an imperfect world

Fault Tolerance Dealing with an imperfect world Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction

More information

MYE017 Distributed Systems. Kostas Magoutis

MYE017 Distributed Systems. Kostas Magoutis MYE017 Distributed Systems Kostas Magoutis magoutis@cse.uoi.gr http://www.cse.uoi.gr/~magoutis Message reception vs. delivery The logical organization of a distributed system to distinguish between message

More information

Today: Fault Tolerance. Reliable One-One Communication

Today: Fault Tolerance. Reliable One-One Communication Today: Fault Tolerance Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing Message logging Lecture 17, page 1 Reliable One-One Communication Issues

More information

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems 11. Consensus. Paul Krzyzanowski Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1 Distributed Commit o A more general problem of atomic

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Implementation Issues. Remote-Write Protocols

Implementation Issues. Remote-Write Protocols Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

Distributed Systems 23. Fault Tolerance

Distributed Systems 23. Fault Tolerance Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors

More information

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Last Class:Consistency Semantics. Today: More on Consistency

Last Class:Consistency Semantics. Today: More on Consistency Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps (Pessimistic) stamp Ordering Another approach to concurrency control: Assign a timestamp ts(t) to transaction T at the moment it starts Using Lamport's timestamps: total order is given. In distributed

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus CIS 505: Software Systems Lecture Note on Consensus Insup Lee Department of Computer and Information Science University of Pennsylvania CIS 505, Spring 2007 Concepts Dependability o Availability ready

More information

G1 m G2 Attack at dawn? e e e e 1 S 1 = {0} End of round 1 End of round 2 2 S 2 = {1} {1} {0,1} decide -1 3 S 3 = {1} { 0,1} {0,1} decide -1 white hats are loyal or good guys black hats are traitor

More information

(Pessimistic) Timestamp Ordering

(Pessimistic) Timestamp Ordering (Pessimistic) Timestamp Ordering Another approach to concurrency control: Assign a timestamp ts(t) to transaction T at the moment it starts Using Lamport's timestamps: total order is given. In distributed

More information

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components Fault Tolerance To avoid disruption due to failure and to improve availability, systems are designed to be fault-tolerant Two broad categories of fault-tolerant systems are: systems that mask failure it

More information

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018

TWO-PHASE COMMIT ATTRIBUTION 5/11/2018. George Porter May 9 and 11, 2018 TWO-PHASE COMMIT George Porter May 9 and 11, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons license These slides

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Page 1 Introduction We frequently want to get a set of nodes in a distributed system to agree Commitment protocols and mutual

More information

Eventual Consistency. Eventual Consistency

Eventual Consistency. Eventual Consistency Eventual Consistency Many systems: one or few processes perform updates How frequently should these updates be made available to other read-only processes? Examples: DNS: single naming authority per domain

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment.

Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. SOFTWARE ENGINEERING SOFTWARE RELIABILITY Software reliability is defined as the probability of failure-free operation of a software system for a specified time in a specified environment. LEARNING OBJECTIVES

More information

Distributed Operating Systems

Distributed Operating Systems 2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System

More information

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed.

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed. Committing Transactions T 1 T T2 2 T T3 3 Clients T n Transaction Manager Transaction Manager (Coordinator) Allocation of transaction IDs (TIDs) Assigning TIDs with Coordination of commitments, aborts,

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

System Models for Distributed Systems

System Models for Distributed Systems System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201 Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service

More information

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms Holger Karl Computer Networks Group Universität Paderborn Goal of this chapter Apart from issues in distributed time and resulting

More information

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management Hector Garcia-Molina CS 347 Notes07 1 Reliable distributed database management Reliability Failure

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Practical Byzantine Fault

Practical Byzantine Fault Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

CSE 486/586 Distributed Systems

CSE 486/586 Distributed Systems CSE 486/586 Distributed Systems Failure Detectors Slides by: Steve Ko Computer Sciences and Engineering University at Buffalo Administrivia Programming Assignment 2 is out Please continue to monitor Piazza

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 6: Reliability Reliable Distributed DB Management Reliability Failure models Scenarios CS 347 Notes 6 2 Reliability Correctness Serializability

More information

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018 Distributed Computing CS439: Principles of Computer Systems November 19, 2018 Bringing It All Together We ve been studying how an OS manages a single CPU system As part of that, it will communicate with

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

Distributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems?

Distributed Systems Question Bank UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems? UNIT 1 Chapter 1 1. Define distributed systems. What are the significant issues of the distributed systems? 2. What are different application domains of distributed systems? Explain. 3. Discuss the different

More information

Consensus in Distributed Systems. Jeff Chase Duke University

Consensus in Distributed Systems. Jeff Chase Duke University Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1 Unreliable multicast P 2 P 3 Consensus algorithm P 2 P 3 v 2 Step 1 Propose. v 3 d 2 Step 2 Decide. d 3 Generalizes

More information

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases We talked about transactions and how to implement them in a single-node database. We ll now start looking into how to

More information

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E) RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E) 2 LECTURE OUTLINE Failures Recoverable schedules Transaction logs Recovery procedure 3 PURPOSE OF DATABASE RECOVERY To bring the database into the most

More information

Synchronization. Chapter 5

Synchronization. Chapter 5 Synchronization Chapter 5 Clock Synchronization In a centralized system time is unambiguous. (each computer has its own clock) In a distributed system achieving agreement on time is not trivial. (it is

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Fundamentals Fall 2013 1 basics 2 basics Single process int i; i=i+1; 1 CPU - Steps are strictly sequential - Program behavior & variables state determined by sequence of operations

More information

TSW Reliability and Fault Tolerance

TSW Reliability and Fault Tolerance TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture4:Failure& Fault-tolerant Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the

More information

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

CSE 5306 Distributed Systems. Synchronization

CSE 5306 Distributed Systems. Synchronization CSE 5306 Distributed Systems Synchronization 1 Synchronization An important issue in distributed system is how processes cooperate and synchronize with one another Cooperation is partially supported by

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Synchronization Jia Rao http://ranger.uta.edu/~jrao/ 1 Synchronization An important issue in distributed system is how process cooperate and synchronize with one another Cooperation

More information

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017 Distributed Computing CS439: Principles of Computer Systems November 20, 2017 Last Time Network Programming: Sockets End point of communication Identified by (IP address : port number) pair Client-Side

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

The Design and Implementation of a Fault-Tolerant Media Ingest System

The Design and Implementation of a Fault-Tolerant Media Ingest System The Design and Implementation of a Fault-Tolerant Media Ingest System Daniel Lundmark December 23, 2005 20 credits Umeå University Department of Computing Science SE-901 87 UMEÅ SWEDEN Abstract Instead

More information

Byzantine Fault Tolerance

Byzantine Fault Tolerance Byzantine Fault Tolerance CS 240: Computing Systems and Concurrency Lecture 11 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. So far: Fail-stop failures

More information