Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Size: px
Start display at page:

Download "Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University"

Transcription

1 Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1

2 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered to each member of the group. Assumptions for simplicity: An agreement exists on who is a member of the group Processes do not fail Processes do not join or leave the group while communication is going on. What is reliable multicasting then when these assumptions do not hold? A message that is sent to a process group should be delivered to each current non-faulty member of the group. 2

3 Basic Reliable-Multicasting Schemes sender receiver receiver receiver receiver history buffer M25 Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 sender receiver receiver receiver receiver Last = 25 Last = 25 Last = 23 Last = 25 ACK 25 M25 ACK 25 M25 M25 M25 Missed 24 ACK 25 A simple solution to reliable multicasting when all receivers are known and are assumed not to fail 3

4 Scalability in Reliable Multicasting Problem 1: The sender is flooded with ACK messages when there are too many receivers (feedback implosion) Solution Receivers return only negative ACK when they notice that they missed a broadcast message Problem 2: With returning only negative ACK the sender has to keep a message in its history buffer forever (or at least a long time) Solution: Use expiration time on messages in history buffer 4

5 Nonhierarchical Feedback Control Feedback suppression: goal is to reduce the number of feedback messages returned to the sender SRM protocol A process that notices a missing messages multicasts it to the group after waiting for a random amount of time Receivers suppress their feedback sender receiver receiver receiver receiver T = 3 T = 5 T = 1 T = 4 NACK NACK NACK NACK Several receivers have scheduled a request for retransmission, but the first retransmission request leads 5 to the suppression of others.

6 Hierarchical Feedback Control (1) Essence: Organize processes into subgroups and appoint a local coordinator to each subgroup For simplicity, assume only one sender Setup a tree where the subgroup of sender process is the root node in the tree. Local coordinator is responsible for handling retransmission requests of receivers within its subgroup Local coordinator keeps a history buffer If the local coordinator itself misses a message it asks the coordinator of its parent subgroup to retransmit the message 6

7 Hierarchical Feedback Control (2) coordinator Sender S LAN C C R receiver The essence of hierarchical reliable multicasting. 7

8 Atomic Multicast Goal: To achieve reliable multicasting in the presence of process failures Guarantees that a message is delivered to either all processes or to none at all. All messages must be delivered in the same order to all processes Some processes in the group may crash In order to achieve reliable atomic multicasting, all the nonfaulty members must have agreed on the group membership; e.g. the crashed process is no longer a group member When the process recovers, it is forced to join the group again. Joining the group requires that the state of the process have to be brought up to date. 8

9 Receiving vs. Delivering Messages The logical organization of a distributed system to distinguish between message receipt and message delivery Message is delivered to application Application Message is received by communication layer Message is buffered in this layer until it can be delivered to the application Comm. Layer Local OS Message comes in from the network Network 9

10 Message Ordering (1) Four different orderings in multicast are distinguished: 1. Unordered (reliable) multicast 2.FIFO-ordered multicast 3.Causally-ordered multicast 4.Totally-ordered multicast Process P1 sends m1 sends m2 Process P2 receives m1 receives m2 Process P3 receives m2 receives m1 Three communicating processes in the same group. The ordering of events per process is shown along the vertical 10 axis.

11 Message Ordering (2) Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 receives m3 receives m1 sends m4 receives m2 receives m2 receives m4 receives m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting 11

12 Message Ordering (3) Six different versions of reliable multicasting. Multicast Reliable multicast FIFO multicast Causal multicast Atomic multicast FIFO atomic multicast Causal atomic multicast Basic Message Ordering None FIFO-ordered delivery Causal-ordered delivery None FIFO-ordered delivery Causal-ordered delivery Total-ordered Delivery? No No No Yes Yes Yes 12

13 Virtual Synchrony (1) Group view: The list of processes that a multicast message is delivered (delivery list); denoted as G Each process on that list should have the same group view, A view change vc may occur (e.g. a process joins or leaves the group) during transmission of message m The message m must be delivered to each nonfaulty process in G before the view change comes into effect. Otherwise, the message m must not be delivered at all. 13

14 Virtual Synchrony (2) For example, a process multicasts a message m to a group of processes Right after that, a process leaves or joins the group another process notices a view change and multicasts view change message (vc) to the group Any message sent in view G must be delivered to each correct process before view change message is delivered A reliable multicast with this property is said to be virtually synchronous In other words, a view change acts as a barrier across which no multicast can pass 14

15 Virtual Synchrony (3) A message sent to view G can be delivered only to processes in G, and is discarded by successive views P1 joins the group P1 Reliable multicast P3 crashes P3 rejoins P2 P3 P4 G = {P1, P2, P3, P4} G = {P1, P2, P4} The principle of virtually synchronous multicast. 15

16 Virtual Synchrony: Examples P P Q R vc Q R vc G G G G P P Q R Q R G G G G 16

17 Implementing Virtual Synchrony (1) Isis system (fault-tolerant distributed system) A reliable point-to-point communication facilities exist and the ordering is assumed to be FIFO Can TCP provide a reliable FIFO ordered point-to-point communication? If a message m has been received by all members in G, m is said to be stable Only stable messages are allowed to be delivered. Otherwise, it is kept in a buffer in the communication layer. Assume The current view is G i and the next view G i+1 is to be installed G i and G i+1 differs by one process (WLG) 17

18 Implementing Virtual Synchrony (2) For example, The process that notices a view change (e.g. a process crashes or a process joins the group probably after recovery) sends a view change message to other nonfaulty processes Any other process P notices the view change when it receives a view change message. P first forwards all unstable messages in the buffer to every process in G i+1 using a reliable point-to-point communication Afterwards, it multicasts a flush message After P has received a flush message from every other process, it can safely install the new view It is also possible to elect a coordinator to forward all unstable messages 18

19 Implementing Virtual Synchrony (3) Unstable message Flush message vc a) Process 4 notices that process 7 has crashed, sends a view change b) Process 6 sends out all its unstable messages, followed by a flush message c) Process 6 installs the new view when it has received a flush message from everyone else 19

20 Distributed Commit Essential issue: having an operation being performed by each member of a process group, or none at all. e.g. committing a transaction Distributed commit problem A coordinator is present to initiate the commit One-phase commit Two-phase commit Three-phase commit 20

21 Two-Phase Commit - 2PC (1) Consider a distributed transaction involving the participation of a number of processes each running on a different machine. Phase 1 a: Coordinator sends VOTE_REQUEST to participants Phase 1 b: When a participant receives VOTE_REQUEST it returns either VOTE_COMMIT or VOTE_ABORT to the coordinator. Phase 2 a: coordinator collects all votes; if all are VOTE_COMMIT it sends GLOBAL_COMMIT to all participants; otherwise it sends GLOBAL_ABORT. Phase 2 b: Each participant waits for GLOBAL_COMMIT or GLOBAL_ABORT and acts accordingly. 21

22 2PC (2) Commit Vote-request Vote-abort Global-abort INIT WAIT Vote-commit Global-commit Vote-request Vote-abort Global-abort ACK INIT READY Vote-request Vote-Commit Global-commit ACK ABORT a COMMIT ABORT b COMMIT a) The finite state machine for the coordinator in 2PC. b) The finite state machine for a participant. 22

23 2PC Failing Participant (1) How does this affect other participants? INIT: No problem READY: A participant P is waiting for either GLOBAL_COMMIT or GLOBAL_ABORT. If the coordinator crashes before its message reached P, P cannot know what to do. 1. It may block until the coordinator recovers 2. It can ask another participant Q. The decision depends which state Q is in i. INIT: they can both abort ii. COMMIT: They can commit iii. ABORT: They both abort iv. READY: Contact another participant. If all the participants it contacted are in this state, they have to wait until the coordinator recovers (apparently the coordinator is failing) 23

24 2PC Failing Participant (2) State of Q COMMIT ABORT INIT READY Action by P Make transition to COMMIT Make transition to ABORT Make transition to ABORT Contact another participant Actions taken by a participant P when residing in state READY and having contacted another participant Q. 24

25 2PC - Steps Taken by Coordinator write START_2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; } 25

26 2PC - Steps Taken by a Participant write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; } if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; } else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator; } 26

27 2PC - When a Participant is Asked for a Decision actions for handling decision requests: /* executed by separate thread */ while true { } wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */ Steps taken for handling incoming decision requests. 27

28 2PC Wait for the Coordinator to Recover All participants need to block until the coordinator recovers when All participants have received and processed the VOTE_REQUEST (i.e. they all are in state READY) from the coordinator while in the meantime the coordinator is crashed. In that case, participants cannot cooperatively decide on the final action to take (COMMIT or ABORT) Assuming that not all participant can be contacted (perhaps they are crashed as well), and uncontacted participant may either be in (or recover to) state INIT, ABORT or COMMIT. This is why another protocol is needed to avoid blocking 28

29 Three-Phase Commit 3PC Avoids blocking processes in the presence of failstop crashes Phase 1 a: Coordinator sends VOTE_REQUEST to participants Phase 1 b: When participant receives VOTE_REQUEST it returns either VOTE_COMMIT or VOTE_ABORT to coordinator. Phase 2 a: Coordinator collects all votes; if all are VOTE_COMMIT it sends PREPARE to all participants; otherwise it sends ABORT Phase 2 b: Each participant waits for PREPARE or ABORT 29

30 3PC (2) Phase 3 a (prepare to commit): Coordinator waits until all participants have ACKed (READY-COMMIT) receipt of PREPARE message, and then sends COMMIT to all. Phase 3 b (prepare to commit): Participants waits for COMMIT States of the coordinator and each participant satisfies the following two conditions: 1. There is no single state from which it is possible to make a transition directly to either COMMIT or ABORT state 2. There is no state in which it is not possible to make final decision, 30

31 3PC (3) Commit Vote-request INIT Vote-request Vote-abort INIT Vote-request Vote-commit Vote-abort Global-abort WAIT Vote-commit Prepare-commit Global-abort ACK WAIT Prepare-commit Ready-commit ABORT Ready-commit Global-commit PRECOMMIT ABORT Global-commit ACK PRECOMMIT a COMMIT b COMMIT a) Finite state machine for the coordinator in 3PC b) Finite state machine for a participant 31

32 3PC Failing Participant (1) Coordinator blocks WAIT: The coordinator sends GLOBAL_ABORT after timeout. PRECOMMIT: On a timeout, it will conclude one of the participant crashed (and it is known to have voted COMMIT); it will send GLOBAL_COMMIT to remaining participants Participant P blocks INIT: abort on a timeout READY: On a timeout, P contacts Q If Q is still in INIT, they can safely abort (since no other participant can be in state PRECOMMIT) 32

33 3PC Failing Participant (2) Participant P blocks (cont.) READY: On a timeout, P contacts Q 1. If each of the participants P contacted is in state READY, the transaction should be aborted (an uncontacted process may be in INIT). If one of the participants not contacted by P is in state PRECOMMIT, it can still abort. 2. If all contacted processes are in state PRECOMMIT, the transaction can safely commit 3. If a contacted process is in state ABORT (or COMMIT), then P moves to the corresponding state. PRECOMMIT: A decision can be taken 33

34 Recovery Once a failure occurs, it is essential that the failing process be able to recover to a correct state. What does it actually mean recovering to a correct state? How can the state of a distributed system be recorded and recovered to? Methods Check-pointing Message logging 34

35 Recovery: Background Essence: When a failure occurs, we need to bring the system into an error-free state. Backward recovery: bring the system from its present erroneous state back into a previously correct state. From time to time, the system state (at least part of it) must be recorded (check-pointing) in a persistent storage. Forward recovery: Instead of a previous check-pointed state, find a correct new state from which the system can continue to execute. In practice: By and large backward error recovery is widely applied 35

36 Forms of Recovery: Example Backward recovery: Retransmitting a lost message Forward recovery: Constructing the missing packets from successfully delivered packets (n, k) block erasure codes Forward recovery require that error types be known in advance so that appropriate recovery mechanisms are deployed. Backward recovery can be used as a general mechanism 36

37 Backward Recovery: Problems Restoring a previous state is costly operation Saving system state is not for free Loop of recovery No guarantee that the same (or similar) failure does not happen again. Rolling back is not always possible Think of a ATM machine handing mistakenly $1000. Imagine a UNIX command like /bin/rm fr * 37

38 Recovery: Stable Storage drive 1 a b c h g f d e a h b c g f d e a b c h g f d e updates recover drive 2 a b c h g f d e a b c h g f d e a b c h g f d e a) Stable Storage b) Crash after drive 1 is updated c) Bad spot (as a result of general wear and tear) 38

39 Checkpointing In a fault-tolerant distributed system, backward error recovery requires that the system regularly save its (global) state onto stable storage. Consistent global state can be captured using distributed snapshot algorithm. A recovery line corresponds to the most recent distributed snapshot Initial state consistent cut Checkpoint P1 P2 Failure inconsistent cut Time 39

40 Independent Checkpointing Processes save their local state independently Each process rolls back to the most recently saved state on a crash If these local states jointly do not form a consistent cut, then processes will have to further roll back to another previous checkpoint. The domino effect. 40

41 Coordinated Checkpointing Essence: All processes synchronize to jointly write their state to local stable storage. Saved state is automatically globally consistent. Two-phase blocking protocol: A coordinator first multicasts a CHECKPOINT_REQUEST message to all processes A receiving process takes a local checkpoint, stops sending messages (queues them and blocks), and tells the coordinator it has taken the checkpoint (ACK). When the coordinator has received an ACK from all processes, it multicasts a CHECKPOINT_DONE message to allow blocked processes to continue Question: What could happen if a process did not stop sending regular messages after saving its local state? 41

42 Message Logging (1) Checkpointing is an expensive operation Message logging allows to reduce the number of checkpoints, but still enables recovery Message logging and checkpointing are used together Idea: If the transmission of messages can be replayed, we can still reach a globally consistent state. A checkpointed state is taken as a starting point. Piecewise deterministic model: The execution of each process is considered to take place as a sequence of intervals, where events occur Each interval starts with a nondeterministic event (e.g. receipt of a message) Execution in an interval is completely deterministic 42

43 Message Logging (2) Conclusion (Piecewise deterministic model) If we record non-deterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay Problem: When should we actually log messages? Issue: Avoid orphan processes Orphan process is a process that survives the crash of another process, but whose state is inconsistent with the crashed process after its recovery Goal: Devise message logging schemes in which orphans do not occur. 43

44 Orphan Process: Example Process Q has just received m 1 and m 2 and subsequently delivered m 3 before it crashes. Assume that m 2 is not logged. When Q crashes and subsequently recovers, only m 1 is going to be replayed, but m 2 is certainly not, and probably m 3 is not. P m1 Q crashes and recovers m1 Q R m2 m3 m2 m3 unlogged message Time logged message Incorrect replay of messages after recovery, leading to an orphan process (Question: which one is the orphan process here?). 44

45 Message Logging Schemes (1) HDR(m): The header of message m containing its source, destination, sequence number, a delivery number The header contains all information for resending a message and delivering it in the correct order A message is stable if it can no longer be lost (e.g. is written to a stable storage, along with its header). DEP(m): the set of processes to which the message m has been delivered. It includes the processes to which another message m, which is causally dependent on m, has been delivered. COPY(m): the set of processes that have a copy of the message (and its header), but not (yet) in their local stable storage. 45

46 Message Logging Schemes (2) The processes in COPY(m) can hand over m. If all processes in this set crashes, the retransmission of m is not possible. Using this notation, Process Q is orphan if there is a message m, such that Q is contained in DEP(m), while at the same time every process in COPY(m) has crashed. There is no way to replay transmission of m. To avoid orphan processes, We can enforce that DEP(m) COPY(m). In other words, whenever a process becomes dependent on the delivery of m, it will always keep a copy of m (i.e. the message along with its header) 46

47 Message Logging Schemes (3) Pessimistic logging protocol For each unstable message m, there is at most one process dependent on m, that is DEP(m) 1. In other words, this protocol ensures that each unstable message m is delivered to at most one process. A process P, after receiving m also becomes a member of COPY(m) P, is forced to write it to a stable storage before sending a message to another process If P crashes before it logs m there will be no problem since no other process will be dependent on the delivery of m. Optimistic logging protocol If each process in COPY(m) has crashed, any orphan process in DEP(m) is rolled back to a state in which it is no longer belongs to DEP(m). 47

MYE017 Distributed Systems. Kostas Magoutis

MYE017 Distributed Systems. Kostas Magoutis MYE017 Distributed Systems Kostas Magoutis magoutis@cse.uoi.gr http://www.cse.uoi.gr/~magoutis Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known

More information

MYE017 Distributed Systems. Kostas Magoutis

MYE017 Distributed Systems. Kostas Magoutis MYE017 Distributed Systems Kostas Magoutis magoutis@cse.uoi.gr http://www.cse.uoi.gr/~magoutis Message reception vs. delivery The logical organization of a distributed system to distinguish between message

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Today: Fault Tolerance. Reliable One-One Communication

Today: Fault Tolerance. Reliable One-One Communication Today: Fault Tolerance Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing Message logging Lecture 17, page 1 Reliable One-One Communication Issues

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Fault Tolerance. Chapter 7

Fault Tolerance. Chapter 7 Fault Tolerance Chapter 7 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability Failure Models Type of failure Crash failure Omission failure Receive omission Send omission

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Fault Tolerance 1/64

Fault Tolerance 1/64 Fault Tolerance 1/64 Fault Tolerance Fault tolerance is the ability of a distributed system to provide its services even in the presence of faults. A distributed system should be able to recover automatically

More information

Today: Fault Tolerance. Failure Masking by Redundancy

Today: Fault Tolerance. Failure Masking by Redundancy Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 07 (version 16th May 2006) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Today: Fault Tolerance. Replica Management

Today: Fault Tolerance. Replica Management Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery

More information

G1 m G2 Attack at dawn? e e e e 1 S 1 = {0} End of round 1 End of round 2 2 S 2 = {1} {1} {0,1} decide -1 3 S 3 = {1} { 0,1} {0,1} decide -1 white hats are loyal or good guys black hats are traitor

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Fault Tolerance. Fall 2008 Jussi Kangasharju

Fault Tolerance. Fall 2008 Jussi Kangasharju Fault Tolerance Fall 2008 Jussi Kangasharju Chapter Outline Fault tolerance Process resilience Reliable group communication Distributed commit Recovery 2 Basic Concepts Dependability includes Availability

More information

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Distributed Systems Principles and Paradigms

Distributed Systems Principles and Paradigms Distributed Systems Principles and Paradigms Chapter 08 (version October 5, 2007) Maarten van Steen Vrije Universiteit Amsterdam, Faculty of Science Dept. Mathematics and Computer Science Room R4.20. Tel:

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1 Distributed Commit o A more general problem of atomic

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra

Today CSCI Recovery techniques. Recovery. Recovery CAP Theorem. Instructor: Abhishek Chandra Today CSCI 5105 Recovery CAP Theorem Instructor: Abhishek Chandra 2 Recovery Operations to be performed to move from an erroneous state to an error-free state Backward recovery: Go back to a previous correct

More information

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed.

Problem: if one process cannot perform its operation, it cannot notify the. Thus in practise better schemes are needed. Committing Transactions T 1 T T2 2 T T3 3 Clients T n Transaction Manager Transaction Manager (Coordinator) Allocation of transaction IDs (TIDs) Assigning TIDs with Coordination of commitments, aborts,

More information

Distributed Systems Reliable Group Communication

Distributed Systems Reliable Group Communication Reliable Group Communication Group F March 2013 Overview The Basic Scheme The Basic Scheme Feedback Control Non-Hierarchical Hierarchical Atomic multicast Virtual Synchrony Message Ordering Implementing

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

(Pessimistic) Timestamp Ordering

(Pessimistic) Timestamp Ordering (Pessimistic) Timestamp Ordering Another approach to concurrency control: Assign a timestamp ts(t) to transaction T at the moment it starts Using Lamport's timestamps: total order is given. In distributed

More information

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Read Operations and Timestamps. Write Operations and Timestamps (Pessimistic) stamp Ordering Another approach to concurrency control: Assign a timestamp ts(t) to transaction T at the moment it starts Using Lamport's timestamps: total order is given. In distributed

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

Process groups and message ordering

Process groups and message ordering Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates

The challenges of non-stable predicates. The challenges of non-stable predicates. The challenges of non-stable predicates The challenges of non-stable predicates Consider a non-stable predicate Φ encoding, say, a safety property. We want to determine whether Φ holds for our program. The challenges of non-stable predicates

More information

Rollback-Recovery p Σ Σ

Rollback-Recovery p Σ Σ Uncoordinated Checkpointing Rollback-Recovery p Σ Σ Easy to understand No synchronization overhead Flexible can choose when to checkpoint To recover from a crash: go back to last checkpoint restart m 8

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures

The objective. Atomic Commit. The setup. Model. Preserve data consistency for distributed transactions in the presence of failures The objective Atomic Commit Preserve data consistency for distributed transactions in the presence of failures Model The setup For each distributed transaction T: one coordinator a set of participants

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

Chapter 17: Recovery System

Chapter 17: Recovery System Chapter 17: Recovery System Database System Concepts See www.db-book.com for conditions on re-use Chapter 17: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery

More information

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi DEPT. OF Comp Sc. and Engg., IIT Delhi Three Models 1. CSV888 - Distributed Systems 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1 Index - Models to study [2] 1. LAN based systems

More information

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit) CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2003 Lecture 21: Network Protocols (and 2 Phase Commit) 21.0 Main Point Protocol: agreement between two parties as to

More information

COMMENTS. AC-1: AC-1 does not require all processes to reach a decision It does not even require all correct processes to reach a decision

COMMENTS. AC-1: AC-1 does not require all processes to reach a decision It does not even require all correct processes to reach a decision ATOMIC COMMIT Preserve data consistency for distributed transactions in the presence of failures Setup one coordinator a set of participants Each process has access to a Distributed Transaction Log (DT

More information

Fault-Tolerant Computer Systems ECE 60872/CS Recovery

Fault-Tolerant Computer Systems ECE 60872/CS Recovery Fault-Tolerant Computer Systems ECE 60872/CS 59000 Recovery Saurabh Bagchi School of Electrical & Computer Engineering Purdue University Slides based on ECE442 at the University of Illinois taught by Profs.

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

The flow of data must not be allowed to overwhelm the receiver

The flow of data must not be allowed to overwhelm the receiver Data Link Layer: Flow Control and Error Control Lecture8 Flow Control Flow and Error Control Flow control refers to a set of procedures used to restrict the amount of data that the sender can send before

More information

Distributed Transactions

Distributed Transactions Distributed Transactions Preliminaries Last topic: transactions in a single machine This topic: transactions across machines Distribution typically addresses two needs: Split the work across multiple nodes

More information

Distributed Commit in Asynchronous Systems

Distributed Commit in Asynchronous Systems Distributed Commit in Asynchronous Systems Minsoo Ryu Department of Computer Science and Engineering 2 Distributed Commit Problem - Either everybody commits a transaction, or nobody - This means consensus!

More information

Network Protocols. Sarah Diesburg Operating Systems CS 3430

Network Protocols. Sarah Diesburg Operating Systems CS 3430 Network Protocols Sarah Diesburg Operating Systems CS 3430 Protocol An agreement between two parties as to how information is to be transmitted A network protocol abstracts packets into messages Physical

More information

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure

More information

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components

Fault Tolerance. it continues to perform its function in the event of a failure example: a system with redundant components Fault Tolerance To avoid disruption due to failure and to improve availability, systems are designed to be fault-tolerant Two broad categories of fault-tolerant systems are: systems that mask failure it

More information

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018

Distributed Computing. CS439: Principles of Computer Systems November 19, 2018 Distributed Computing CS439: Principles of Computer Systems November 19, 2018 Bringing It All Together We ve been studying how an OS manages a single CPU system As part of that, it will communicate with

More information

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017

Distributed Computing. CS439: Principles of Computer Systems November 20, 2017 Distributed Computing CS439: Principles of Computer Systems November 20, 2017 Last Time Network Programming: Sockets End point of communication Identified by (IP address : port number) pair Client-Side

More information

Basic vs. Reliable Multicast

Basic vs. Reliable Multicast Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?

More information

Assignment 12: Commit Protocols and Replication Solution

Assignment 12: Commit Protocols and Replication Solution Data Modelling and Databases Exercise dates: May 24 / May 25, 2018 Ce Zhang, Gustavo Alonso Last update: June 04, 2018 Spring Semester 2018 Head TA: Ingo Müller Assignment 12: Commit Protocols and Replication

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 6: Reliability Reliable Distributed DB Management Reliability Failure models Scenarios CS 347 Notes 6 2 Reliability Correctness Serializability

More information

Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology

Clock and Time. THOAI NAM Faculty of Information Technology HCMC University of Technology Clock and Time THOAI NAM Faculty of Information Technology HCMC University of Technology Using some slides of Prashant Shenoy, UMass Computer Science Chapter 3: Clock and Time Time ordering and clock synchronization

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

Data Link Control Protocols

Data Link Control Protocols Protocols : Introduction to Data Communications Sirindhorn International Institute of Technology Thammasat University Prepared by Steven Gordon on 23 May 2012 Y12S1L07, Steve/Courses/2012/s1/its323/lectures/datalink.tex,

More information

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all failures or predictable: exhibit a well defined failure behavior

More information

Synchronization. Clock Synchronization

Synchronization. Clock Synchronization Synchronization Clock Synchronization Logical clocks Global state Election algorithms Mutual exclusion Distributed transactions 1 Clock Synchronization Time is counted based on tick Time judged by query

More information

EECS 591 DISTRIBUTED SYSTEMS. Manos Kapritsos Winter 2018

EECS 591 DISTRIBUTED SYSTEMS. Manos Kapritsos Winter 2018 EECS 591 DISTRIBUTED SYSTEMS Manos Kapritsos Winter 2018 ATOMIC COMMIT Preserve data consistency for distributed transactions in the presence of failures Setup one coordinator a set of participants Each

More information

Reliable Distributed System Approaches

Reliable Distributed System Approaches Reliable Distributed System Approaches Manuel Graber Seminar of Distributed Computing WS 03/04 The Papers The Process Group Approach to Reliable Distributed Computing K. Birman; Communications of the ACM,

More information

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware

Hypervisor-based Fault-tolerance. Where should RC be implemented? The Hypervisor as a State Machine. The Architecture. In hardware Where should RC be implemented? In hardware sensitive to architecture changes At the OS level state transitions hard to track and coordinate At the application level requires sophisticated application

More information

Exam 2 Review. October 29, Paul Krzyzanowski 1

Exam 2 Review. October 29, Paul Krzyzanowski 1 Exam 2 Review October 29, 2015 2013 Paul Krzyzanowski 1 Question 1 Why did Dropbox add notification servers to their architecture? To avoid the overhead of clients polling the servers periodically to check

More information

Distributed System. Gang Wu. Spring,2018

Distributed System. Gang Wu. Spring,2018 Distributed System Gang Wu Spring,2018 Lecture4:Failure& Fault-tolerant Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the

More information

Randomization. Randomization used in many protocols We ll study examples:

Randomization. Randomization used in many protocols We ll study examples: Randomization Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling 1 Ethernet Single shared broadcast channel 2+ simultaneous

More information

Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling

Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling Randomization Randomization used in many protocols We ll study examples: Ethernet multiple access protocol Router (de)synchronization Switch scheduling 1 Ethernet Single shared broadcast channel 2+ simultaneous

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers Clock Synchronization Synchronization Tanenbaum Chapter 6 plus additional papers Fig 6-1. In a distributed system, each machine has its own clock. When this is the case, an event that occurred after another

More information

0: BEGIN TRANSACTION 1: W = 1 2: X = W + 1 3: Y = X * 2 4: COMMIT TRANSACTION

0: BEGIN TRANSACTION 1: W = 1 2: X = W + 1 3: Y = X * 2 4: COMMIT TRANSACTION Transactions 1. a) Show how atomicity is maintained using a write-ahead log if the system crashes when executing statement 3. Main memory is small, and can only hold 2 variables at a time. Initially, all

More information

A Survey of Rollback-Recovery Protocols in Message-Passing Systems

A Survey of Rollback-Recovery Protocols in Message-Passing Systems A Survey of Rollback-Recovery Protocols in Message-Passing Systems Mootaz Elnozahy * Lorenzo Alvisi Yi-Min Wang David B. Johnson June 1999 CMU-CS-99-148 (A revision of CMU-CS-96-181) School of Computer

More information

Chapter 14: Recovery System

Chapter 14: Recovery System Chapter 14: Recovery System Chapter 14: Recovery System Failure Classification Storage Structure Recovery and Atomicity Log-Based Recovery Remote Backup Systems Failure Classification Transaction failure

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

The UNIVERSITY of EDINBURGH. SCHOOL of INFORMATICS. CS4/MSc. Distributed Systems. Björn Franke. Room 2414

The UNIVERSITY of EDINBURGH. SCHOOL of INFORMATICS. CS4/MSc. Distributed Systems. Björn Franke. Room 2414 The UNIVERSITY of EDINBURGH SCHOOL of INFORMATICS CS4/MSc Distributed Systems Björn Franke bfranke@inf.ed.ac.uk Room 2414 (Lecture 13: Multicast and Group Communication, 16th November 2006) 1 Group Communication

More information

Distributed Systems Multicast & Group Communication Services

Distributed Systems Multicast & Group Communication Services Distributed Systems 600.437 Multicast & Group Communication Services Department of Computer Science The Johns Hopkins University 1 Multicast & Group Communication Services Lecture 3 Guide to Reliable Distributed

More information

EECS 591 DISTRIBUTED SYSTEMS

EECS 591 DISTRIBUTED SYSTEMS EECS 591 DISTRIBUTED SYSTEMS Manos Kapritsos Fall 2018 Slides by: Lorenzo Alvisi 3-PHASE COMMIT Coordinator I. sends VOTE-REQ to all participants 3. if (all votes are Yes) then send Precommit to all else

More information

Multicast EECS 122: Lecture 16

Multicast EECS 122: Lecture 16 Multicast EECS 1: Lecture 16 Department of Electrical Engineering and Computer Sciences University of California Berkeley Broadcasting to Groups Many applications are not one-one Broadcast Group collaboration

More information

TCP/IP Protocol Suite 1

TCP/IP Protocol Suite 1 TCP/IP Protocol Suite 1 Stream Control Transmission Protocol (SCTP) TCP/IP Protocol Suite 2 OBJECTIVES: To introduce SCTP as a new transport-layer protocol. To discuss SCTP services and compare them with

More information

Coordination and Agreement

Coordination and Agreement Coordination and Agreement 1 Introduction 2 Distributed Mutual Exclusion 3 Multicast Communication 4 Elections 5 Consensus and Related Problems AIM: Coordination and/or Agreement Collection of algorithms

More information

CHAPTER 4: INTERPROCESS COMMUNICATION AND COORDINATION

CHAPTER 4: INTERPROCESS COMMUNICATION AND COORDINATION CHAPTER 4: INTERPROCESS COMMUNICATION AND COORDINATION Chapter outline Discuss three levels of communication: basic message passing, request/reply and transaction communication based on message passing

More information

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast Coordination 2 Today l Group communication l Basic, reliable and l ordered multicast How can processes agree on an action or a value? Modes of communication Unicast 1ç è 1 Point to point Anycast 1è

More information

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management

CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management CS 347: Distributed Databases and Transaction Processing Notes07: Reliable Distributed Database Management Hector Garcia-Molina CS 347 Notes07 1 Reliable distributed database management Reliability Failure

More information

Coordination and Agreement

Coordination and Agreement Coordination and Agreement 12.1 Introduction 12.2 Distributed Mutual Exclusion 12.4 Multicast Communication 12.3 Elections 12.5 Consensus and Related Problems AIM: Coordination and/or Agreement Collection

More information

Chapter 17: Recovery System

Chapter 17: Recovery System Chapter 17: Recovery System! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management! Failure with

More information

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure Chapter 17: Recovery System Failure Classification! Failure Classification! Storage Structure! Recovery and Atomicity! Log-Based Recovery! Shadow Paging! Recovery With Concurrent Transactions! Buffer Management!

More information

Control. CS432: Distributed Systems Spring 2017

Control. CS432: Distributed Systems Spring 2017 Transactions and Concurrency Control Reading Chapter 16, 17 (17.2,17.4,17.5 ) [Coulouris 11] Chapter 12 [Ozsu 10] 2 Objectives Learn about the following: Transactions in distributed systems Techniques

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 2016 Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2016 Question 1 Why does it not make sense to use TCP (Transmission Control Protocol) for the Network Time Protocol (NTP)?

More information

Basic Protocols and Error Control Mechanisms

Basic Protocols and Error Control Mechanisms Basic Protocols and Error Control Mechanisms Nicola Dragoni Embedded Systems Engineering DTU Compute ACK/NACK Protocol Polling Protocol PAR Protocol Exchange of State Information Two-Way Handshake Protocol

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information