The Totem System. L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, C. A. Lingley-Papadopoulos, T. P. Archambault

Size: px
Start display at page:

Download "The Totem System. L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, C. A. Lingley-Papadopoulos, T. P. Archambault"

Transcription

1 The Totem System L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, R. K. Budhia, C. A. Lingley-Papadopoulos, T. P. Archambault Department of Electrical and Computer Engineering University of California, Santa Barbara, CA Abstract The Totem system supports fault-tolerant applications in which distributed processes cooperate to perform a common task and in which replicated data must be updated consistently in the presence of asynchrony and faults. Reliable totally ordered delivery of messages to processes within process groups is provided on a single local-area network or over multiple local-area networks interconnected by gateways. Message ordering is consistent across the entire network, despite processor and communication faults, without requiring all processes to deliver all messages. The Totem system handles processor failure and recovery, as well as network partitioning and remerging, and provides membership and topology maintenance services. 1 Introduction The Totem system, developed at the University of California, Santa Barbara, supports applications in which information must be replicated to guard against faults and in which the consistency of information must be maintained as it is updated in the presence of faults. Totem provides reliable totally ordered delivery of messages to processes within process groups. This total order of messages simplies the application programming needed to maintain the consistency of information and, thus, reduces the risk of errors in programming fault-tolerant applications. Easier programming results in lower costs, shorter development times, and higher reliability. Causally ordered and unordered message delivery were previously advocated for distributed applications because of the poor performance of previous Supported by the National Science Foundation, Grant No. NCR , by the Advanced Research Project Agency, Grant No. N K-0097, and by Rockwell CMC through the State of California MICRO program, Grant No total ordering protocols, but are rendered unnecessary by the high throughput and low latency of Totem's total ordering protocol. The exceptional performance of Totem results from eective ow control mechanisms and from exploiting the locality of process groups by ltering messages at the gateways. Applications that can benet from Totem's totally ordered message delivery service include air traf- c control, industrial automation, transaction processing, banking, stock market trading, intelligent highway, medical monitoring, and replicated database applications. Other reliable ordered message delivery systems with similar objectives for similar applications include Isis [5, 7], Psync [16], Trans and Total [12, 14], Transis [3], Amoeba [9], Delta-4 [18] and Horus [17]. The Totem system operates on a single local-area network or over multiple local-area networks interconnected by gateways. Superimposed on each local-area network is a logical token-passing ring. The elds of the circulating token provide reliable delivery and total ordering of messages, conrmation that messages have been received by all processors, eective ow control, and detection of faults. Consistency of message ordering is provided across the entire system without the need for all processes to deliver all messages. The membership and topology maintenance services provided by Totem handle processor failure and recovery, as well as network partitioning and remerging. Conguration Change and Topology Change messages delivered to each application process dene a sequence of process group congurations. Each process receives all messages multicast to the process group by a member of a conguration, and timestamped within that conguration. The Totem system hierarchy is shown in Figure 1. The bottom layer of the hierarchy is a collection of local-area networks with best-eort hardware broadcasts or multicasts. The single-ring protocol converts

2 Ordered multicast to process group Globally ordered reliable multicast Locally ordered reliable multicast Best-effort multicast Application Layer Process Group Interface Multiple-Ring Protocol Single-Ring Protocol Physical Medium Process group membership changes Network-wide topology changes Figure 1: The Totem system hierarchy. Local configuration changes Absence of messages and timeouts the best-eort multicasts into the service of reliable totally ordered delivery of messages and, in addition, provides fault detection and recovery on a single localarea network. The multiple-ring protocol uses the single-ring protocol to provide system-wide total ordering of messages, as well as system-wide topology and membership services. The multiple-ring protocol, using information from the process group interface above it, forwards messages through the gateways to the rings on which they are required. The process group interface delivers messages to application processes in the appropriate process groups, and provides process group membership services. Detailed descriptions of the protocols that comprise the Totem system can be found in [2, 11], and descriptions of earlier versions of the protocols can be found in [1, 4, 13]. The Totem system has been implemented in the C programming language on Sun IPC workstations running SunOS 4.1, and on Sun SPARCstation 20s running Solaris 2.3, over a 10 Mbit/sec Ethernet. It uses standard UNIX facilities, in particular UNIX UDP sockets to broadcast messages and to transfer the token. The implementation has been ported to several other types of workstations with little modication to the source code. 2 The Model The Totem system supports fault-tolerant applications within a distributed system in which processors are connected by local-area networks, possibly several local-area networks interconnected by gateways. Totem is designed to protect against communication faults, including message loss and network partitioning. It also protects against processor faults, including crash, omission, and timing faults but not Byzantine or software faults. In Totem, we distinguish between receipt and delivery of a message as follows: a message is received from the next lower layer of the protocol hierarchy and is delivered in order to the next higher layer. When messages are received, they may not be in order and, thus, the lower layer may need to reorder them before delivering them to the upper layer. Totem provides two levels of message delivery, agreed and safe, selected by the originator of the message. Delivery in agreed order for a conguration requires that (1) messages are delivered in a total order for that conguration, (2) messages are delivered in the same total order by all processors in the conguration, and (3) every message that precedes a message in the total order for the conguration is delivered before that message. The total order on messages respects Lamport's causal order [10]. Delivery in safe order for a conguration is delivery in agreed order for that conguration and, in addition, requires that (4) a processor knows that the message has been received by all of the other processors in the conguration. An important use of safe delivery is that it allows a processor to reclaim buer space, because a message that is safe will never need to be retransmitted subsequently. Continued operation when faults occur poses substantial challenges to maintaining the consistency of message ordering. When a processor fails or the system partitions, it is impossible to be certain which messages were delivered by the processor before it failed or by processors in other components of the partitioned system. Virtual synchrony [6] ensures that processors that are members of the same consecutive congurations deliver the same sequence of messages and conguration changes, but does not constrain the behavior of faulty or isolated processors. For systems in which faulty processors can be repaired and resume operation with stable storage intact and for systems in which the network can partition and remerge, we have introduced the concept of extended virtual synchrony [4, 15]. Extended virtual synchrony requires the properties of delivery in agreed and safe order within a con- guration, even if a processor fails and restarts or if the network partitions and remerges. More importantly, it requires that the total order of messages for a conguration is a subset of a global total order on all messages generated in the system. Extended virtual synchrony allows two processors in dierent components of a partitioned system to deliver dierent messages, but does not allow them to deliver the same

3 messages in dierent orders. When faults occur, extended virtual synchrony is achieved by introducing a transitional conguration with a reduced membership, all members of which are able to honor the agreed and safe message delivery guarantees. 3 The Totem Single-Ring Protocol The Totem single-ring protocol provides reliable totally ordered delivery of messages using a logical token-passing ring superimposed on a local-area network, such as an Ethernet. The token circulates around the ring as a point-to-point message; only the processor holding the token can broadcast messages. A sequence number eld in the token provides a single sequence of strictly increasing sequence numbers for all messages broadcast on the ring; messages are delivered in sequence number order. The single-ring protocol also provides membership services to handle processor failure and recovery, as well as network partitioning and remerging. To guard against token loss, a token retransmission mechanism has been implemented. 3.1 Message Ordering The sequence numbers of the messages are derived from a sequence number eld in the token, called seq. The seq eld is incremented as each new message is broadcast. Processors recognize missing messages by detecting gaps in the sequence of message sequence numbers, and request retransmissions by inserting the sequence numbers of the missing messages into a retransmission request (rtr) eld of the token. If a processor has received a message and all of its predecessors, as indicated by the message sequence numbers, it can deliver that message in agreed order. The token also contains an all-received-upto (aru) eld which enables a processor to determine a sequence number such that all messages with lower sequence numbers have been received by all processors on the ring. Messages with sequence numbers less than or equal to this sequence number can be delivered in safe order. The Totem single-ring protocol provides eective ow control and, thereby, achieves high throughput and low latency. The ow control mechanisms are based on two limits: the number of messages that can be broadcast by any one processor during a single token visit and the total number of messages that can be broadcast by all processors during a single token rotation. The token also provides information about the aggregate message backlog of the various processors on the ring, which allows a fairer allocation of bandwidth to processors than is achieved by simpler schemes, such as FDDI. Measurements for the Totem single-ring protocol, with low message loss rates, show a throughput that is two to ve times higher than the throughput achieved by competing ordered multicast protocols using similar equipment, and that is comparable to the throughput achieved by TCP/IP for point-to-point communication. Low latency, from message origination to delivery, is maintained even under high message transmission rates. 3.2 Processor Membership To provide fault tolerance, the Totem single-ring ordering protocol is integrated with a membership protocol that provides membership services to recongure the system, including addition of new and recovered processors, deletion of faulty processors, handling of network partitioning, and remerging of components of a partitioned network. Timeouts are used to detect processor faults. New or restarted processors are detected by the appearance of messages on the local-area network from processors that are not members of the current ring. The membership protocol uses heuristics based on timeouts to identify faulty processors, with a bias toward preserving the current membership. The protocol ensures consensus in that every member of the conguration agrees on the membership of the conguration, and termination in that every processor installs some conguration with an agreed membership within a bounded time unless it fails within that time. Subject to these consensus and termination requirements, the membership protocol aims to form a membership that is as large as possible. It then constructs a new ring on which the ordering protocol can resume operation, generates a new token, and recovers messages that have not been received by some of the processors when the fault occurred. For each change in the membership within the local-area network, Totem delivers two Conguration Change messages to the application, rather than the one message that might have been expected. When a processor fails or the network partitions, the rst Con- guration Change message introduces a transitional conguration of reduced size that excludes the faulty or inaccessible processors. Delivery of this message informs the application that the delivery guarantees now apply only to the smaller transitional conguration. Within the transitional conguration, the remaining messages of the old conguration are delivered. After these messages are delivered, the second Congura-

4 tion Change message is delivered, which introduces the new regular conguration. These message ordering and membership services provide reliable totally ordered delivery of messages to the multiple-ring protocol described below. 4 The Totem Multiple-Ring Protocol The Totem multiple-ring protocol is layered on top of the single-ring protocol, and delivers messages and topology changes to the application processes in timestamp order. Timestamp order guarantees global consistency of message ordering. The ability of a processor to deliver a message in timestamp order depends, however, on that processor's knowing that it has already received and delivered all relevant messages, from all of the connected rings, with timestamps that are less than the timestamp of the message to be ordered. The multiple-ring protocol exploits the services provided by the single-ring protocol for this knowledge. 4.1 Message Ordering On each individual ring, messages are generated with increasing sequence numbers and timestamps. The single-ring protocol provides reliable delivery of messages in sequence number order on an individual ring. As the single-ring protocol delivers messages to a gateway, the gateway forwards the messages, in order, onto the other ring. On the new ring, a message retains its original timestamp but acquires a new sequence number so that it can be reliably delivered in sequence number order on that ring. Since the messages are forwarded in order, when the multiple-ring protocol receives a message that was originated on another ring, it has already received all relevant messages originated on that ring with earlier timestamps. Each processor maintains a ring table with an entry for each ring in the network, containing a recv msgs list of messages received from that ring. A processor can deliver a message in agreed order if the message is the lowest entry in all of the recv msgs lists and if each recv msgs list is nonempty. If the recv msgs list of some ring is empty, then no further messages can be delivered until a message from that ring has been received, because the next message from that ring may have a lower timestamp than the other messages in those lists. Messages, called Guarantee Vector messages, are broadcast periodically for each ring by the gateways to ensure that processors can continue to deliver messages in agreed order even if, for some ring, no pro- Application Process Group Interface Multiple Ring Protocol Single Ring Protocol ring table Totally ordered delivery in timestamp order Reliable delivery in sequence number order Local-Area Network 1 Local-Area Network 2 Figure 2: The operation of Totem at a gateway. cessor on that ring originated a regular message recently. The Guarantee Vector messages also report which messages have been received on that ring and, thus, allow other processors to determine which messages can be delivered in safe order. A ltering mechanism at each gateway ensures that messages addressed to a process group are forwarded only if there are members of that process group in the direction of the forwarding. This enables the Totem multiple-ring protocol to exploit process group locality and to operate eciently in large networks, using the system-wide total ordering of messages to provide strict consistency of message delivery. The operation of the Totem message ordering protocol at a gateway is shown in Figure Topology Maintenance The message ordering protocol described above depends on knowledge of the network topology. If messages are originated on a ring of which a processor is unaware, that processor will not wait for such messages during the ordering and may prematurely deliver other messages with higher timestamps. Similarly, if a ring becomes inaccessible and a processor is not informed, that processor will wait forever for a message from that ring and message ordering will stop. Each gateway maintains a data structure, called topology, which contains its view of the current topology of the network, represented as a graph with a node corresponding to a ring and an edge to a gateway. Gateways use topology for several purposes, including to decide which messages should be forwarded. In the

5 event of a topology change, a processor receives the necessary topology information from the gateways on its ring. Processor faults and network partitioning are detected by the single-ring protocol, which generates a Conguration Change message to report the change to the local ring. The gateways analyze the Conguration Change message to determine its eect on the network topology. When a ring becomes inaccessible, a processor or gateway generates a Topology Change message and removes that ring from its ring table, ending the need to wait for messages from that ring which will never arrive and allowing messages from other rings to be ordered. A topology change must have the same eect for each of the processors that were previously able to, and can still, communicate with each other. Even though the processors learn of the topology change at dierent physical times, all agree on a common logical time for the topology change and on the same sets of messages delivered before and after the topology change. To achieve this, Conguration Change and Topology Change messages are timestamped, and are delivered to the application in timestamp order, along with the other messages. The Totem multiple-ring protocol is thus able to maintain, across a network of many rings, the extended virtual synchrony guarantees for agreed and safe delivery. 5 The Totem Process Group Interface The Totem system allows the fault-tolerant application to be structured as a set of process groups [8]. Each process group consists of a set of processes that cooperate to perform some task of the application. Messages within the Totem system are addressed to one or more process groups and are delivered to all of the processes in those groups. A typical application is structured as multiple process groups, where each process may be a member of several process groups. Maintaining the consistency of message ordering when process groups can intersect is a challenging problem. The only ecient and eective method known to us for solving this problem is to require a global total order over all messages for all process groups in the system. The process group interface provides the services of creating, joining and leaving process groups and of sending and receiving messages. The interface establishes a socket for each application process through which the process communicates with Totem and which the process can poll to determine whether any messages are pending. The process group interface passes messages from the application processes down to the multiple-ring protocol, which breaks large messages into small messages (packets). The process group interface receives messages from the multiple-ring protocol, which constructs large messages from small messages, and then determines the application processes, if any, to which to deliver the messages. Since messages are delivered to the process group interface in order, the interface does not need to be concerned with message ordering. On each processor, the process group interface also maintains the current membership of any process group of which at least one process on that processor is a member. When a process joins or leaves a group, this fact is disseminated throughout the network to all of the other members of the group by the process group membership protocol. 6 Demonstration The demonstration of the Totem system consists of a simulated air trac control application on a network of workstations. In the airspace displayed on each of the workstation screens, aircraft travel from aireld to aireld, maintaining safe separation in ight and safe sequencing for takeo and landing. Each aircraft is controlled by one of the workstations, as indicated by the color of the aircraft. The ight plans of the aircraft are replicated on all of the workstations, exploiting Totem's total ordering of messages to maintain the consistency of these replicas. The aircraft control process on each workstation periodically generates a new ight, with a ight plan represented as a sequence of times and positions. The ight plan for the new ight is broadcast. On ordering this message, the aircraft control process checks the ight plan for conicts with ight plans already in the database. If no conict is detected, the ight plan is inserted into the database. The workstation controlling a ight, recorded in the database, periodically broadcasts the aircraft position to the display processes. The display process on each workstation displays the aircraft on the workstation screen. Workstations can be stopped and restarted at arbitrary times, demonstrating the fault detection and reconguration capabilities of Totem. When a workstation fails, a Conguration Change message informs the aircraft control processes of the new membership and, thus, which of the workstations has failed. The remaining workstations assume control of the aircraft of the failed workstations in round-robin order and, on the workstation display, the colors of the aircraft change accordingly.

6 7 Conclusion The Totem system enables fault-tolerant applications in distributed systems to maintain the consistency of replicated information by providing reliable totally ordered multicasting of messages to processes within process groups. A hierarchy of protocols allows operation over a single local-area network or over multiple local-area networks interconnected by gateways. The message ordering strategy of Totem employs timestamps to dene a total order on messages system-wide and sequence numbers to ensure reliable delivery by determining whether all messages with lower timestamps have been received on a ring. The strategy is computationally inexpensive and results in excellent performance. Many issues remain to be considered, including more eective ow control for multiple-ring networks and better coupling between the routing and process group mechanisms. We are also planning to implement Totem over faster communication media, such as 100 Mbit/sec Ethernet and 155 Mbit/sec ATM. References [1] D. A. Agarwal, P. M. Melliar-Smith, and L. E. Moser, \Totem: A protocol for message ordering in a widearea network," Proceedings of the First International Conference on Computer Communications and Networks, San Diego, CA, pp. 1{5, June [2] D. A. Agarwal, Totem: A Reliable Ordered Delivery Protocol for Interconnected Local-Area Networks. PhD Thesis, University of California, Santa Barbara, August [3] Y. Amir, D. Dolev, S. Kramer, and D. Malki, \Transis: A communication sub-system for high availability," Proceedings of the 22nd Annual International Symposium on Fault-Tolerant Computing, Boston, MA, pp. 76{84, July [4] Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. Agarwal, and P. Ciarfella, \Fast message ordering and membership using a logical token-passing ring," Proceedings of the 13th IEEE International Conference on Distributed Computing Systems, Pittsburgh, PA, pp. 551{560, May [5] K. P. Birman and T. A. Joseph, \Reliable communication in the presence of failures," ACM Transactions on Computer Systems, vol. 5, no. 1, pp. 47{76, February [6] K. P. Birman and T. A. Joseph, \Exploiting virtual synchrony in distributed systems," Proceedings of the 11th Annual ACM Symposium on Operating Systems Principles, pp. 123{138, November [7] K. P. Birman, A. Schiper, and P. Stephenson, \Lightweight causal and atomic group multicast," ACM Transactions on Computer Systems, vol. 9, no. 3, pp. 272{314, August [8] D. R. Cheriton and W. Zwaenepoel, \Distributed process groups in the V kernel," ACM Transactions on Computer Systems, vol. 3, no. 2, pp. 77{107, May [9] M. F. Kaashoek and A. S. Tanenbaum, \Group communication in the Amoeba distributed operating system," Proceedings of the 11th IEEE International Conference on Distributed Computing Systems, Arlington, TX, pp. 882{891, May [10] L. Lamport, \Time, clocks, and the ordering of events in a distributed system," Communications of the ACM, vol. 21, no. 7, pp. 558{565, July [11] C. A. Lingley-Papadopoulos, The Totem Process Group Membership and Interface, Master's Thesis, University of California, Santa Barbara, August [12] P. M. Melliar-Smith, L. E. Moser, and V. Agrawala, \Broadcast protocols for distributed systems," IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 1, pp. 17{25, January [13] P. M. Melliar-Smith, L. E. Moser, and D. A. Agarwal, \Ring-based ordering protocols," Proceedings of the IEE International Conference on Information Engineering, Singapore, pp. 882{891, December [14] L. E. Moser, P. M. Melliar-Smith, and V. Agrawala, \Processor membership in asynchronous distributed systems," IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 5, pp. 459{473, May [15] L. E. Moser, Y. Amir, P. M. Melliar-Smith, and D. A. Agarwal, \Extended virtual synchrony," Proceedings of the 14th IEEE International Conference on Distributed Computing Systems, Poznan, Poland, pp. 56{ 65, June [16] L. L. Peterson, N. C. Buchholz, and R. D. Schlichting, \Preserving and using context information in interprocess communication," ACM Transactions on Computer Systems, vol. 7, no. 3, pp. 217{246, August [17] R. van Renesse, T. M. Hickey, and K. P. Birman, \Design and performance of Horus: A lightweight group communications system," Technical Report , Cornell University, Department of Computer Science, August [18] P. Verissimo, L. Rodrigues, and J. Runo, \The Atomic Multicast protocol (AMp), in D. Powell, ed., Delta-4: A Generic Architecture for Dependable Distributed Computing, pp. 267{294, Springer-Verlag, 1991.

Hebrew University. Jerusalem. Israel. Abstract. Transis is a high availability distributed system, being developed

Hebrew University. Jerusalem. Israel. Abstract. Transis is a high availability distributed system, being developed The Design of the Transis System??? Danny Dolev??? and Dalia Malki y Computer Science Institute Hebrew University Jerusalem Israel Abstract. Transis is a high availability distributed system, being developed

More information

Site 1 Site 2 Site 3. w1[x] pos ack(c1) pos ack(c1) w2[x] neg ack(c2)

Site 1 Site 2 Site 3. w1[x] pos ack(c1) pos ack(c1) w2[x] neg ack(c2) Using Broadcast Primitives in Replicated Databases y I. Stanoi D. Agrawal A. El Abbadi Dept. of Computer Science University of California Santa Barbara, CA 93106 E-mail: fioana,agrawal,amrg@cs.ucsb.edu

More information

[19] R. Van Renesse, K. P. Birman, and S. Maeis. Horus: A Felxible Group Communication

[19] R. Van Renesse, K. P. Birman, and S. Maeis. Horus: A Felxible Group Communication [19] R. Van Renesse, K. P. Birman, and S. Maeis. Horus: A Felxible Group Communication System. Communications of the ACM, 39, April 1996. About the authors: DANNY DOLEV is a professor at the Institute

More information

A Group Communication Protocol for CORBA

A Group Communication Protocol for CORBA A Group Communication Protocol for CORBA L. E. Moser, P. M. Melliar-Smith, R. Koch, K. Berket Department of Electrical and Computer Engineering University of California, Santa Barbara 93106 Abstract Group

More information

Packing Messages as a Tool for Boosting the Performance of. Roy Friedman Robbert van Renesse. Cornell University. Abstract

Packing Messages as a Tool for Boosting the Performance of. Roy Friedman Robbert van Renesse. Cornell University. Abstract Packing Messages as a Tool for Boosting the Performance of Total Ordering Protocols Roy Friedman Robbert van Renesse Department of Computer Science Cornell University Ithaca, NY 14853. July 7, 1995 Abstract

More information

The Transis Approach to. High Availability Cluster Communication. A unique multicast service designed for partitionable operation is examined here.

The Transis Approach to. High Availability Cluster Communication. A unique multicast service designed for partitionable operation is examined here. The Transis Approach to High Availability Cluster Communication Danny Dolev and Dalia Malki A unique multicast service designed for partitionable operation is examined here. 1 Introduction In the local

More information

1. INTRODUCTION Totally Ordered Broadcast is a powerful service for the design of fault tolerant applications, e.g., consistent cache, distributed sha

1. INTRODUCTION Totally Ordered Broadcast is a powerful service for the design of fault tolerant applications, e.g., consistent cache, distributed sha Chapter 3 TOTALLY ORDERED BROADCAST IN THE FACE OF NETWORK PARTITIONS Exploiting Group Communication for Replication in Partitionable Networks 1 Idit Keidar Laboratory for Computer Science Massachusetts

More information

Transis: A Communication Sub-System for High Availability. Yair Amir, Danny Dolev, Shlomo Kramer, Dalia Malki

Transis: A Communication Sub-System for High Availability. Yair Amir, Danny Dolev, Shlomo Kramer, Dalia Malki Transis: A Communication Sub-System for High Availability Yair Amir, Danny Dolev, Shlomo Kramer, Dalia Malki The Hebrew University of Jerusalem, Israel Abstract This paper describes Transis, a communication

More information

A Mechanism for Sequential Consistency in a Distributed Objects System

A Mechanism for Sequential Consistency in a Distributed Objects System A Mechanism for Sequential Consistency in a Distributed Objects System Cristian Ţăpuş, Aleksey Nogin, Jason Hickey, and Jerome White California Institute of Technology Computer Science Department MC 256-80,

More information

Design and Implementation of a Consistent Time Service for Fault-Tolerant Distributed Systems

Design and Implementation of a Consistent Time Service for Fault-Tolerant Distributed Systems Design and Implementation of a Consistent Time Service for Fault-Tolerant Distributed Systems W. Zhao, L. E. Moser and P. M. Melliar-Smith Eternal Systems, Inc. 5290 Overpass Road, Building D, Santa Barbara,

More information

Distributed Systems Multicast & Group Communication Services

Distributed Systems Multicast & Group Communication Services Distributed Systems 600.437 Multicast & Group Communication Services Department of Computer Science The Johns Hopkins University 1 Multicast & Group Communication Services Lecture 3 Guide to Reliable Distributed

More information

(Long Manuscript) High-Performance, Reliable Multicasting: Foundations for Future. Internet Groupware Applications

(Long Manuscript) High-Performance, Reliable Multicasting: Foundations for Future. Internet Groupware Applications (Long Manuscript) High-Performance, Reliable Multicasting: Foundations for Future Internet Groupware Applications John Callahan, Todd Montgomery, Brian Whetten fcallahan,tmontg@cerc.wvu.edu, whetten@cs.berkeley.edu

More information

The Transis Approach to. High Availability Cluster Communication. Dalia Malki, Yair Amir, Danny Dolev, Shlomo Kramer. Institute of Computer Science

The Transis Approach to. High Availability Cluster Communication. Dalia Malki, Yair Amir, Danny Dolev, Shlomo Kramer. Institute of Computer Science The Transis Approach to High Availability Cluster Communication Dalia Malki, Yair Amir, Danny Dolev, Shlomo Kramer Institute of Computer Science The Hebrew University of Jerusalem Jerusalem, Israel Technical

More information

Early Delivery Totally Ordered Multicast. the control over the dissemination of that information.

Early Delivery Totally Ordered Multicast. the control over the dissemination of that information. Early Delivery Totally Ordered Multicast in Asynchronous Environments Danny Dolev, Shlomo Kramer, Dalia Malki y The Hebrew University of Jerusalem, Israel Abstract This paper presents the construction

More information

Replica consistency of CORBA objects in partitionable distributed systems*

Replica consistency of CORBA objects in partitionable distributed systems* Distrib. Syst. Engng 4 (1997) 139 150. Printed in the UK PII: S0967-1846(97)82270-X Replica consistency of CORBA objects in partitionable distributed systems* P Narasimhan, L E Moser and P M Melliar-Smith

More information

Consistency of Partitionable Object Groups in a CORBA Framework

Consistency of Partitionable Object Groups in a CORBA Framework Consistency of Partitionable Object Groups in a CORBA Framework P. Narasimhan, L. E. Moser, P. M. Melliar-Smith Department of Electrical and Computer Engineering University of California, Santa Barbara,

More information

Specication and Design of a Fault Recovery Model for the Reliable. Multicast Protocol. Todd Montgomery, John R. Callahan, Brian Whetten

Specication and Design of a Fault Recovery Model for the Reliable. Multicast Protocol. Todd Montgomery, John R. Callahan, Brian Whetten Specication and Design of a Fault Recovery Model for the Reliable Multicast Protocol Todd Montgomery, John R. Callahan, Brian Whetten ftmont,callahang@cerc.wvu.edu, whetten@tenet.cs.berkeley.edu NASA/West

More information

Multimedia Multicast Transport Service for Groupware

Multimedia Multicast Transport Service for Groupware Multimedia Multicast Transport Service for Groupware Chockler, Gregory V., Huleihel, Nabil, Keidar, Idit, and Dolev, Danny, The Hebrew University of Jerusalem, Jerusalem, Israel 1.0 Abstract Reliability

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

End-to-End Latency of a Fault-Tolerant CORBA Infrastructure Λ

End-to-End Latency of a Fault-Tolerant CORBA Infrastructure Λ End-to-End Latency of a Fault-Tolerant CORBA Infrastructure Λ W. Zhao, L. E. Moser and P. M. Melliar-Smith Department of Electrical and Computer Engineering University of California, Santa Barbara, CA

More information

Fault Tolerance Middleware for Cloud Computing

Fault Tolerance Middleware for Cloud Computing 2010 IEEE 3rd International Conference on Cloud Computing Fault Tolerance Middleware for Cloud Computing Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Cleveland,

More information

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a

Kevin Skadron. 18 April Abstract. higher rate of failure requires eective fault-tolerance. Asynchronous consistent checkpointing oers a Asynchronous Checkpointing for PVM Requires Message-Logging Kevin Skadron 18 April 1994 Abstract Distributed computing using networked workstations oers cost-ecient parallel computing, but the higher rate

More information

Programming with Object Groups in PHOENIX

Programming with Object Groups in PHOENIX Programming with Object Groups in PHOENIX Pascal Felber Rachid Guerraoui Département d Informatique Ecole Polytechnique Fédérale de Lausanne CH-1015 Lausanne, Switzerland felber@lse.epfl.ch rachid@lse.epfl.ch

More information

Communication Groups. Abstract. We introduce a membership protocol that maintains the set

Communication Groups. Abstract. We introduce a membership protocol that maintains the set Membership Algorithms for Multicast Communication Groups Yair Amir, Danny Dolev?, Shlomo Kramer, Dalia Malki The Hebrew University of Jerusalem, Israel Abstract. We introduce a membership protocol that

More information

The University of Michigan. Ann Arbor, Michigan 48109{2122. fzaher, ashaikh, farnam, ABSTRACT

The University of Michigan. Ann Arbor, Michigan 48109{2122. fzaher, ashaikh, farnam, ABSTRACT RTCAST: Lightweight Multicast for Real-Time Process Groups Tarek Abdelzaher, Anees Shaikh, Farnam Jahanian, and Kang Shin Real-time Computing Laboratory Department of Electrical Engineering and Computer

More information

Replication over a Partitioned Network

Replication over a Partitioned Network Replication over a Partitioned Network Yair Amir Ph.D. Presentation The Transis Project The Hebrew University of Jerusalem yairamir@cs.huji.ac.il http://www.cs.huji.ac.il/papers/transis/yairamir/yairamir.html

More information

Specifying and Proving Broadcast Properties with TLA

Specifying and Proving Broadcast Properties with TLA Specifying and Proving Broadcast Properties with TLA William Hipschman Department of Computer Science The University of North Carolina at Chapel Hill Abstract Although group communication is vitally important

More information

Consensus Service: a modular approach for building agreement. protocols in distributed systems. Rachid Guerraoui Andre Schiper

Consensus Service: a modular approach for building agreement. protocols in distributed systems. Rachid Guerraoui Andre Schiper Consensus Service: a modular approach for building agreement protocols in distributed systems Rachid Guerraoui Andre Schiper Departement d'informatique Ecole Polytechnique Federale de Lausanne 1015 Lausanne,

More information

IMPLEMENTATION AND PERFORMANCE TESTING OF A GOSSIP-BASED COMMUNICATION SYSTEM

IMPLEMENTATION AND PERFORMANCE TESTING OF A GOSSIP-BASED COMMUNICATION SYSTEM IMPLEMENTATION AND PERFORMANCE TESTING OF A GOSSIP-BASED COMMUNICATION SYSTEM Kim P. Kihlstrom, Joel L. Stewart, N. Tobias Lounsbury, Adrian J. Rogers, and Michael C. Magnuson Department of Computer Science,

More information

receive optview send flush send optimistic messages send regular messages receive view send viewack

receive optview send flush send optimistic messages send regular messages receive view send viewack Optimistic Virtual Synchrony Jeremy Sussman Idit Keidar y Keith Marzullo z Abstract Group communication systems are powerful building blocks that facilitate the development of fault-tolerant distributed

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

Run-Time Switching Between Total Order Algorithms

Run-Time Switching Between Total Order Algorithms Run-Time Switching Between Total Order Algorithms José Mocito and Luís Rodrigues University of Lisbon {jmocito,ler}@di.fc.ul.pt Abstract. Total order broadcast protocols are a fundamental building block

More information

Reliable Distributed System Approaches

Reliable Distributed System Approaches Reliable Distributed System Approaches Manuel Graber Seminar of Distributed Computing WS 03/04 The Papers The Process Group Approach to Reliable Distributed Computing K. Birman; Communications of the ACM,

More information

On the interconnection of message passing systems

On the interconnection of message passing systems Information Processing Letters 105 (2008) 249 254 www.elsevier.com/locate/ipl On the interconnection of message passing systems A. Álvarez a,s.arévalo b, V. Cholvi c,, A. Fernández b,e.jiménez a a Polytechnic

More information

Using Group Communication Technology to. Cornell University. Abstract. In this paper we explore the use of group communication technology, developed

Using Group Communication Technology to. Cornell University. Abstract. In this paper we explore the use of group communication technology, developed Using Group Communication Technology to Implement a Reliable and Scalable Distributed IN Coprocessor Roy Friedman Ken Birman Department of Computer Science Cornell University Ithaca, NY 14853 Abstract

More information

Low Latency Fault Tolerance System

Low Latency Fault Tolerance System Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Faculty Publications Electrical Engineering & Computer Science Department 10-2-2012 Low Latency Fault Tolerance

More information

Low Latency Fault Tolerance System

Low Latency Fault Tolerance System Low Latency Fault Tolerance System Wenbing Zhao 1, P. M. Melliar-Smith 2 and L. E. Moser 2 1 Department of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115 2 Department

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009

Event Ordering. Greg Bilodeau CS 5204 November 3, 2009 Greg Bilodeau CS 5204 November 3, 2009 Fault Tolerance How do we prepare for rollback and recovery in a distributed system? How do we ensure the proper processing order of communications between distributed

More information

Active leave behavior of members in a fault-tolerant group

Active leave behavior of members in a fault-tolerant group 260 Science in China Ser. F Information Sciences 2004 Vol.47 No.2 260 272 Active leave behavior of members in a fault-tolerant group WANG Yun Department of Computer Science and Engineering, Southeast University,

More information

High Throughput Total Order Broadcast for Cluster Environments

High Throughput Total Order Broadcast for Cluster Environments High Throughput Total Order Broadcast for Cluster Environments Rachid Guerraoui IC EPFL, Switzerland CSAIL MIT, USA Ron R. Levy IC EPFL, Switzerland Bastian Pochon IC EPFL, Switzerland Vivien Quéma INRIA,

More information

Process groups and message ordering

Process groups and message ordering Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave

More information

Design, Implementation, and Impact of. Multicast in the ParPar Control Network. David Er-El Avi Kavas Dror G. Feitelson

Design, Implementation, and Impact of. Multicast in the ParPar Control Network. David Er-El Avi Kavas Dror G. Feitelson Design, Implementation, and Impact of Multicast in the ParPar Control Network David Er-El Avi Kavas Dror G. Feitelson Institute of Computer Science The Hebrew University of Jerusalem 91904 Jerusalem, Israel

More information

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7].

David B. Johnson. Willy Zwaenepoel. Rice University. Houston, Texas. or the constraints of real-time applications [6, 7]. Sender-Based Message Logging David B. Johnson Willy Zwaenepoel Department of Computer Science Rice University Houston, Texas Abstract Sender-based message logging isanewlow-overhead mechanism for providing

More information

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast

Coordination 2. Today. How can processes agree on an action or a value? l Group communication l Basic, reliable and l ordered multicast Coordination 2 Today l Group communication l Basic, reliable and l ordered multicast How can processes agree on an action or a value? Modes of communication Unicast 1ç è 1 Point to point Anycast 1è

More information

Basic vs. Reliable Multicast

Basic vs. Reliable Multicast Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?

More information

A Membership Protocol for Multi-Computer Clusters

A Membership Protocol for Multi-Computer Clusters A Membership Protocol for Multi-Computer Clusters Francesc D. Muñoz-Escoí Vlada Matena José M. Bernabéu-Aubán Pablo Galdámez Technical Report ITI-ITE-98/4 Abstract Distributed applications need membership

More information

Stateful Group Communication Services

Stateful Group Communication Services Stateful Group Communication Services Radu Litiu and Atul Prakash Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, MI 48109-2122, USA E-mail: fradu,aprakashg@eecs.umich.edu

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Coordination and Agreement

Coordination and Agreement Coordination and Agreement Nicola Dragoni Embedded Systems Engineering DTU Informatics 1. Introduction 2. Distributed Mutual Exclusion 3. Elections 4. Multicast Communication 5. Consensus and related problems

More information

Throughput Stability of Reliable Multicast Protocols *

Throughput Stability of Reliable Multicast Protocols * Throughput Stability of Reliable Multicast Protocols * Öznur Özkasap 1 Kenneth P. Birman 2 1 Ege University, Department of Computer Engineering, 351 Bornova, Izmir, Turkey ozkasap@bornova.ege.edu.tr 2

More information

Consul: A Communication Substrate for Fault-Tolerant Distributed Programs

Consul: A Communication Substrate for Fault-Tolerant Distributed Programs Consul: A Communication Substrate for Fault-Tolerant Distributed Programs Shivakant Mishra, Larry L. Peterson, and Richard D. Schlichting Department of Computer Science The University of Arizona Tucson,

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

A HIERARCHICAL FAULT-TOLERANT RING PROTOCOL FOR DISTRIBUTED REAL-TIME SYSTEMS. TURHAN TUNALI, KAYHAN ERCIYES y, AND ZEHRA SOYSERT z

A HIERARCHICAL FAULT-TOLERANT RING PROTOCOL FOR DISTRIBUTED REAL-TIME SYSTEMS. TURHAN TUNALI, KAYHAN ERCIYES y, AND ZEHRA SOYSERT z A HIERARCHICAL FAULT-TOLERANT RING PROTOCOL FOR DISTRIBUTED REAL-TIME SYSTEMS TURHAN TUNALI, KAYHAN ERCIYES y, AND ZEHRA SOYSERT z Abstract. Asynchronous communication protocol is designed and implemented

More information

A Fast Group Communication Mechanism for Large Scale Distributed Objects 1

A Fast Group Communication Mechanism for Large Scale Distributed Objects 1 A Fast Group Communication Mechanism for Large Scale Distributed Objects 1 Hojjat Jafarpour and Nasser Yazdani Department of Electrical and Computer Engineering University of Tehran Tehran, Iran hjafarpour@ece.ut.ac.ir,

More information

Danny Dolev and Dalia Malki. The Transis Approach to High Availability Cluster Communication

Danny Dolev and Dalia Malki. The Transis Approach to High Availability Cluster Communication Groupommunication Danny Dolev and Dalia Malki The Transis pproach to High vailability luster ommunication unique large-scale multicast service designed for partitionable operation is examined here. In

More information

RTCAST: Lightweight Multicast for Real-Time Process Groups

RTCAST: Lightweight Multicast for Real-Time Process Groups RTCAST: Lightweight Multicast for Real-Time Process Groups Tarek Abdelzaher, Anees Shaikh, Farnam Jahanian, and Kang Shin Real-time Computing Laboratory Department of Electrical Engineering and Computer

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali Self Stabilization CS553 Distributed Algorithms Prof. Ajay Kshemkalyani by Islam Ismailov & Mohamed M. Ali Introduction There is a possibility for a distributed system to go into an illegitimate state,

More information

FAULT-TOLERANT CLUSTER MANAGEMENT FOR RELIABLE HIGH-PERFORMANCE COMPUTING

FAULT-TOLERANT CLUSTER MANAGEMENT FOR RELIABLE HIGH-PERFORMANCE COMPUTING Proceedings of the 13th International Conference on Parallel and Distributed Computing and Systems Anaheim, California, pp. 480-485, August 2001. FAULT-TOLERANT CLUSTER MANAGEMENT FOR RELIABLE HIGH-PERFORMANCE

More information

Frequently asked questions from the previous class survey

Frequently asked questions from the previous class survey CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED COORDINATION/MUTUAL EXCLUSION] Shrideep Pallickara Computer Science Colorado State University L22.1 Frequently asked questions from the previous

More information

Exam 2 Review. October 29, Paul Krzyzanowski 1

Exam 2 Review. October 29, Paul Krzyzanowski 1 Exam 2 Review October 29, 2015 2013 Paul Krzyzanowski 1 Question 1 Why did Dropbox add notification servers to their architecture? To avoid the overhead of clients polling the servers periodically to check

More information

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [DISTRIBUTED COORDINATION/MUTUAL EXCLUSION] Shrideep Pallickara Computer Science Colorado State University

More information

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA

TCP over Wireless Networks Using Multiple. Saad Biaz Miten Mehta Steve West Nitin H. Vaidya. Texas A&M University. College Station, TX , USA TCP over Wireless Networks Using Multiple Acknowledgements (Preliminary Version) Saad Biaz Miten Mehta Steve West Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX

More information

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION

VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE SIMULATION MATHEMATICAL MODELLING AND SCIENTIFIC COMPUTING, Vol. 8 (997) VALIDATING AN ANALYTICAL APPROXIMATION THROUGH DISCRETE ULATION Jehan-François Pâris Computer Science Department, University of Houston, Houston,

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers Clock Synchronization Synchronization Tanenbaum Chapter 6 plus additional papers Fig 6-1. In a distributed system, each machine has its own clock. When this is the case, an event that occurred after another

More information

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q

Coordination 1. To do. Mutual exclusion Election algorithms Next time: Global state. q q q Coordination 1 To do q q q Mutual exclusion Election algorithms Next time: Global state Coordination and agreement in US Congress 1798-2015 Process coordination How can processes coordinate their action?

More information

End-To-End Latency of a Fault-Tolerant CORBA Infrastructure

End-To-End Latency of a Fault-Tolerant CORBA Infrastructure Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Faculty Publications Electrical Engineering & Computer Science Department 5-2006 End-To-End Latency of a Fault-Tolerant

More information

A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication Yair Amir, Claudiu Danilov, Jonathan Stanton Department of Computer Science The Johns Hopkins University Baltimore,

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit

Fault Tolerance. o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication. o Distributed Commit Fault Tolerance o Basic Concepts o Process Resilience o Reliable Client-Server Communication o Reliable Group Communication o Distributed Commit -1 Distributed Commit o A more general problem of atomic

More information

Tradeoffs in Byzantine-Fault-Tolerant State-Machine-Replication Protocol Design

Tradeoffs in Byzantine-Fault-Tolerant State-Machine-Replication Protocol Design Tradeoffs in Byzantine-Fault-Tolerant State-Machine-Replication Protocol Design Michael G. Merideth March 2008 CMU-ISR-08-110 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

More information

The Object Group Design Pattern. Silvano Maeis y. Olsen & Associates. Zurich, Switzerland. Replica Pattern.

The Object Group Design Pattern. Silvano Maeis y. Olsen & Associates. Zurich, Switzerland. Replica Pattern. The Object Group Design Pattern Silvano Maeis y maeis@acm.org Olsen & Associates Zurich, Switzerland Abstract This paper describes \Object Group", an object behavioral pattern for group communication and

More information

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit) CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2003 Lecture 21: Network Protocols (and 2 Phase Commit) 21.0 Main Point Protocol: agreement between two parties as to

More information

Broker B2 for SPA and SPB

Broker B2 for SPA and SPB Scalable Atomic Multicast Lus Rodrigues Rachid Guerraoui Andre Schiper DI{FCUL TR{98{2 January 1998 Departamento de Informatica Faculdade de Ci^encias da Universidade de Lisboa Campo Grande, 1700 Lisboa

More information

Self-Adapting Epidemic Broadcast Algorithms

Self-Adapting Epidemic Broadcast Algorithms Self-Adapting Epidemic Broadcast Algorithms L. Rodrigues U. Lisboa ler@di.fc.ul.pt J. Pereira U. Minho jop@di.uminho.pt July 19, 2004 Abstract Epidemic broadcast algorithms have a number of characteristics,

More information

Rigorous Design of Moving Sequencer Atomic Broadcast with Unicast Broadcast

Rigorous Design of Moving Sequencer Atomic Broadcast with Unicast Broadcast Proc. of Int. Conf. on Advances in Computer Science, AETACS Rigorous Design of Moving Sequencer Atomic Broadcast with Unicast Broadcast Prateek Srivastava a*, Kamaljit I. Lakhtaria b and Amit Jain c a,

More information

A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication

A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication A Low Latency, Loss Tolerant Architecture and Protocol for Wide Area Group Communication Yair Amir, Claudiu Danilov, Jonathan Stanton Department of Computer Science Johns Hopkins University 3400 North

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

On Bootstrapping Replicated CORBA Applications Λ

On Bootstrapping Replicated CORBA Applications Λ On Bootstrapping Replicated CORBA Applications Λ W. Zhao, L. E. Moser and P. M. Melliar-Smith Department of Electrical and Computer Engineering University of California, Santa Barbara, CA 93106 wenbing@alpha.ece.ucsb.edu,

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

BYZANTINE AGREEMENT CH / $ IEEE. by H. R. Strong and D. Dolev. IBM Research Laboratory, K55/281 San Jose, CA 95193

BYZANTINE AGREEMENT CH / $ IEEE. by H. R. Strong and D. Dolev. IBM Research Laboratory, K55/281 San Jose, CA 95193 BYZANTINE AGREEMENT by H. R. Strong and D. Dolev IBM Research Laboratory, K55/281 San Jose, CA 95193 ABSTRACT Byzantine Agreement is a paradigm for problems of reliable consistency and synchronization

More information

An Introduction to the Amoeba Distributed Operating System Apan Qasem Computer Science Department Florida State University

An Introduction to the Amoeba Distributed Operating System Apan Qasem Computer Science Department Florida State University An Introduction to the Amoeba Distributed Operating System Apan Qasem Computer Science Department Florida State University qasem@cs.fsu.edu Abstract The Amoeba Operating System has been in use in academia,

More information

End-to-End Latency Analysis and Evaluation of a Fault-Tolerant CORBA Infrastructure Λ

End-to-End Latency Analysis and Evaluation of a Fault-Tolerant CORBA Infrastructure Λ End-to-End Latency Analysis and Evaluation of a Fault-Tolerant CORBA Infrastructure Λ W. Zhao, L. E. Moser and P. M. Melliar-Smith Department of Electrical and Computer Engineering University of California,

More information

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency

More information

Accepted Manuscript. On the Interconnection of Message Passing Systems. A. Álvarez, S. Arévalo, V. Cholvi, A. Fernández, E.

Accepted Manuscript. On the Interconnection of Message Passing Systems. A. Álvarez, S. Arévalo, V. Cholvi, A. Fernández, E. Accepted Manuscript On the Interconnection of Message Passing Systems A. Álvarez, S. Arévalo, V. Cholvi, A. Fernández, E. Jiménez PII: S0020-0190(07)00270-0 DOI: 10.1016/j.ipl.2007.09.006 Reference: IPL

More information

Practical Byzantine Fault Tolerance Using Fewer than 3f+1 Active Replicas

Practical Byzantine Fault Tolerance Using Fewer than 3f+1 Active Replicas Proceedings of the 17th International Conference on Parallel and Distributed Computing Systems San Francisco, California, pp 241-247, September 24 Practical Byzantine Fault Tolerance Using Fewer than 3f+1

More information

Causal Ordering in Deterministic Overlay Networks

Causal Ordering in Deterministic Overlay Networks Causal Ordering in Deterministic Overlay Networks Roy Friedman Department of Computer Science Technion - Israel Institute of Technology Haifa 3 Israel roy@cs.technion.ac.il April 18, 4 Abstract Shiri Manor

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Decentralized Message Ordering

Decentralized Message Ordering Decentralized Message Ordering Cristian Lumezanu University of Maryland College Park, MD 20742 lume@cs.umd.edu advisors: Neil Spring Bobby Bhattacharjee Abstract We describe a method to order messages

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster A Freely Congurable Audio-Mixing Engine with Automatic Loadbalancing M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster Electronics Laboratory, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland

More information

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015

Distributed Systems. Pre-Exam 1 Review. Paul Krzyzanowski. Rutgers University. Fall 2015 Distributed Systems Pre-Exam 1 Review Paul Krzyzanowski Rutgers University Fall 2015 October 2, 2015 CS 417 - Paul Krzyzanowski 1 Selected Questions From Past Exams October 2, 2015 CS 417 - Paul Krzyzanowski

More information

Replicated Database Recovery using Multicast Communication

Replicated Database Recovery using Multicast Communication Replicated Database Recovery using Multicast Communication JoAnne Holliday Dept of Computer Engineering Santa Clara University, Santa Clara, CA 95053 jholliday@acm.org Abstract Database replication with

More information