A Lightweight Fault Tolerance Framework for Web Services 1

Size: px
Start display at page:

Download "A Lightweight Fault Tolerance Framework for Web Services 1"

Transcription

1 Web Intelligence and Agent Systems: An International Journal 0 (2008) IOS Press A Lightweight Fault Tolerance Framework for Web Services 1 Wenbing Zhao Honglei Zhang and Hua Chai Department of Electrical and Computer Engineering Cleveland State University, 2121 Euclid Ave, Cleveland, OH 44115, USA {w.zhao1,h.zhang105,h.chai}@csuohio.edu Abstract. In this paper, we present the design and implementation of a lightweight fault tolerance framework for Web services. With our framework, a Web service can be rendered fault tolerant by replicating it across several nodes. A consensus-based algorithm is used to ensure total ordering of incoming application requests to the replicated Web service, and to ensure consistent membership view among the replicas. The framework is built by extending an open-source implementation of the WS- ReliableMessaging specification, and all reliable message exchanges in our framework conform to the specification. As such, our framework does not depend on any proprietary messaging and transport protocols, which is consistent with the Web services design principles. Our performance evaluation shows that our implementation is nearly optimal and the framework incurs only moderate runtime overhead. Keywords: Fault Tolerance, Web Services, Distributed Consensus, Reliable Messaging, Replication 1. Introduction Many Web intelligence systems offer their services in the form of Web services, and some of the core services must be made highly available and reliable to accomplish their missions. In fact, the capability of automatically reconfiguring themselves for continuous operation in the occurrences of component failures should be an essential element of any intelligence system. However, designing a sound fault tolerance solution for Web services is not trivial. It is attempting to perform a relatively straightforward translation of many existing fault tolerance mechanisms from older generations of distributed computing platforms, such as those described in FT- CORBA [23], to Web services. We argue against such an approach, for several reasons. As pointed out by 1 This work was supported by Department of Energy Contract DE- FC26-06NT42853, and by Cleveland State University through a Faculty Research Development award. An earlier version of this paper was presented at the 2007 IEEE/WIC/ACM Internation Conference on Web Intelligence [33]. * Corresponding author. wenbing@ieee.org many researchers, Web services is drastically different from the older generation of distributed computing technologies [25,32] in that Web services is designed for Web-based computing over the Internet, and it adopts a message-based approach for maximum interoperability, while the older technologies are not designed for the Internet and they primarily focused on the Application Programming Interface (API) based interactions. Furthermore, Web services advocates flexibility, composability, and technology independence. Hence, a fault tolerance solution for Web services must take an approach that is consistent with the design principles of Web services. Secondly, the FT-CORBA standard [23], which is one of the major outcomes of the fault tolerance research for CORBA, contains a great number of APIs for replication and fault management, and many sophisticated mechanisms, which has been considered too heavyweight even for CORBA applications, let alone for Web services. The above observation prompted us to design a novel, lightweight fault tolerance framework for Web services. The framework has the following features: /08/$17.00 c 2008 IOS Press and the authors. All rights reserved

2 2 A Lightweight Fault Tolerance Framework for Web Services It does not rely on any proprietary communication protocol for the interactions between the clients and server replicas, and among the server replicas. All the messaging required for replication is defined in Web Services Description Language (WSDL), and carried out on top of the standard Web services transport protocol, i.e., SOAP 1. This decision leads us to adopt a consensus-based algorithm [20], rather than a group communication system, to perform state-machine based replication. The algorithm ensures the total ordering of all incoming application requests to the replicated Web service, and a consistent membership view of the replicas (which is crucial to avoid the split-brain syndrome [6]). The framework is backward compatible with the WS-ReliableMessaging [4] specification, which ensures reliable point-to-point communication of Web services. A Web service using our framework can be protected against failures by replication when needed, otherwise, it runs as a WS- ReliableMessaging implementation. The switch between replication and non-replication modes can happen dynamically during runtime. Unlike other fault tolerance framework, our framework does not incur any extra overhead when running in the non-replication mode (i.e., single replica). Our framework is lightweight in that it does not impose sophisticated replication and fault management requirement, as did in FT-CORBA. The configuration is through a simple property file. The fault detection is incorporated in the replication mechanisms. The framework requires minimum changes to the Web services and their clients. On the service side, only two additional operations are introduced to retrieve and restore the service state. On the client side, the application must specify to use our module as one of the options to the SOAP engine. All other changes happen in the configuration files used by the SOAP engine. We have implemented our framework using Apache Axis2 [2] (the latest generation of the open source SOAP engine) and Sandesha2 [3] (an open source im- 1 SOAP once stands for Simple Object Access Protocol. Since SOAP version 1.2, the acronym has been dropped because SOAP has evolved much beyond the initial objective of enabling simple object invocations via the HyperText Transport Protocol (HTTP) on the Internet. plementation of the WS-ReliableMessaging specification on top of Axis2). The consensus-based replication algorithm is adapted from the BFT algorithm [9]. It is essentially an implementation of the Paxos algorithm [20]. The performance of the framework is carefully characterized and optimized. The runtime overhead is quite moderate considering our all-webservices-technology approach. 2. Related Work A considerable number of high availability solutions for Web services have been proposed in recent years. Two of them, namely, WS-Replication [28] and Thema [26], are most closely related to this work because they both ensure strong replica consistency for Web services. Similar to our work, WS-Replication achieves consistent replication of Web services by totally ordering all incoming requests to the replicated Web service. Even though the interfaces to the client application and the replicated Web services conform to the Web services standards, the actual transport is carried out using JGroup [18], which is a proprietary group communication system. JGroup does offer a SOAP transport. However, the performance is poor when such a transport is used. Consequently, proprietary message serialization is used to achieve decent performance. Unfortunately, such a move violates the Web services design principles, which insists on the use of standard Internet based transport protocols. The use of a proprietary group communication system is also problematic, because the clients and all replicas are strongly coupled to a single technology, which would pose interoperability problems. From the implementation perspective, WS-Replication uses separate proxy and dispatcher processes to capture and multicast clients requests, and to receive multicast messages from JGroup and forward the requests to the replicated Web services, respectively, which is inefficient. Our framework avoids the above problems by using standard Web services transport and messaging protocols for all interactions between clients and the Web services, and among the replicas. Furthermore, in our framework, clients communicate directly with the replicated Web services. Thema [26] reported a Byzantine fault tolerant [19] framework for Web services. Even though it is also constructed on a consensus-based replication algorithm like ours, an adaptor is used to interface with

3 A Lightweight Fault Tolerance Framework for Web Services 3 an existing implementation of the algorithm [9] which is based on UDP multicast, rather than the standard SOAP/HTTP transport, as such, it suffers from the same problem of WS-Replication [28]. It does, however, use a much weaker fault model [19]. Other work [5,10,13,14,15,16,22,27] either uses a different approach such as checkpointing and replay, or is still in conceptual stage. Some of the work ignored the consistency issues when performing replication and failure detection over the Internet, which may be problematic because the Internet is largely an asynchronous system. In the following, we briefly summarize each work we have known so far. Birman et al. [5] outlined a high availability architecture for Web services. A few core fault tolerance mechanisms were introduced, such as fault monitoring and TCP endpoint sharing. However, no working prototype was reported and many mechanisms described are not specific to Web services. Chan et al. [10] reported an analysis and experimental results on the evaluation of different fault tolerance approaches for Web services. However, the methods described were generic and did not consider the unique features of Web services. Also, no description was provided regarding the fault tolerance methods experimented. Dialani et al. [13] proposed a high availability architecture for Web services based on checkpointing and replay. A number of mechanisms were introduced to ensure the system to rollback to a consistent state among several inter-related processes. It is a very different approach from ours, which rely on active replication to achieve fault tolerance. Dobson [14] proposed to use WS-BPEL as an implementation technique to build fault tolerant Web services. The idea is to use WS-BPEL to provide a single interface for a group of similar Web services so that when one Web service fails, the request can be rerouted to an equivalent Web service. However, such an approach rely on a reliable failure detector, which is not attainable in asynchronous systems such as the Internet. Erradi and Maheshwari [15] proposed a brokerbased architecture for fault tolerant Web services interactions. The focus is on building a message bus that can mediate the interactions reliably, rather than a replication framework. As such their work is complimentary to ours. Fang et al. [16] and Santos et al. [29] reported a similar fault tolerance architecture for Web services and its implementations. Their architecture is apparently based on the FT-CORBA specification. The focus is on replication and fault management, rather than ensuring replica and membership consistency. Looker et al. [22] described a framework that relies on the n-version model, and a voting mechanism to ensure fault tolerance of a Web service. There was no description on how to ensure the total ordering of the requests and the replica membership consistency. Moser et al. [27] provided a general discussion on fault tolerance techniques that could be used to build fault tolerant Web services. No concrete system was built. 3. System Models We consider a Web service and its clients interacting over the Internet. When considering the safety of our replication algorithm, we use the asynchronous distributed system model. However, to ensure liveness, certain synchrony must be assumed. Similar to [9], we assume that the message transmission and processing delay has an asymptotic upper bound. This bound is dynamically explored in our algorithm in that each time a view change occurs, the timeout for the new view is doubled. We assume a crash fault model, i.e., a Web service replica might fail due to hardware or software failures, but once it fails, it stops emitting any messages. In particular, neither the clients nor the replicas behave maliciously. We assume that the network may incur transient faults, but they can be promptly repaired, i.e., we assume network partition does not occur. The Web service is replicated using a state-machine based approach, and hence, we assume the Web service operates deterministically. We are aware that most practical Web services contain some degree of nondeterminism. How to fully cope with such nondeterminism systematically is beyond the scope of this paper. But we do provide some elaboration on how we address some replica nondeterminism we have encountered in Sandesha2, on which this framework is built. We assume that 2f +1 replicas are available, among which at most f can be faulty. Similar to [9], each replica is assigned a unique id i, where i varies from 0 to 2f. For view v, the replica whose id i satisfies i = v mod (2f + 1) would serve as the primary. The view starts from 0. For each view change, the view number is incremented by one and a new primary is selected.

4 4 A Lightweight Fault Tolerance Framework for Web Services 4. Replication Algorithm In this section, we present our replication algorithm. We first provide a short summary of the original Paxos algorithm [20]. We then show how to adapt the Paxos algorithm for replication. We optimize the performance of the replication algorithm by separating it into a sub-algorithm for normal operation and a subalgorithm for view change. We also provide a sketch of the proof of correctness of our replication algorithm. Our replication algorithm ensures the following safety and liveness conditions: Safety: If an application request r is delivered at a replica in some total order, then no other replica delivers r in a different order. Liveness: An application request r will eventually be delivered at the replicas according to some total order as long as the system is sufficiently synchronous. Note that the safety condition guarantees that even if a replica fails right after the delivery of a request, the request will be delivered in the same total order at other replicas The Paxos Algorithm Before describing our replication algorithm, it is instructive to summarize the Paxos algorithm [20] and its application to replication. In the original Paxos algorithm, three agents are used, they are proposers, acceptors and learners, respectively. The proposers are those who propose values. To differentiate different proposals, each proposal must carry a unique, monotonically increasing proposal number v. The acceptors are those who accept (or reject) the proposals. If the majority of the acceptors have accepted a proposal with a value d, then it is said that the value d has been chosen (by the group of acceptors). The learners are those who must find out if a value has been chosen. The Paxos algorithm operates in two phases. In phase one, a proposer sends a prepare request with a proposal number v to the acceptors. In response to the prepare request, an acceptor sends the proposer (1) a promise that it will not accept any more proposals numbered less than v, and (2) the highest-numbered proposal, if any, that it has accepted, provided that it has not responded to a higher-numbered proposal. During phase two, the proposer sends an accept request to the acceptors with the proposal number v and a value d, provided that it has collected responses to its Client REQUEST Prepare Phase PREPARE PREPARE_ACK Accept Phase ACCEPT ACCEPT_ACK COMMIT Execution Fig. 1. The Paxos algorithm in the context of replication. prepare request from the majority of the acceptors. The value d is determined to be the value in the highestnumbered proposal among the responses (to the prepare request), or any value selected by the proposer if no acceptor has accepted any proposal previously. An acceptor accepts the accept request with v and d provided that it has not responded to a prepare request with a higher proposal number. After a value has been chosen, there must be a way for the learners to find this out. A simple way to achieve this is for the proposer to disseminate the chosen value to the learners. To apply the Paxos algorithm to solve the replication problem, we assume that each replica may act as all three agents, and the value to be chosen is the total-ordering of each application request (in later text, we say the ordering for an application request is committed when the majority of replicas have agreed on the ordering). During normal operation, a single replica, i.e., the primary, acts as the leader, and only the leader proposes the ordering for each application request. However, it may occur that two or more replicas believe that they are the leaders. The Paxos algorithm ensures the safety condition even if this happens. To guarantee liveness, we do need the existence of a unique leader among the (majority of) replicas for sufficiently long period (so that the total ordering for a request can be established). An illustration of the Paxos algorithm in the context of replication is shown in Figure 1. The essence of the prepare phase in the Paxos algorithm is to ensure that the history is propagated from one proposer to another so that if a proposal v with value d has been chosen by the acceptors, all future proposers who propose with a higher proposal number select the same value d. The accept phase is to ensure the agreement of the chosen value among the acceptors. A value d is not chosen unless the majority of acceptors have accepted d. As pointed out in [31], when there is a unique leader, the prepare phase is not needed to reach a consensus. REPLY

5 A Lightweight Fault Tolerance Framework for Web Services 5 Client REQUEST Accept Phase ACCEPT ACCEPT_ACK Execution COMMIT REPLY Fig. 2. Normal operation of the replication algorithm. In the context of replication, the condition for omitting the prepare phase can be further relaxed, i.e., as long as the majority of replicas agree with the same leader, the total ordering of application requests can be established without the need of running the prepare phase. This observation prompted us to decompose the Paxos algorithm into two sub-algorithms, one for normal operation while the majority of replicas agree with the same leader, and one for the abnormal situation when the leader is suspected by other replicas, which usually lead to the election of a new leader. The change of the leadership is referred to as view change is this paper. The proposal number in the original Paxos algorithm is referred to as the view number in our replication algorithm. Note that this decomposition is possible because we assume that the initial membership of the replicas, including the leader selection, is established a priori. The benefit of this decomposition is obvious - the normal operation overhead of the replication algorithm is significantly reduced comparing with that of the original Paxos algorithm because the prepare phase is moved out of the critical execution path Normal Operation The normal operation of the algorithm is shown in Figure 2. When the client issues a request to a replicated Web service, the request is multicast to all replicas. The request has the form <REQUEST, s, m, o>, where s is a unique sequence id, m is the message number within the sequence s, and o is the operation to be invoked on the Web service, together with necessary parameters. On receiving a client s request, a replica checks if it is a duplicate. The primary retrieves the corresponding response from its log (if one can be found) and sends it to the client if the request is a duplicate. The duplicate request is dropped subsequently. The backups simply drop the duplicate without resending the response for efficiency reasons. Note that the message format described here captures the essential information needed for total ordering. The actual message is an XML document encoded according to the SOAP standard. The concept of sequence is introduced in WS-ReliableMessaging [4]. When the client sends its first request to a Web service via WS-ReliableMessaging, a unique sequence is established between the client and the Web service. Every reliable message sent over the sequence is assigned a message number, which starts from 1 and increases by 1 for each subsequent message sent. A sequence forms a unidirectional reliable channel between two communicating endpoints. Therefore, another sequence is established for the Web service to send the reply back to the client. The mechanisms for establishing and terminating a sequence is elaborated in the WS-ReliableMessaging specification [4], and hence, they are not repeated here. When a replica accepts a client s request (to distinguish from the control messages used to establish total ordering, the client s requests are referred to as application requests from now on), and it is the next expected message from its sequence, the replica starts a view change timer. The timeout is set to allow the consensus to be reached on the ordering of the message. When the primary p (replica 0 in the figure) is ready to order this message, it assigns the message a monotonically increasing sequence number n (not to be confused with the sequence concept in WS- ReliableMessaging) and its current view number v, and multicasts an accept request to all replicas (the one to its own is not actually sent to the network - it is stored in the local data structure). The accept message has the form <ACCEPT, v, n, s, m>, where v is the current view number, n is the global sequence number assigned by the primary for the application request message identified by s and m. A backup accepts an accept message provided that the replica is in view v and it has not accepted an accept request with the same or higher global sequence number in view v. If it receives an accept message for a newer view, a replica contacts the primary in that view for any missing state and messages. Accept messages that belong to an older view are discarded. Note that a backup might receive an accept request ahead of the application request being ordered. As long as the sequence between the backup and the client is open, the backup will eventually receive the request. If the sequence has been terminated due to a premature timeout at the backup, the backup reestablish the sequence and asks the primary for retransmission of the message. The missing of the message being ordered

6 6 A Lightweight Fault Tolerance Framework for Web Services does not prevent a backup from accepting the accept request. When a backup accepts the accept request, it stores the message in its data structure and sends an accept response to the primary. The accept response has the form <ACCEPT_ACK, v, n>. At this point, we say that the replica has accepted the ordering for the application request with sequence number n in view v. When the primary receives an accept response, it verifies that the response indeed contains a matching sequence number and the view number with the accept request it has sent. It logs a valid accept response to its data structure. When the accept response messages from different replicas, together with its own accept request, form a quorum, i.e., the total number of such messages is equal to f + 1, the primary knows that the ordering for the application message has been committed by the replicas. The application request can be delivered if all previous requests have been delivered to the Web service. Before a backup could deliver and execute the application request, however, it must be sure that a quorum of replicas have agreed on the ordering for the message. This requires the primary to disseminate a commit message <COMMIT, v, n> to all backups when it has collected f accept responses from different backups. This commit message is acknowledged in the transport level (by the WS-ReliableMessaging mechanism) instead of the algorithm level. On receiving the commit message, a backup knows that the ordering for the application request with sequence number n is committed, and it is ready to deliver the application request being ordered if it has delivered all previous requests to the Web service. When the primary finishes executing the application request, it logs the corresponding reply and sends the message to the client. For performance reason, a backup only logs the reply and does not actually send it to the network, unless the replica becomes the new primary after a view change. The logged responses will be garbage collected when the clients acknowledge them Garbage Collection and Checkpoint A replica must keep the application requests and their ordering information in its log until all non-faulty replicas have delivered them. To avoid holding on these messages forever, each replica periodically takes a checkpoint of its state according to a deterministic (primary for v) (primary for v+1) View v View Change Phase VIEW_CHANGE View Installation Phase for v+1 NEW_VIEW Fig. 3. Sequence diagram showing the steps of the view change algorithm. algorithm (say, take one checkpoint for every 100 requests executed). After taking a checkpoint, a replica multicasts a checkpoint message to all other replicas. The checkpoint message has the form <CHECKPOINT, n, i>, where n is the sequence number of the last application request executed before taking the checkpoint, and i is the replica id. If a replica has collected a quorum (i.e., f + 1) of checkpoint messages from different replicas (including the message it has sent) for n, the checkpoint for n is said to have become stable, and the replica garbage collects all logged messages up to n, and the associated control messages (accept and commit etc.). It also deletes all previous checkpoints. A backup might lag behind and needs an application request that has been garbage collected by the primary, in which case, it asks the primary for a state transfer instead View Change If a backup i could not advance to the committed state on expiration of the view change timer, it initiates a view change by sending a view change message to all other replicas, as shown in Figure 3. The view change message has the form <VIEW_CHANGE, v + 1, l, P, i>, where l is the sequence number for the last stable checkpoint known to i, P is a set of accepted records for all application requests whose ordering has been accepted by the replica i. Each accepted record consists of a tuple <view, n, s, m>, where view is v or smaller. To ensure liveness, on receiving a view change message, a replica also suspects the primary and multicasts the view change message, provided that the view change message is for a future view. Once a replica suspects the primary, it stops participating the message ordering process and it accepts only checkpointing and view change related messages, until a new view is installed. When the primary in view v + 1 has collected f + 1 view change messages for view v + 1, includ-

7 A Lightweight Fault Tolerance Framework for Web Services 7 ing the one it would have sent, it installs the new view and notifies the backups with a new view message <NEW_VIEW, v + 1, O>, where O is a set of accept messages. The accept messages included in O is determined in the following way: If the new primary received an accepted record <view, n, s, m> in a view change message (including the one it would have sent), it constructs a new accept message <ACCEPT, v, n, s, m>. There might be a gap between the sequence number of the last known checkpoint, and the least sequence number of an accepted record, or a gap between two accepted records, in which case, an accept message is created for a null application request, i.e., <ACCEPT, v, n, s null, m null >. The execution of the null application request is a noop (i.e., there is no actual execution for the null message). When a backup receives the new view message, it accepts the message if it has not installed a newer view. If the replica accepts the new view message, it installs the new view, and processes the accept requests included in the new view message as usual. If an application request has been executed in an older view, it is not re-executed in the new view. The view change algorithm ensures that if a request whose ordering has been committed at any replica in a view, the associated accept record will be propagated to the new view Proof of Correctness We now sketch the proof of the safety and the liveness properties of our replication algorithm. Safety: If an application request r is delivered at a replica in some total order, then no other replica delivers r in a different order. Proof. The theorem follows from the two lemmas described below: Lemma 1: For all replicas that commit an application request r in the same view v, they agree on the same sequence number n. Proof. We prove by contradiction. Assume that two replicas i and j committed the same application request r with two different sequence numbers m and n, respectively. For replica i to commit the request with m, it must have received a commit request from the primary (or it has sent a commit request if it is the primary itself). This means that a quorum of R1 replicas have accepted the assignment of the sequence number m for r. Similarly, because j committed r with a different sequence number n, a quorum of R2 replicas must have accepted the assignment of the sequence number n for r. By definition of quorum, R1 and R2 must intersect in at least one non-faulty replica, which implies that this replica has accepted two different sequence numbers for the same request r. This contradicts to our algorithm because a non-faulty replica accepts only one sequence number for each request in a single view. Therefore, lemma 1 stands. Lemma 2: For replicas that commit an application request r in different views, they agree on the same sequence number. Proof. We prove by contradiction. Assume that replica i committed r with a sequence number m in view v, and replica j committed r with a different sequence number n in view u. Without loss of generality, we assume u > v. Since i committed r with m in view v, there are a quorum of R3 replicas that have accepted the sequence number assignment for r. To install a new view u, the new primary must have collected the view change messages from a quorum of R4 replicas. R3 and R4 must intersect in at least one non-faulty replica. Since this replica has accepted the sequence number assignment for r, it would have included the accepted tuple in its view change message sent to the new primary, and the new primary must have constructed an accept message using the sequence number m for r. If j committed r in view u, it must have accepted the accept message for the m and r binding, which contracts our assumption. Therefore, lemma 2 is correct. Liveness: An application request r will eventually be delivered at the replicas according to some total order as long as the system is sufficiently synchronous. Proof. The replication algorithm ensures liveness only during the period of synchrony. To prove the liveness, we first show that if the primary is not faulty during the period of synchrony, the request will be ordered and delivered at correct replicas and the client (if it does not fail) will receive the corresponding reply. We then show that if the request does not complete at all replicas during the current view, then a view change occurs. If the primary is not faulty, during the period of synchrony, the primary will order the request and multi-

8 8 A Lightweight Fault Tolerance Framework for Web Services cast an accept message to all backups, collect at least f + 1 accept responses from the backups. The primary then commits the message and delivers it if all previously ordered messages have been delivered. After the primary processed the request, the reply will be sent to the client. If the request cannot complete at all replicas in the current view, each correct replica will multicast a view change message on expiration of the view change timer. Since there are up to f faulty replicas, at least f + 1 replicas will perform the same action on expiration of their view change timers. The view change messages sent by these f + 1 replicas would lead to the installation of a new view according to our view change algorithm. The liveness might be hampered if a replica mistakenly suspects the primary because once a replica suspects the primary, it stops participating the ordering of application requests. To address this issue, a replica also multicasts a review change message upon receiving one from another replica, even if it has not suspected the primary. This mechanism guarantees that as long as a non-faulty replica suspects the primary (even if by mistake), a new view will eventually be installed. 5. Membership Management The replication algorithm described in the previous section assumes a static membership, i.e., the server replicas are predetermined and the composition of the replicas do not change over time. This is clearly very restrictive, because some replicas might fail and needs to be repaired and restarted, and the degree of replication might have to be adjusted due to the change of quality of service requirement. In this section, we introduce a set of mechanisms that can be used to perform membership management. We do not intend to support arbitrarily dynamic membership formation. Besides the initial static membership configuration, we assume that the change of the replication degree is carried out in the following planned manner: For each expansion, two new replicas are added so that the number of tolerated faults is increased by one. For each reduction, two existing replicas are removed so that the number of tolerated faults is reduced by one. Replica 3 Replica 3 Replica 4 Mark f=1 Replica f=1 JOIN ACCEPT ACCEPT_ACK COMMIT Add f=1 Replica f=2 JOIN ACCEPT ACCEPT_ACK Fig. 4. Steps of expanding the membership size. For expansion, the new replicas are informed of the endpoints of the existing replicas so that they can initiate the join process. For reduction, the replicas to be removed are informed and they initiate the leave request Rejoin To ensure a long-running replicated service, it requires a mechanism that allows a previously failed replica to rejoin the existing replicas once it has been repaired. Our consensus-based replication algorithm enables the temporary suspension of a replica for repair and the subsequent rejoin without any change of membership formation. If the failed replica was the primary of the view, a view change would take place, but the rejoin of the replica does not cause any view change because another replica has taken place the primary role in the new view since the failure of the replica. To rejoin, the recovering replica simply multicast a state transfer request, and the primary of the current view would distribute the latest checkpoint and all requests received since the checkpoint to the recovering replica. The recovering replica could participate the normal operation of the replication algorithm as soon as it is started. COMMIT

9 A Lightweight Fault Tolerance Framework for Web Services 9 Replica 3 Replica 4 Replicas 3 and 4 join f=1 f=2 View Change Order request j with sequence number n Order request k with sequence number n (Primary) Replica 3 Replica 4 f=2 LEAVE ACCEPT ACCEPT_ACK COMMIT Mark Replica f=2 Fig. 5. Without totally ordering the join/leave requests with respect to other requests, the safety condition might be violated Increasing Replication Degree When increasing the replication degree, two new replicas are added into the membership, one after another, according to the mechanism described in Figure 4. The new replica multicasts a join request, in the form <JOIN, i, t>, where i is the replica id, and t is the timestamp of the join request. The join request is totally ordered with respect to all application requests, using the same agreement algorithm. The total ordering is needed so that the switch-over to a higher replication degree can be carried out by all replicas from the same replication state. This is important to ensure the safety and liveness of the replication algorithm. Figure 5 illustrates a scenario of safety violation if the total ordering of the membership change is not enforced. Upon joining the membership, the new replica must retrieve the latest checkpoint and all application requests since the checkpoint from existing replicas. To alleviate the burden on the primary, the new replica could ask for such information from any existing replica in the same view. A tricky issue is when to switch to the new replication degree. It is not a good idea to switch to the new replication degree, i.e., from 2f + 1 to 2(f + 1) + 1 immediately after the joining of the first new replica. Switching to the new replication degree at this point does not increase the resiliency because one more replica is needed to increase the tolerance of f faulty replicas to f + 1 faulty replicas. Therefore, after the joining of the first new replica, the new membership is marked, and the new replica can obtain the latest state from existing replicas, but it is not allowed to participate the message ordering process. The switching to the new replication degree takes place when the second new replica is joined, as shown in Figure 4. (Primary) Replica 3 Replica 4 f=2 Remove Replica f=1 LEAVE ACCEPT ACCEPT_ACK COMMIT Leave Group Leave Group Fig. 6. Steps of reducing the replication degree Decreasing Replication Degree When decreasing the replication degree, two replicas are removed from the membership, one after another, according to the mechanism shown in Figure 6. The replica to be removed multicasts a leave request, in the form <LEAVE, i, t>. Similar to the join request, the leave request is totally ordered with respect to all application requests. The reason for total ordering is the same as that for the join process. Note that if one of the two replicas to be removed is the primary of the current view, the primary initiates a view change prior to the start of the leave process. When the primary initiates a view change, the new view is installed immediately. The first replica to leave is marked but not removed from the membership until the second leave request is processed. In the mean time, this replica must participate all the ordering tasks, as it is needed to maintain the fault tolerance degree. Consider the following scenario with the initial replication degree of 5 (to tolerate up to 2 faulty replicas). The primary fails right after the first leave request is honored. If the first replica to leave was removed from the membership right away, there would be only 3 remaining replicas, including the second replica to be removed from the membership. If this replica becomes faulty before the leave process is finished, liveness is lost because the remaining 2 repli-

10 10 A Lightweight Fault Tolerance Framework for Web Services Total Order Invoker In Msg Queue In Handler Web Service Replication Engine Out Handler Sender Out Msg Queue Fig. 7. The architecture of the server-side framework. cas cannot complete the consensus algorithm (with the fault tolerance degree of f = 2). Note that it is not an option to reduce the fault tolerance degree from f + 1 to f right after the removal of the first replica because the remaining replicas might reach different decisions, and hence violating the safety property. 6. Implementation and Performance Evaluation 6.1. Fault Tolerance Framework Architecture We have implemented our fault tolerance framework by extending Sandesah2. The architecture of the server-side framework is shown in Figure 7. The In- Order Invoker component in Sandesha2 is replaced by a Total Order Invoker, which is responsible to deliver requests to the replicated Web service in a total order. At each replica, the Total Order Invoker polls the replication engine for the next application request to be delivered, then it fetches the application message from the In Msg Queue, which stores all incoming application requests. The Sender component in Sandesha2 is replaced by a multicast-capable Sender. On the server side, the multicast is used only by the primary, and by the backups for the checkpoint and view change messages. To perform the multicast, multiple threads are launched to concurrently send the same message to different destinations using Axis2. Each thread is responsible to send the message to a distinct destination point-to-point. No proprietary reliable multicast tool is used. If the destination is temporarily unreachable, the thread retries the sending a few times. If the sending is successful, or it fails to send the message after retry, it reports the status to the Sender component for the corresponding destination. The Sender will perform further retransmission to that destination according to WS- ReliableMessaging mechanisms. The In handler in Sandesha2 is augmented to handle the control messages used by the replication algorithm (i.e., accept, accept response, commit, checkpoint, view change, and new view messages). All incoming messages are placed in the In Msg Queue after preliminary processing, and before they are delivered to the Web service by the Total Order Invoker. The Out handler in Sandesha2 is largely left intact. All outgoing messages are placed in the Out Msg Queue before they are sent out by the Sender component. The sending of the replication control messages is carried out using the normal Axis2 interface, which means such messages will be treated as application messages by the Sandesha2 mechanisms, except that some messages are multicast by the Sender component. The Sender knows what messages to be multicast by examining the SOAP action property included in each message. If a multicast is needed, the Replication Engine is consulted to obtain the multicast destinations. The Replication Engine is the core addition to the Sandesha2 framework. This component drives the execution of the replication algorithm. At the primary, when an application request becomes the next message to be delivered in its sequence according to WS- ReliableMessaging, it is assigned a global sequence number and an accept message is sent for the message. The request won t be delivered to the Web service until it has been totally ordered and all previously ordered messages have been delivered. The Replication Engine uses its own storage to log the replication control messages, and the checkpoints. The client-side architecture is similar, except that the application replies are not totally ordered (they are FIFO ordered within each sequence according to WS- ReliableMessaging, however). The Replication Engine simply keeps the server-side configuration information so that the Sender knows where to multicast the application requests Optimization In our implementation, we use the following two optimizations. The first optimization reduces the number of communication steps for each invocation from

11 A Lightweight Fault Tolerance Framework for Web Services 11 Client REQUEST REPLY Deliver Reply REPLY quence, and the requests ordered must have been FIFO ordered within their own sequences Performance Evaluation ACCEPT ACCEPT_ACK COMMIT Fig. 8. Tentative execution of an invocation. 4 to 3, and therefore, reduces the end-to-end latency significantly. The second optimization batches application requests for total ordering, which improves system throughput. Latency Optimization. To reduce the end-to-end latency, we employ the tentative execution mechanism introduced in [9]. As shown in Figure 8, as soon as the primary has ordered an application request and it has executed all application requests ordered previously, it tentatively delivers the message and executes it. The reply is sent to the client. Similarly, as soon as a backup has accepted an accept message for an application request, it tentatively delivers and executes the request, and sends the reply to the client, provided that it has executed the previous requests. To enable this optimization, the client cannot deliver the first reply received right away. Instead, it must wait until it has collected f + 1 matching replies from different replicas. If the primary fails, the client might not be able to collect f + 1 matching replies, in which case, it abandons the incomplete reply set. In case of the primary failure, the backups that have tentatively executed a request might have to be rolled back to its last checkpoint and re-execute the request in a potentially different order, as instructed by the new primary. Throughput Optimization. Even though in the replication algorithm description, each application request is assigned a unique sequence number, doing so would be very inefficient. Similar to the BFT framework [9], we incorporated a batching mechanism to improve the system throughput. The batching mechanism works in the following way. At the primary, it does not immediately order an application request when it is in FIFO order within its sequence, instead, it postpones doing so if there are already k batches of messages being ordered, where k is a tunable parameter and it is often set to 1. When the primary is ready to order a new batch of messages, it assigns the next sequence number for a group of application requests, at most one per se- Our performance evaluation is carried out on a testbed consisting of 12 Dell SC440 servers connected by a 100Mbps Ethernet. Each server is equipped with a single Pentium D 2.8GHz processors and 1GB memory running SuSE 10.2 Linux. In this subsection, we report two types of experimental results. We first report the runtime overhead of our replication algorithm during normal operation. We then present the experimental results for membership management related tasks. A backup failure virtually does not affect the operation of the algorithm, and hence, we see no noticeable degradation of runtime performance. However, when the primary fails, the client would see a significant delay if it has a request pending to be ordered, due to the timeout value for view changes. The timeout is usually set to 2 seconds in our experiment which is in a LAN environment. In the Internet environment, the timeout would be set to a higher number. If there are consecutive primary failure, the delay would be even longer. Micro-benchmark for Normal Operation. An echo test application is used to micro-benchmark the runtime overhead. The client sends a request to the replicated Web service and waits for the corresponding reply within a loop without any think time between two consecutive calls. The request (and the reply) contains an XML document with varying number of elements, encoded using AXIOM (AXis Object Model) [1]. At the replicated Web service, the request is parsed and a nearly identical reply XML document is returned to the client. In each run, 1000 samples are obtained. The end-toend latency for the echo operation is measured at the client. The latency for the application processing time (to parse the request and to generate a reply) and the throughput are measured at the replicated Web service. In our experiment, we vary the number of replicas, the request sizes in terms of the number of elements in each request, and the number of concurrent clients. Figure 9 shows the end-to-end latency and throughput measurement results. In Figure 9(a), The end-toend latency of the echo operation is reported for replication degrees of 1, 3 and 5. Note that when there is only a single replica, our framework rolls back to the Sandesha2 implementation without incurring any ad-

12 12 A Lightweight Fault Tolerance Framework for Web Services End-to-End Latency (milliseconds) Without Replication With 3 Server Replicas With 5 Server Replicas Throughput (calls/second) Throughput (calls/second) Without Replication With 3 Server Replicas Number of Concurrent Clients (b) Without Replication With 3 Server Replicas 100 Elements Per Call Number of Concurrent Clients (c) 500 Elements Per Call Request Size (Number of Elements) (a) Throughput (calls/second) Without Replication With 3 Server Replicas Number of Concurrent Clients (d) 1000 Elements Per Call Fig. 9. End-to-end latency and throughput measurement results. (a) End-to-end latency of the echo operation for replication degrees of 1, 3 and 5. (b)-(d) System throughput with and without replication, for the request sizes of 100, 500, and 1000 elements, respectively. ditional overhead. As can be seen, the latency incurred by our replication algorithm for three-way replication is about 50ms or so, consistent with our expectation that only one additional communication step is incurred in our fault tolerance framework comparing with the non-replicated case. The throughput results are shown in Figure 9(b)-(d). As can be seen, for short request sizes, the throughput degradation when replication is enabled is significant, especially when there are many concurrent clients. This is not surprising considering the complexity of the replication algorithm. Even with optimal batching for 6 concurrent clients, the primary must send 2 control messages and receive 4 control messages (2 of them in the transport level) to order the 6 application requests. The approximately 30% reduction in throughput is nearly optimal (note that the control messages are much shorter than the application requests with 100 elements). A 2/3 reduction in throughput is reported in [28] when the standard SOAP protocol and Web services are used, which is much less efficient due to their architecture. When the application request complexity is increased, the throughput reduction becomes less, as shown in Figure 9(c) and (d). Membership Management Experimental Results. The latency measurement results of the rejoin of a repaired replica and the expansion of replication degree with various state size (in terms of the number of data items in the state) are summarized in Figure 10. The inset shows the relationship between the number of data items and the encoded checkpoint size of the state. The latency for the rejoin of a repaired replica is dominated by the cost of state transfer from the primary to the recovering replica. The expansion of replication degree takes longer time because (1) the join request must be totally ordered with respect to other re-

A Light Weight Fault Tolerance Framework for Web Services

A Light Weight Fault Tolerance Framework for Web Services Cleveland State University EngagedScholarship@CSU ETD Archive 2010 A Light Weight Fault Tolerance Framework for Web Services Srikanth Dropati Cleveland State University How does access to this work benefit

More information

BFT-WS: A Byzantine Fault Tolerance Framework for Web Services

BFT-WS: A Byzantine Fault Tolerance Framework for Web Services BFT-WS: A Byzantine Fault Tolerance Framework for Web Services Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University, 2121 Euclid Ave, Cleveland, OH 44115 wenbing@ieee.org

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers 1 HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers Vinit Kumar 1 and Ajay Agarwal 2 1 Associate Professor with the Krishna Engineering College, Ghaziabad, India.

More information

Fast Paxos Made Easy: Theory and Implementation

Fast Paxos Made Easy: Theory and Implementation International Journal of Distributed Systems and Technologies, 6(1), 15-33, January-March 2015 15 Fast Paxos Made Easy: Theory and Implementation Wenbing Zhao, Department of Electrical and Computer Engineering,

More information

arxiv:cs/ v3 [cs.dc] 1 Aug 2007

arxiv:cs/ v3 [cs.dc] 1 Aug 2007 A Byzantine Fault Tolerant Distributed Commit Protocol arxiv:cs/0612083v3 [cs.dc] 1 Aug 2007 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University, 2121 Euclid Ave,

More information

Low Latency Fault Tolerance System

Low Latency Fault Tolerance System Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Faculty Publications Electrical Engineering & Computer Science Department 10-2-2012 Low Latency Fault Tolerance

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

Consensus, impossibility results and Paxos. Ken Birman

Consensus, impossibility results and Paxos. Ken Birman Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

Trustworthy Coordination of Web Services Atomic Transactions

Trustworthy Coordination of Web Services Atomic Transactions 1 Trustworthy Coordination of Web s Atomic Transactions Honglei Zhang, Hua Chai, Wenbing Zhao, Member, IEEE, P. M. Melliar-Smith, Member, IEEE, L. E. Moser, Member, IEEE Abstract The Web Atomic Transactions

More information

Practical Byzantine Fault

Practical Byzantine Fault Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault

More information

Design and Implementation of a Byzantine Fault Tolerance Framework for Web Services

Design and Implementation of a Byzantine Fault Tolerance Framework for Web Services Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Faculty Publications Electrical Engineering & Computer Science Department 6-29 Design and Implementation of a

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

Byzantine Fault Tolerant Coordination for Web Services Atomic Transactions

Byzantine Fault Tolerant Coordination for Web Services Atomic Transactions Byzantine Fault Tolerant Coordination for Web s Atomic Transactions Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University, Cleveland, OH 44115 wenbing@ieee.org Abstract.

More information

Semi-Passive Replication in the Presence of Byzantine Faults

Semi-Passive Replication in the Presence of Byzantine Faults Semi-Passive Replication in the Presence of Byzantine Faults HariGovind V. Ramasamy Adnan Agbaria William H. Sanders University of Illinois at Urbana-Champaign 1308 W. Main Street, Urbana IL 61801, USA

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks.

Consensus a classic problem. Consensus, impossibility results and Paxos. Distributed Consensus. Asynchronous networks. Consensus, impossibility results and Paxos Ken Birman Consensus a classic problem Consensus abstraction underlies many distributed systems and protocols N processes They start execution with inputs {0,1}

More information

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos Consensus I Recall our 2PC commit problem FLP Impossibility, Paxos Client C 1 C à TC: go! COS 418: Distributed Systems Lecture 7 Michael Freedman Bank A B 2 TC à A, B: prepare! 3 A, B à P: yes or no 4

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm

A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm Appears as Technical Memo MIT/LCS/TM-590, MIT Laboratory for Computer Science, June 1999 A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm Miguel Castro and Barbara Liskov

More information

Low Latency Fault Tolerance System

Low Latency Fault Tolerance System Low Latency Fault Tolerance System Wenbing Zhao 1, P. M. Melliar-Smith 2 and L. E. Moser 2 1 Department of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115 2 Department

More information

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous

More information

Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography

Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography Appears as Technical Memo MIT/LCS/TM-589, MIT Laboratory for Computer Science, June 999 Authenticated Byzantine Fault Tolerance Without Public-Key Cryptography Miguel Castro and Barbara Liskov Laboratory

More information

Byzantine Fault Tolerant Coordination for Web Services Atomic Transactions

Byzantine Fault Tolerant Coordination for Web Services Atomic Transactions Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Faculty Publications Electrical Engineering & Computer Science Department 2007 Byzantine Fault Tolerant Coordination

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004)

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004) Cheap Paxos Leslie Lamport and Mike Massa Appeared in The International Conference on Dependable Systems and Networks (DSN 2004) Cheap Paxos Leslie Lamport and Mike Massa Microsoft Abstract Asynchronous

More information

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast

Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast HariGovind V. Ramasamy Christian Cachin August 19, 2005 Abstract Atomic broadcast is a communication primitive that allows a group of

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems 11. Consensus. Paul Krzyzanowski Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one

More information

Incompatibility Dimensions and Integration of Atomic Commit Protocols

Incompatibility Dimensions and Integration of Atomic Commit Protocols The International Arab Journal of Information Technology, Vol. 5, No. 4, October 2008 381 Incompatibility Dimensions and Integration of Atomic Commit Protocols Yousef Al-Houmaily Department of Computer

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Appears in the Proceedings of the Third Symposium on Operating Systems Design and Implementation, New Orleans, USA, February 1999 Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Laboratory

More information

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985

Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 Network Working Group Request for Comments: 969 David D. Clark Mark L. Lambert Lixia Zhang M. I. T. Laboratory for Computer Science December 1985 1. STATUS OF THIS MEMO This RFC suggests a proposed protocol

More information

Key-value store with eventual consistency without trusting individual nodes

Key-value store with eventual consistency without trusting individual nodes basementdb Key-value store with eventual consistency without trusting individual nodes https://github.com/spferical/basementdb 1. Abstract basementdb is an eventually-consistent key-value store, composed

More information

Asynchronous Reconfiguration for Paxos State Machines

Asynchronous Reconfiguration for Paxos State Machines Asynchronous Reconfiguration for Paxos State Machines Leander Jehl and Hein Meling Department of Electrical Engineering and Computer Science University of Stavanger, Norway Abstract. This paper addresses

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications

Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications Adapting Commit Protocols for Large-Scale and Dynamic Distributed Applications Pawel Jurczyk and Li Xiong Emory University, Atlanta GA 30322, USA {pjurczy,lxiong}@emory.edu Abstract. The continued advances

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS

EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS LONG KAI THESIS 2013 Long Kai EMPIRICAL STUDY OF UNSTABLE LEADERS IN PAXOS BY LONG KAI THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5. Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message

More information

Exam 2 Review. October 29, Paul Krzyzanowski 1

Exam 2 Review. October 29, Paul Krzyzanowski 1 Exam 2 Review October 29, 2015 2013 Paul Krzyzanowski 1 Question 1 Why did Dropbox add notification servers to their architecture? To avoid the overhead of clients polling the servers periodically to check

More information

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory Byzantine Fault Tolerance and Consensus Adi Seredinschi Distributed Programming Laboratory 1 (Original) Problem Correct process General goal: Run a distributed algorithm 2 (Original) Problem Correct process

More information

Process groups and message ordering

Process groups and message ordering Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave

More information

AS distributed systems develop and grow in size,

AS distributed systems develop and grow in size, 1 hbft: Speculative Byzantine Fault Tolerance With Minimum Cost Sisi Duan, Sean Peisert, Senior Member, IEEE, and Karl N. Levitt Abstract We present hbft, a hybrid, Byzantine fault-tolerant, ted state

More information

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 10. Consensus: Paxos Paul Krzyzanowski Rutgers University Fall 2017 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency

More information

The UNIVERSITY of EDINBURGH. SCHOOL of INFORMATICS. CS4/MSc. Distributed Systems. Björn Franke. Room 2414

The UNIVERSITY of EDINBURGH. SCHOOL of INFORMATICS. CS4/MSc. Distributed Systems. Björn Franke. Room 2414 The UNIVERSITY of EDINBURGH SCHOOL of INFORMATICS CS4/MSc Distributed Systems Björn Franke bfranke@inf.ed.ac.uk Room 2414 (Lecture 13: Multicast and Group Communication, 16th November 2006) 1 Group Communication

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

ABSTRACT. Web Service Atomic Transaction (WS-AT) is a standard used to implement distributed

ABSTRACT. Web Service Atomic Transaction (WS-AT) is a standard used to implement distributed ABSTRACT Web Service Atomic Transaction (WS-AT) is a standard used to implement distributed processing over the internet. Trustworthy coordination of transactions is essential to ensure proper running

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Proactive Recovery in a Byzantine-Fault-Tolerant System

Proactive Recovery in a Byzantine-Fault-Tolerant System Proactive Recovery in a Byzantine-Fault-Tolerant System Miguel Castro and Barbara Liskov Laboratory for Computer Science, Massachusetts Institute of Technology, 545 Technology Square, Cambridge, MA 02139

More information

Evaluating BFT Protocols for Spire

Evaluating BFT Protocols for Spire Evaluating BFT Protocols for Spire Henry Schuh & Sam Beckley 600.667 Advanced Distributed Systems & Networks SCADA & Spire Overview High-Performance, Scalable Spire Trusted Platform Module Known Network

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

Paxos Made Simple. Leslie Lamport, 2001

Paxos Made Simple. Leslie Lamport, 2001 Paxos Made Simple Leslie Lamport, 2001 The Problem Reaching consensus on a proposed value, among a collection of processes Safety requirements: Only a value that has been proposed may be chosen Only a

More information

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Practical Byzantine Consensus CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Scenario Asynchronous system Signed messages s are state machines It has to be practical CS 138

More information

Practical Byzantine Fault Tolerance Using Fewer than 3f+1 Active Replicas

Practical Byzantine Fault Tolerance Using Fewer than 3f+1 Active Replicas Proceedings of the 17th International Conference on Parallel and Distributed Computing Systems San Francisco, California, pp 241-247, September 24 Practical Byzantine Fault Tolerance Using Fewer than 3f+1

More information

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports Server failures are the common case in data centers

More information

Lecture XII: Replication

Lecture XII: Replication Lecture XII: Replication CMPT 401 Summer 2007 Dr. Alexandra Fedorova Replication 2 Why Replicate? (I) Fault-tolerance / High availability As long as one replica is up, the service is available Assume each

More information

Coordination and Agreement

Coordination and Agreement Coordination and Agreement Nicola Dragoni Embedded Systems Engineering DTU Informatics 1. Introduction 2. Distributed Mutual Exclusion 3. Elections 4. Multicast Communication 5. Consensus and related problems

More information

MODELS OF DISTRIBUTED SYSTEMS

MODELS OF DISTRIBUTED SYSTEMS Distributed Systems Fö 2/3-1 Distributed Systems Fö 2/3-2 MODELS OF DISTRIBUTED SYSTEMS Basic Elements 1. Architectural Models 2. Interaction Models Resources in a distributed system are shared between

More information

Assignment 12: Commit Protocols and Replication Solution

Assignment 12: Commit Protocols and Replication Solution Data Modelling and Databases Exercise dates: May 24 / May 25, 2018 Ce Zhang, Gustavo Alonso Last update: June 04, 2018 Spring Semester 2018 Head TA: Ingo Müller Assignment 12: Commit Protocols and Replication

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

Introduction to Distributed Systems Seif Haridi

Introduction to Distributed Systems Seif Haridi Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send

More information

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015

Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Agreement in Distributed Systems CS 188 Distributed Systems February 19, 2015 Page 1 Introduction We frequently want to get a set of nodes in a distributed system to agree Commitment protocols and mutual

More information

Paxos and Replication. Dan Ports, CSEP 552

Paxos and Replication. Dan Ports, CSEP 552 Paxos and Replication Dan Ports, CSEP 552 Today: achieving consensus with Paxos and how to use this to build a replicated system Last week Scaling a web service using front-end caching but what about the

More information

Byzantine Fault Tolerant Coordination for Web Services Business Activities

Byzantine Fault Tolerant Coordination for Web Services Business Activities Byzantine Fault Tolerant Coordination for Web s Business Activities Wenbing Zhao and Honglei Zhang Department of Electrical and Computer Engineering Cleveland State University, 2121 Euclid Ave, Cleveland,

More information

A Reliable Broadcast System

A Reliable Broadcast System A Reliable Broadcast System Yuchen Dai, Xiayi Huang, Diansan Zhou Department of Computer Sciences and Engineering Santa Clara University December 10 2013 Table of Contents 2 Introduction......3 2.1 Objective...3

More information

COMMUNICATION IN DISTRIBUTED SYSTEMS

COMMUNICATION IN DISTRIBUTED SYSTEMS Distributed Systems Fö 3-1 Distributed Systems Fö 3-2 COMMUNICATION IN DISTRIBUTED SYSTEMS Communication Models and their Layered Implementation 1. Communication System: Layered Implementation 2. Network

More information

Practical Byzantine Fault Tolerance and Proactive Recovery

Practical Byzantine Fault Tolerance and Proactive Recovery Practical Byzantine Fault Tolerance and Proactive Recovery MIGUEL CASTRO Microsoft Research and BARBARA LISKOV MIT Laboratory for Computer Science Our growing reliance on online services accessible on

More information

CS603: Distributed Systems

CS603: Distributed Systems CS603: Distributed Systems Lecture 2: Client-Server Architecture, RPC, Corba Cristina Nita-Rotaru Lecture 2/ Spring 2006 1 ATC Architecture NETWORK INFRASTRUCTURE DATABASE HOW WOULD YOU START BUILDING

More information

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, PushyDB Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina, osong}@mit.edu https://github.com/jeffchan/6.824 1. Abstract PushyDB provides a more fully featured database that exposes

More information

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen Beyond FLP Acknowledgement for presentation material Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen Paper trail blog: http://the-paper-trail.org/blog/consensus-protocols-paxos/

More information

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms

Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms Verteilte Systeme/Distributed Systems Ch. 5: Various distributed algorithms Holger Karl Computer Networks Group Universität Paderborn Goal of this chapter Apart from issues in distributed time and resulting

More information

416 practice questions (PQs)

416 practice questions (PQs) 416 practice questions (PQs) 1. Goal: give you some material to study for the final exam and to help you to more actively engage with the material we cover in class. 2. Format: questions that are in scope

More information

AN ADAPTIVE ALGORITHM FOR TOLERATING VALUE FAULTS AND CRASH FAILURES 1

AN ADAPTIVE ALGORITHM FOR TOLERATING VALUE FAULTS AND CRASH FAILURES 1 AN ADAPTIVE ALGORITHM FOR TOLERATING VALUE FAULTS AND CRASH FAILURES 1 Jennifer Ren, Michel Cukier, and William H. Sanders Center for Reliable and High-Performance Computing Coordinated Science Laboratory

More information

Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks

Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks Guanfeng Liang and Nitin Vaidya Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University

More information

Distributed Consensus Protocols

Distributed Consensus Protocols Distributed Consensus Protocols ABSTRACT In this paper, I compare Paxos, the most popular and influential of distributed consensus protocols, and Raft, a fairly new protocol that is considered to be a

More information

A Reservation-Based Extended Transaction Protocol for Coordination of Web Services

A Reservation-Based Extended Transaction Protocol for Coordination of Web Services A Reservation-Based Extended Transaction Protocol for Coordination of Web Services Wenbing Zhao Dept. of Electrical and Computer Engineering Cleveland State University Cleveland, OH 44115 wenbing@ieee.org

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

Fast Paxos. Leslie Lamport

Fast Paxos. Leslie Lamport Distrib. Comput. (2006) 19:79 103 DOI 10.1007/s00446-006-0005-x ORIGINAL ARTICLE Fast Paxos Leslie Lamport Received: 20 August 2005 / Accepted: 5 April 2006 / Published online: 8 July 2006 Springer-Verlag

More information

Consensus on Transaction Commit

Consensus on Transaction Commit Consensus on Transaction Commit Jim Gray and Leslie Lamport Microsoft Research 1 January 2004 revised 19 April 2004, 8 September 2005, 5 July 2017 MSR-TR-2003-96 This paper appeared in ACM Transactions

More information