A Lightweight Fault Tolerance Framework for Web Services 1

Size: px

Start display at page:

Download "A Lightweight Fault Tolerance Framework for Web Services 1"

Tracy Mason
6 years ago
Views:

1 Web Intelligence and Agent Systems: An International Journal 0 (2008) IOS Press A Lightweight Fault Tolerance Framework for Web Services 1 Wenbing Zhao Honglei Zhang and Hua Chai Department of Electrical and Computer Engineering Cleveland State University, 2121 Euclid Ave, Cleveland, OH 44115, USA {w.zhao1,h.zhang105,h.chai}@csuohio.edu Abstract. In this paper, we present the design and implementation of a lightweight fault tolerance framework for Web services. With our framework, a Web service can be rendered fault tolerant by replicating it across several nodes. A consensus-based algorithm is used to ensure total ordering of incoming application requests to the replicated Web service, and to ensure consistent membership view among the replicas. The framework is built by extending an open-source implementation of the WS- ReliableMessaging specification, and all reliable message exchanges in our framework conform to the specification. As such, our framework does not depend on any proprietary messaging and transport protocols, which is consistent with the Web services design principles. Our performance evaluation shows that our implementation is nearly optimal and the framework incurs only moderate runtime overhead. Keywords: Fault Tolerance, Web Services, Distributed Consensus, Reliable Messaging, Replication 1. Introduction Many Web intelligence systems offer their services in the form of Web services, and some of the core services must be made highly available and reliable to accomplish their missions. In fact, the capability of automatically reconfiguring themselves for continuous operation in the occurrences of component failures should be an essential element of any intelligence system. However, designing a sound fault tolerance solution for Web services is not trivial. It is attempting to perform a relatively straightforward translation of many existing fault tolerance mechanisms from older generations of distributed computing platforms, such as those described in FT- CORBA [23], to Web services. We argue against such an approach, for several reasons. As pointed out by 1 This work was supported by Department of Energy Contract DE- FC26-06NT42853, and by Cleveland State University through a Faculty Research Development award. An earlier version of this paper was presented at the 2007 IEEE/WIC/ACM Internation Conference on Web Intelligence [33]. * Corresponding author. wenbing@ieee.org many researchers, Web services is drastically different from the older generation of distributed computing technologies [25,32] in that Web services is designed for Web-based computing over the Internet, and it adopts a message-based approach for maximum interoperability, while the older technologies are not designed for the Internet and they primarily focused on the Application Programming Interface (API) based interactions. Furthermore, Web services advocates flexibility, composability, and technology independence. Hence, a fault tolerance solution for Web services must take an approach that is consistent with the design principles of Web services. Secondly, the FT-CORBA standard [23], which is one of the major outcomes of the fault tolerance research for CORBA, contains a great number of APIs for replication and fault management, and many sophisticated mechanisms, which has been considered too heavyweight even for CORBA applications, let alone for Web services. The above observation prompted us to design a novel, lightweight fault tolerance framework for Web services. The framework has the following features: /08/$17.00 c 2008 IOS Press and the authors. All rights reserved

2 2 A Lightweight Fault Tolerance Framework for Web Services It does not rely on any proprietary communication protocol for the interactions between the clients and server replicas, and among the server replicas. All the messaging required for replication is defined in Web Services Description Language (WSDL), and carried out on top of the standard Web services transport protocol, i.e., SOAP 1. This decision leads us to adopt a consensus-based algorithm [20], rather than a group communication system, to perform state-machine based replication. The algorithm ensures the total ordering of all incoming application requests to the replicated Web service, and a consistent membership view of the replicas (which is crucial to avoid the split-brain syndrome [6]). The framework is backward compatible with the WS-ReliableMessaging [4] specification, which ensures reliable point-to-point communication of Web services. A Web service using our framework can be protected against failures by replication when needed, otherwise, it runs as a WS- ReliableMessaging implementation. The switch between replication and non-replication modes can happen dynamically during runtime. Unlike other fault tolerance framework, our framework does not incur any extra overhead when running in the non-replication mode (i.e., single replica). Our framework is lightweight in that it does not impose sophisticated replication and fault management requirement, as did in FT-CORBA. The configuration is through a simple property file. The fault detection is incorporated in the replication mechanisms. The framework requires minimum changes to the Web services and their clients. On the service side, only two additional operations are introduced to retrieve and restore the service state. On the client side, the application must specify to use our module as one of the options to the SOAP engine. All other changes happen in the configuration files used by the SOAP engine. We have implemented our framework using Apache Axis2 [2] (the latest generation of the open source SOAP engine) and Sandesha2 [3] (an open source im- 1 SOAP once stands for Simple Object Access Protocol. Since SOAP version 1.2, the acronym has been dropped because SOAP has evolved much beyond the initial objective of enabling simple object invocations via the HyperText Transport Protocol (HTTP) on the Internet. plementation of the WS-ReliableMessaging specification on top of Axis2). The consensus-based replication algorithm is adapted from the BFT algorithm [9]. It is essentially an implementation of the Paxos algorithm [20]. The performance of the framework is carefully characterized and optimized. The runtime overhead is quite moderate considering our all-webservices-technology approach. 2. Related Work A considerable number of high availability solutions for Web services have been proposed in recent years. Two of them, namely, WS-Replication [28] and Thema [26], are most closely related to this work because they both ensure strong replica consistency for Web services. Similar to our work, WS-Replication achieves consistent replication of Web services by totally ordering all incoming requests to the replicated Web service. Even though the interfaces to the client application and the replicated Web services conform to the Web services standards, the actual transport is carried out using JGroup [18], which is a proprietary group communication system. JGroup does offer a SOAP transport. However, the performance is poor when such a transport is used. Consequently, proprietary message serialization is used to achieve decent performance. Unfortunately, such a move violates the Web services design principles, which insists on the use of standard Internet based transport protocols. The use of a proprietary group communication system is also problematic, because the clients and all replicas are strongly coupled to a single technology, which would pose interoperability problems. From the implementation perspective, WS-Replication uses separate proxy and dispatcher processes to capture and multicast clients requests, and to receive multicast messages from JGroup and forward the requests to the replicated Web services, respectively, which is inefficient. Our framework avoids the above problems by using standard Web services transport and messaging protocols for all interactions between clients and the Web services, and among the replicas. Furthermore, in our framework, clients communicate directly with the replicated Web services. Thema [26] reported a Byzantine fault tolerant [19] framework for Web services. Even though it is also constructed on a consensus-based replication algorithm like ours, an adaptor is used to interface with

3 A Lightweight Fault Tolerance Framework for Web Services 3 an existing implementation of the algorithm [9] which is based on UDP multicast, rather than the standard SOAP/HTTP transport, as such, it suffers from the same problem of WS-Replication [28]. It does, however, use a much weaker fault model [19]. Other work [5,10,13,14,15,16,22,27] either uses a different approach such as checkpointing and replay, or is still in conceptual stage. Some of the work ignored the consistency issues when performing replication and failure detection over the Internet, which may be problematic because the Internet is largely an asynchronous system. In the following, we briefly summarize each work we have known so far. Birman et al. [5] outlined a high availability architecture for Web services. A few core fault tolerance mechanisms were introduced, such as fault monitoring and TCP endpoint sharing. However, no working prototype was reported and many mechanisms described are not specific to Web services. Chan et al. [10] reported an analysis and experimental results on the evaluation of different fault tolerance approaches for Web services. However, the methods described were generic and did not consider the unique features of Web services. Also, no description was provided regarding the fault tolerance methods experimented. Dialani et al. [13] proposed a high availability architecture for Web services based on checkpointing and replay. A number of mechanisms were introduced to ensure the system to rollback to a consistent state among several inter-related processes. It is a very different approach from ours, which rely on active replication to achieve fault tolerance. Dobson [14] proposed to use WS-BPEL as an implementation technique to build fault tolerant Web services. The idea is to use WS-BPEL to provide a single interface for a group of similar Web services so that when one Web service fails, the request can be rerouted to an equivalent Web service. However, such an approach rely on a reliable failure detector, which is not attainable in asynchronous systems such as the Internet. Erradi and Maheshwari [15] proposed a brokerbased architecture for fault tolerant Web services interactions. The focus is on building a message bus that can mediate the interactions reliably, rather than a replication framework. As such their work is complimentary to ours. Fang et al. [16] and Santos et al. [29] reported a similar fault tolerance architecture for Web services and its implementations. Their architecture is apparently based on the FT-CORBA specification. The focus is on replication and fault management, rather than ensuring replica and membership consistency. Looker et al. [22] described a framework that relies on the n-version model, and a voting mechanism to ensure fault tolerance of a Web service. There was no description on how to ensure the total ordering of the requests and the replica membership consistency. Moser et al. [27] provided a general discussion on fault tolerance techniques that could be used to build fault tolerant Web services. No concrete system was built. 3. System Models We consider a Web service and its clients interacting over the Internet. When considering the safety of our replication algorithm, we use the asynchronous distributed system model. However, to ensure liveness, certain synchrony must be assumed. Similar to [9], we assume that the message transmission and processing delay has an asymptotic upper bound. This bound is dynamically explored in our algorithm in that each time a view change occurs, the timeout for the new view is doubled. We assume a crash fault model, i.e., a Web service replica might fail due to hardware or software failures, but once it fails, it stops emitting any messages. In particular, neither the clients nor the replicas behave maliciously. We assume that the network may incur transient faults, but they can be promptly repaired, i.e., we assume network partition does not occur. The Web service is replicated using a state-machine based approach, and hence, we assume the Web service operates deterministically. We are aware that most practical Web services contain some degree of nondeterminism. How to fully cope with such nondeterminism systematically is beyond the scope of this paper. But we do provide some elaboration on how we address some replica nondeterminism we have encountered in Sandesha2, on which this framework is built. We assume that 2f +1 replicas are available, among which at most f can be faulty. Similar to [9], each replica is assigned a unique id i, where i varies from 0 to 2f. For view v, the replica whose id i satisfies i = v mod (2f + 1) would serve as the primary. The view starts from 0. For each view change, the view number is incremented by one and a new primary is selected.

4 4 A Lightweight Fault Tolerance Framework for Web Services 4. Replication Algorithm In this section, we present our replication algorithm. We first provide a short summary of the original Paxos algorithm [20]. We then show how to adapt the Paxos algorithm for replication. We optimize the performance of the replication algorithm by separating it into a sub-algorithm for normal operation and a subalgorithm for view change. We also provide a sketch of the proof of correctness of our replication algorithm. Our replication algorithm ensures the following safety and liveness conditions: Safety: If an application request r is delivered at a replica in some total order, then no other replica delivers r in a different order. Liveness: An application request r will eventually be delivered at the replicas according to some total order as long as the system is sufficiently synchronous. Note that the safety condition guarantees that even if a replica fails right after the delivery of a request, the request will be delivered in the same total order at other replicas The Paxos Algorithm Before describing our replication algorithm, it is instructive to summarize the Paxos algorithm [20] and its application to replication. In the original Paxos algorithm, three agents are used, they are proposers, acceptors and learners, respectively. The proposers are those who propose values. To differentiate different proposals, each proposal must carry a unique, monotonically increasing proposal number v. The acceptors are those who accept (or reject) the proposals. If the majority of the acceptors have accepted a proposal with a value d, then it is said that the value d has been chosen (by the group of acceptors). The learners are those who must find out if a value has been chosen. The Paxos algorithm operates in two phases. In phase one, a proposer sends a prepare request with a proposal number v to the acceptors. In response to the prepare request, an acceptor sends the proposer (1) a promise that it will not accept any more proposals numbered less than v, and (2) the highest-numbered proposal, if any, that it has accepted, provided that it has not responded to a higher-numbered proposal. During phase two, the proposer sends an accept request to the acceptors with the proposal number v and a value d, provided that it has collected responses to its Client REQUEST Prepare Phase PREPARE PREPARE_ACK Accept Phase ACCEPT ACCEPT_ACK COMMIT Execution Fig. 1. The Paxos algorithm in the context of replication. prepare request from the majority of the acceptors. The value d is determined to be the value in the highestnumbered proposal among the responses (to the prepare request), or any value selected by the proposer if no acceptor has accepted any proposal previously. An acceptor accepts the accept request with v and d provided that it has not responded to a prepare request with a higher proposal number. After a value has been chosen, there must be a way for the learners to find this out. A simple way to achieve this is for the proposer to disseminate the chosen value to the learners. To apply the Paxos algorithm to solve the replication problem, we assume that each replica may act as all three agents, and the value to be chosen is the total-ordering of each application request (in later text, we say the ordering for an application request is committed when the majority of replicas have agreed on the ordering). During normal operation, a single replica, i.e., the primary, acts as the leader, and only the leader proposes the ordering for each application request. However, it may occur that two or more replicas believe that they are the leaders. The Paxos algorithm ensures the safety condition even if this happens. To guarantee liveness, we do need the existence of a unique leader among the (majority of) replicas for sufficiently long period (so that the total ordering for a request can be established). An illustration of the Paxos algorithm in the context of replication is shown in Figure 1. The essence of the prepare phase in the Paxos algorithm is to ensure that the history is propagated from one proposer to another so that if a proposal v with value d has been chosen by the acceptors, all future proposers who propose with a higher proposal number select the same value d. The accept phase is to ensure the agreement of the chosen value among the acceptors. A value d is not chosen unless the majority of acceptors have accepted d. As pointed out in [31], when there is a unique leader, the prepare phase is not needed to reach a consensus. REPLY

5 A Lightweight Fault Tolerance Framework for Web Services 5 Client REQUEST Accept Phase ACCEPT ACCEPT_ACK Execution COMMIT REPLY Fig. 2. Normal operation of the replication algorithm. In the context of replication, the condition for omitting the prepare phase can be further relaxed, i.e., as long as the majority of replicas agree with the same leader, the total ordering of application requests can be established without the need of running the prepare phase. This observation prompted us to decompose the Paxos algorithm into two sub-algorithms, one for normal operation while the majority of replicas agree with the same leader, and one for the abnormal situation when the leader is suspected by other replicas, which usually lead to the election of a new leader. The change of the leadership is referred to as view change is this paper. The proposal number in the original Paxos algorithm is referred to as the view number in our replication algorithm. Note that this decomposition is possible because we assume that the initial membership of the replicas, including the leader selection, is established a priori. The benefit of this decomposition is obvious - the normal operation overhead of the replication algorithm is significantly reduced comparing with that of the original Paxos algorithm because the prepare phase is moved out of the critical execution path Normal Operation The normal operation of the algorithm is shown in Figure 2. When the client issues a request to a replicated Web service, the request is multicast to all replicas. The request has the form <REQUEST, s, m, o>, where s is a unique sequence id, m is the message number within the sequence s, and o is the operation to be invoked on the Web service, together with necessary parameters. On receiving a client s request, a replica checks if it is a duplicate. The primary retrieves the corresponding response from its log (if one can be found) and sends it to the client if the request is a duplicate. The duplicate request is dropped subsequently. The backups simply drop the duplicate without resending the response for efficiency reasons. Note that the message format described here captures the essential information needed for total ordering. The actual message is an XML document encoded according to the SOAP standard. The concept of sequence is introduced in WS-ReliableMessaging [4]. When the client sends its first request to a Web service via WS-ReliableMessaging, a unique sequence is established between the client and the Web service. Every reliable message sent over the sequence is assigned a message number, which starts from 1 and increases by 1 for each subsequent message sent. A sequence forms a unidirectional reliable channel between two communicating endpoints. Therefore, another sequence is established for the Web service to send the reply back to the client. The mechanisms for establishing and terminating a sequence is elaborated in the WS-ReliableMessaging specification [4], and hence, they are not repeated here. When a replica accepts a client s request (to distinguish from the control messages used to establish total ordering, the client s requests are referred to as application requests from now on), and it is the next expected message from its sequence, the replica starts a view change timer. The timeout is set to allow the consensus to be reached on the ordering of the message. When the primary p (replica 0 in the figure) is ready to order this message, it assigns the message a monotonically increasing sequence number n (not to be confused with the sequence concept in WS- ReliableMessaging) and its current view number v, and multicasts an accept request to all replicas (the one to its own is not actually sent to the network - it is stored in the local data structure). The accept message has the form <ACCEPT, v, n, s, m>, where v is the current view number, n is the global sequence number assigned by the primary for the application request message identified by s and m. A backup accepts an accept message provided that the replica is in view v and it has not accepted an accept request with the same or higher global sequence number in view v. If it receives an accept message for a newer view, a replica contacts the primary in that view for any missing state and messages. Accept messages that belong to an older view are discarded. Note that a backup might receive an accept request ahead of the application request being ordered. As long as the sequence between the backup and the client is open, the backup will eventually receive the request. If the sequence has been terminated due to a premature timeout at the backup, the backup reestablish the sequence and asks the primary for retransmission of the message. The missing of the message being ordered

6 6 A Lightweight Fault Tolerance Framework for Web Services does not prevent a backup from accepting the accept request. When a backup accepts the accept request, it stores the message in its data structure and sends an accept response to the primary. The accept response has the form <ACCEPT_ACK, v, n>. At this point, we say that the replica has accepted the ordering for the application request with sequence number n in view v. When the primary receives an accept response, it verifies that the response indeed contains a matching sequence number and the view number with the accept request it has sent. It logs a valid accept response to its data structure. When the accept response messages from different replicas, together with its own accept request, form a quorum, i.e., the total number of such messages is equal to f + 1, the primary knows that the ordering for the application message has been committed by the replicas. The application request can be delivered if all previous requests have been delivered to the Web service. Before a backup could deliver and execute the application request, however, it must be sure that a quorum of replicas have agreed on the ordering for the message. This requires the primary to disseminate a commit message <COMMIT, v, n> to all backups when it has collected f accept responses from different backups. This commit message is acknowledged in the transport level (by the WS-ReliableMessaging mechanism) instead of the algorithm level. On receiving the commit message, a backup knows that the ordering for the application request with sequence number n is committed, and it is ready to deliver the application request being ordered if it has delivered all previous requests to the Web service. When the primary finishes executing the application request, it logs the corresponding reply and sends the message to the client. For performance reason, a backup only logs the reply and does not actually send it to the network, unless the replica becomes the new primary after a view change. The logged responses will be garbage collected when the clients acknowledge them Garbage Collection and Checkpoint A replica must keep the application requests and their ordering information in its log until all non-faulty replicas have delivered them. To avoid holding on these messages forever, each replica periodically takes a checkpoint of its state according to a deterministic (primary for v) (primary for v+1) View v View Change Phase VIEW_CHANGE View Installation Phase for v+1 NEW_VIEW Fig. 3. Sequence diagram showing the steps of the view change algorithm. algorithm (say, take one checkpoint for every 100 requests executed). After taking a checkpoint, a replica multicasts a checkpoint message to all other replicas. The checkpoint message has the form <CHECKPOINT, n, i>, where n is the sequence number of the last application request executed before taking the checkpoint, and i is the replica id. If a replica has collected a quorum (i.e., f + 1) of checkpoint messages from different replicas (including the message it has sent) for n, the checkpoint for n is said to have become stable, and the replica garbage collects all logged messages up to n, and the associated control messages (accept and commit etc.). It also deletes all previous checkpoints. A backup might lag behind and needs an application request that has been garbage collected by the primary, in which case, it asks the primary for a state transfer instead View Change If a backup i could not advance to the committed state on expiration of the view change timer, it initiates a view change by sending a view change message to all other replicas, as shown in Figure 3. The view change message has the form <VIEW_CHANGE, v + 1, l, P, i>, where l is the sequence number for the last stable checkpoint known to i, P is a set of accepted records for all application requests whose ordering has been accepted by the replica i. Each accepted record consists of a tuple <view, n, s, m>, where view is v or smaller. To ensure liveness, on receiving a view change message, a replica also suspects the primary and multicasts the view change message, provided that the view change message is for a future view. Once a replica suspects the primary, it stops participating the message ordering process and it accepts only checkpointing and view change related messages, until a new view is installed. When the primary in view v + 1 has collected f + 1 view change messages for view v + 1, includ-

7 A Lightweight Fault Tolerance Framework for Web Services 7 ing the one it would have sent, it installs the new view and notifies the backups with a new view message <NEW_VIEW, v + 1, O>, where O is a set of accept messages. The accept messages included in O is determined in the following way: If the new primary received an accepted record <view, n, s, m> in a view change message (including the one it would have sent), it constructs a new accept message <ACCEPT, v, n, s, m>. There might be a gap between the sequence number of the last known checkpoint, and the least sequence number of an accepted record, or a gap between two accepted records, in which case, an accept message is created for a null application request, i.e., <ACCEPT, v, n, s null, m null >. The execution of the null application request is a noop (i.e., there is no actual execution for the null message). When a backup receives the new view message, it accepts the message if it has not installed a newer view. If the replica accepts the new view message, it installs the new view, and processes the accept requests included in the new view message as usual. If an application request has been executed in an older view, it is not re-executed in the new view. The view change algorithm ensures that if a request whose ordering has been committed at any replica in a view, the associated accept record will be propagated to the new view Proof of Correctness We now sketch the proof of the safety and the liveness properties of our replication algorithm. Safety: If an application request r is delivered at a replica in some total order, then no other replica delivers r in a different order. Proof. The theorem follows from the two lemmas described below: Lemma 1: For all replicas that commit an application request r in the same view v, they agree on the same sequence number n. Proof. We prove by contradiction. Assume that two replicas i and j committed the same application request r with two different sequence numbers m and n, respectively. For replica i to commit the request with m, it must have received a commit request from the primary (or it has sent a commit request if it is the primary itself). This means that a quorum of R1 replicas have accepted the assignment of the sequence number m for r. Similarly, because j committed r with a different sequence number n, a quorum of R2 replicas must have accepted the assignment of the sequence number n for r. By definition of quorum, R1 and R2 must intersect in at least one non-faulty replica, which implies that this replica has accepted two different sequence numbers for the same request r. This contradicts to our algorithm because a non-faulty replica accepts only one sequence number for each request in a single view. Therefore, lemma 1 stands. Lemma 2: For replicas that commit an application request r in different views, they agree on the same sequence number. Proof. We prove by contradiction. Assume that replica i committed r with a sequence number m in view v, and replica j committed r with a different sequence number n in view u. Without loss of generality, we assume u > v. Since i committed r with m in view v, there are a quorum of R3 replicas that have accepted the sequence number assignment for r. To install a new view u, the new primary must have collected the view change messages from a quorum of R4 replicas. R3 and R4 must intersect in at least one non-faulty replica. Since this replica has accepted the sequence number assignment for r, it would have included the accepted tuple in its view change message sent to the new primary, and the new primary must have constructed an accept message using the sequence number m for r. If j committed r in view u, it must have accepted the accept message for the m and r binding, which contracts our assumption. Therefore, lemma 2 is correct. Liveness: An application request r will eventually be delivered at the replicas according to some total order as long as the system is sufficiently synchronous. Proof. The replication algorithm ensures liveness only during the period of synchrony. To prove the liveness, we first show that if the primary is not faulty during the period of synchrony, the request will be ordered and delivered at correct replicas and the client (if it does not fail) will receive the corresponding reply. We then show that if the request does not complete at all replicas during the current view, then a view change occurs. If the primary is not faulty, during the period of synchrony, the primary will order the request and multi-

8 8 A Lightweight Fault Tolerance Framework for Web Services cast an accept message to all backups, collect at least f + 1 accept responses from the backups. The primary then commits the message and delivers it if all previously ordered messages have been delivered. After the primary processed the request, the reply will be sent to the client. If the request cannot complete at all replicas in the current view, each correct replica will multicast a view change message on expiration of the view change timer. Since there are up to f faulty replicas, at least f + 1 replicas will perform the same action on expiration of their view change timers. The view change messages sent by these f + 1 replicas would lead to the installation of a new view according to our view change algorithm. The liveness might be hampered if a replica mistakenly suspects the primary because once a replica suspects the primary, it stops participating the ordering of application requests. To address this issue, a replica also multicasts a review change message upon receiving one from another replica, even if it has not suspected the primary. This mechanism guarantees that as long as a non-faulty replica suspects the primary (even if by mistake), a new view will eventually be installed. 5. Membership Management The replication algorithm described in the previous section assumes a static membership, i.e., the server replicas are predetermined and the composition of the replicas do not change over time. This is clearly very restrictive, because some replicas might fail and needs to be repaired and restarted, and the degree of replication might have to be adjusted due to the change of quality of service requirement. In this section, we introduce a set of mechanisms that can be used to perform membership management. We do not intend to support arbitrarily dynamic membership formation. Besides the initial static membership configuration, we assume that the change of the replication degree is carried out in the following planned manner: For each expansion, two new replicas are added so that the number of tolerated faults is increased by one. For each reduction, two existing replicas are removed so that the number of tolerated faults is reduced by one. Replica 3 Replica 3 Replica 4 Mark f=1 Replica f=1 JOIN ACCEPT ACCEPT_ACK COMMIT Add f=1 Replica f=2 JOIN ACCEPT ACCEPT_ACK Fig. 4. Steps of expanding the membership size. For expansion, the new replicas are informed of the endpoints of the existing replicas so that they can initiate the join process. For reduction, the replicas to be removed are informed and they initiate the leave request Rejoin To ensure a long-running replicated service, it requires a mechanism that allows a previously failed replica to rejoin the existing replicas once it has been repaired. Our consensus-based replication algorithm enables the temporary suspension of a replica for repair and the subsequent rejoin without any change of membership formation. If the failed replica was the primary of the view, a view change would take place, but the rejoin of the replica does not cause any view change because another replica has taken place the primary role in the new view since the failure of the replica. To rejoin, the recovering replica simply multicast a state transfer request, and the primary of the current view would distribute the latest checkpoint and all requests received since the checkpoint to the recovering replica. The recovering replica could participate the normal operation of the replication algorithm as soon as it is started. COMMIT

9 A Lightweight Fault Tolerance Framework for Web Services 9 Replica 3 Replica 4 Replicas 3 and 4 join f=1 f=2 View Change Order request j with sequence number n Order request k with sequence number n (Primary) Replica 3 Replica 4 f=2 LEAVE ACCEPT ACCEPT_ACK COMMIT Mark Replica f=2 Fig. 5. Without totally ordering the join/leave requests with respect to other requests, the safety condition might be violated Increasing Replication Degree When increasing the replication degree, two new replicas are added into the membership, one after another, according to the mechanism described in Figure 4. The new replica multicasts a join request, in the form <JOIN, i, t>, where i is the replica id, and t is the timestamp of the join request. The join request is totally ordered with respect to all application requests, using the same agreement algorithm. The total ordering is needed so that the switch-over to a higher replication degree can be carried out by all replicas from the same replication state. This is important to ensure the safety and liveness of the replication algorithm. Figure 5 illustrates a scenario of safety violation if the total ordering of the membership change is not enforced. Upon joining the membership, the new replica must retrieve the latest checkpoint and all application requests since the checkpoint from existing replicas. To alleviate the burden on the primary, the new replica could ask for such information from any existing replica in the same view. A tricky issue is when to switch to the new replication degree. It is not a good idea to switch to the new replication degree, i.e., from 2f + 1 to 2(f + 1) + 1 immediately after the joining of the first new replica. Switching to the new replication degree at this point does not increase the resiliency because one more replica is needed to increase the tolerance of f faulty replicas to f + 1 faulty replicas. Therefore, after the joining of the first new replica, the new membership is marked, and the new replica can obtain the latest state from existing replicas, but it is not allowed to participate the message ordering process. The switching to the new replication degree takes place when the second new replica is joined, as shown in Figure 4. (Primary) Replica 3 Replica 4 f=2 Remove Replica f=1 LEAVE ACCEPT ACCEPT_ACK COMMIT Leave Group Leave Group Fig. 6. Steps of reducing the replication degree Decreasing Replication Degree When decreasing the replication degree, two replicas are removed from the membership, one after another, according to the mechanism shown in Figure 6. The replica to be removed multicasts a leave request, in the form <LEAVE, i, t>. Similar to the join request, the leave request is totally ordered with respect to all application requests. The reason for total ordering is the same as that for the join process. Note that if one of the two replicas to be removed is the primary of the current view, the primary initiates a view change prior to the start of the leave process. When the primary initiates a view change, the new view is installed immediately. The first replica to leave is marked but not removed from the membership until the second leave request is processed. In the mean time, this replica must participate all the ordering tasks, as it is needed to maintain the fault tolerance degree. Consider the following scenario with the initial replication degree of 5 (to tolerate up to 2 faulty replicas). The primary fails right after the first leave request is honored. If the first replica to leave was removed from the membership right away, there would be only 3 remaining replicas, including the second replica to be removed from the membership. If this replica becomes faulty before the leave process is finished, liveness is lost because the remaining 2 repli-

10 10 A Lightweight Fault Tolerance Framework for Web Services Total Order Invoker In Msg Queue In Handler Web Service Replication Engine Out Handler Sender Out Msg Queue Fig. 7. The architecture of the server-side framework. cas cannot complete the consensus algorithm (with the fault tolerance degree of f = 2). Note that it is not an option to reduce the fault tolerance degree from f + 1 to f right after the removal of the first replica because the remaining replicas might reach different decisions, and hence violating the safety property. 6. Implementation and Performance Evaluation 6.1. Fault Tolerance Framework Architecture We have implemented our fault tolerance framework by extending Sandesah2. The architecture of the server-side framework is shown in Figure 7. The In- Order Invoker component in Sandesha2 is replaced by a Total Order Invoker, which is responsible to deliver requests to the replicated Web service in a total order. At each replica, the Total Order Invoker polls the replication engine for the next application request to be delivered, then it fetches the application message from the In Msg Queue, which stores all incoming application requests. The Sender component in Sandesha2 is replaced by a multicast-capable Sender. On the server side, the multicast is used only by the primary, and by the backups for the checkpoint and view change messages. To perform the multicast, multiple threads are launched to concurrently send the same message to different destinations using Axis2. Each thread is responsible to send the message to a distinct destination point-to-point. No proprietary reliable multicast tool is used. If the destination is temporarily unreachable, the thread retries the sending a few times. If the sending is successful, or it fails to send the message after retry, it reports the status to the Sender component for the corresponding destination. The Sender will perform further retransmission to that destination according to WS- ReliableMessaging mechanisms. The In handler in Sandesha2 is augmented to handle the control messages used by the replication algorithm (i.e., accept, accept response, commit, checkpoint, view change, and new view messages). All incoming messages are placed in the In Msg Queue after preliminary processing, and before they are delivered to the Web service by the Total Order Invoker. The Out handler in Sandesha2 is largely left intact. All outgoing messages are placed in the Out Msg Queue before they are sent out by the Sender component. The sending of the replication control messages is carried out using the normal Axis2 interface, which means such messages will be treated as application messages by the Sandesha2 mechanisms, except that some messages are multicast by the Sender component. The Sender knows what messages to be multicast by examining the SOAP action property included in each message. If a multicast is needed, the Replication Engine is consulted to obtain the multicast destinations. The Replication Engine is the core addition to the Sandesha2 framework. This component drives the execution of the replication algorithm. At the primary, when an application request becomes the next message to be delivered in its sequence according to WS- ReliableMessaging, it is assigned a global sequence number and an accept message is sent for the message. The request won t be delivered to the Web service until it has been totally ordered and all previously ordered messages have been delivered. The Replication Engine uses its own storage to log the replication control messages, and the checkpoints. The client-side architecture is similar, except that the application replies are not totally ordered (they are FIFO ordered within each sequence according to WS- ReliableMessaging, however). The Replication Engine simply keeps the server-side configuration information so that the Sender knows where to multicast the application requests Optimization In our implementation, we use the following two optimizations. The first optimization reduces the number of communication steps for each invocation from

11 A Lightweight Fault Tolerance Framework for Web Services 11 Client REQUEST REPLY Deliver Reply REPLY quence, and the requests ordered must have been FIFO ordered within their own sequences Performance Evaluation ACCEPT ACCEPT_ACK COMMIT Fig. 8. Tentative execution of an invocation. 4 to 3, and therefore, reduces the end-to-end latency significantly. The second optimization batches application requests for total ordering, which improves system throughput. Latency Optimization. To reduce the end-to-end latency, we employ the tentative execution mechanism introduced in [9]. As shown in Figure 8, as soon as the primary has ordered an application request and it has executed all application requests ordered previously, it tentatively delivers the message and executes it. The reply is sent to the client. Similarly, as soon as a backup has accepted an accept message for an application request, it tentatively delivers and executes the request, and sends the reply to the client, provided that it has executed the previous requests. To enable this optimization, the client cannot deliver the first reply received right away. Instead, it must wait until it has collected f + 1 matching replies from different replicas. If the primary fails, the client might not be able to collect f + 1 matching replies, in which case, it abandons the incomplete reply set. In case of the primary failure, the backups that have tentatively executed a request might have to be rolled back to its last checkpoint and re-execute the request in a potentially different order, as instructed by the new primary. Throughput Optimization. Even though in the replication algorithm description, each application request is assigned a unique sequence number, doing so would be very inefficient. Similar to the BFT framework [9], we incorporated a batching mechanism to improve the system throughput. The batching mechanism works in the following way. At the primary, it does not immediately order an application request when it is in FIFO order within its sequence, instead, it postpones doing so if there are already k batches of messages being ordered, where k is a tunable parameter and it is often set to 1. When the primary is ready to order a new batch of messages, it assigns the next sequence number for a group of application requests, at most one per se- Our performance evaluation is carried out on a testbed consisting of 12 Dell SC440 servers connected by a 100Mbps Ethernet. Each server is equipped with a single Pentium D 2.8GHz processors and 1GB memory running SuSE 10.2 Linux. In this subsection, we report two types of experimental results. We first report the runtime overhead of our replication algorithm during normal operation. We then present the experimental results for membership management related tasks. A backup failure virtually does not affect the operation of the algorithm, and hence, we see no noticeable degradation of runtime performance. However, when the primary fails, the client would see a significant delay if it has a request pending to be ordered, due to the timeout value for view changes. The timeout is usually set to 2 seconds in our experiment which is in a LAN environment. In the Internet environment, the timeout would be set to a higher number. If there are consecutive primary failure, the delay would be even longer. Micro-benchmark for Normal Operation. An echo test application is used to micro-benchmark the runtime overhead. The client sends a request to the replicated Web service and waits for the corresponding reply within a loop without any think time between two consecutive calls. The request (and the reply) contains an XML document with varying number of elements, encoded using AXIOM (AXis Object Model) [1]. At the replicated Web service, the request is parsed and a nearly identical reply XML document is returned to the client. In each run, 1000 samples are obtained. The end-toend latency for the echo operation is measured at the client. The latency for the application processing time (to parse the request and to generate a reply) and the throughput are measured at the replicated Web service. In our experiment, we vary the number of replicas, the request sizes in terms of the number of elements in each request, and the number of concurrent clients. Figure 9 shows the end-to-end latency and throughput measurement results. In Figure 9(a), The end-toend latency of the echo operation is reported for replication degrees of 1, 3 and 5. Note that when there is only a single replica, our framework rolls back to the Sandesha2 implementation without incurring any ad-

12 12 A Lightweight Fault Tolerance Framework for Web Services End-to-End Latency (milliseconds) Without Replication With 3 Server Replicas With 5 Server Replicas Throughput (calls/second) Throughput (calls/second) Without Replication With 3 Server Replicas Number of Concurrent Clients (b) Without Replication With 3 Server Replicas 100 Elements Per Call Number of Concurrent Clients (c) 500 Elements Per Call Request Size (Number of Elements) (a) Throughput (calls/second) Without Replication With 3 Server Replicas Number of Concurrent Clients (d) 1000 Elements Per Call Fig. 9. End-to-end latency and throughput measurement results. (a) End-to-end latency of the echo operation for replication degrees of 1, 3 and 5. (b)-(d) System throughput with and without replication, for the request sizes of 100, 500, and 1000 elements, respectively. ditional overhead. As can be seen, the latency incurred by our replication algorithm for three-way replication is about 50ms or so, consistent with our expectation that only one additional communication step is incurred in our fault tolerance framework comparing with the non-replicated case. The throughput results are shown in Figure 9(b)-(d). As can be seen, for short request sizes, the throughput degradation when replication is enabled is significant, especially when there are many concurrent clients. This is not surprising considering the complexity of the replication algorithm. Even with optimal batching for 6 concurrent clients, the primary must send 2 control messages and receive 4 control messages (2 of them in the transport level) to order the 6 application requests. The approximately 30% reduction in throughput is nearly optimal (note that the control messages are much shorter than the application requests with 100 elements). A 2/3 reduction in throughput is reported in [28] when the standard SOAP protocol and Web services are used, which is much less efficient due to their architecture. When the application request complexity is increased, the throughput reduction becomes less, as shown in Figure 9(c) and (d). Membership Management Experimental Results. The latency measurement results of the rejoin of a repaired replica and the expansion of replication degree with various state size (in terms of the number of data items in the state) are summarized in Figure 10. The inset shows the relationship between the number of data items and the encoded checkpoint size of the state. The latency for the rejoin of a repaired replica is dominated by the cost of state transfer from the primary to the recovering replica. The expansion of replication degree takes longer time because (1) the join request must be totally ordered with respect to other re-

A Light Weight Fault Tolerance Framework for Web Services

Cleveland State University EngagedScholarship@CSU ETD Archive 2010 A Light Weight Fault Tolerance Framework for Web Services Srikanth Dropati Cleveland State University How does access to this work benefit