Replication Using Group Communication Over a Partitioned Network

Size: px

Start display at page:

Download "Replication Using Group Communication Over a Partitioned Network"

Lester Phelps
6 years ago
Views:

1 Relication Using Grou Communication Over a Partitioned Network Thesis submitted for the degree Doctor of Philosohy Yair Amir Submitted to the Senate of the Hebrew University of Jerusalem (1995).

2 This work was carried out under the suervision of Professor Danny Dolev ii

3 Acknowledgments I am deely grateful to Danny Dolev, my advisor and mentor. I thank Danny for believing in my research, for sending so many hours on it, and for giving it the theoretical touch. His warm suort and atient guidance heled me through. I hoe I managed to adot some of his rofessional attitude and integrity. I thank Daila Malki for her hel during the early stages of the Transis roject. Thanks to Idit Keidar for heling me sharen some of the issues of the relication server. I enjoyed my collaboration with Ofir Amir on develoing the coloring model of the relication server. Many thanks to Roman Vitenberg for his valuable insights regarding the extended virtual synchrony model and the relication algorithm. I benefited a lot from many discussions with Ahmad Khalaila regarding distributed systems and other issues. My thanks go to David Breitgand, Gregory Chokler, Yair Gofen, Nabil Huleihel and Rimon Orni, for their contribution to the Transis roject and to my research. I am grateful to Michael Melliar-Smith and Louise Moser from the Deartment of Electrical and Comuter Engineering, University of California, Santa Barbara. During two summers, several mutual visits and extensive electronic corresondence, Louise and Michael were involved in almost every asect of my research, and unofficially served as my co-advisors. The work with Deb Agarwal and Paul Ciarfella on the Totem rotocol contributed a lot to my understanding of high-seed grou communication. Ken Birman and Robbert van-renesse from the Comuter Science Deartment at Cornell University, were always willing to contribute their valuable advice to my research. Sending last summer with them was an educating exerience for me. For that I thank them both. Secial thanks to Ken for convincing me to ursue an academic osition. Thanks to Eldad Zamler for first introducing me to what became my research roblem, ten years ago. I thank Yaacov Ben-Yaacov and Gidi Kuerstein for six years of collaboration in building a working system and delivering it to the customer. They are all secial friends. I would like to thank my arents Shulamit and Reuven, for their love, encouragement and constant suort. I thank my brother Yaron, my brother Ofir, Amira and Lee, for always being there for me. Last, but not least, I am grateful to my wife and my artner Michal, for her unending suort. My success is the roduct of her wisdom, confidence, and love. iii

4 Contents 1. INTRODUCTION PROBLEM DESCRIPTION SOLUTION HIGHLIGHTS THESIS ORGANIZATION RELATED WORK Grou Communication Protocols Grou Communication Semantics Relication Protocols THE MODEL THE SERVICE MODEL THE FAILURE MODEL REPLICATION REQUIREMENTS THE ARCHITECTURE EXTENDED VIRTUAL SYNCHRONY EXTENDED VIRTUAL SYNCHRONY SEMANTICS Basic Delivery Delivery of Configuration Changes Self Delivery Failure Atomicity Causal Delivery Agreed Delivery Safe Delivery AN EXAMPLE OF CONFIGURATION CHANGES AND MESSAGE DELIVERY DISCUSSION GROUP COMMUNICATION LAYER THE TRANSIS SYSTEM THE RING RELIABLE MULTICAST PROTOCOL Message Ordering Membershi State Machine Achieving Extended Virtual Synchrony PERFORMANCE REPLICATION LAYER THE CONCEPT Concetual Algorithm Selecting a Primary Comonent Proagation by Eventual Path THE ALGORITHM iv

5 6.3 PROOF OF CORRECTNESS Safety Liveness CUSTOMIZING SERVICES FOR APPLICATIONS STRICT CONSISTENCY WEAK CONSISTENCY QUERY DIRTY QUERY TIMESTAMPS AND COMMUTATIVE UPDATES DISCUSSION CONCLUSIONS v

6 Abstract In systems based on the client-server model, a single server may serve many clients and the heavy load on the server may cause the resonse time to be adversely affected. In such circumstances, relicating data or servers may imrove erformance. Relication may also imrove the availability of information when rocessors crash or the network artitions. Existing relication methods are often needlessly exensive. They sometimes use ointto-oint communication when multicast communication is available; they tyically ay the full rice of end-to-end acknowledgments for all of the articiants for every udate; they may claim locks, and therefore, may be vulnerable to faults that can unnecessarily block the system for long eriods of time. This thesis resents a new architecture and algorithms for relication over a artitioned network. The architecture is structured into two layers: a relication server and a grou communication layer. Each of the relication servers maintains a rivate coy of the database. Actions (queries and udates) requested by the alication are globally ordered by the relication servers in a symmetric way. Ordered actions are alied to the database and result in a state change and in a rely to the alication. We rovide a grou communication ackage, named Transis, to serve as the grou communication layer. Transis utilizes the available non-reliable hardware multicast for efficient dissemination of messages to a grou of rocesses. The relication servers use Transis to multicast actions and to learn about changes in the membershi of the currently connected servers, in a consistent manner. Transis locally orders messages sent within the currently connected servers. The relication servers use this order to construct a long-term global total order of actions. Since the system is subject to artitioning, we must ensure that two detached comonents do not reach contradictory decisions regarding the global order. Therefore, the relication servers use dynamic linear voting to select, at most, one rimary comonent that continues to order actions. The architecture is non-blocking: actions can be generated by the alication anytime. While in a rimary comonent, queries are immediately relied in a consistent manner. While in a non-rimary comonent, the user can choose to wait for a consistent rely (that will arrive as soon as the network is reaired) or to get an immediate, though not necessarily consistent rely. High erformance of the architecture is achieved because: End-to-end acknowledgments are not needed on a regular basis. They are used only after membershi change events such as rocessor crashes and recoveries, and network artitions and merges. Synchronous disk writes are almost eliminated, without comromising consistency. Hardware multicast is used where ossible. vi

7 Chater 1 1. Introduction In systems based on the client-server model, a single server may serve many clients and the heavy load on the server may cause the resonse time to be adversely affected. In such circumstances, relicating data or servers may imrove erformance. Relication may also imrove the availability of information when rocessors crash or the network artitions. Existing relication methods are often needlessly exensive. They sometimes use ointto-oint communication when multicast communication is available. They tyically ay the full rice of end-to-end acknowledgment for all of the articiants for every udate, or even of several rounds of end-to-end acknowledgments. They may claim locks, and therefore, may be vulnerable to faults that can unnecessarily block the system for long eriods of time. This thesis ends a ten year rofessional journey. It started with my involvement in the design and imlementation of a large and geograhically distributed control system. The requirements of that system demanded a non-blocking solution with maximal availability. Each of the control stations had to be autonomous, to work desite network artitions, and to survive ower failures. To meet the requirements, we constructed a data relication scheme to function over an unreliable communication media in a dynamic environment. We managed to limit the udate semantics to commutative udates. Hence, the relica control roblem was reduced to imlementing a guaranteed delivery of actions to all of the relicas. This was done by constructing oint-to-oint stable queues. The concet was roven adequate and is still oerational today, maintaining consistent relication of several tens of databases. However, the use of oint-to-oint communication and the extensive use of synchronous disk writes, as well as the limitation imosed on the udate semantics, left me with a feeling that a better relication concet can be found. My Ph.D. research was motivated by this belief. Together with Danny Dolev, Dalia Malki and Shlomo Kramer, we initiated the Transis system, targeted at building tools for highly available distributed systems. We gave Transis its name to acknowledge the innovation of both the Trans rotocol [MMA90] and the ISIS system [BvR94]. Transis was aimed at roviding grou communication services using non-reliable hardware multicast available in most local area networks, tolerating network artitions and merges as well as rocessor crashes and recoveries. On to of Transis, we designed a relication server that eliminates the need for synchronous disk writes er udate without comromising consistency. Avoiding disk writes on the critical ath and utilizing hardware multicast renders our relication architecture highly efficient and more scalable than revious solutions. 1

8 1.1 Problem Descrition The roblem tackled in this thesis is how to construct an efficient and robust long-term relication architecture, within a fixed set of servers. Each server maintains a rivate coy of the database. The initial state of the database is identical at all of the servers. Tyically, each server runs on a different rocessor. The relication architecture is required to handle network artitioning. We exlicitly assume that the network may artition to several comonents. Some or all of the artitioned comonents, may subsequently re-merge. The architecture is also required to handle server crashes and recoveries. It is assumed that the underlying communication suorts some form of non-reliable multicast service (this service can be mimicked by unreliable oint-to-oint transmission). The architecture is required to overcome message omissions. We assume no message corrution. We rely on error detection and error correction rotocols to eliminate corruted messages. Corruted messages have the effect of omitted messages. We do not handle malicious faults. We assume that all the servers are running their rotocols faithfully. 1.2 Solution Highlights We resent a new architecture and algorithms for active relication over a artitioned network. Active relication is a symmetric aroach where each of the relicas is guaranteed to invoke the same set of actions at the same order. This aroach requires the next state of the database to be determined by the current state and the next action, and it guarantees that all of the relicas reach the same database state. Other factors, such as the assage of time, should not have any bearing on the next database state. The architecture, resented in Figure 1.1, is structured into two layers: a relication server and a grou communication layer. Each of the relication servers maintains a rivate coy of the database. Actions (queries and udates) requested by the alication are globally ordered by the relication servers in a symmetric way. Ordered actions are alied to the database and result in a state change and in a rely to the alication. The relication servers use the grou communication layer to efficiently disseminate actions, and to learn about changes in the membershi of the currently connected servers in a consistent manner. The grou communication layer locally orders messages disseminated within the currently connected grou. When a new comonent is formed by merging two or more comonents, the servers exchange information about actions and about the actions order in the system. Actions missed by at least one of the servers, are multicast, and the connected servers reach a 2

9 common state. This way, actions are roagated as soon as ossible. We call this method roagation by eventual ath. Alication DB Alication Rely DB Relication Server Global order of actions Local order of messages Request Relication Server Aly Grou Communication Network Grou Communication Figure 1.1: The Architecture. Since the system may artition, we must ensure that two different comonents do not reach contradictory decisions regarding the global order of actions. Hence, we need to identify at most one comonent, the rimary comonent, that may continue ordering actions. We emloy dynamic linear voting [JM90] which is generally acceted as the best technique when certain restrictions hold. We define a new semantics, extended virtual synchrony, for the grou communication service. The significance of extended virtual synchrony is that, during network artitioning and re-merging and during rocess crash and recovery, it maintains a consistent relationshi between the delivery of messages and the delivery of configuration change notifications across all rocesses in the system. Prior grou communication rotocols have focused on totally ordering messages at the grou communication level. That service, although useful for some alications, is not enough to guarantee comlete consistency at the alication level without additional endto-end acknowledgments, as has been noted by Cheriton and Skeen [CS93]. Extended virtual synchrony secifies the safe delivery service which rovides additional level of knowledge within the grou communication rotocol. The strict semantics of extended virtual synchrony and its safe delivery service is exloited by the relication servers to eliminate the need for end-to-end acknowledgment on a er-action basis without comromising consistency. End-to-end acknowledgment is only required when the membershi of connected servers is changed. e.g. in case of network artitions, merges, server crashes and recoveries. This leads to high erformance of the architecture. In the general case, when the membershi of connected servers is stable, the throughut and latency of actions is 3

10 determined by the erformance of the grou communication and not so much by other factors such as the number of relicas and the erformance of synchronous disk writes. The architecture is non-blocking: actions can be generated by the alication anytime. While in a rimary comonent, queries are immediately relied in a consistent manner. While in a non-rimary comonent, the user can choose to wait for a consistent rely (that will arrive as soon as the network is reaired) or to get an immediate, though not necessarily consistent rely. Two different, well-defined, semantics are available for immediate relies in a non-rimary comonent. The key contributions of this Ph.D. research are: Defining an efficient architecture for relication. Constructing a highly efficient reliable multicast rotocol that tolerates artitions, and imlementing it in a general Unix environment. The symmetric rotocol rovides reliable message ordering and membershi services. The rotocol s excetional erformance is achieved by utilizing a non-reliable multicast service where ossible. Defining the extended virtual synchrony semantics for grou communication services. Extended virtual synchrony, among other things, strictly defines message delivery semantics in the resence of network artitions and re-merges, as well as rocess crashes and recoveries. Constructing the roagation by eventual ath technique for efficient information dissemination in a dynamic network. This method utilizes grou communication to roagate knowledge as soon as ossible between servers. The strengths of the roagation by eventual ath method are most evident when the membershi of connected servers is dynamically changing. Eliminating the need for end-to-end acknowledgments and for synchronous disk writes on a er-action basis. Instead, end-to-end acknowledgments and synchronous disk writes are needed once, just after a change in the membershi of the connected servers. Tailoring and otimizing relication services for different kinds of alications. 1.3 Thesis Organization The rest of the thesis is organized as follows: The next subsection resents revious research in grou communication rotocols, grou communication semantics, and relication rotocols. Chater 2 resents the theoretical model and defines the correctness criteria of the solution. 4

11 Chater 3 resents the overall relication architecture. Chater 4 defines the extended virtual synchrony semantics. Chater 5 resents Transis, our grou communication layer, which rovides extended virtual synchrony. We describe the logical ring rotocol, one of the two reliable multicast rotocols oerational in Transis. Throughut and latency measurements of Transis, over a network of Pentium machines running Unix, are rovided. Chater 6 details our relication server. The relication rotocol demonstrates how extended virtual synchrony is exloited to rovide efficient long-term relication service. Chater 7 customizes services for different kinds of alications. Chater 8 concludes this thesis. A reader, interested in an overview of this thesis beyond the introduction, may read Chater 3, Chater 5 Section 1 and Section 3, and Chater 6 Section 1. A reader interested in the ractical asects of this thesis and in imlementation details, may want to focus on Chater 3, Chater 5 Section 2 and Section 3, Chater 6 Section 2, and Chater 7. Additional information including a coy of this thesis, a slide show, relevant ublished aers and more, can be obtained from: htt:// htt:// or by writing to yairamir@cs.jhu.edu or 1.4 Related Work Much work has been done in the area of grou communication and in the area of relication. We relate our work to three research areas: grou communication rotocols, grou communication semantics, and relication rotocols Grou Communication Protocols The ISIS toolkit [BJ87, BCJM+90, BvR94] is one of the first general urose grou communication systems. ISIS rovides a grou communication session service, where rocesses can join rocess grous, multicast messages to grous, and receive messages sent to grous. Two multicast rimitives are rovided: The CBCAST service guarantees causally ordered message delivery (see [Lam78]) across overlaing grous. CBCAST is imlemented using vector timestams that are iggybacked on each message. The ABCAST service extends the causal order to a total order using a central grou coordinator that emits ordering decisions. ISIS also rovides membershi notifications 5

12 when the grou membershi is changed. Grou membershi changes due to rocesses voluntarily joining or leaving the grou, or due to rocess failures. Network artitions and re-merges, as well as rocess recoveries, are not suorted. The novelty of ISIS is in guaranteeing a formal and rigorous service semantics named virtual synchrony. ISIS rotocols are imlemented using oint-to-oint communication. Although much better rotocols exist today, and desite the lack of suort for network artitions, ISIS is the most mature general urose system available today. The ISIS system is commercially available from ISIS Distributed Systems LTD. The V system [CZ85] rovides grou communication services at the oerating system level. It was the first to utilize hardware multicast to imlement rocess grou communication. However, only non-reliable, best-effort, unordered delivery service is rovided. Similar services for wide area networks are rovided by the IP-multicast [Dee89] rotocol. The Chang and Maxemchuk reliable broadcast and ordering rotocol [CM84] uses a token-assing strategy, where the rocessor holding the token acknowledges messages. All the articiating rocessors can broadcast messages at any time. The rotocol also rovides membershi and token recovery algorithms. Tyically, between two and three messages are required to order a message in an otimally loaded system. The rotocol does not rovide a mechanism for flow control. The TPM rotocol [RM89] uses a token on a logical ring of rocessors for broadcasting and retransmission of messages. The token is circulated along a known token list in order to serialize message transmission. The token contains the next sequence number to be stamed on new messages. TPM starts by circulating the token to multicast a set of messages. Then, the token is used to retransmit messages belonging to the set, that are missed by some of the rocessors. When no message is missed by any of the rocessors, the whole set is delivered to the alication and a new set of messages can be introduced. TPM also rovides a dynamic membershi and token regeneration algorithm. If the network artitions, the comonent with the majority of the members (if such exists) is allowed to continue. The Delta-4 [Pow91] system rovides tools for building distributed, fault-tolerant realtime systems. As art of Delta-4, a reliable multicast rotocol, xam [RV92] and a membershi rotocol [RVR93] are imlemented. The rotocols utilize the non-reliable multicast or broadcast rimitive of local area networks. The Delta-4 rotocols assume failsto behavior and as such, do not suort network artitions and re-merges. The membershi rotocol rovides low-level rocessor membershi so that a higher level rocess grou membershi can be built on to of it in a simle way. Our exerience in Transis indicates that this two-levels architecture is better than solving the membershi roblem at the rocess level. Delta-4 is more real-time oriented than Transis, and it uses a secial hardware for message ordering and failure detection. This seems to be a strong limitation on the roject s usability. 6

13 The Amoeba distributed oerating system uses the Fli high erformance reliable multicast rotocol [KvRvST93] to suort high level services such as fault-tolerant directory service [KTV93]. In Amoeba, members of the grou send oint-to-oint messages to a distinct member called the sequencer. The sequencer stams each message with a sequence number and broadcasts it to the grou. A Member that detects a ga in the message sequences, sends a oint-to-oint retransmission request to the sequencer. The Amoeba system is resilient to any re-defined number of failed rocessors, but its erformance degrades as the number of allowed failures is increased. The Trans and Total rotocols [MMA90, MMA93, MM93] rovide reliable ordered broadcast delivery in an asynchronous environment. The Trans rotocol uses ositive and negative acknowledgments iggybacked onto broadcast messages and exloits the transitivity of ositive acknowledgments to reduce the number of acknowledgments required. The Total rotocol, layered on to of the Trans rotocol, converts the artial order into a total order. The Trans and Total rotocols maintain causality and ensure that oerational rocessors continue to order messages even though other rocessors have failed, rovided that a resiliency constraint is met. A membershi rotocol [MMA94] is imlemented on to of Total. If a rocessor susects another rocessor, it sends a fault message for the susected rocessor. When that message is ordered, the membershi is changed to exclude this rocessor. The limitation of that architecture is that if Total cannot order the membershi messages (e.g. because the resiliency constraint is not met), the system is blocked. The Psync rotocol [PBS89] builds a context grah that reresents the causal artial order on messages. This order can be extended into a total order by determining comlete waves of causally concurrent messages and by ordering the messages of a wave using some deterministic order. Based on the causal order rovided by Psync, a membershi algorithm is constructed [MPS91]. Using this algorithm, rocessors reach eventual agreement on membershi changes. The algorithm handles rocessor faults and allows a rocessor to join a re-existing grou asymmetrically. Network artitions and re-merges are not suorted. The Newto rotocol [MES93, Mac94] relaces the context grah of Psync by the notion of causal blocks. Each causal block defines a set of messages. All the messages within a block are causally indeendent. The blocks are totally ordered. The messages in a block are delivered together, in some deterministic order. In this way, Newto rovides totally ordered delivery similar to the wave technique of Psync and the all-ack mechanism of Lansis [ADKM92a], but with much less bookkeeing. Newto causal delivery is less efficient than Psync or Trans because the causal information reresented in causal blocks is not accurate and more essimistic then needed (though more comact). Moreover, using causal blocks eliminates the ability to use faster algorithms (e.g. TOTO [DKM93]) that use the full context grah to reach fast decision on total order. Newto imlements a membershi service that handles rocessor crashes and network artitions. However, rocess recoveries and network re-merges are not addressed. The most interesting oint of Newto is its service semantics resented in the next section. The Horus roject [vrbfhk95] imlements grou communication services, roviding unreliable or reliable FIFO, causal, or total multicast services. Horus is extensively layered 7

14 and highly configurable, allowing alications to only ay for the overhead of services they use. The layers include the COM layer which rovides basic non-reliable multicast, the NAK layer which rovides reliable FIFO multicast, the MBRSHIP layer that rovides membershi maintenance, the STABLE layer which rovides message stability, the FC layer which rovides flow control, the CAUSAL and TOTAL layers, the LWG layer which maintains rocess grous, the EVS layer which maintains extended virtual synchrony (see below), and many more. Advanced memory management techniques are used in order to avoid the full cost of layering. The Transis roject, described in Section 5.1, rovides grou communication services in a artitionable network. Three multicast rimitives are rovided according to the extended virtual synchrony semantics: Causal multicast, Agreed multicast for total order delivery, and Safe multicast that rovides even stronger guarantees. Two different reliable multicast rotocols are imlemented in Transis. Lansis [ADKM92a], the earlier rotocol, uses a direct acyclic grah (DAG) reresenting the causal relation on messages to rovide reliable multicast. The DAG is derived from negative and ositive acknowledgments iggybacked on messages. The causal order mechanism in Lansis is derived from the Trans rotocol with several imortant modifications that adat it for ractical use. Two total order algorithms extended the causal order to a total, agreed order. The first is the all-ack algorithm which is similar to the algorithm used in Psync, and the second is the TOTO early delivery algorithm [DKM93]. Both comutes the total order based on the DAG structure without exchange of additional messages. While TOTO is more efficient than the all-ack rotocol, it cannot maintain extended virtual synchrony. The membershi algorithm of Transis [ADKM92b] is a symmetric rotocol that was the first to handle network artitions and re-merges. Although oerational in asynchronous environment, the algorithm ensures termination in a bounded time. The basic idea of this membershi algorithm was adoted by Totem and Horus. Excellent reading about Transis and its membershi algorithm is found in [Mal94]. The second reliable multicast rotocol in Transis is the Ring rotocol, detailed in Section 5.2. The Ring rotocol was develoed while the author was visiting the Totem roject. The Totem system [Aga94] rovides reliable multicast and membershi services across a collection of local-area networks. The Totem system is comosed of a hierarchy of two rotocols. The bottom layer is the Ring rotocol [AMMAC93, AMMAC95] which rovides reliable multicast and rocessor membershi services within a broadcast domain. The uer layer is the Multile-Rings rotocol [Aga94] that rovides reliable delivery and ordering across the entire network. Gateways are resonsible to forward messages and configuration changes between broadcast domains. Each gateway interconnects two broadcast domains, and articiates in the Ring rotocol for each of them. Each domain may contain several gateways connecting it to several other domains. Extended virtual synchrony was first imlemented in the Totem system [AMMAC93]. 8

15 1.4.2 Grou Communication Semantics It is highly imortant for a grou communication service to maintain a well-defined service semantics. The alication builder can rely on that semantics when designing correct alications using this grou communication service. The semantics must secify both the assumtions taken and the guarantees rovided. The ISIS system defines and maintains the virtual synchrony semantics [BvR94, BJ87, SS93]. Virtual synchrony ensures that all the rocesses belonging to a rocess grou erceive configuration changes as occurring at the same logical time. Moreover, all rocesses belonging to a configuration deliver the same set of message for that configuration. A message is guaranteed to be delivered at the same configuration in which it was multicast at all the rocesses that deliver it. The delivery of a CBCAST message maintains causality. The delivery of an ABCAST message, in addition, occurs at the same logical time at all the rocesses. Virtual synchrony assumes message omission faults and fail-sto rocess faults. i.e. a rocess that fails can never (or is not allowed to) recover. When network artitioning occurs, virtual synchrony ensures that rocesses in at most one connected comonent of the network, the rimary comonent, are able to make rogress; rocesses in other comonents become blocked. Unfortunately, before a rocess fails or before it detects that it had artitioned from the rimary comonent, ISIS may deliver messages to it in an order inconsistent with the order determined at the rimary comonent (if a database is maintained by the detached rocess, these messages may result in an inconsistent database state). Therefore, if a rocess recovers after a crash, or can merge again with the rimary comonent, it must come back with a different rocess identifier and it is considered as a new rocess. If this rocess maintains stable storage (e.g. database), this storage has to be erased. Unable to coe with network artitions and re-merges, and with rocess recoveries, virtual synchrony has a limited ractical value. Nevertheless, the virtual synchrony model emhasized the imortance of a rigorous semantics for grou communication services. To overcome these drawbacks, we extended the definition of virtual synchrony. This extension, extended virtual synchrony [MAMA94] is detailed in Chater 4. Valuable work done at the Newto roject [Mac94], searately from the work done in Transis and Totem, defines another grou communication semantics which extends virtual synchrony to suort artitions. Newto semantics secifies several roerties regarding the delivery of messages and configuration changes. It generalizes the rimary comonent model of virtual synchrony to suort several artitioned comonents without the need to block non-rimary comonents (the alication is, of course, free to block oeration in non-rimary comonents if it refers). Newto semantics is weaker than the extended virtual synchrony semantics. In articular, since Newto does not suort network remerges, weaker requirements are secified for totally ordered delivery. This weakness allows the total order determined at a rocess to vary, and to contain holes, when comared to the total order determined at another rocess that just artitioned. Moreover, 9

16 Newto semantics does not secify the safe delivery roerty of extended virtual synchrony, whose imortance is made clear at Chater 6 of this thesis. A recent work by Cristian and Schmuck on grou membershi in an asynchronous environment [CS95] defines the timed synchronous system model. In contrast to the theoretical asynchronous model that has no notion of time, the timed synchronous model assumes that rocessors have local clocks that allow them to measure the assage of time. Local clocks may drift with some (small) bounded rate. Each rocessor also contains a stable storage. Processor crashes introduce artial-amnesia behavior where the state of stable storage is the same as before the crash, while the state of the volatile storage is reinitialized. The model allows for message omission or erformance (delay) faults, rocessor crashes and recoveries, and network artitions and re-merges. The unique asect of [CS95], lays in bounding the local time u to which certain guarantees of the grou membershi service will hold at each of the rocessors. While the membershi algorithms develoed in Transis and Totem do maintain the requirements resented in [CS95], they are not required to do so by the extended virtual synchrony model (which leaves local time out of the model). Combining ideas from the timed synchronous model to extended virtual synchrony might lead to a model which guarantees stronger liveness roerties (that are rovided anyway by the imlementations of Transis and Totem). This, in turn, might lead to the ability to rove stronger liveness roerties (with bounded local time) for rotocols that currently use extended virtual synchrony to reason about their behavior. e.g. it might be ossible to rove a better liveness roerty for the relication rotocol described in Chater 6, than the required liveness roerty stated in Chater Relication Protocols Much work has been done in the area of relication. Traditionally, a relicated database is considered correct if it behaves as if there is only one coy of it, as far as the user can tell. This roerty is called one-coy equivalence. In a one-coy database, the system should ensure serializability. i.e. interleaved execution of user transactions is equivalent to some serial execution of these transactions. Thus, a relicated database is considered correct if it is one-coy serializable ([BHG87]). i.e. it ensures serializability and one-coy equivalence. Two-hase-commit rotocols [EGLT76] are the main tool for roviding serializability in a distributed database system when transactions may san several sites. The same rotocols can be used to maintain one-coy serializability in a relicated database. In a tyical rotocol of this kind [Gra78], one of the servers, the transaction coordinator, sends a request to reare to commit to all of the articiating servers. Each server relies either by a ready to commit or by an abort. If any of the servers votes to abort, all of them abort. The transaction coordinator collects all the resonses and informs the servers of the decision. Between the two hases, each server kees the local database locked waiting for the final word from the transaction coordinator. If a server fails before its vote reaches the 10

17 transaction coordinator, it is usually assumed to vote abort. If the transaction coordinator fails, all the servers remain blocked indefinitely, unable to resolve the transaction. Even though blocking reserves consistency, it is highly undesirable because the locks cannot be relinquished, rendering the data inaccessible by other requests at oerational servers. Clearly, a rotocol of this kind imoses a substantial additional communication cost on each transaction. Three-hase-commit rotocols [Ske82] try to overcome some of the availability roblems of two-hase-commit rotocols, aying the rice of an additional communication round, and therefore, of additional latency. In case of server crashes or network artitions, a three-hase-commit rotocol allows a majority or a quorum to resolve the transaction. If failures cascade, however, a majority can be connected and still remain blocked as is shown in [KD95]. A recent work by [KD95] resents an imroved version of three-hase-commit that always allows a connected majority to roceed, regardless of ast failures. In the available coy rotocols [BHG87], udate oerations are alied at all of the available servers, while a query accesses any server. Correct execution of these rotocols require that the network never artition. Otherwise they block. Voting rotocols are based on quorums. The basic quorum scheme uses majority voting [Tho79] or weighted majority voting [Gif79]. Using voting rotocols, each site is assigned a number of votes. The database can be udated in a artition only if that artition contains more than half of the votes. The Accessible Coies algorithms [ESC85, ET86] maintain an aroximate view of the connected servers, called a virtual artition. A data item can be read/written within a virtual artition only if this virtual artition (which is an aroximation of the current connected comonent) contains a majority of its read/write votes. If this is the case, the data item is considered accessible and read/write oerations can be done by collecting subquorums in the current comonent. The maintenance of virtual artitions greatly comlicates the algorithm. When the view changes, the servers need to execute a rotocol to agree on the new view, as well as to recover the most u-to-date item state. Moreover, although view decisions are made only when the membershi of connected servers changes, each udate requires the full end-to-end acknowledgment from the sub-quorum. Dynamic linear voting [JM87, JM90] is a more advanced aroach that defines the quorum in an adative way. When a network artition (or re-merge) occurs, if a majority of the last installed quorum is connected, a new quorum is established and udates can be erformed within this artition. Dynamic linear voting generally outerforms the static schemes as shown by [PL88]. Esilon serializability [PL91] alies an extension to the serializability correctness criterion. Esilon serializability introduces a tradeoff between consistency and availability. It allows inconsistent data to be seen, but requires that data will eventually converge to a consistent (one-coy serializability) state. The user can control the degree of inconsistency. In the limit, strict one-coy serializability can be enforced. Several relica control rotocols are suggested in [PL91]. One of these rotocols limits the transactional model to commutative oerations (COMMU) and another limits it to read-indeendent 11

18 timestamed udates (RITU). In contrast, the ordered udates (ORDUP) rotocol does not limit the transactional model. ORDUP executes transactions asynchronously, but in the same order at all of the relicas. Udate transactions are disseminated and are alied to the database when they are totally ordered. The relication rotocol resented in Chater 6 of this thesis comlies with the ORDUP model. Otimizations for COMMU and RITU udates models are resented in Chater 7 of this thesis. Lazy relication [LLSG90, LLSG92] is a relication method that overcomes network artitions and re-merges. It relaxes the constraints on oeration ordering by exloiting the semantics of the service s oerations. The client alication can secify exactly what causal relations should be enforced between oerations. Using this aroach, unrelated oerations do not incur any latency delay due to communication. By using a gossi method to roagate oerations, lazy relication ensures reliable eventual delivery of all the oerations to all of the relica. However, the loose control on oeration transmissions between relicas is a serious drawback of lazy relication. An oeration might be transmitted from one relica to another many times, even when it is already known at the other relica. The timestamed anti-entroy relication technique [Gol90] rovides eventual weak consistency. This method also ensures the eventual delivery of each action to each of the relication servers using an eidemic technique: Pairs of servers eriodically contact each other to exchange actions that one of them has and the other misses. This exchange is called anti-entroy session. When the network artitions and subsequently re-merges, servers from different comonents exchange actions generated at the disconnected comonent using anti-entroy sessions. A total order on the actions can be laced using a similar method to [AAD93]. The anti-entroy technique used to roagate actions is far more efficient comared to the gossi technique of [LLSG90]. In rior research [AAD93], we described an architecture that uses the Transis grou communication layer to achieve consistent relication. The architecture handles network artitions and re-merges, as well as server crashes and recoveries. It constructs a highly efficient eidemic technique, using the configuration change notification rovided by Transis to kee track of the membershi of the currently connected servers. Uon a reconfiguration change, the currently connected servers efficiently exchange state information. Each action known to one of the servers and missed by at least one server, is sent exactly once. The relication servers does not need to worry about message omissions because the grou communication layer (Transis) guarantees reliable multicast. This technique is more efficient than the anti-entroy technique because instead of using two-way exchange of knowledge and actions, multi-way exchange is used. Moreover, the exchange takes lace exactly when it is needed (i.e. after a membershi change) rather than eriodically. The serious inefficiency of [AAD93] is the method of global total ordering, which uses Lamort clock and requires an eventual ath from every server to order an action. A valuable work by Keidar [Kei94] uses the architecture of [AAD93] but relaces its global total ordering method. The novel ordering algorithm in [Kei94] always allows a connected majority of the servers to make rogress, regardless of ast failures. As in [AAD93], it always allows servers to initiate actions (even when they are not art of a 12

19 connected majority). Thus, actions can eventually become totally ordered even if their initiator is never a member of a majority comonent. Both [Kei94] and [AAD93] use the flow control and multicast roerties of grou communication, but both still need an end-to-end acknowledgments between servers on a er-action basis to allow global ordering of a message. This diminishes the erformance advantages gained by using grou communication. The relication server, described in [ADMM94] and detailed in Chater 6 of this thesis, eliminates the need for an end-to-end acknowledgment at servers level without comromising consistency. End-to-end acknowledgment is still needed just after the membershi of the connected server is changed. Thus, the erformance gain is substantial, and is determined by the erformance rovided by the grou communication. The rice to ay (comared to [Kei94]) is that there exist rare scenarios in which multile servers in the rimary comonent crash or become disconnected within a window of time so short that the membershi algorithm could not be comleted anywhere. In these scenarios, if none of the servers is certain about which actions were ordered within that rimary comonent (e.g. due to a global crash), then the recovery of, and communication with, every server of the last rimary comonent is required before the next rimary comonent can be formed. 13

20 Chater 2 2. The Model 2.1 The Service Model A Database is a collection of organized, related data that can be accessed and maniulated. An Action defines a transition from the current state of the database to the next state; the next state is comletely determined by the current state and the action. Each Action contains an otional query art and an otional udate art. The udate art of an action defines a modification to be made to the database, and the query art returns a value. A relication service maintains a relicated database in a distributed system. The relication service is rovided by a known finite set of rocesses, called the servers grou. The individual rocesses within the servers grou are called relication servers or simly servers, each of which has a unique identifier. Each server within the servers grou maintains a rivate coy of the database on stable storage. The initial state of the database is identical at all of the servers. Tyically, each server runs on a different rocessor. Processes to which the service is rovided are called clients. The number of clients in the system is unlimited. We introduce the following notation: S is the servers grou. a si, is the ith action erformed by server s. D si, is the state of the database at server s after actions 1..i have been erformed by server s. stable_system(s, r) is a redicate that denotes the existence of a set of servers containing s and r, and a time, from which on, that set does not face any communication or server failure. Note that this redicate is only defined to reason about the liveness of certain rotocols. It does not imly any limitation on our ractical rotocol. 14

21 2.2 The Failure Model The system is subject to message omission, server crashes and network artitions. We assume no message corrution and no malicious faults. A server or a rocessor may crash and may subsequently recover after an arbitrary amount of time. A server recovers with its stable storage intact, is aware of its recovery, and retains its old identifier. The network may artition into a finite number of comonents. The servers in a comonent can receive messages generated by other servers in the same comonent, but servers in two different comonents are unable to communicate with each other. Two or more comonents may subsequently merge to form a larger comonent. A message which is multicast within a comonent may get lost by some or even all of the rocessors. 2.3 Relication Requirements According to the service model, the initial state of the database is identical at all of the servers. sr, S D = D. s, 0 r, 0 Also, the next state of the database is comletely determined by the current state and the erformed action. s S Dsi, function( Dsi,, asi, ) = 1. The correctness criteria for the solution are defined as follows: Safety. If server s erforms the ith action and server r erforms the ith action, then these actions are identical. a, a a = a. si, ri, si, ri, Note that if the servers erform the same set of actions in the same order then they reach an identical state. For databases that comly with our service model (where the next database state is comletely determined by the current state and the erformed action), our safety criterion translates to one-coy serializability (see [BHG87]). One-coy serializability requires that concurrent execution of actions on a relicated database be equivalent to some serial execution of these actions on a non-relicated database. 15

22 Liveness. If server s erforms an action and there exists a set of servers containing s and r, and a time, from which on, that set does not face any communication or rocesses failures, then server r eventually erforms the action. ( a si, o stable_ system( s, r)) a ri,. Our liveness criterion only admits rotocols that roagate actions between any two servers, while it excludes rotocols that rely on a central server, or on some secific servers, to roagate actions. 16

23 3. The Architecture Chater 3 Two main aroaches for relication are known in the literature: the first is the rimary-backu aroach, and the second is active relication. In the rimary-backu aroach, one of the relication servers, the rimary, is the only server allowed to resond to alication requests (actions). The other servers, the backus, udate their coy of the database after the rimary informs them of the action. If the rimary crashes, one of the backus takes over and becomes the new rimary. Some rimary-backu architectures allow backus to resond to queries in order to increase system erformance. Active relication, in contrast, is a symmetric aroach where each of the relication servers is guaranteed to invoke the same set of actions in the same order. This aroach requires the next database state to be determined by the current state and the next action. Other factors, such as the assage of time, have no bearing on the next state. Some active relication architectures relicate only the udates, while queries are locally relied. This work takes the aroach of active relication. As can be seen in Figure 3.1, our relication architecture is a symmetric architecture which is structured into two layers: a relication server layer and a grou communication layer. Tyically, each relication server is a rocess that runs on a different rocessor that hosts a coy of the database. The grou communication layer is another rocess running on the same rocessor and communicating with the relication server via inter rocess communication mechanisms. Alternatively, it can be imlemented as a library which is linked within the relication server rocess. Each of the relication servers maintains a rivate coy of the database. The client alication requests an action from one of the relication servers. The client-server interaction is done via some communication mechanism such as RPC, IPC, or even via the grou communication layer. The relication servers agree on the order of actions to be erformed on the relicated database. As soon as a relication server knows the final order of an action, it alies this action to the database. If the action contains a query art, a rely is returned to the client alication from the database coy maintained by the original server that got the request. The relication servers use the grou communication layer to disseminate the actions among the servers grou and to hel reach an agreement about the final global order of the set of actions. 17

24 In a tyical oeration, when an alication requests an action from a relication server, this server generates a message containing the action. The message is then assed to the local grou communication layer which sends the message over the communication medium. Each of the currently connected grou communication layers finally receives the message and then delivers the message in the same order to their relication servers. We say that these servers are currently connected. If the system artitions into several comonents, the relication servers identify at most one comonent as the rimary comonent. The relication servers in a rimary comonent determine the final global total order of actions according to the order rovided by the grou communication layer. As soon as the final order of an action is determined, this action is alied to the database. In the rimary comonent, new actions can be ordered, and be alied to the database, immediately uon delivery by the grou communication layer. In non-rimary comonents, actions must be delayed until communication is restored and the servers learn of the order determined by the rimary comonent. Alication Rely Alication Request DB DB Relication Server Aly Relication Server Actions Generate Grou Communication Deliver Grou Communication Messages Send Receive Medium Figure 3.1: Detailed Architecture The grou communication layer rovides reliable multicast and membershi services according to the extended virtual synchrony model secified in Chater 4. This layer overcomes message omission faults and notifies the relication server of changes in the membershi of the currently connected servers. This notification corresonds to server crashes and recoveries and to network artitions and re-merges. The Transis system, which is an imlementation of such grou communication layer is described in Chater 5. 18

Distributed Systems (5DV147)

Distributed Systems (5DV147) Mutual Exclusion and Elections Fall 2013 1 Processes often need to coordinate their actions Which rocess gets to access a shared resource? Has the master crashed? Elect a new