Ordering events in distributed systems: A review

Size: px

Start display at page:

Download "Ordering events in distributed systems: A review"

Miles Holt
5 years ago
Views:

1 Ordering events in distributed systems: A review Yannic Bonenberger TU Kaiserslautern, Germany The concept of time is fundamental to our way of thinking. However, our intuitive concept of a total ordering of events is not well suited to manage temporal ordering in large, distributed systems. In this paper, we will introduce two methods how to design distributed systems to handle events gracefully. In the second part of this paper, we present concrete use-cases for these methods, and how they were implemented in these applications. 1 Introduction The concept of time is a very important part of human thinking. We all use it every day to describe the order in which events occur or to describe the duration of a single event. For example, we say that an event occurred at 3:15 if the clock we look at shows 3:15 and before it shows 3:16. We might also say that a given event took 15 minutes if it started at 3:15 and ended at 3:30. However, this intuitive concept of time is not very accurate and comes with some inherent problems when we try to use it in distributed systems. The fact that the accuracy of this way of thinking about time is not very high can be split into two rather obvious parts: Firstly, when we say that an event happened at 3:15, the clock already shows 3:15, which is when the exact moment where it was exactly 3:15 is over. This means that we always use the past to describe time. The second reason why the accuracy of our intuitive concept of time is not very high is that we cannot know that the clock we use to describe when an event happened actually shows the correct time. Sure, we may know if the time on a given clock is completely wrong, but we cannot know if there is a small difference between two arbitrary clocks. While this does not seem like it is an issue in everyday life, it can be important in some situations. Let us assume we sell tickets for a festival, and all sales have to be done by telephone. As usual, the tickets are sold First come, First served, which means we have to order the requests by time. Let us also say that it is a big festival, and there are a lot of people calling at the same time. Since we do not want our visitors to wait for a long time, we hire a big call center with multiple offices to handle the requests. Every time someone calls to order a ticket, we write down the current time and hand the request to some central station which orders them by time, and then assigns tickets to the first callers. As you can imagine, this total ordering of events is not very accurate because there might be some small, but noticeable, difference between the different clocks used to get the time when the tickets were ordered. If one clock now already shows 3:16, while another one still shows 3:15, we may violate our First come, First serve premise. These limitations of accuracy and scalability becomes more severe if we start to talk about distributed systems instead. Submitted to: ES Seminar 2018

2 2 Ordering events in distributed systems: A review A distributed system is a collection of distinct processes which can communicate with each other by sending messages to other processes. An example of such a system is the internet, which consists of independent hosts which communicate by exchanging messages. Single computers can also be viewed as a distributed system. Central control unit, the memory units, and the input-output channels are separate processes and can communicate by sending messages over a central BUS. Modern computers also usually have more than one processor core, and can execute more than one hardware thread concurrently. These independent threads can communicate with other thready by shared memory, or by sending messages. However, transmission time in single computers is usually negligible compared to the time between events in a process. Therefore, we will concern ourselves primarily with systems of spatially separated computers. However, most of the concepts apply more generally. If we go back to the example we used to demonstrate the limitations of accuracy and scalability of the intuitive concept of time: If we replace the manual process of ordering tickets by calling a call center with an automated system to handle these requests, we can create an automated system which is comparable to our manual system. However, we still have the same issues to solve. Due to scalability constraints, we cannot use a single process to handle a potentially very large number of concurrent requests. This means that we still have to order all events by time to fulfill the requests on a First come, First serve basis. However, it is not strictly necessary to have a total ordering across all requests, as long as we can guarantee that as soon as the request for the last ticket was fulfilled, all requests which happen after this final ticket is sold will be fulfilled. In this paper, we will formally define physical time, introduce to the concept of logical time and present some reasoning why true physical is not required in many applications. After we formally defined the relationship of events, and what it means when we say that an event e i happened before another event e j e i, or what it means when we say two events e i,e j happened concurrently, we introduce two methods, Lamport Time [4] and Vector Time [2], how to handle time in distributed systems and compare them. In the end, we present two distributed applications that require handling of events and how they use the methods presented earlier to solve their problem [3, 6].

3 Y. Bonenberger 3 2 Physical Clocks Let us introduce physical clocks into out model. Let C i (t) denote the reading of clock C i at time t. For mathematical consistency, we assume to have a clock running continuously rather than having a clock with discrete ticks (A clock with discrete ticks can be modeled by a clock running continuously, and a reading error of up to 1 2 tick). More precisely, we assume that C i(t) is a continuous, differentiable function of t except for isolated jump discontinuities where the clock is reset. Then dc i(t) dt represents the rate at which the clock is running at time t. To have a true physical clock C i, it is crucial to assume that t : dc i(t) dt. More precisely, the following condition must always be satisfied: x << 1 : i : dc i (t) dt 1 < x. (1) Typical crystal controlled clocks usually have x However, to have a true physical clock, all clocks must not only individually run at approximately the correct rate, it is also crucial that all clocks must be synchronized so that i, j : C i (t) C j (t). To be more precise ε : i, j : C i (t) C j (t) < ε. (2) Assuming we have a system as presented in Figure 1, we can consider the vertical distance between events to represent physical time. In this model, (2) states that the variation of the tick lines is less than ε if the clocks are sufficiently synchronized. However, since two different clocks tend to drift further and further apart because it is almost impossible that they run at exactly the same rate. Therefore, we must develop an algorithm to ensure that the condition (2) always holds. However, as we stated earlier in this section, as well as Section 1, using true physical time to describe the ordering of a set of events provides us with several challenges. Fortunately, most systems do not need true physical time to order events. Instead, it is often enough to assign a strictly incrementing number to each event to create a sufficient ordering. 3 Logical Clocks Now that we know physical clocks, we can introduce the concept of logical clocks into our system. Let us begin with an abstract point of view where a clock simply assigns numbers to events. We can think of these numbers as the time at which the event occurred. More precisely, we define an arbitrary clock C k for each process P k to be a simple function which assigns a number C k (e i ) to every event e i in that process. The entire system is represented by the function C which assigns to any event e j the number C(e j ), where C(e j ) = C l (e j ) if e j is an event in process P l. For now, we do not make any assumption about the relationship of the numbers C i (a) to physical time,

4 4 Ordering events in distributed systems: A review Figure 1: Three independent processes P,Q,R processing events p i,q i,r i, and sending messages to each other. so we can think of the clocks C i as logical rather than physical clocks. They may be implemented by counters with no actual timing mechanism. Now that we formally introduced the concept of logical clocks, we must define what it means for such systems to be correct. Since we cannot introduce true physical clocks keeping real physical time into our system and therefore also cannot base our definition of correctness on physical time, we must base our definition purely on the order in which events occur. The strongest reasonable condition for the correctness of the proposed timing mechanism is that if an event e i occurs before another event e j, then e j happens at a later time than e j. If we consider the vertical distance of events in Figure 1 to represent logical time, then an event e i happened before e j if and only if the time i at which e i occurs is truly smaller than the time j at which e j occurs (e i e j i < j).

Y. Bonenberger 5 4 Ordering Events To compare a set of events, we have to define a relation between two events e i, e j so that e i e j e i happened before e j.

5 Y. Bonenberger 5 4 Ordering Events To compare a set of events, we have to define a relation between two events e i, e j so that e i e j e i happened before e j. In this section, we will formally introduce the concepts of total order and partial order, as well as introducing some general concepts which apply when ordering events. Order theory is a branch of mathematics which investigates the notion of order using binary relations. It provides a formal foundation to describe statements such as this is less than that, or this happened before that. We can distinguish between two different, but related, kinds of orders: partial order and total order. Let E be a set of events and be a relation on E. Then is called a partial order if and only if it is reflexive, transitive, and antisymmetric. Figure 2: Hasse diagram of the set of all divisors of 60, partially ordered by divisibility [1].

6 6 Ordering events in distributed systems: A review This means that e i,e j,e k E we have that e i e i (reflexivity), (3) (e i e j ) (e j e k ) e i e k (transitivity), and (4) (e i e j ) (e j e i ) e i = e j (antisymmetry). (5) A set which fulfills those three properties is called a partial order. By checking these properties, one immediately sees that the well-known orders on natural numbers, integers, rational numbers, and reals are all orders in the above sense. However, they can have the additional property of being a total order. This means that e i,e j E we have that (e i e j ) (e j e i ) (totality). (6) In Figure 2, we present an example of divisibility of 60, which is only a partial order and not a total order. Please note that only nodes which are directly or indirectly connected by edges are comparable. For example, we know that 1 2 (read 1 divides 2 ), 10 20, or However, we do not know whether or 15 10, as these two nodes are not comparable. 5 Lamport Time In this section, we will present the algorithm described by Leslie Lamport in the paper Time, Clocks, and the Ordering of Events in a Distributed System in 1978 [4]. The algorithm proposed by Lamport uses logical clocks, which we have already formally defined in Section 3. As a reminder: Logical clocks can be implemented by counters with no actual timing mechanism, and every event is assigned a strictly increasing number. Because we cannot rely on physical time in such a system, the definition of correctness must be based on the order in which events occur. The strongest reasonable condition is that if an event e i occurs before another event e j, then e i should happen at an earlier time than e j. We formally define this Clock Condition as follows: e i,e j E : (e i e j ) (C(e i ) < C(e j )). (7) Looking at Figure 1, we can see that the events p 2 and p 3 are concurrent with q 3. Assuming that the converse condition of (7) also holds, both events p 2 and p 3 must occur at the same time as q 3. Since this would contradict the Clock Condition (p 2 p 3 ), we cannot expect this converse condition to be true. Given our definition of, it is easily derivable that the Clock Condition defined in (7) is satisfied

7 Y. Bonenberger 7 if these two conditions hold: If e i and e j are events in process P k, and e i e i, then C k (e i ) < C k (e j ), and (8) If e i is the sending of a message by process P k and e j is the receipt of that, then C k (e i ) < C k (e j ). (9) Let us consider the clocks in terms of a space-time diagram: It is easily imaginable that the clock of a process ticks through every number, incrementing between every event. For example, if e i and e j are successive events in process P k with C k (e i ) = 4 and C k (e j ) = 7, then the ticks 5, 6, and 7 of C k occur between these two events. If we draw a dashed tick line through all the ticks of the different processes, we can see that the space-time diagram presented in Figure 3 below yields a similar picture to the picture we used to illustrate the ticks of a physical clock in Figure 1 in section 2 above. From condition (8), we can derive that there must be a tick line between any two successive events of any process, and condition (9) requires that all message lines must cross at least one tick line. Looking at the meaning of in the space-time diagram in Figure 3, it is easily imaginable that the tick lines represent the coordinate lines of a Cartesian coordinate system on space-time, implying that the two necessary conditions of our Clock Condition are indeed true. If we redraw Figure 1 to straighten these dashed coordinate lines, we can create a similar picture which yields a valid alternative way of representing the same system of events. However, it is not decidable which of these two possible representations is better without introducing physical time into our system. Readers may find it helpful to use a two-dimensional spatial network of processes yielding a three-dimensional space-time diagram for visualization. Similarly to our representation in this paper, the alternative representation models processes and messages as lines. However, tick lines are now represented by two-dimensional surfaces. Now that we have formally defined our requirements, let us assume that the processes represent algorithms, and the events are certain actions during their execution. We will continue to show how to introduce clocks into the processes which satisfy the Clock Condition. If we use a register C k to represent the clock of process P k, and C k (e i ) is the value of C k during the event e i, the value in the register will change only between two events in the same process P k. For obvious reasons, this change must not constitute an event itself. We can show that this implementation satisfies the Clock Condition by ensuring that we satisfy (8) and (9). Showing that the approach proposed above satisfies condition (8) is simple: The processes only need to obey this following implementation rule: Each process P k increments C k between any two successive events. (10)

8 8 Ordering events in distributed systems: A review Figure 3: Three independent processes P,Q,R processing events p i,q i,r i, and sending messages to each other. Meeting the second condition (9) is slightly more complicated: We must ensure that each message m contains a timestamp T m equals to the time at which the message was sent. Every time a process P l receives a message m in with the timestamp T in, the process must advance its clock to be later than T in. More precisely, we define the following two rules: If event e i is the sending of a message m by process P k, then m contains a timestamp T m = C k (e i ), and (11) Upon receiving m, process P k sets C k so that C k its present value and C k T in. (12)

9 Y. Bonenberger 9 The first rule (12) requires that the event representing the receipt of the message m occurs after the setting of C k. However, we want to note that this is only a small nuisance in the notation, and not relevant in any actual implementation. It is also trivial to show that (11) and (12) satisfy the condition (9). Hence, these three simple rules (10), (11) and (12) for an implementation of the approach presented in this paper imply that the Clock Condition is satisfied, and guarantee a correct system of logical clocks. 6 Vector Time Although the relation introduced by Lamport [4] is always consistent with the observable behaviour of distributed systems, it only defines one of the possibly many valid event orderings for a given distributed computation and all other possible, certainly equally valid, event orderings are lost. Even the partial ordering resulting from the fact that subsets of the events can have the same timestamp does not preserve all potential and valid orderings. In this section, we will introduce a second approach which retains all possible and valid orderings. While this is the exact opposite approach to the one introduced earlier, it is best suited for problems concerning with the global state of a program. Figure 4: Use of timestamp vectors for asynchronous communication. In this model, rather than having only a single integer value shared by all processes, timestamps are represented as vectors c 1 c 2... c n (13)

10 10 Ordering events in distributed systems: A review with a dedicated integer value for every process in the distributed system. Formally, we define that e p represents an event e p executed by a process p, and that T ep is the timestamp vector permanently attached ( T to the record of the execution of this event. For example, assuming that ) is attached to an arbitrary event x in process 2, we can see that the clock value was T x2 [2] = 7 when x was executed, and that the last known clock value of process 4 was T x2 [4] = 12. It is important to note that the local timestamp of process 4 may have advanced well beyond this value by the time x is executed, but 12 is the most recent value available to process Asynchronous Communication Figure 4 shows an example of asynchronous communication In this case, the timestamp vectors are managed by the following algorithm: Initially, all values of the timestamp vector are set to zero. (14) The local clock value is incremented at least once before each atomic event. (15) Every outgoing message is augmented with the entire timestamp vector. (16) Upon receiving a message, a process sets the value of each entry in the timestamp vector to be the maximum of the two corresponding values in the local vector, and in the piggybacked vector received. The value corresponding to the sender, however, is a special case and is set to be one greater than the value received (to allow for transit time), but only if the local value is not already greater than that received (to allow for message overtaking ). (17) Values in the timestamp vectors are never decremented. (18) To compare timestamps attached to the stored records of events, we proceed as follows: e p f q T ep [p] < T fq [p]. (19) In order to satisfy this condition, an event e p is a predecessor of another event f q if and only if p has sent a message to q either during or after the execution of e p. To achieve transitivity and therefore make it possible to determine causal relation between events executed by processes which may never communicate directly, we additionally allow the indirect propagation of timestamp vectors.

11 Y. Bonenberger Synchronous Communication Now that we have presented our solution for asynchronous communication, we can show that the synchronous case can be solved easily if we require the exchange of timestamps from both sender and receiver with every message, and that both processes set their local clock to the maximum of the exchanged timestamp. This act is necessary because synchronous communication is always symmetric. To achieve synchronous communication, we define that every time a process receives a message, a dummy message with the local clock value of the receiver is returned to the sender, with both processes adjusting their local clocks according to the received timestamps. As long as all processes adhere to this simple protocol, it is impossible to introduce deadlocks into the system. For brevity, we will not present a proof in this paper. Interested readers can find it in the cited paper by Fidge [2]. Figure 5: Algorithm proposed by Lamport, adapted for synchronous communication: a) clock in sender running fast, b) clock in sender running slow. It is important to note that, due to the symmetry of synchronous communication and how we modeled the exchange of such messages, the direction of information transfer is not important. Therefore, we will omit the directional arrows for synchronous messages in future visualization. Now that we have formally defined synchronous communication and how we plan to handle it, we provide a modified version of the algorithm presented in Section 6.1 to manage timestamp vectors in the synchronous case:

12 12 Ordering events in distributed systems: A review Initially, all values of the timestamp vector are set to zero. (20) The local clock value is incremented at least once before each atomic event. (21) During a communication event, the two processes involved exchange timestamp vectors and each element in the local vector is set to be the maximum of its old value and the corresponding value in the received vector. (22) Values in the timestamp vectors are never decremented. (23) Figure 6: Timestamp vectors for synchronous communication. Each message causes an individual event in both processes involved in the communication to compensate for the fact that the execution of events is recorded in each process separately. Similarly, we modify the procedure to compare timestamps of stored record as described below: e p f q (T ep [p] T fq [p]) (T ep [q] < T fq [q]). (24) The first half comparatively complex conjunction ensures that process q has received a clock values from its communication partner p which is at least as recent as the execution of event e p. If we know that this precondition is satisfied, we also know that e p must have been executed before f q. Careful

13 Y. Bonenberger 13 readers may notice that we use as comparator rather than < to allow for the possibility that e p is a communication event itself. In this special case, process q may already have up-to-date information about the other process p when event f q was executed. The second half of conjunction (24) states that p cannot have up-to-date information about q (i.e. e p f q ). This second part is necessary to avoid reflexivity. Since, in our model, processes generate histories of timestamp traces for post-mortem analysis independently, we did not attempt to test for e p f q directly. Alternatively to the algorithm presented in (24) above, it is equivalent to compare the entire timestamp vectors because they are exactly the same if e p f q. However, since this computation is O(n) with n being the number of processes in the system, doing so will become inefficient if the number of processes is large, while the proposed approach is O(1) since it only needs to compare two integer values. 7 Comparison In this section, we will compare the approach proposed by Lamport [4] to the approach proposed by Fidge [2]. Both papers present an algorithm to handle time in distributed systems, with both of them avoiding the usage of true physical time to order the events. The approach introduced by Lamport uses only a single integer value as timestamp shared by all processes, while the approach proposed by Fidge uses a timestamp vector with a dedicated integer value for every process. Both approaches have certain advantages over the other. Two very important properties of the algorithm presented in Section 5 are that it is rather easy to implement, and that memory required to manage and save the timestamp is constant. While it is easy to see why the approach proposed by Lampost is very easy to implement, we do not think that this is a noteworthy advantage. The second algorithm presented in Section 6 is only slightly more complex. However, the approach proposed by Fidge requires that a timestamp vector with a dedicated integer value for every process is used, which leaves us with a space complexity of O(n), while the other approach requires only one timestamp value which is shared by all processes, which means that we have a space complexity of O(1). This can be especially problematic when we have a large number of processes, because it not only requires more memory, it also increases the total size of every message because of the piggybacked timestamp vector. However, while the algorithms differ in space complexity, both have a time complexity of O(1). Another advantage of the approach presented by Lamport is that it can handle dynamic processes. In this report, we have always assumed that there is a fixed number of processes. However, this assumption is problematic because most applications must be able to scale up or scale down, depending on the current workload. It is easy to see why the single integer timestamp presented in Section 5 has no problem with this, while a timestamp vector with a fixed size like we assumed in Section 6 cannot be used for this. To mitigate this fundamental limitation, Fidge proposed to replace the fixed-size timestamp vector with an extendable timestamp list, and add a slot for every new process. However, we can see at least two

14 14 Ordering events in distributed systems: A review issues with this solution: Firstly, this will severely increase the memory footprint of the timestamp vector, especially for systems which frequently scale their number of processes, because we only ever add new slots to the vector and never remove old and stale ones. Please note that it is not possible to remove slots of processes which are no longer running because the absence of messages from a process is not a good indicator whether it is still alive or not. This would require a central directory which brings us to our second concern: Allocating new slots requires a central directory because every process must have a unique slot in the timestamp vector. If two new processes are added at the same time, they may decide to use the same slot if they do not know of each other. In lots of cases, this requirement of a central directory is not desirable, especially if the distributed system has availability requirements. However, there is also very important advantages of the approach proposed by Fidge: Firstly, Lamport time cannot handle synchronous communication, and secondly, the relation defined by Lamport only defines one of many possible valid event orderings for a given distributed computation, and knowledge of any other, equally valid, orderings is lost. As a final note, it is crucial that applications weight which properties of a distributed event ordering algorithm are important for their use-case, and then decide which approach they chose. 8 Providing high availability using lazy replication In this section, we will present an example of a distributed system which uses the presented approaches to order events. As already noted earlier, high availability is a requirement for services such as mail or bulletin boards. More precisely, they should be accessible with high probability despite site crashes and network failures. To achieve this, data must be replicated to multiple independent nodes and data consistency must be guaranteed. One way to guarantee this required consistency is to force that all operations occur in the exact same order at all instances, which is expensive. Fortunately, not every application requires this stronger causal operation order to preserve the required level of consistency, yielding an improved performance. To achieve this weak casual order, the authors of the paper introduce a concept of lazy replication, which is intended for environments in which individual computers are connected by a communication network. With this architecture, both the nodes and the network can fail without bringing down the system as a whole. Nodes are modeled as fail-stop processors [5], network partitions can happen, messages between nodes may be lost, delayed, duplicated, and delivered out-of-order, and instances can leave or join the application at any time. Replicated systems are designed as a services consisting of multiple computers acting as replicas. Such systems are usually located in a dedicated network. However, large systems spanning multiple physical locations are possible. For brevity, we will only look at systems directly connected by a single network. locations. Replicas communicate new information among themselves by lazy exchange of gossip messages. For brevity, we will only look at two kinds of operations: update operations, modifying the state of the system but cannot observe it, and query operations, which observe the state but do not modify it. To be able to execute these operations in a valid order, we augment them

15 Y. Bonenberger 15 with a label indicating the state of the system, and specify which previous label is required to perform the operation. To achieve the best possible efficiency, we need to define compact representations of labels, and decide whether an operation is ready to be executed in a fast and efficient way. Additionally, labels must be generated by individual instances independently. To achieve these properties, we use multipart ( ) T timestamps. A multipart timestamp is a vector t 1 t 2... t n where n is the number of replicas in the service. Every entry in this vector is a non-negative integer, which is initially zero. These timestamps are ordered the intuitive way: t s (t 1 s 1 t 2 s 2... t n s n ). (25) Merging two timestamps t and s into a new timestamp u is done by taking their component-wise maximum (u[i] = max(t[i],s[i])). Replicas receives operations (call messages), and also gossip messages from other nodes. When receiving a call message for an update which has not been performed by this instance before, the update is assigned a timestamp and the replica adds information about it to its local log record. Periodically, this information is propagated to other replicas in the network as gossip messages, and then also reflected in the log of the receiving instance. Every node maintains a local timestamp, rep ts, identifying the set of records contained in the local log, and thus expressing the which updates are known by a particular instance. The local timestamp values are incremented every time an update call is processed. Therefore, the value of the replica s part of rep ts directly reflects the number of processed updates. All other parts of rep ts are only incremented when nodes receive gossip messages from other replicas. Therefore, every part i of rep ts counts the number of updates processed at replica i and are known by the replica maintaining this timestamp vector. When an update record is known by all replicas in the network, we know that it is reflected in the local state of all instances and can discard it from all logs. To be more precise, if an update is known to be known by all nodes, it can be excecuted meaning that it will also be reflected in the local state. The reasoning behind this more general claim is the following: ssuming a node knows about some abitrary update recurd u, the node knows that this record is known by all instance if and only if it has received gossip messages containing u from all other replicas. Since all gossip messages contain the accumulated log recorded by the sender, it is guaranteed that instances receiving the log entry containing u from replica i have also, either in the current gossip message or in an earlier one, received all operations processed by i before u was executed. Therefore, if a replica has heard about u from all other replicas, it has also heard about all updates u depends on. If that would not be true, the gossip could not have contained u because its dependencies must have been executed before u. Therefore, u is ready to execute. 8.1 Processing an update message If an update is late, or if it has already been processed by the replica receiving the call, the message is discarded. Otherwise, the following actions are performed to process the update:

16 16 Ordering events in distributed systems: A review Advances its local timestamp by incrementing its ith part by one while leaving all the other parts unchanged. (26) Computes the timestamp for the update, t s by replacing the ith part of the input timestamp with the ith part of the local timestamp. (27) Constructs the update record r associated with this execution of the update, r = makeu pdaterecord(u,i,t s ) and adds it to the local log. (28) Executes the update operation u if all the updates that u depends on have already been incorporated into the local state. (29) Returns the updates timestamp in a reply message. (30) Since updates can depend on other updates, the local timestamp and the timestamp assigned to the update call u may not necessarily be comparable. For example, if replica i receives an update u depending on another update v, and v has been executed by another node j, i may not know about v yet and has to delay the execution of u until v is propagated to i by gossip. 8.2 Processing a query message When replica i receives a query message q, it compares the querys input timestamp with its own state, which identifies all locally reflected updates. If the querys input timestamp is smaller than the timestamp presenting the local state, it executes the query and returns the result and the timestamp representing its local state. If the query s input timestamp is not smaller than the local timestamp, the replica waits since required information is missing. If a node is in this state, there are two possible ways to resolve the situation and continue with the query: The replica can either wait for gossip messages containing the missing data, or it can send a request to another instance and explicitly request the required information. 8.3 Processing a gossip message As mentioned earlier, gossip messages are used to propagate update messages to all nodes in the system. Therefore, these messages contain the log of the sender, as well as the senders local timestamp. Processing of the individual gossip messages can be done by executing the following three steps: Merging the log in the message with the local log. (31)

17 Y. Bonenberger 17 Computing the local view of the service state based on the new information. (32) Discarding records from the log and from the set of records which participated in the update. (33) For obvious reasons, the processing of incoming gossip messages only happens if the local state does not already reflect the updates of this particular gossip message. Receiving redundant information in gossip messages can happen for two reasons: Messages are delivered out-of-order, or another gossip message from a different node already contained the updates from this gossip message. If the gossip message is not discarded, the replica performs the following actions: Adds the new information in the message m new to the replicas log: log = log r m new \ log (34) Merges the replicas timestamp with the timestamp in the message so that the local state reflects the information known at the replica. (35) Finds all the update records that are ready to be added to the local value. (36) Computes the new local value. (37) Updates its local timestamp table. (38) Discards update records from the log if they have been received by all replicas. (39) Discards records from the list of records which participated in an update if an ack for this update is in the log and there is no update record for that update in the log. (40) Discards ack records from the log if they are known everywhere and sufficient time has passed. (41) The decision to delete records from the log only if these records are known everywhere can be problematic in the case of a network partition, since it uses information from all other replicas which may not be available in this case. Supposing that a partition divided the network into two sides A and

18 18 Ordering events in distributed systems: A review B, and that a record r is known by all nodes in both A and B, if no replica in partition A knows that r is known by all replicas in B, r will not be discarded from the log of the nodes in A. Once the two network parts A and B connect, this problem is automatically solved. 8.4 Analysis For brevity, we will not present an analysis of the correctness of the proposed system in this review. Interested readers can find it in Section of the paper by Ladin, Liskov, Shrira, and Ghemawat [3]. One very important aspect of the performance of the proposed system is that it highly depends on the type and frequency of executed operations. For brevity, will only present a very brief overview of the proposed system. Figure 7: Capacity of a single replica. In Figure 7 and 8, we present the response times of a single instance in a system consisting of three replicas for a given mix of operations, and the response times of an unreplicated system respectively. By comparing the capacity of the unreplicated system to the capacity of the replica, it is possible to derive the savings due to gossip. However, it is worth mentioning that the performance of the system as a whole is likely dependent to the relative priorities of gossip and operations. The system used to measure the response times visualized in Figures 7 and 8 was configured to prioritize gossip, meaning that the gossip will be processed whenever there is gossip to send or receive. Configuring the system to prioritize update

19 Y. Bonenberger 19 Figure 8: Capacity of the unreplicated system. or query operations over gossip will likely yield better response times. However, it is crucial that gossip cannot be allowed to lag too far behind since this would slow down the propagation of information about updates. For brevity, no experiments with changes in the relative priority were performed during this analysis. Real implementations should perform this analysis to find the optimal configuration. References [1] (2018): Available at the_divisibility_of_60.svg. [2] Colin J Fidge (1987): Timestamps in message-passing systems that preserve the partial ordering. [3] Rivka Ladin, Barbara Liskov, Liuba Shrira & Sanjay Ghemawat (1992): Providing high availability using lazy replication. ACM Transactions on Computer Systems (TOCS) 10(4), pp [4] Leslie Lamport (1978): Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), pp [5] Richard D Schlichting & Fred B Schneider (1983): Fail-stop processors: an approach to designing faulttolerant computing systems. ACM Transactions on Computer Systems (TOCS) 1(3), pp [6] Reinhard Schwarz & Friedemann Mattern (1994): Detecting causal relationships in distributed computations: In search of the holy grail. Distributed computing 7(3), pp

Providing High Availability Using Lazy Replication

Providing High Availability Using Lazy Replication 1 Rivka Ladin Digital Equipment Corp. One Kendall Square Cambridge, MA 02139 Barbara Liskov Liuba Shrira Sanjay Ghemawat MIT Laboratory for Computer Science