The Design and Implementation of a Fault-Tolerant Media Ingest System

Size: px

Start display at page:

Download "The Design and Implementation of a Fault-Tolerant Media Ingest System"

Liliana Lucas
5 years ago
Views:

1 The Design and Implementation of a Fault-Tolerant Media Ingest System Daniel Lundmark December 23, credits Umeå University Department of Computing Science SE UMEÅ SWEDEN

3 Abstract Instead of handling media on video tapes, television companies are moving towards storing media in digital media archives. Populating a media archive with media files is known as ingesting the media. The DART system is a system capable of automatically scheduling ingest of media available through live video feeds. This is a complex process with several single-points-of-failure, points where, if these fail, the entire ingest will fail. To avoid these failures, this thesis suggests a redesign of the DART system to make it fault-tolerant. Fault-tolerance is a widely studied subject in distributed computing. To make a system fault-tolerant it is vital that multiple identical copies are kept of vital data. This requires the use of replication methods. One such common theoretical replication model is known as active replication. Active replication has been successfully used, together with other changes, to make the DART system fault-tolerant. No single internal component of the system or in the network connecting it can cause the redesigned system to fail. To make this possible, the DART system has been analyzed to find all single-pointsof-failure and to determine what changes are necessary when redesigning the system. The changes have been tested extensively and used by a major television news network in everyday work. After evaluating these results it is clear that the redesign has been largely successful in making the DART system fault-tolerant. This shows that it is possible to use the active replication method as a basis for practical fault-tolerance. The new system also has some problems, such as being more complex and sensitive to bugs than the original system. i

5 Contents 1 Introduction Background Media ingest The DART system Fault-tolerance Thesis Description Goal Purpose Method Distributed Systems Fundamentals System model Synchronous and asynchronous systems Process failures Handling process failures Important concepts for distributed systems Partially and totally ordered sets Physical clocks Logical clocks and causal order Agreement problems Message passing in distributed systems Unicast communication Broadcast communication Total ordered broadcast and the agreement problem Implementation of reliable and totally ordered broadcast Multicast communication Replication in distributed systems A model for describing replication protocols Passive replication Active replication Comparison of replication strategies iii

6 iv CONTENTS 2.5 Existing solutions and protocols Virtual router redundancy protocol (VRRP) Multiport Video Computer Protocol (MVCP) The DART System Architecture of the DART system The DART client The dartsessd daemon The dartrecd daemon Video encoding servers and protocol translators Adapters between DART and external systems Controlling the video router Communication between components Physical connectivity in the DART system The Ardendo MessageBus Logical communication between components Starting a recording in DART Providing Fault-Tolerance in the DART System Possible failures in DART components Failures in computers running the DART client Failure of the application server Failure of the database server Failures in video encoder servers Video routing failures Possible failures in network communication Network interface card failover Default gateway failover Implementing fault-tolerance in DART The need for changes in the dartsessd daemon Choosing a replication method for DART Model based description of DART replication Necessary changes in dartsessd Message passing in the DART system Handling of recordings Video routing Network related changes External systems and the adapter Failover to a new primary system Replication failures Re-synchronizing a failed DART system Starting a recording in the fault-tolerant DART system

7 CONTENTS v 5 Evaluation and analysis of the fault-tolerant DART system Function tests Performance tests Failover tests Application server failure Database server failure Video routing failure Network failures Loss of synchronization Using DART in production Conclusion Feasability of using active replication Problems with the active approach to DART replication Using a passive replication method Final thoughts Acknowledgments 85

9 List of Figures 1.1 An overview of the ingest process A partial order where a b if a b Events and logical clocks for processes p, q, r Pros and cons of active replication Pros and cons of passive replication The different parts of the DART system The DART graphical user interface Format of a message sent using the MessageBus Message flow for a replicated request and response Format of MessageBus messages with changed from attribute The mutual exclusion protocol of the adapters Algorithm followed when dartsynch compares the databases Algorithm followed when dartsynch synchronizes the databases Starting a recording in the fault-tolerant DART system vii

11 Chapter 1 Introduction 1.1 Background This section includes a short introduction to the area of media ingest and the concept of making a distributed system fault-tolerant Media ingest Organizations handling large amounts of media, such as television companies, are moving away from video-tape based archive solutions. Instead media is being stored digitally, in media management systems. This enables an enhanced, faster, workflow, making it possible to search and edit media easily. The purpose of the DART system is to automatically ingest media, making it available to media management systems. At a specific time, the correct video will be available on, e.g. a live satellite feed. To ingest it, the video feed must be connected to a video server having the capability to encode video digitally and store it into a file. To make it possible for the video server to be connected to any of a large number of feeds both the feeds and the video servers are connected to a video router. The video router can then be used to temporarily connect a specific source to one or more destination video servers, just as a network router creates temporary (perhaps only instant) connections between computers on a network. To perform an ingest, multiple steps must be taken. At the time when the video is available the video router must be controlled to create the route between the correct live feed and a video server. The video server must then be told to start to encode the video that it is receiving. As the video file is being created, the media is then catalogued in a media management system [3] The DART system Ardendo[2] is a software company based in Stockholm, Sweden, providing complete solutions for managing media in digital archives. One such solution is the Digital Automated Recording Tool, DART. DART enables ingests of media to be done automatically. Recordings can also be scheduled ahead of time so that ingests can be done at any time, without the need for user interaction. DART also has support for handling recurring recordings. If the same feed is 1

12 2 Chapter 1. Introduction recorded each day (or e.g. on some certain days of the week) this only needs to be scheduled once in DART. The recording can then be specified to recur on other days. Client computer Video source feeds Video router Application server Video encoder servers Media management system Figure 1.1: An overview of the ingest process The DART system is divided into a graphical user interface (GUI) running on client computers that the users of the system use, and a backend. The backend, running on an application server, consists on a set of daemon programs performing the work required to automate the ingest process. An overview of how the DART system controls the ingest process is given in Figure Fault-tolerance For the DART system to work properly, many different components and pieces need to function as expected. If any of these parts has a failure, the entire system will fail. Each of these parts constitutes a single-point-of-failure for the entire system. These failures may be major ones, such as a crash of the server hosting the backend daemons, but it may also be quite minor ones, such as a failed serial cable. Fortunately as the DART system is distributed over several computers this opens up the possibility to add redundant software and hardware components to make the system tolerant to many different types of failures. 1.2 Thesis Description Goal This thesis project proposes a change of the design of the DART system so it becomes tolerant of failures and no longer has any single-point-of-failure and discusses the results of implementing these changes. This requires changes to the design of many parts of the system.

13 1.2. Thesis Description Purpose Replication in distributed systems is a problem that has been the subject of much research. There is a large number of articles published that discusses the problem and the two basic theoretical methods for replicating distributed systems. One is the passive replication method ([1, 43]) which is described in Section The other method is called active replication ([24, 41]) and is described in Section We study these two methods and the background theory necessary to understand them and then select one of them and use it to make the DART system fault-tolerant. This includes analyzing the DART system to find all places where changes need to be done in order to satisfy the conditions of the replication method. Furthermore, the DART system is analyzed to find all other single-point-of-failures that need to be removed in order for the system to be fully fault-tolerant. The central question that needs to be answered is: Is it feasible to use one of these theoretical replication methods as a basis for redesigning an existing distributed system to be fault-tolerant? Method To understand the two general replication methods and to be able to make a good choice of which one is most suited to use when redesigning the DART system it is necessary to perform an in-depth investigation of the area of replication in distributed systems and related topics such as reliable broadcast and multicast in distributed systems. It is also necessary to study the DART system to be able to analyze it to find all singlepoints-of-failures and to find the places where it needs to be modified in order to satisfy restrictions placed on it by the replication method. A major part of this thesis work is redesigning the system and implementing and testing the necessary changes. The author of this thesis has not been the only person working on this project. Instead, the work described in this thesis has been performed by a team of four persons working at Ardendo AB in Danderyd, Sweden. This thesis would not have been possible to perform without the hard work of all four persons in this team.

15 Chapter 2 Distributed Systems Fundamentals Generally, a distributed system can be assumed to be a collection of services usable by clients [43]. The goal of this thesis is to make the distributed DART system fault-tolerant by adding a redundant server to run the DART services. The result computed from a client request must then be available in the backup server. The two most common replication methods for distributed systems, active replication and passive replication, are described in Section 2.4 in this chapter. The active replication method has been chosen to be implemented in the DART and is therefore described in more detail. The actual implementation, as well as the reasons for choosing this technique instead of the passive method is described in Section These replication methods are dependent on ordered broadcast, being able to pass messages to multiple receivers in a certain order, which is described theoretically in Section 2.3 in this chapter. How the problem is solved in the DART system is described in Section System model When discussing distributed systems this thesis uses a system model for some assumptions about the system [16]. A distributed system is assumed to consist of a set of processes P = {p 1, p 2,..., p n } that communicate with each other by sending messages m 1, m 2, over a communication channel. Hence, they do not have any shared memory, even though they may sometimes execute on the same server. Communication between two processes p 1 and p 2 is performed by having p 1 send a message m, making it available for the communication channel to transport so that it is available for p 2. Process p 2 then receives m, reading it from the communication channel and making it available for the program that it executes. Each process p i has a local state that it transforms either when it executes its program, e.g., as a result of receiving a message m. Also, similarly as done by Lamport in [24], the delay necessary for communication is assumed to be large enough that it is not negligible when compared to the delay between steps taken in the execution of the processes. The important steps in the execution of a process p are the ones that occur when it changes its internal state and possibly receives and sends messages. These are described as being events e p 1, ep 2, ep 3, for the process. Which of the steps in the execution of a process 5

16 6 Chapter 2. Distributed Systems Fundamentals that are actually known as events differs between programs and what is important for the current discussion. If the actual process that executes the event is not important, the process identifier is excluded. In other cases the letters p, q, r may be used for process identifiers as well as p 1, p 2, p 3,. The channel that is used for communication between processes is assumed to be reliable for unicast messages (messages sent to only one other process) as offered by e.g. the TCP protocol [38]. Reliable communication needs several properties to be fulfilled, e.g. that messages are sent reliably, that if p 1 sends m i to p 2, m i will eventually be received by p 2, that they are sent in order, if p 1 sends m i and then m i+1 to p 2, m i will be received by p 2 before m i+1 and that no dublicates are generated of messages, if p 1 sends m i to p 2, only one copy of m i will be received by p Synchronous and asynchronous systems The design of a distributed system is heavily influenced by what assumptions are made regarding the difference in execution speed of the involved processes and the time necessary to transmit a message between two processes. There are two main types of assumptions that can be made about these parameters. Either that the system is synchronous, that there is a known upper bound for these times, or that the system is asynchronous, that no upper bounds can be assumed, times can be arbitrarily long. Assuming a synchronous distributed system makes it easier to develop algorithms for the system. Unfortunately having such exact guarantees of delays is hard in any practical system, especially a system implemented using common network protocols as the TCP/IP protocol suite can not provide such guarantees [38]. Also, if too low upper bounds are assumed the system may have no choice but to assume that correct processes have crashed, as they no longer seem to be responding. At the other end of the spectrum are the asynchronous systems that make no assumptions at all regarding the length of delays in the system. Distributed systems designed with this assumption are naturally tolerant to the environment in which they are implemented, but unfortunately there are problems that can not be solved given totally asynchronous assumptions (see Section for an example). Different degrees of partial synchrony can also be assumed [14]. The distributed system designed in this thesis tries to avoid making synchronous assumption where not necessary. However, timeouts assumed to be long enough are frequently used to detect failures of processes or in communication with processes Process failures An important part of this thesis is studying failures and failure handling in distributed systems. There are several ways in which processes can fail to act as expected. A faulttolerant distributed system is designed to function as expected even though one or more processes involved in the system may exhibit failures. Distributed systems vary a lot in which kind of process failures they are designed to handle. The different classes of failures discussed in this thesis are similar to the ones discussed in many other sources, such as [10], [33], [4] and [16]. The failure classes are listed in order from most benign (and easiest to handle) to most severe (and hardest to handle). Each subsequent class contain the earlier ones as more restricted cases.

17 2.1. System model 7 Crash failures Processes having crash failures always fail simply by ceasing to execute their programs. No other erroneous behavior is exhibited [19, 4]. Send-omission failures Apart from having crash failures a process with send-omission failures may also sometimes fail to send some messages that it should send. Messages that are sent are always correct, though [19, 4]. General omission failures A process with general omission failures may crash, sometimes omit to send messages and sometimes omit to receive messages. However, all messages that are sent or received are always correct [35]. Byzantine failures The most severe class of failures are the Byzantine failures, also called arbitrary failures. A process displaying these kinds of failures may crash, omit to send or receive messages as well as execute any unexpected instruction apart from its program. This means the messages it sends may contain the wrong information. This information could even be such that other processes will react to it as if the process sending the message was in some other part of its program than it is expected to be. Handling Byzantine failures were probably first discussed in a published paper by Lamport in 1978 [23] (Lamport states that the idea existed earlier) that extends on the ideas about implementing a distributed system as a state machine (see Sections and 2.2.1) [24] but also handles failures. The name Byzantine was introduced in 1982 by Lamport, Schostak and Pease [25] Handling process failures It is much more difficult to design distributed systems to handle Byzantine failures, compared to crash failures. Also, recovering from Byzantine failures requires that two thirds of the processes have no failures [25]. This is therefore not possible, e.g., with only two processes. Even though recovering from crashed processes may be as easy as having a long enough timeout to detect that they have crashed, designing a distributed system to recover from Byzantine failures complicates the task. Complicating the design opens up the possibility of bugs being introduced in the design or in the implementation. Unfortunately, even if a distributed system is assumed to only exhibit crash failures, bugs in the design or implementation of it may show up as Byzantine failures. Even though the failed process follows its program, it deviates from how the program is expected to function. This may make it necessary to consider Byzantine failures for practical distributed systems. 1 The article includes a popular example involving three generals of the Byzantine army communicating with each other using messages that could be altered by traitors.

18 8 Chapter 2. Distributed Systems Fundamentals Transforming failures to simpler classes A possible solution to the dilemma of whether to design distributed systems to handle Byzantine failures or not is given by the possibility to transform these severe failures into crash failures by performing a controlled crash of the failed process if it is detected to have displayed an omission failure or a Byzantine failure [33]. In general this is done by adding communication rounds to the algorithm. Note that the methods given in [33] are designed for synchronous distributed systems. Using these methods systems can be designed to handle crash failures but be transformed to handle more severe failures without adding unnecessary complexity to the system. More details of these kinds of methods are given in [4] that presents a hierarchy of translation methods, including those given in [33], and discusses their optimality. The first method given in [33] transforms systems designed for crash failures to handle general omission failures. This adds a communication step that ensures that omission is detected both when receiving messages and sending them. When a process p 1 wants to send a message m to process p 2 it sends m to all processes in P, all processes must then echo the sending of m. When p 2 receives any echo of m it receives m. If p receives fewer echos than the known minimum number of correct processes, p 1 knows that it must have exhibited an omission failure and halts its program, triggering a crash of itself. The article also presents a method for converting algorithms handling general omission failures to ones handling Byzantine failures. Methods similar to this makes is possible for a crash to be triggered in a process that is suspected of having failed. The crash can be triggered either by the process itself, or another process. This ensures that a process crashes instead of exhibiting more severe failures. However, as mentioned in [16], a controlled crash is not a perfect solution. It might sometimes be necessary to force a crash of a process that has been incorrectly suspected of having crashed. For some kinds of distributed systems controlled crashes may result in correct processes being crashed until no working process is left (e.g. for systems with lossy channels). Also, fault-tolerant distributed systems are only capable of handling a specified number of failures, crashing processes lowers the amount of processes available for the system [16]. However, assuming that a similar method can be used to tolerate more severe failures, the system model for the distributed system in this thesis is designed to handle only crash failures for processes. 2.2 Important concepts for distributed systems This section describes some basic concepts associated with distributed systems that are useful when discussing the main parts of the background theory for this thesis, message passing in distributed systems (in Section 2.3) and replication methods for distributed systems (in Section 2.4) Partially and totally ordered sets Order theory studies how binary relations (relations between two entities) are used to defined an order between objects in a set. Two commonly used relations that define orderings are the partial order and total order relations. For a binary relation between objects in a set S to be a partial order, the following three properties have to hold for all objects a, b, c,... in S:

19 2.2. Important concepts for distributed systems 9 reflexivity a a. antisymmetry if a b and b a then a = b. transitivity if a b and b c then a c. α, β Set a γ Set c α, β, γ Set b γ, δ Set d α, β, γ, δ Set e Figure 2.1: A partial order where a b if a b. An example of a relation that is a partial order and order objects in a set partially is the subset relation between sets. If a b then set b contains all the elements of set a. It should be noted that not all sets are related with the subset relation. Two sets a, b where neither a b nor b a is true can not be ordered based on this relation. In Figure 2.1, where the edges in the graph shows how the sets are ordered using this relation, one such example is sets b and d. These sets have element γ in common but neither is a subset of the other. None is therefore a subset of the other and ordered before. An example of sets that are ordered are a and b where a b. Total order is defined as a relation over objects in a set S having the above three properties together with an additional property: totalness a, b S, a b or b a. This property ensures that all elements in the set can be ordered using this relation. A simple example of a total order relation is the relation between natural numbers N, a, b N, a b or b a. This relation orders all natural numbers without leaving any pairs undecided. The partial order defined by the relation above, on the other hand, leaves some pairs undecided. The concept of partial and total orderings are important when discussing message orderings and different relations between messages that can be used to order them, see, e.g., Section Physical clocks Processes in a distributed system execute on different computers which may be situated in different, physically remote, locations. One of the problems caused by this difference is that processes do not have access to a common physical clock that they can read the current time from. Each computer that a process executes on have a hardware clock. Even if different clocks would be initially perfectly synchronized they tend to drift apart as time goes on. Also, as computers are located physically apart, it is usually impossible to synchronize them manually. A message passing protocol is therefore needed in order to do this.

20 10 Chapter 2. Distributed Systems Fundamentals This is made more complicated by the fact that it is impossible to deterministically synchronize the physical clocks of n processes in a system P closer than (max min)(1 1/n), where max is the maximum delay in communication between any two processes p i, p j in P and min is the minimum delay [28]. A probabilistic algorithm is given by Cristian [9], which improves on this bound, but perfect synchronization is still not possible. As perfect physical clock synchronization can not be achieved physical clocks are not always reliable in distributed systems. Clock synchronization protocol such as the Network Time Protocol [30, 31], based on research such as [9], are still commonly used in the Internet and in many distributed algorithms. But when using physical time in distributed algorithms care must be taken to ensure that different processes are not dependent on perfect clock synchronization Logical clocks and causal order In distributed systems there is often a need to know in which order events happen. However, some properties of distributed systems often make this impossible to determine. Processes that are part of the system may execute in physically different locations, making it impossible to manually determine in which order events happen. As a rule, information about what happens has to be passed in messages between processes. As mentioned in [24], in a distributed system communication delays are not negligible when compared to process execution delays. Also, both of these delays may vary from time to time. This makes it harder to tell in which order events happen. Say that a process p receives a message from process q after it executes an event e p i. The sending of this event by process q is defined to be an event of it s own, event e q j. Process p can not, with any certainty, say whether the sending of the message, event e q j, happened before or after ep i. Even though the message was received after e p i, the actual event eq j might still have happened first. The fact that the message was received after e p i might be entirely due to large network delays. But this can not be known for sure. One possible way to solve this would be to include timestamps taken from a computer clock in messages, and using these timestamps to order them. But, as discussed in Section 2.2.2, it is not trivial to depend on the values read from the physical clocks on the computers the processes are executing on. A higher timestamp read by one process could very well be read earlier than a lower timestamp read by another computer 2. The happened before relation and causal order In order to find another, sometimes more useful, way to order events happening in different processes Lamport in [24] defines the happened before relation denoted by to be a relation between two events that has the following properties: 1. If e p i and ep j are events in the same process, and ep i happens before ep j, then ep i ep j. 2. If e p i is the sending of a message by process p and eq j is the receipt of the same message by process q, then e p i eq j. 3. If e i e j and e j e k then e i e k. When considering that the relation e i e j is meant to hold when event e i happens before event e j the first two properties feel natural. It is trivial to order events happening 2 In fact, [24] mentions that due to effects from special relativity there is actually no total order for events in the real world based on time. Different observers can disagree on which order events happen in

21 2.2. Important concepts for distributed systems 11 in the same process and it is also trivial to determine that the sending of a message happens before receiving it (how long before though, is not always possible to say). The third property states that this relation is transitive. If it is known that one event happens before another and that that event happens before a third, it is likewise natural to claim that the first one happens before the third. p e p 1 e p 2 e p 3 e p 4 C p q e q 1 e q 2 C q r e r 1 e r 2 e r 3 C r Figure 2.2: Events and logical clocks for processes p, q, r Even though these are three simple properties it is possible to combine them to form a partial order, as discussed in Section 2.2.1, on the set of all events in a distributed system. This can be illustrated with a directed graph by having entries as vertices and connecting them with directed edges both where events happen before each other on the same process and where messages are passed between processes. As seen in Figure 2.2, when starting from any event, that event happens before all events that can be reached from it by following the arrows, this illustrates the third property of the happened before relation. For this example, p first sends a message to r. Then p executes an event without sending a message while r sends a message to q. Finally all three processes execute an event. Not all events are ordered by this relation. As an example events e p 3 and eq 1 in Figure 2.2 have no path connecting them. This is because none of the two events can be said to occur before the other using the happened before relation. The events are said to be concurrent. Another example is the last events shown for the processes. These three events are all concurrent as well. This is also the reason that the happened before relation is only a partial ordering. This relation can also be seen as defining which events have a causal relationship, where in effect an event might be the cause of another event. This ordering is therefore also known as causal ordering. Logical clocks Even though the happened before order can be used to partially order events it must somehow be implemented in the distributed system for the processes to be able to perform the ordering. This is done in [24] by having each process p hold a value called a logical clock. The logical clock for process p is denoted by C p and it s value at the time of event a is C p (e p i ). In each message m that a process p sends it includes the current value t m of C p. The goal is to have two logical clock values be C p (e i ) < C q (e j ) if and only if e p i eq j according to the happened before relation. In [24] the value of the clocks is computed in the following way:

22 12 Chapter 2. Distributed Systems Fundamentals 1. Each process p increments C p between any two successive events. 2. Upon receiving a message m, process p sets C p to be greater than t m and greater than or equal to it s current value. The initial value of C p for a process p can be arbitrarily chosen. As C p is incremented each time an event occurs the first property of the happened before relation clearly holds. Because a process p includes its current value t m = C p in each message m it sends and each receiving process q makes sure that C q > T m after receiving m the second property also holds. The third property holds trivially as the values of C p are chosen from the natural numbers that are transitive in regard to the less than relation (if a < b and b < c then clearly a < c). In Figure 2.2, the logical clocks for processes p, q, r at different steps in their programs is shown beneath the boxes representing the processes. A total order for events The partial ordering implemented by logical clocks is enough for many applications. However it can be easily extended to a total ordering (see Section 2.2.1) as shown in [24] by using any total order relation to order the process identifiers p, q,. For example, if identifiers are integers then the less than relation is sufficient. Concurrent events e p i and eq j that have the same clock value, C p (e p i ) = C q(e q j ) are then ordered according to their process identifiers, if p < q then e p i < eq j. A possible problem with this ordering is that it may differ from the order in which events actually occur as observed by a user of the system. An example is given in [24] where two users communicating in some way not controlled by the logical clocks in the system (such as speaking over the phone) can use this communication to causally affect the information that is sent in the system using the logical clocks. Naturally this causal relationship is not noted in the ordering induced by the clocks. The paper then describes a way of using approximately synchronized physical clocks (see Section 2.2.2) as a basis for the ordering. The resulting ordering will then reflect these kinds of relationships as well. As an application of total orderings the paper describes an algorithm where the orderings are used to provide distributed mutual exclusion for a system of processes. The paper described modelling this algorithm in the processes as a state machine, where the processes receive messages and transform an internal state. All processes implement the same state machine and this introduced the state machine approach to replication, also known as active replication. This approach is discussed in Section 2.4.3, together with a description of this application Agreement problems A fundamental problem in distributed computing is the agreement problem. Most distributed systems have a need for their processes to agree on a value or a set of values. Common examples of this includes sensor readings, message orderings or current values of clocks. However, every time processes need to cooperate there is often a need to agree and come to the same conclusion. If there is no need to tolerate failures this is easily done [17]. Say that all processes p i P where n = P is the number of processes in P, have read sensors that produce n values y i, one for each p i. These values are approximately the same, but differ slightly. The processes need to agree on a value y to use as a common sensor reading. All p i now simply

23 2.2. Important concepts for distributed systems 13 send their values to each other (using either broadcast or multiple unicast communication). After all p i have received all y i they calculate the mean value y = ( y i ) /n. However if processes can fail, either with crash failures or more severe Byzantine failures (see Section 2.1.2), this can not be done. If processes may crash, some p i may not receive all y i. If some processes are subject to Byzantine failures then these processes may lie and send the wrong y i values to some p i, which will then calculate a different y. Agreement problems describe the problem of reaching agreement when failures may occur. A survey of agreement problems can be found in [17], where the following three common and related problems are discussed. Different agreement problems The problem that is most often known as the consensus problem (although this term is sometimes used synonymously with agreement) has the goal of making all correct processes p i P agree on a single bit y, which is called the consensus value. At first all p i have an initial bit x i and y is somehow calculated (as an example by voting) from the x i values. A more general version of consensus is interactive consistency [34]. In this problem all correct p i need to agree on the same vector y, that consists of values y i = x i. In other words, the processes p i need to agree on values for all x i instead of calculating a final value y from them. The final agreement problem is the generals problem [17]. In this problem a single process p 0 transmits a value x to all other p i P, all correct processes must then agree on the the same value y. Note that if p 0 is subject to Byzantine failures it may be necessary for all correct p i to agree on a default value that can be used when no other agreement can be made. This problem was introduced as the Byzantine generals problem and described with Byzantine failures [25]. It is shown [17] that these three problems are related. Distributed systems that handle one of them can also be made to handle the others. The paper then described how (or if) the Byzantine generals problem can be solved for different types of distributed systems. Interactive consistency can be solved by solving Byzantine generals n times, each time having p i transmitting it s value [25]. Consensus is solved by calculating y by using all y i in y. As the problems are related, this discussion applies to the other problems as well. Agreement in synchronous environments A voting algorithm with multiple communication phases is used in [25] to reach agreement for the Byzantine generals problem in a synchronous distributed system (see Section 2.1.1). Similar algorithms also exists [34]. It will not be described in detail here but the basic idea is to first have p 0 transmit x to all p i P. If the transmission to a p i fails, that p i uses a predetermined default value instead. All p i will then perform a recursive procedure based on the same algorithm, transmitting x to all other p j P, j i, j 0. Depending on m, the maximum number of Byzantine processes, more and more recursive steps will be necessary. After receiving copies x 1,, x n 1 of x from other processes each p i chooses the majority of the values as its value y. It is also mentioned in [25] that if m processes in P are subject to Byzantine failures it is necessary to have P = n 3m + 1 to reach agreement. This was proven earlier in [23], which was the first paper to describe a synchronous distributed system capable of handling Byzantine failures. A simple example where it is not possible to reach agreement, for n = 3, m = 1, is given in [25]. Both receiving processes, p 1 and p 2, will receive two copies of x, one from p 0 and

24 14 Chapter 2. Distributed Systems Fundamentals one from the other receiver. As any of the three processes may be the Byzantine process, a correct receiver may receive two copies of x with different values and be unable to make a correct decision while the other correct process may either have been sent a correct copy of x from the Byzantine process or may be the transmitter and automatically have the correct value. Therefore it can not be guaranteed that the two correct processes agree on the same value y. Agreement with authentication A way of improving on these results by using authenticated communication was given in [23] and further described in [34]. One of the reasons that n 3m+1 processes is needed is that a Byzantine process may lie to its neighbors about values received from other processes. If it would be possible for processes to authenticate these passed on values in such a way that they could tell whether it was correct or not, then it would be possible to stop Byzantine processes from corrupting values that are sent. In that case it is enough to have n > m. No matter how many Byzantine processes there are, a correct process will always correctly receive the correct value from all non-byzantine processes. Processes may crash or failing to send messages, but as values are correctly passed on all values will eventually be diffused to all correct processes (as long as the communication network is sufficiently connected between the correct processes). One way to perform such authentication is to have each process calculate a digital signature for the messages they send. Digital signature is a cryptographic technique first introduced in [40] having two functions, sign(m) and verify(m, v). Only the correct sender of a message can calculate the digital signature value sign(m) = v but anyone can use this signature to verify that a copy m of m is really the same message that was signed by using verify(m, v) which is only true when m = m. This was made possible by the discovery of public-key cryptography [12] 3. As described, a digital signature uses two related values called a public key and a private key. The signature is made using the private key which is kept secret to the signer but can be verified using the public key, which anyone may access. The keys also have the important property that it is computationally infeasible to calculate the private key given the public key. Digital signatures are computationally expensive to use, which causes them to be less useful for agreement algorithms. However, as long as it can be assumed that the Byzantine failures only involve random errors, not an actual intelligent opponent that wants to sabotage the agreement, simpler authentication methods can be used that are not cryptographically secure. These values can be computed by anyone, without using keys, but it is not likely that they would randomly be introduced into a message. Agreement in asynchronous environments Unfortunately, agreement problems can not be reliably solved in purely asynchronous environments. This is proven in [18] even for distributed systems that only assume crash failures. As the system is assumed to be fully asynchronous, no timeouts can be used to detect that processes have crashed. The proof is quite involved but it proves that if processes can fail then an agreement protocol can be made to execute inifinitly many steps without reaching a state where there is only one possible value to agree on. 3 This paper introduced the concept of public-key cryptography and several important concepts. The first practical public key cryptosystem was given in [40], though.

25 2.3. Message passing in distributed systems 15 This has severe implications on the practicality of constructing distributed systems that are fully asynchronous. However, as the agreement problem can still be solved for many kinds of distributed systems that are only partially synchronous (see Section 2.1.1). It is also mentioned in [17] that probabilistic algorithms can be used to reach consensus with very high probability even in asynchronous distributed systems. Perhaps the most common way of solving this dilemma, however, results from the fact that one of the central reasons for the results in [18] is the inability of processes to reliably detect that other processes have failed. The concept of unreliable failure detectors has therefore been introduced [7]. Unreliable failure detectors hold a list of processes that it suspects have crashed. This list may be incorrect, add processes can be added and removed from the list as a process gains more information (such as receiving messages from processes it suspected had crashed). It has also been shown what the weakest properties are that a detector needs to have in order to reach agreement in an asynchronous distributed system [6]. 2.3 Message passing in distributed systems Distributed systems consist of processes running on multiple physically separated computer systems. To make these processes work together, the computer systems are connected as endpoints, or hosts, in a communication network connected by network routers. The processes communicate by sending messages to each other using a network communication protocol. Such communication, sent from a host to a single destination host, is known as unicast communication Unicast communication Communication over a network is generally done on a best-effort basis [37]. Packets sent over a network are repeatedly enqueued in intermediate network routers that send packets on step-by-send on the path to their destination. These routers have finite sized buffers to store packets in and may be forced to drop packets due to heavy traffic. Packets may therefore be lost and not received at their destination. Also, due to interference e.g. by cosmic radiation, packets may be corrupted while they are sent. Depending on which network protocols are used as the packets are sent, packets include redundant data, i.e. checksums used to determine whether data has been corrupted or not. If data is corrupted, it will not be received at its destination. The most common protocol used today for such end-to-end communication is TCP [38], one part of the TCP/IP protocol suite used for most network communication today. Being quite an advanced protocol, TCP provides several features that are useful for developing networked applications. Perhaps most importantly, TCP can provide reliable communication. Every time a TCP packet is sent to a host, it will respond with an acknowledgement packet (an ACK) sent back to the sender. If the sender does not receive an ACK within a certain time after sending a packet, it will resend it (the length of this timeout is calculated based on the expected delay in communication between hosts, for details see [38]). As this may result in duplicate packets being received by a host, each packet includes a sequence number that makes it possible for hosts to skip these duplicates. This combination of resends and sequence numbers enables TCP to make sure that all packets sent from a host is reliably received by the destination host and that no duplicate packets are received. The sequence numbers included in packets also make sure that messages are received in the same order they are sent.

26 16 Chapter 2. Distributed Systems Fundamentals However, even though TCP can provide reliable unicast communication, there is often a need in distributed systems for sending the same message either to all receivers on a network, what is known as broadcast communication, or sending it to a subset of the possible receivers, known as multicast communication. This can of course be done by sending a unicast message to each intended receiver. However, even though each individual TCP connection provides reliable service, the sending host may fail after having sent the message to only some of the receivers, thus having failed to send the message to all receivers. Also if multiple hosts are sending packets in this way, receivers may receive the packets from different senders in different orders Broadcast communication The TCP/IP protocol suite contains protocols that provide basic broadcast and multicast communication, but these are not reliable. No acknowledgement messages are used, and no guarantees are given in regard to the order in which messages are delivered [11]. This problem, providing distributed systems with reliable broadcast and multicast with ordering guarantees has been the subject of much research. This section will include an overview of specifications for different kinds of broadcast communication, starting with what is required to have reliable broadcast communication and then covering different kinds of ordering guarantees. This separation into different kinds of broadcasts is given in [20]. It is assumed for these broadcast specifications that each message m which is sent includes two fields: sender(m) To able to tell which process sent the message. seq(m) A sequence number used to tell how many messages the process has sent. If sender(m) = p and seq(m) = i then this is the i:th message sent by process p. Using these fields all messages will be unique as long as all processes have unique identifiers (for different processes, p will be unique and for a given process, i will be unique). This is necessary to separate different messages and sometimes to deliver them in a given order. We ll say that a process broadcasts a message when the message is initially sent from that process. Processes that receive broadcasted messages (the sending process as well) will then deliver the message and make it available to the application making use of the broadcast communication. Reliable broadcast Reliable broadcast is specified by using three requirements [20]: Validity: If a correct process broadcasts a message m, then it eventually delivers m. Agreement: If a correct process delivers a message m, then all correct processes eventually deliver m. Integrity: For any message m, every correct process delivers m at most once, and only if m was previously broadcast by sender(m). It is straightforward to realize that if a broadcast algorithm satisfies both validity and agreement then all broadcasted messages will be delivered by all processes and that all processes will deliver the same set of messages. Without validity some messages might not be delivered and without agreement some processes may deliver messages that other

27 2.3. Message passing in distributed systems 17 processes do not. It follows from integrity that messages will only be delivered once (even though they may be sent multiple times in broadcast algorithms) and that no extra messages are generated. So far, even a simple algorithm like having the sending process unicast the message to all receivers will hold. However, it is necessary that the algorithm works even though some processes may crash. If the sending process crashes after beginning to broadcast a message m there are two options. Either no processes deliver m, say if the sender crashed before sending m to any other process, or all remaining processes must deliver m, if the sender managed to send m to some other process [20]. If a sender would crash using the simple multiple unicast algorithm after sending m to some receivers, the rest would never receive it. This simple algorithm is therefore not reliable. A simple reliable algorithm will be shown in FIFO broadcast For some applications, reliable broadcast might be sufficient, but as discussed in [20] sometimes a message m is dependent on other messages to be meaningful. Unless some other messages are already delivered, it would not be meaningful to deliver m. A common case is when messages sent from a specific sender are dependent on the previous messages this sender has sent. If such a dependency may exist in an application, a FIFO broadcast 4 must be used. It is a reliable broadcast with the following requirement on the order in which messages are delivered: FIFO Order: If a process broadcasts a message m before it broadcasts a message m, then no correct process delivers m before it has previously delivered m. This requirement ensures that all messages sent by a process are delivered in the order they are sent. When a process delivers a message m it will then already have delivered all messages m is dependent on. Implementing FIFO order for a broadcast algorithm will be discussed in Causal broadcast In some cases messages might be dependent on more than just messages previously sent by the process. As the process might receive and deliver messages from other processes that causes it to react in different ways, messages from processes might for some applications depend on the messages that the process has delivered. The solution is to order messages causally as discussed in when delivering them. This way messages that are causally dependent on other messages will be delivered after them. As defined in [20] causal broadcast is therefore a reliable broadcast that also satisfies this requirement: Causal Order: If the broadcast of a message m causally precedes the broadcast of a message m, then no correct process delivers m before it has previously delivered m. In order to more easily prove that a broadcast algorithm satisfies causal order, [20] proves that satisfying causal order is equivalent to satisfying FIFO order and the following requirement: 4 First in, first out. The first message sent must be the first to be delivered.

28 18 Chapter 2. Distributed Systems Fundamentals Local Order: If a process broadcasts a message m and a process delivers m before broadcasting m, then no correct process delivers m unless it has previously delivered m. Informally this is easy to see when comparing FIFO and local order to the requirements given in for causal order. FIFO order satisfies the first requirement, that events happening after each other on the same process must be ordered after each other. The second requirement, that events be ordered according to causality is fulfilled by local order. If a message m is sent by a process p due to another message m being received, local order states that all processes must then deliver m before m. The third requirement, that this ordering should be transitive is trivial as both FIFO and local order satisfies this. Total ordered broadcast Causal broadcast correctly orders delivery of messages that are dependent on each other based on causality. This includes all cases when a process receives a message from another process. However, delivery of messages that are not dependent on each other may be delivered in different order in different processes. In many cases this will not matter. However consider the case where two users using an application discuss what actions they should take using their applications. As the messages generated by their applications are not casually dependent (the processes are not aware of the users communicating) they may be delivered in any order even though users may be aware that one should precede the other and the users may notice the difference in their applications. For applications where this may be an issue, total ordered broadcast should be used. Total order broadcast, sometimes called atomic broadcast makes sure that all processes deliver messages in the same order by placing the following requirement on delivery order: Total Order: If correct processes p and q both deliver messages m and m, then p delivers m before m if and only if q delivers m before m. As the ordinary requirements of reliable broadcast still hold, totally ordered broadcasts ensure that all processes deliver the same messages in the same order. A couple of ways of implementing total order in a broadcast algorithm will be given in Causal broadcast In some cases messages might be dependent on more than just messages previously sent by the process. As the process might receive and deliver messages from other processes that causes it to react in different ways, messages from processes might for some applications depend on the messages that the process has delivered. The solution is to order messages causally as discussed in when delivering them. This way messages that are causally dependent on other messages will be delivered after them. As defined in [20] causal broadcast is therefore a reliable broadcast that also satisfies this requirement: Causal Order: If the broadcast of a message m causally precedes the broadcast of a message m, then no correct process delivers m before it has previously delivered m. In order to more easily prove that a broadcast algorithm satisfies causal order, [20] proves that satisfying causal order is equivalent to satisfying FIFO order and the following requirement:

29 2.3. Message passing in distributed systems 19 Local Order: If a process broadcasts a message m and a process delivers m before broadcasting m, then no correct process delivers m unless it has previously delivered m. Total FIFO broadcast Notice though that using just total broadcast does not imply that messages will be delivered in neither causal nor FIFO order. E.g., as shown in [20], if a process broadcasts message m but has a temporary failure which results in no process delivering m and then broadcasts m, total broadcast allows processes to deliver m (as long as all process do it). However as m might be dependent on m being delivered (m could e.g. contain an update to some value that was defined using m), this violates FIFO order. It is possible for a broadcast algorithm to satisfy both the requirement for FIFO order and total order. This is called total FIFO broadcast. Dependent messages broadcasted using this algorithm will then be delivered in FIFO order, and even independent messages will be delivered in the same order by all hosts. Total causal broadcast Even if using total FIFO broadcast, messages will not be delivered in causal order. Of course, it is also possible to combine the requirements for causal order and total order. This results in total causal broadcast which is the strongest broadcast scheme presented here. This will deliver casually dependent messages, which preserves dependencies between them and make sure that all other messages as well are delivered in the same order by all processes. As mentioned in [20] and [41], a message delivery order based on causal order is needed to implement active replication, which is discussed in Total ordered broadcast and the agreement problem When using a totally ordered broadcast, all processes p i P agree on which message to deliver when. This is similar to the agreement problem (as discussed in Section 2.2.4). Indeed, it is proven in [7] that agreement and totally ordered broadcast are equivalent in asynchronous distributed systems. The proof is done by showing that both problems can be solved by using the other. Solving agreement using totally ordered broadcast is easy. Assume an agreement problem where each process p i proposes a value y i and the goal is to have all processes agree on the same value y by using these values somehow. This can be done by having each p i propose its y i value by broadcasting it to the other processes. The y value can be chosen as the first y i that is delivered (as all p i deliver messages in the same order). Performing a totally ordered broadcast using agreement is somewhat more advanced but involves sending messages between processes using any reliable broadcast and then using an agreement algorithm to have each process propose a sequence number for each message. All processes will eventually agree on sequence numbers for all messages which can then be delivered in the order of the sequence numbers. As shown in [7] this has negative consequences for total ordered broadcast. As discussed in Section 2.2.4, the agreement problem can not be solved using deterministic algorithms in a fully asynchronous distributed system. As the problems are equivalent, the same holds for total ordered broadcast. Either probabilistic algorithms or unreliable failure detectors must be used to perform a totally ordered broadcast in such a system.

30 20 Chapter 2. Distributed Systems Fundamentals Implementation of reliable and totally ordered broadcast Algorithms for implementing reliable and ordered broadcast have been extensively researched and a large number of different algorithms exist. A couple of easy-to-understand and basic algorithms will be described here. Partly in order to further explain reliable and ordered broadcast, but mainly in order to better understand the discussion of how message passing has been implemented in the DART system. This discussion can be found in One simple algorithm for implementing reliable broadcast is commonly known as message diffusion. The idea of this algorithm is that a process wishing to broadcast a message m sends it to all its neighboring processes. When a process receives a messages that it has not already received, it in turn passes it on to its neighbors. This ensures that a message reaches all processes (as long as the network connecting them is not split into different parts). This algorithm does not guarantee order, but can be used as a basis for ordered broadcast as well. A easy to read and useful survey of algorithms that implement totally ordered reliable broadcast is given by Défago, Schipér and Urbán [16]. It divides algorithms in five different categories depending on how the ordering mechanism works, as the paper claims that the ordering mechanism in the thing that influences the algorithm the most. The paper also gives a survey of about sixty different existing algorithms and explains their properties. The categories will by summarized here. The survey has three different roles for processes at the time of a total order broadcast of message m. The sender process is the process that starts to send m and a destination process is a process that will receive m. A process can of course be both a sender and destination process. The final role is the sequencer, which is a special process that can be used to order messages somehow, it may or may not be one a sender or destination process. Not all algorithms use a sequencer; however, algorithms that use the sequencer make up the two first categories described in the survey. Fixed sequencer Some algorithms using a sequencer process has that role being filled by a unique process, e.g. as determined by some kind of voting scheme. That process is normally kept as the sequencer for all steps in the algorithm (unless the process has a failure, at least). When a sender process is to perform a total order broadcast for a message m it needs to assign it a sequence number sn(m) using the sequencer. The survey describes three ways that this can be done. One alternative is to unicast m to the sequencer which then broadcasts m, sn(m) to all destinations. A second alternative, describes as reducing the load of the sequencer and making it easier to handle a failure of the sequencer but generating more messages, is to broadcast m to all destinations and to the sequencer. The sequencer then generates and broadcasts sn(m) to all destinations. The final way, which is not a frequently used, is for the sender to unicast a request to the sequencer which then sends a unicast message sn(m) back to the sender which can then broadcast m, sn(m) to all destination processes. Moving sequencer Instead of having a fixed process having the sequencer role the survey describes that it is possible to have a set of processes, usually all the destinations, rotate this role between them by passing a token message between the sequencers. This is described as being more

31 2.3. Message passing in distributed systems 21 complex than the fixed sequencer approach but making it possible to divide the load of being the sequencer between the processes. The usual variant of the moving sequencer algorithm is to have the sender broadcast m to all sequencers. The sequencers regularly pass on a token message between them that contain the next sn(m) value and a list of messages that have been given a number already. When the token is passed on the current sequencer generates sn(m) values for all m that have not yet been given a number, broadcasts these m, sn(m) pairs to the destinations and passes on the token. Privilege-based Instead of using a sequencer, it is possible to have the sending process order messages. The first of two categories where this is done is the privilege-based one where only one process is given the privilege of being able to be the sender at a time. One way of doing this is to have senders circulate a token message, similarly as done by the sequencers in the moving sequencer approach. This token contains the next value that should be used as sn(m). When a process receives the token it generates numbers for all messages m it wants to send and passes on the token with the new value for sn(m). In order to avoid a starvation scenario where one process keeps the token and sends a large number of messages before other process can send a limit can be places on the time a process can hold the token, or a maximum number of messages that can be sent while holding the token. Communication history Instead of having only one possible sending process at a time it is possible to let any process send a message m at any time. In this way the order is guaranteed by having each process p including a timestamp t p (m) with m and having the destination processes deliver m at a later time when it can be guaranteed that no lower timestamp can be received than this t p (m). Either each process p can generate its t p (m) independently of the other processes. The destination processes then follow a deterministic way of delivering messages from several sender, e.g. by first delivering the next message from p 1, then the next one from p 2, etc. The other major alternative approach to generating timestamps is to use logical clocks to generate the timestamps, as described in Section The destinations then deliver the the message with the next logical clock value. If two messages from different processes have the same value, the value of the process identifiers are used as well as described in Section To avoid having destination processes having to wait for new messages to be sent from all processes (so that none of them have a lower timestamp) processes can regularly send empty messages if they have no real messages to send. These empty messages can be discarded instead of delivered. If the distributed system is synchronous and a known maximum value is known for the difference between the physical clocks of the possible senders a physical timestamp can be used. After delaying the messages for a length of time equal to the maximum difference, the destinations can be sure that no messages with a lower timestamp can arrive and can deliver the message.

32 22 Chapter 2. Distributed Systems Fundamentals Destinations agreement Finally the last possibility is to have the destination processes use some sort of agreement algorithm to determine the order of the messages. One way this can be done is to first have a process sends a message m to all destinations is to have each destination generate a possible timestamp t(m) for m and broadcast t(m) to all destinations. When all destinations have received m and broadcast their suggestions for t(m) all destinations can agree on a sequence number for m by taking the maximum of all broadcast t(m) and using this as the sequence number sn(m) for m. The messages are then delivered in the order that their sn(m) imply. If two messages have the same sn(m) the process identifiers of the senders can be used as a tie break value, as done in the previous category imply. If two messages have the same sn(m) the process identifiers of the senders can be used as a tie break value, as done in the previous category. Apart from this, it is also possible to perform the destinations agreement step as a series of consensus problems or by using an atomic commit protocol. These possibilities are not described in detail in this thesis Multicast communication The different specifications of reliable broadcast as discussed in Section all make the assumption that messages should be sent to all possible processes. If D(m) is the set of processes receiving a message m, then m : D(m) = P. All processes receive all messages. Often it is interesting to only send messages to a group G of processes so that D(m) = G = {p 1, p 2,, p i } P. A message m is then addressed to the group, G[8]. This discussion assumes that the multicast groups are open, which means that processes can be added and deleted from G at any time. If a process crashes or wants to stop receiving messages sent to the group it leaves the group. If a new process wants to receive the messages sent to G it joins the group. The processes that are part of the group needs to provide two services, a membership service and a multicast service in order for messages sent to the group to be sent reliably to all processes currently in that group. Group membership service There are several properties defined for group membership services [8], the properties that are the most relevant to this discussion will be covered here. The concept of delivering a view is central to how the group membership service functions. The service delivers a new view V of the multicast group G to its members when the service has detected that a new process p j has joined G or that a process has left G. Depending on the multicast system this can be done in different ways, as an example by supplying a special message to the application or by using a callback function in the API that the application uses to access the multicast system. Among other things, the information passed through the view delivery must contain an identifier for the view and a list of the processes {p 1, p 2,, p j } G. One important property of the membership service is whether it allows groups to be divided into partitions. Consider a multicast group G consisting of two sets of processes, P 1, P 2 executing on computers in two local networks. If the networks were to lose contact with each other some applications might benefit from allowing P 1 and P 2 each to install views where only those processes are members then those both partitions can communicate between themselves. This is known as partitionable membership services. This is described

33 2.4. Replication in distributed systems 23 [8] as placing a partial order (Section 2.2.1) on the delivered views, so that some views are concurrently delivered. Other applications, e.g. those that require that all processes maintain the same state, may have the requirement that only one set of processes continue to communicate between themselves. This is known as primary component membership services. This is described as having a total order (Section 2.2.1) of delivered views so that only one view is delivered at a time. Processes that are not allowed to deliver a view may for example be blocked from further execution or be forced to crash. Multicast service One basic property that a multicast system needs to implement is functionality for synchronizing message delivery with view changes [8]. One way is to have the multicast service give the guarantee that all receivers of a message m receive it in the same view (after installing view V i, but before installing the next view V i+1 ). This is known as same view delivery. A stronger version of this property guarantees both that all receivers deliver m in the same view and that this view is the same as the one the sender of m was in when the message was sent. This is known as sending view delivery. This stronger version allows for a easy to use communication model where processes sending messages are sure which processes will be receiving it. This can be used for example to update processes in a new view on the state the group is in. Reliability and ordering in multicast is very similar to reliable broadcast, as discussed in Section However multicast normally guarantees this only for messages sent within the same view. Making sure that messages sent in different views uphold reliability and order would make it necessary to save sent messages and retransmit them to processes that join the group [8]. Other than this, the same kind of orderings applies for reliable multicast as for reliable broadcast. Multicast systems in which applications are allowed to join multiple groups at the same time have other issues as well. When messages are sent to different groups, systems differ in whether these messages should be delivered in the same order at all destination processes. Giving this guarantee, which is not done by all systems due to the performance penalty that is causes is known as atomic multicast [8]. 2.4 Replication in distributed systems Since a distributed system by its very nature executes on more than one physical system it provides the possibility to replicate services offered by the system. To replicate a service in a distributed system, the service must somehow be implemented in multiple systems. Each such system, implementing the service, is known as a replica. To be able to describe different replication methods, it is useful to have a common model that can describe them all. One such model, introduced by Wiesmann, Pedone, Schiper, Kemme and Alonso [43], is used to describe replication strategies both for distributed systems and databases. Two of the replication strategies for distributed systems described there will be described here as well. These two, called passive replication and active replication are probably the two most common strategies used when constructing replication protocol for distributed systems. There exists other strategies as well, [43] mentions two. One called semi-active replication [39], which is really a mix of the two main strategies and one called semi-passive replication [15] which is based on passive replication but tries to avoid using

34 24 Chapter 2. Distributed Systems Fundamentals the group communication primitive used in that replication strategy in Section These strategies will not be covered further A model for describing replication protocols The model consists of five different phases that the actions of replication protocols are divided into. These phases will be described here. Depending on the specific replication method, it may sometimes be possible to skip some phases as they are not necessary [43]. Request phase The replication model constructed by Wiesmann et al. [43], is based on the assumption that actions taken by the replicas are based on clients making requests to the replicas, e.g. because a user has performed some action. Therefore the first phase in the replication model is the request phase, where a request is sent from a client to the replicas. This can either be accomplished by the client sending the request to all replicas at the same time or by the client sending the request to a single replica which then passes the request on to the other replicas. Note that in the first case, the client must be aware that the system is replicated, and be aware of all replicas. Server coordination phase The purpose of the server coordination phase is to make sure that replicas agree on an order that the received requests should be handled in [43]. The different ways of ordering messages sent in distributed systems described in 2.3 are also used when ordering requests, and depending on the application using the replication protocol, one of them is normally required. The order that the requests are handled in is also important for ensuring the correctness of the replicated system. Finally, for certain replication protocols, this phase can also be used to decide whether the request should be handled at all, it may be skipped if the replicated system is not ready to handle it. Execution phase The actual execution that computes the result of the client s request is handled in the execution phase [43]. Different protocols may affect how these executions are done, such as having to avoid non-determinism when the servers execute. Changing the state of the servers is typically delayed from this phase and instead handled in the next phase. Agreement coordination phase In some cases requests may not be successfully handled by all replicas, even though they have been ordered in the same way in the server coordination phase. This is mostly the case for database replication protocols, but also for some replication protocols for distributed systems. The protocol then needs to coordinate the computed result between the replicas so that they all persistently stores the same result [43]. Client response phase After all replicas agree on the result of the request the client needs to receive an answer to it [43]. Depending on the protocol used, the client may receive answers from a single replica, or from all of them. Having all replicas pass their computed result back to the client provides

35 2.4. Replication in distributed systems 25 more fault-tolerance, as it may then be possible to disregard Byzantine faults, where a minority of replicas return the wrong result, simply by having the client use the result that a majority of clients return. However, using the answer that is returned first yields higher performance by not having to wait for all replicas to return Passive replication The main idea behind passive replication, is that one of the replicas in the replicated system is a primary replica. This is the only replica that will handle and execute requests from clients. The other replicas are passive during this time. When the primary has finished handling the request, it will forward it to the rest of the replicas, which will then store the result. When the primary knows that this has been done, it will send the answer back to the client. If the primary replica fails, one of the other replicas must become a new primary. This replication method is commonly known as primary-backup replication [43]. The Alsberg-Day protocol The Alsberg-Day protocol [1] is believed to the the first passive replication protocol [5]. This protocol will serve as an example of a real protocol implementing the passive replication method. The Alsberg-Day protocol is meant to implement resilient services. Resiliency, as described in [1] is a wider concept than fault-tolerance. It is described as placing four demands on a system: 1. It is able to detect and recover from a given maximum number of errors. 2. It is reliable to a sufficiently high degree that a user of the resilient service can ignore the possibility of service failure. 3. If the service provides perfect detection and recovery from n errors, error (n + 1) is not catastrophic. A best effort is made to continue service. 4. The abuse of the service by a single user should have negligible effect on other users of the service. A fault-tolerant system would satisfy the first demand, and to some extent the second. Parts of the Alsberg-Day protocol implementing the third and fourth part will be noted and not described in great detail as those parts provide resiliency rather than fault-tolerance. A resilient service is implemented on a set of server hosts. One of the hosts is the primary host that will perform the execution necessary to provide the service. The rest of the hosts are backup hosts. Clients execute on application hosts. In the paper, the server hosts are ordered in a linear logical architecture. The hosts each have a previous and a next neighbor that they communicate with, except for the primary host and the last backup that only have one neighbor each. Other communication architectures are possible, e.g. the version of this protocol described in the replication model in Section communicates using a total ordered multicast. It is assumed that the network delay in communication between hosts in the system, is about times greater than the time needed for command execution on the application hosts. It is also assumed that communication between hosts is reliable, e.g. by using the TCP protocol [38]. The protocol for implementing a 2-host resilient service, a service where two hosts must fail for the service to fail, consists of three communication steps and will differ slightly depending on whether the client sends the initial request to the

36 26 Chapter 2. Distributed Systems Fundamentals primary host or to one of the backups. Since the version of this protocol described using the replication model assumes that the client communicates only with the primary replica, only that version of the resiliency protocol will be described here. The first communication step consists of the client sending a request to the primary host. The primary executes the request, but does not answer the client yet as it can not be sure that the desired amount of resiliency, in this case 2-host resiliency, is achieved. Instead it sends a cooperate request to the next server host, the first backup host. As that host receives the cooperate request, it stores the update. In the third communication step the first backup host sends a cooperate ACK back to the primary to notify it that the system now satisfies 2-host resiliency. After sending this acknowledgement the backup host sends the answer for the request back to the client. Since the backup host does not need to wait for the primary host to receive and process the acknowledgement it can send to answer back to the client at the same time as the acknowledgement (at least if local execution times are low enough when compared to message delays). Because the backup sends the answer back to the client instead of the primary, the client receives the answer faster, as it would otherwise have to wait for another communication delay before the primary has received the acknowledgement. This increases the resiliency of the system, but has no real impact on it s fault-tolerance. The backup also sends a backup request to the next server. This may enable the system to handle failure of both the primary and the first backup but really fault-tolerance as it can not be guaranteed that this protocol will handle failure of both the primary and the first backup. Rather, this part of the protocol satisfies the third demand placed on resilient systems, to make a best-effort attempt to handle more errors that can be guaranteed. This protocol can be adapted to handle n-host resiliency. One way to do this would be to ensure that n 1 backups store the update before the coordinate ACK is sent back to the primary and the client is sent an answer. To detect failure of message passing or a failure of the downstream host, the protocol suggests using time-outs that are set when sending a request, and reset when receiving the ACK message. In order to detect failures before the next request needs to be sent, messages are regularly sent between hosts to make other hosts know that they are still alive. Also, as messages may need to be resent, sequence numbers should be placed in messages to detect duplicates. Note that similar problems are solved in the same way in [38] to implement reliable message passing between hosts, but are still needed here if hosts fail. If these messages are not being sent the system needs to recover from the failure of this host. Assuming that a host notices that its downstream host has failed, it sends a structure modification message to the primary host notifying it of the failure. The primary then constructs a new chain without the failed host and sends a structure modification message to the first backup host. The backup hosts then send this message on in the new chain until it reaches the last host in the chain. The last backup host then sends an ACK message back that is then sent back to the primary again. Note that this process may have to be done in the middle of handling a client request if a failure of the first backup host is detected then. The Alsberg-Day protocol does not give a detailed description of what is needed to handle a failure of the primary host. Some sort of election algorithm is needed to make one of the backup hosts be the new primary. One simple way is to choose the first backup host as a new primary. If the primary fails, the client should notice that it does not get any response to it s request and resend the request.

37 2.4. Replication in distributed systems 27 Model based description of passive replication Passive replication has later been given a more general description [43]. Perhaps the biggest change is that backup hosts are not placed in a linear logical architecture. Instead a view synchronous broadcast primitive (see Section 2.3) is used for the primary replica to send updates to the backup replicas. The purpose of this broadcast is to avoid a scenario where the primary node would fail when it was half done sending updates to the backup nodes. Some of the backups would then have received the update and some would not have. Use of a view synchronous broadcast provides a good way to make sure that when the new view of the group is delivered, one without the failed primary, all of the backup replicas will agree on which updates have been received. Similarly to the Alsberg-Day protocol, backups use some sort of algorithm to elect a new primary after it has failed, e.g. by electing the first remaining backup. When using this model to describe passive replication the server coordination phase is not needed, as the primary host alone handles the request, so the request is automatically coordinated. Request phase The client sends the request to the primary replica. Note that depending on how the system is implemented, the client may or may not be aware of which replica it is sending the request to. The client should include a unique ID with each message it sends. Execution phase The primary replica executes the request exactly as in a non-faulttolerant system. Agreement coordination phase As the primary replica uses a view-synchronous broadcast to send the result of the update to the backup replicas, either all the replicas will receive the update or none will. Therefore all replicas agree on the result. Client response phase The primary replica sends the response back to the client. If the primary, the client may not receive the response and can then resend the request. The new primary can then use the unique ID of the request to check whether to re-execute the request or just resend the answer Active replication Active replication is another common type of replicating for fault-tolerance [41]. It is also known as the state machine approach. The main idea is that replicas keep an internal state. When a replica receives a request from a client, the state of that replica is changed and a response may be returned to the client. Replication of the service is done by having all replicas receive the requests sent from clients in the same order and react identically to these requests. The state of all replicas will thus stay identical, so if one of them fails, the others can keep on working without any actions being taken. For this approach to work, a couple of requirements are placed on the system. The major constraint placed on the implementation of the service on the replicas is that it needs to execute deterministically [43]. If this is not the case, if, e.g. the result of computations of the service depends on the exact value that the computer s clock has when the message is received, replicas will react differently to requests received from clients and will end up having different internal states.

38 28 Chapter 2. Distributed Systems Fundamentals As an example of this, consider a simple service used to keep track of how many hours employees are working. When the service receives a message that an employee has arrived at work, it records this fact together with the exact time the message was received. Different replicas will then record different arrival times for the same employees will not be synchronized using pure active replication. Note that even if the computers clocks are synchronized to within a second, and the arrival times are rounded to the nearest second it is still possible for the arrival times to be rounded to different seconds. Depending on the computations done with these times, the error may propagate and increase. Another common source of non-determinism in the implementation of services is the use of multi-threading in the implementation. Using multiple threads of execution in a process (or using multiple processes in a program) means that things may happen in different order in two executions of the program. As described by Schneider [41] the most important issue when implementing a faulttolerant state machine service is replica coordination, to make sure that all replicas process clients requests in the same order. As described in the paper, this is further divided into two issues: Agreement Every non-faulty state machine replica receives every request. Order Every non-faulty state machine process the requests it receives in the same relative order. Generally these requirements have to hold for all requests sent by clients. Requests that only query the service, without changing it s state do not need to be sent to every replica, so the agreement requirement can be relaxed for these requests. If no response is received, the request can be resent by the client. However this only holds for systems that are designed for crash failures or omission failures, not severe failures (see Section 2.1.2). If the replicated system is designed to handle Byzantine failures, received responses may be incorrect. Then the client must receive all responses to make use of a correct majority. Certain operations done when handling client requests that may commute result in an identical state regardless of the order in which they are performed. For a sequence of such requests, the order requirement can be relaxed and they may be performed in any order. However, in general, and when the order algorithm has no knowledge of the semantics of the operations being performed, this can not be done as reordering operations may then result in different states in the replicas. Implementing agreement and order The agreement and order properties are easily satisfied by using a totally, causally ordered reliable broadcast scheme to pass messages to the replicas. It is also mentioned in [41] that it is possible to use an agreement protocol to solve the agreement property. This is known to be equivalent to using totally ordered broadcast (see Section 2.3.3). Early examples of active replication As mentioned in Section 2.2.3, using state machines to model replicated distributed systems, was first introduced by Lamport [24] as early as The paper also introduced both the happened before relation and logical clocks (see Section 2.2.3). A later paper by Lamport [23] used the same replication approach, but this time modified the replication method to handle Byzantine failures in a synchronous environment.

39 2.4. Replication in distributed systems 29 Lamport s mutual exclusion algorithm The algorithm given in [24] provides mutual exclusion for a set of processes sharing a resource that can only be accessed by a single process at a time. The paper describes how the algorithm is implemented for each process as an identical state machine. When processes have to access the resource, they broadcast requests to all other processes. These processes are totally ordered using the ordering method introduced in the paper. All processes then locally make the decision to let processes access the resource according to this total order of requests. In this way the mutual exclusion service is replicated in all processes reliably. As all processes run the same program and requests are totally ordered, the requirements for active replication are fulfilled. Model based description of active replication When describing active replication in [43] the paper chooses to completely base the ordering of client requests on a totally ordered broadcast. When giving a model-based description of active replication, the client request and server coordination phase are merged into one. Also, the agreement phase is not needed. Request and server coordination phase Clients use a reliable, totally ordered broadcast to send requests to the replicas. This ensures both that all replicas receive the request and that they handle it in the same order. Execution phase As discussed in Section 2.4.3, it is vital that replicas execute in a deterministic manner to compute the same result from received client requests. This e.g. limits the amount of concurrent programming that can be used to implement the service and makes it harder to make use of physical clock readings in computing the result, as clocks can not be perfectly synchronized (see Section 2.2.2). All replicas must always compute the same result from a received request. Agreement coordination phase As long as all replicas receive the same requests, handle them in the same order and compute the same result, this phase is not needed for active replication. There will never be any differences in the state that results from the execution phase. Client response phase Typically all replicas pass their response back to the client, that can then choose the first one. If the system is designed to handle Byzantine failures, the client would rather receive all responses to check for a correct majority Comparison of replication strategies In order to choose a replication strategy as the basis for a real replication protocol this Section compares the two strategies. As summarized in Figures , both active replication and passive replication have their advantages and disadvantages when compared to each other. One advantage of active replication is that the same program is being executed in all replicas while replicas using passive replication have to execute different parts of their program when acting as the primary replica or as one of the backup replicas. Also, coordination is required between the replicas to pass on the state changes that are calculated by the primary replicas. This asymmetry between replicas makes the system more complex which may introduce more bugs in the design or implementation.

40 30 Chapter 2. Distributed Systems Fundamentals Active replication + Same program being executed in all replicas + Failover is easy to perform + Can be made to handle Byzantine failures Possibly lower performance Restricts non-determinism Replicas may go out of synch Figure 2.3: Pros and cons of active replication. The same arguments apply to the failover process in the two approaches. Using active replication the failover is extremely simple. When a replica crashes nothing needs to be done by the remaining replicas. The replicated system will keep on working exactly as before as no special role could have been held by the failed replica. When using passive replication however, a new primary replica has to be chosen and started up as primary. Client requests that were handled when the primary failed may need to be re-handled. This failover process is more complex which probably makes it take longer to perform then the active one and, as above, increases the risk of bugs in the passive replication system. Another possible advantage of the active replication approach is that it can be made to tolerate Byzantine failures (see Section 2.1.2) by having clients receive answers from all replicas and then use a correct majority. If the primary replica has Byzantine failures when using passive replication the entire system can fail, as it may continue to execute in the wrong way without letting a backup replica take over. Passive replication + May have higher performance when everything works + No restrictions on implementation of replicas Needs to communicate state changes between replicas More complicated failover Figure 2.4: Pros and cons of passive replication. An advantage of passive replication, on the other hand, is that it may perform less costly, simpler, operations when the system works correctly. For example, message passing from the clients to the primary replica can be made simpler than the totally ordered message passing needed for active replication. As the system is expected to perform correctly most of the time, this means that this part of the design and implementation of the system could be more error-prone for active replication. Also, this may lead to better performance for the passive replication system, as the primary replica does not need to synchronize its actions with other replicas until it has handled the entire request. A definite disadvantage with the active replication approach is that it restricts the nondeterminism that can exist in the implementation of the replicas. This may keep replicas from running multi-threaded programs, possibly decreasing performance, and may make their design harder. If the design and implementation of the replicas is not done correctly it may lead to replicas not being synchronized with each other, executing different parts of their program. This introduces Byzantine failures into an active replication system. Either all parts of the system must be able to handle these errors, or a failed replica must be caused

41 2.5. Existing solutions and protocols 31 to crash if this happens so that it does not introduce errors into other parts of the system. 2.5 Existing solutions and protocols This section contains a short description of two existing protocols that are used, or are possible to use, in the DART system, but that are not developed by Ardendo Virtual router redundancy protocol (VRRP) The Internet Protocol (IP) is used to transmit packets of data (called datagrams in the standard) between a source and destination host computers connected by an intermediate system of local networks [37]. Each of these networks has connected devices that forwards these packets along hops on a path from the source to the destination. A network local to a device contains the different devices that can be communicated with directly, without sending the packets on another step. The path from the source to a destination is determined from the IP address of the destination host. The path is referred to as a route, and the process of determining the path is called routing and is usually done in each intermediate device in the network by comparing the destination IP address with a forwarding table constructed using any of a set of routing algorithms (such as RIP [29] or OSPF [32]). The forwarding table contains a mapping between sets of destination IP addresses and the different neighboring gateways (network devices, routers, connected to multiple IP networks) that a device has. When a packet is to be sent on towards a destination IP address it is first sent to the gateway listed in the forwarding table [37]. Usually host computers connected to an IP network do not run any routing protocol to construct their forwarding tables. A couple of reasons for this is that routing protocols used in the network may not be implemented on all architectures used by host computers and even if they are they might require a lot of resources to execute or be too hard to configure on all hosts. There are some alternatives to this; however the most common one is to have a default gateway be statically configured in the host computers [21]. A statically defined default gateway requires no processing overhead and is also easily configured by a protocol such as DHCP which can be used to automatically configure IP address and default gateway for a host as it is connected to a network [13]. However, even though this is a good solution to the problem of configuring host computers it opens up a single-point-of-failure for the TCP/IP communication for that host. If the default gateway should fail no packets can be sent from that host as they all will first be sent to the failed default gateway. The Virtual Router Redundancy Protocol (VRRP) [21] removes this single-point-offailure by making it possible for a backup router to take the role of default gateway for a network, in case the normal default gateway fails. This is also done by other proprietary protocols such as Cisco Systems Hot Standby Routing Protocol (HSRP) [26]. VRRP provides an open source alternative for these protocols and is available for several architectures. VRRP provides this redundancy by having a virtual router being defined as default gateway by the hosts in the network. The virtual IP address associated with this default gateway is usually held by one of a set of real routers connected to the network, together with a virtual physical address used when communicating on the local network, e.g. using IEEE801 Ethernet [22]. The real router currently using these addresses is referred to as the master. Other real routers also running VRRP in the network and configured to be able to use these addresses are referred to as backups.

42 32 Chapter 2. Distributed Systems Fundamentals The master regularly sends messages to the backups (using IP multicast [11]) to inform them that it is still alive. If it stops sending these updates, the backup configured with the highest ID starts using the virtual IP address and the virtual physical address. When a host wants to send packets to other networks it will first forward the packet to the virtual IP address. This is done by using the Address Resolution Protocol (ARP) [36] to physically send the packet to the current VRRP master router. This router then forwards the package to next router on the path to the destination by consulting its forwarding table. If the master fails, another quickly (the delay is described to be short, about a second [21]) takes on the role of master and continues to forward packets. Packets lost during this delay are, at least if it is the TCP protocol that makes use of IP, resent automatically [38]. Apart from providing fault-tolerance for the network communication, VRRP can also provide higher performance. If two real routers are used this can be done by having half the hosts define an address held by one of the routers as their default gateway and having the other half define an address of the other router as theirs. If one of the routers fails, that virtual address is then used by the other router, as well as its old address [21]. This provides a good combination of both fault-tolerance and improved performance Multiport Video Computer Protocol (MVCP) The MVCP protocol, devloped by SGI [42], is a text based protocol used in many Ardendo components to control video servers and other related types of equipment. As most of these do not support the MVCP protocol directly, protocol translators are used as described in Section to make it possible to control them indirectly using MVCP. This section is intended as a short introduction to the MVCP protocol as it is used in the DART system, and has been extended in Section for increased fault-tolerance in the DART system. The MVCP protocol is based on the concept of a unit, which can be thought of as a virtual video recorder capable of either recording video to a file on the server or playing it out from the server. Normally a server has multiple video ports that units can be created against, these ports are either used for record or playout (it is also possible to have ports and units capable of doing both, but that is left out of this description). A server capable of encoding or decoding several video streams at a time will have several ports available. Creating a unit is done with the UADD command, the response recevied from the command contains the unit name. Removing a unit is done with the UCLS command. UADD adds a unit connected to the specified port. UCLS removes a unit. Once a unit is created, a video clip is loaded on it with the LOAD command. It is possible to load the clip both for recording to the clip (by using the IN option) or for playing out from the clip (by using the OUT option). If the clip does not already exist it is created at this time by specifying the CRTE option for the command. The USTA command can be used to get the current status of a unit. The information received in the response contains what action is currently being performed, the clip that is active on the unit and at what place in the clip the unit is currently active. LOAD load a clip on the specified unit. USTA get information about a unit, such as the current clip and place in the clip.

43 2.5. Existing solutions and protocols 33 Playing back a clip from a unit After an existing clip has been loaded for playback on a unit, the unit must be made prepared to play the clip by using the CUE command. CUE prepare a unit to play the clip that is currently loaded. The same kinds of commands usually available on a video tape recorder are available in MVCP as well. As an example, this includes: PLAY plays the clip loaded on the unit. PAUS freezes the playback of the clip. REW plays the clip quickly backwards. FF plays the clip quickly forwards. STOP stops the playback of the clip, the unit be queued again to resume. Recording to a clip When a clip is loaded for recording on a unit the unit is made ready to start the record by using the CUER command. CUER prepare a unit to start a recording. After the CUER command completes successfully the record is started with the REC command. REC start recording to the clip currently loaded on the unit. During the recording, the PAUS and STOP commands can be used in the same way as when playing back a clip.

45 Chapter 3 The DART System To understand why failures can occur in the DART system (as described in Section 4.1 and what changes need to be done to the system to handle these failures (as described in Section 4.3), it is useful to begin with a description of the system. This includes an overview of the architecture of the system, as well as a description of how different components communicate and finally an example of what is done in the system in order to start a recording. 3.1 Architecture of the DART system Client computer DART user interface Video source feeds Application server dartsessd Video router dartrecd Video encoder servers adapter protocol translator Media management system Figure 3.1: The different parts of the DART system. This is an overview of the architecture of the DART system, a system used to schedule ingest of media content into media management systems. Generally this content is recorded from satellite feeds availible to the system through a video router. The DART system automatically controls the video router, as well as other types of equipment such as video encoder servers, to make the recorded media available. 35

46 36 Chapter 3. The DART System As the system is commonly modified on a customer-basis to adapt to the needs of the customer this can only be seen as an example of the architecture. However, generally this is only modified in certain parts, such as which kind of video servers and video router is used and which (if any) external system it communicates with. Users interact with the DART system through a client application, a Java applet GUI running in a web browser on the user s local machine. Clients communicate, using TCP/IP, with an application server which has a set of daemons running which perform all the tasks necessary to manage the DART schedule as well as communicating with all systems necessary to perform ingests. The different parts of the DART system are displayed in Figure 3.1. DART stores data using a database server, which could be the same as the application server but for performance reason, this is often a separate server. In order to ingest video, DART also need to work with some kind of video router and a number of video servers that will actually encode the video. These can be of a number of different brands of encoders or routers. DART is also commonly used to ingest video into a media management system, such as Ardendo ARDOME, which can then be used to access the media. This system could run on the same server as the DART backend, or a separate server (which would probably be the case if another system than ARDOME is used) The DART client Users access the DART system through a graphical user interface shown in Figure 3.2. The GUI is available as a Java applet running in a web browser on any desktop PC. The GUI can be used to show and edit the scheduled recordings for each day. When viewing recordings scheduled on a specific day, the GUI shows a timeline of that day with multiple visible rows. Each row can display the scheduled recordings for either a specific source feed or a specific video encoding server. To make a recording of a source feed, the GUI can be controlled with the mouse by pressing the mouse button and marking the time in the timeline where that source feed should be recorded. If the current source feed is not currently displayed, a button in the GUI can be used to select a source feed from a list. After doing this a dialog window pops up that allows the user to enter different categories of descriptions about the recording, called metadata. This metadata can for example include a descriptive title for the recording, a list of participating persons or any other information as this is fully customizable. This metadata then becomes a vital part of the media as it is ingested into a media management system. If the video is due to be transmitted shortly, a crash recording can be scheduled where the video starts to record as quickly as possible. This can be done by clicking a single button in the GUI. Apart from scheduling recordings, it is also possible to edit existing recordings, for example to change the metadata or the length of the recording, as well as deleting recordings from the schedule. As users perform one of these actions no changes to the DART schedule is done directly from the client program. Instead a request message is sent to the dartsessd daemon, one of the programs executing in the DART backend. This daemon program performs the actual changes and sends a response message back to the client so that clients are made of aware of changes done by them or any other clients.

3.1. Architecture of the DART system 37 Figure 3.2: The DART graphical user interface 3.1.2 The dartsessd daemon The central and most important daemon of the DART backend is dartsessd.

47 3.1. Architecture of the DART system 37 Figure 3.2: The DART graphical user interface The dartsessd daemon The central and most important daemon of the DART backend is dartsessd. This daemon receives messages from clients and modifies the scheduled entries. The scheduled entries are kept in a database, together with information about the setup of the DART system, such as the available video encoder servers and the available sources of video. Requests sent to dartsessd by the DART clients cause dartsessd to modify the schedule, adding, deleting or modifying scheduled entries. The dartsessd daemon also handles requests for clients to view different parts of the schedule or to view details about scheduled entries. Along with the entries themselves, DART also stores metadata, data describing the media to be ingested. The user can through the clients view and modify this metadata together with the entries themselves. The format of this metadata depends upon the customer s wishes. When handling a request by a client, this will result in dartsessd modifying the data kept in the database, and/or trigger a search in the database to find the entry or the part of the schedule that the user is interested in. The state of the schedule is affected by the current time of the DART server. The dartsessd daemon regularly monitors the schedule for recordings that are due to start or stop. Shortly before the scheduled start time dartsessd will communicate with other parts of the DART system to control routing of video, trigger video servers to start to encode video they are receiving and to catalogue the recording in a media management system. When a recording is due to stop, dartsessd will coordinate stopping the video encoder as well as updating external systems. This monitoring is done repeatedly with a few seconds in between each time. If the database stores a large number of past and future scheduled entries it can take

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability