Speeding up Consensus by Chasing Fast Decisions

Size: px

Start display at page:

Download "Speeding up Consensus by Chasing Fast Decisions"

Corey Beasley
5 years ago
Views:

1 Speeding up Consensus by Chasing Fast Deisions Balaji Arun, Sebastiano Peluso, Roberto Palmieri, Giuliano Losa, Binoy Ravindran ECE, Virginia Teh, USA Abstrat This paper proposes CAESAR, a novel multi-leader Generalized Consensus protool for geographially repliated sites. The main goal of CAESAR is to overome one of the major limitations of existing approahes, whih is the signifiant performane degradation when appliation workload produes onfliting requests. CAESAR does that by hanging the way a fast deision is taken: its ordering protool does not rejet a fast deision for a lient request if a quorum of nodes reply with different dependeny sets for that request. The effetiveness of CAESAR is demonstrated through an evaluation study performed on Amazon s EC2 infrastruture using 5 geo-repliated sites. CAESAR outperforms other multi-leader (e.g., EPaxos) ompetitors by as muh as 1.7x in the presene of 3% onfliting requests, and single-leader (e.g., Multi-Paxos) by up to 3.5x. Keywords-Consensus, Geo-Repliation, Paxos I. INTRODUCTION Geographially repliated (geo-sale) servies, namely those where ators are spread aross geographi loations and operate on the same shared database, an be implemented in an easy manner by exploiting underlying synhronization mehanisms that provide strong onsisteny guarantees. These mehanisms ultimately rely on implementations of Consensus [1] to globally agree on sequenes of operations to be exeuted. Paxos [2], [3] is a popular algorithm for solving Consensus among partiipants interonneted by asynhronous networks, even in presene of faults, and it an be leveraged for building suh robust servies [4], [5], [6], [7], [8]. An example of Paxos used in a prodution system is Google Spanner [4], [9]. The most deployed version of Paxos is Multi-Paxos [3], where there is a designated node, the leader, that is eleted and responsible for deiding the order of lient-issued ommands. Multi-Paxos solves Consensus in only three ommuniation delays, but in pratie, its performane is tied to the performane of the leader. This relation is risky when Multi- Paxos is deployed in geo-sale beause network delays an be arbitrarily large and unpreditable. In these settings, the leader might often be unreahable or slow, thus ausing the slow down of the entire system. To overome this limitation, protools aimed at allowing multiple nodes to operate as ommand leaders at the same time [1], [11], [12] have been proposed. Suh solutions provide implementations of Generalized Consensus [13], a variant of Consensus that agrees on a ommon order of nonommutative (or onfliting) ommands. These approahes, despite avoiding the bottlenek of the single leader, suffer from other osts whenever a non-trivial amount of onfliting ommands (e.g., 5% 4%) is proposed onurrently, as they do not rely on a unique point of deision. This paper presents the first multi-leader implementation of Generalized Consensus designed for maintaining high performane in the presene of both mostly non-onfliting workloads (named as suh if less than 5% of onfliting ommands are issued) and onfliting workloads (where at most 4% of ommands onflit with eah other). For this reason, our solution is apt for geo-sale deployments. More speifially, state-of-the-art implementations of Generalized Consensus (e.g., EPaxos [1] and M 2 Paxos [14]) redue the minimum number of ommuniation delays required to reah an agreement from three to two in ase a proposed ommand does not enounter any ontention (fast deision). However, they fail in the following aspet: they are not able to minimize the lateny as soon as some ontention on issued ommands arises, with the onsequene of requiring a slow deision, whih onsists of at least four ommuniation delays. To address these aspets, we propose CAESAR, a onsensus layer that deploys an innovative multi-leader ordering sheme. As a high-level intuition, when a onfliting ommand is proposed, CAESAR does not suffer from the ondition that auses a slow deision of that ommand in all existing Generalized Consensus implementations (inluding EPaxos). Suh a ondition is the following: For a proposed ommand, at least two nodes in a quorum are aware of different sets of ommands onfliting with. CAESAR avoids this pitfall beause it approahes the problem of establishing agreement from a different perspetive. When a ommand is proposed, CAESAR seeks an agreement on a ommon delivery timestamp for rather than on its set of onfliting ommands. To failitate this, a loal wait ondition is deployed to prevent ommands onfliting with from interfering with the deision proess of if they have a timestamp greater than s timestamp. The basi idea behind the ordering proess of CAESAR is the following: a ommand is assoiated with a logial timestamp by the sender, and if a quorum of nodes onfirms that the timestamp is still valid, then the ommand is ordered after all the onfliting ommands having a valid earlier timestamp. Otherwise, the timestamp is onsidered invalid, and the ommand is rejeted foring it to undergo two more ommuniation delays (total of four) before being deided. Note that the equality of the sets of onfliting ommands olleted by nodes does not influene the ordering deision. With this sheme, CAESAR boosts timestamp-based ordering protools, suh as Menius [11], by exploiting quorums, whih is a fundamental requirement in geo-sale where ontating all nodes is not feasible. CAESAR does that without relying on a

2 single designated leader unlike Multi-Paxos. Our approah also provides the benefit of a more parallel delivery of ordered ommands when ompared to EPaxos, whih requires analysis of the dependeny graphs. That is beause one the delivery timestamp for a ommand is finalized, the ommand impliitly arries with itself the set of predeessor ommands that have to be delivered before it. This so-alled predeessors set is omputed during the exeution of the ordering algorithm for the deision of the timestamp and not after the delivery of the ommand. We onduted an evaluation study for CAESAR using keyvalue store interfaes. With them, we an injet different workloads by varying the perentage of onfliting ommands and measure various performane parameters. We ontrasted CAESAR against: EPaxos and M 2 Paxos, multi-leader quorumbased Generalized Consensus implementations; Menius, a multi-leader timestamp-based Consensus implementation that does not rely on quorums; and Multi-Paxos, a single-leader Consensus implementation. As a testbed, we deployed 5 georepliated sites using the Amazon EC2 infrastruture. The results onfirm the effetiveness of CAESAR in providing fast deisions, even in the presene of onfliting workloads, while ompetitors slow down. Using workloads with a onflit perentage in the range of 2% %, CAESAR outperforms EPaxos, whih is the losest ompetitor in most of the ases, by reduing lateny up to 6% and inreasing throughput by 1.7. These performane boosts are due to the higher perentage of fast deisions aomplished. With 3% of onfliting workload, CAESAR takes up to 7% fewer slow deisions ompared to EPaxos. II. RELATED WORK In the Paxos [3] algorithm, a value is deided after a minimum of four ommuniation delays. Progress guarantees annot be provided as the initial prepare phase may fail in the presene of multiple onurrent proposals. Multi-Paxos alleviates this by letting promises in the prepare phase over an entire sequene of values. This effetively establishes a distinguished proposer that ats as a single designated leader. Fast Paxos [15] eliminates one ommuniation delay by having proposers broadast their request and bypass the leader. However, a lassi Paxos round exeuted by the leader is needed to resolve a ollision, reahing a total of six ommuniation delays to deide a value. Generalized Paxos [13] relies on a single leader to detet onflits among ommands and enfore an order, and it uses fast quorums as Fast Paxos. Some of its limitations are overome by FGGC [16], whih an use optimal quorum size but still relies on designated leaders. On the ontrary, CAESAR avoids the usage of a single designated leader either to reah an agreement, as in Paxos, or to resolve a onflit, as in Fast and Generalized Paxos. Menius [11] overomes the limitations of a single leader protool by providing a multi-leader ordering sheme based on a pre-assignment of Consensus instanes to nodes. It preassigns sending slots to nodes, and a sender an deide the order of a message at a ertain slot s only after hearing from all nodes about the status of slots that preede s. Clearly this approah is not able to adopt quorums (unlike Paxos), and it may result in poor performane in ase of slow nodes or unbalaned inter-node delays. To alleviate the problem of slow nodes, Fast Menius has been proposed [17]. It uses a mehanism that enables the fast nodes to revoke the slots assigned to the slow nodes. However, Fast Menius still suffers from high lateny in speifi WAN deployments sine it does not rely on quorums for delivering. EPaxos employs dependeny traking and fast quorums to deliver non-onfliting ommands using a fast path. In addition, its graph-based dependeny linearization mehanism that is adopted to define the final order of exeution of ommands may easily suffer from omplex dependeny patterns. Instead, Alvin [12] avoids the expensive omputation on the dependeny graphs enfored by EPaxos via a slot-entri deision, but it still suffers from the same vulnerability to onflits of EPaxos: a ommand s leader is not able to deide on a fast path if it observes disordant opinions from a quorum of nodes. That is not the ase of CAESAR, whose fast deision sheme is optimized to inrease the probability of deiding in two ommuniation delays regardless of disordant feedbaks. M 2 Paxos [14] is a multi-leader onsensus implementation that provides fast deisions while i) adopting only a majority of nodes as quorum size, and ii) avoiding to exhange dependenies of ommands. It does that by embedding an ownership aquisition phase for ommands into the agreement proess, so as to guarantee that a node having the ownership on a set of ommands an autonomously take deisions on those ommands. However, in ase there are multiple nodes that ompete for the deision of non-ommutative ommands, the protool might require an expensive ownership aquisition phase to re-distribute their ownership reords. CAESAR is also related to Clok-RSM [18]. In Clok-RSM, eah node proposes ommands piggybaked with its physial timestamp, whih are then deterministially ordered aording to their assoiated timestamps. Although Clok-RSM is multileader like CAESAR, and it relies on quorums to implement repliation, it suffers from the same drawbaks of Menius, namely the need of a onfirmation that no other ommand with an earlier timestamp has been onurrently proposed. III. SYSTEM MODEL We assume a set of nodes Π = {p 1, p 2,..., p N } that ommuniate through message passing and do not have aess to either a shared memory or a global lok. Nodes may fail by rashing but do not behave maliiously. A node that does not rash is alled orret; otherwise, it is faulty. Messages may experiene arbitrarily long (but finite) delays. Beause of FLP [19], we assume that the system an be enhaned with the weakest type of unreliable failure detetor [2] that is neessary to implement a leader eletion servie [21]. In addition, we assume that at least a strit majority of nodes, i.e., N 2 + 1, is orret. We name lassi quorum (CQ), or more simply quorum, any subset of Π with size at least equal to N We name fast quorum (FQ) any subset of Π

3 p PROPOSE: C OK: C OK: C STABLE: C p PROPOSE: C OK: C {} OK: C {} STABLE: C {} p 1 p 1 p 2 p 2 p 3 PROPOSE: p 4 OK: OK: STABLE: p 3 PROPOSE: p 4 4 OK: {} OK: {C} STABLE: 4 {C} (a) The non-ommutative ommands and are exeuted only after a quorum of nodes reeives them. A total order of the ommands is not enfored in this ase, sine ommands are submitted only via reliable broadast. Fig. 1. (b) The non-ommutative ommands and are exeuted only after a quorum of nodes reeives them. A total order of the ommands is enfored in this ase: is exeuted after on all nodes, sine T = < T = 4, and timestamp are reeived in order by p 2. Reliable broadast exeution vs. CAESAR exeution with size at least equal to 3N 4 (derived by minimizing CQ). As it will be lear in Setion V, a fast quorum is required to ahieve fast deisions in two ommuniation delays, while lassi quorum is required in ase the protool needs more than two ommuniation delays to reah a deision. We follow the definition of Generalized Consensus [13]: eah node an propose a ommand via the PROPOSE() interfae, and nodes deide ommand strutures C-strut s via the DECIDE(s) interfae. The speifiation is suh that: ommands that are inluded in deided C-struts must have been proposed (Non-triviality); if a node deided a C-strut v at any time, then at all later times it an only deide v σ, where σ is a sequene of ommands (Stability); if has been proposed then will be eventually deided in some C-strut (Liveness); and two C-struts deided by two different nodes are prefixes of the same C-strut (Consisteny). Note that the symbol is the append operator as defined in [13]. For simpliity of the presentation, we also use the notation DECIDE() for the deision of a ommand on a node p i, with the following semantis: the sequene of k onseutive alls of DECIDE( 1 ) DECIDE( 2 ) DECIDE( k ) on p i is equivalent to the all of DECIDE( 1 2 k ). We say that two ommands and are non-ommutative, or onfliting, and we write, if the results of the exeution of both and depend on whether has been exeuted before or after. It is worth noting that, as speified in [13], two C- struts are still the same if they only differ by a permutation of non-onfliting ommands. IV. OVERVIEW OF CAESAR We introdue CAESAR inrementally by starting from a base protool, whih only provides reliable broadast of ommands, and then we present the design of the final protool, whih implements Generalized Consensus. We onsider the first protool as a referene point to show the minimal osts that are required to implement our speifiation of Consensus, and we explain how CAESAR is able to maximize the probability to exeute with the same number of ommuniation steps as the referene protool. Setion V provides the details of CAESAR. A neessary ondition for implementing both a reliable broadast protool and the onsisteny property of CAESAR is guaranteeing that if a ommand is delivered to a (orret or faulty) node, then it is eventually delivered to any orret node. This is beause whenever a ommand is exeuted by a node and the result externalized to lients, the ommand must be durable in the system despite rashes. The base protool exeutes as shown in Figure 1(a). When a lient proposes a ommand to the system via the interfae PROPOSE(), the protool hooses a node to be s leader, p in this ase, whih broadasts a PROPOSE message with to all nodes. Afterwards, whenever s leader ollets a quorum of OK replies for, it broadasts a STABLE message for in order to allow all nodes (inluding the leader itself) to deliver and exeute (thik arrows in Figure 1(a)). The base protool is fault-tolerant beause whenever is delivered and exeuted on some node, one of the following onditions is true, regardless of the rash of f nodes: if s leader does not rash, eventually any other orret node reeives the STABLE message for ; or if s leader rashes, there always exists at least one orret node that reeived the PROPOSE message for, so it an take over the rashed leader by re-exeuting the protool for. Moreover, the sheme adopted by the base protool needs two ommuniation delays: one for the PROPOSE message and one for the OK messages, to return the result of an exeution to a lient. Two ommuniation delays are the minimum required to implement onsensus in an asynhronous system [22]. The base protool does not implement Generalized Consensus beause it does not enfore any order on the delivery of non-ommutative ommands. In fat, two onurrent ommands, and, an be delivered and exeuted in any order by different nodes, regardless of their ommutativity relation. CAESAR implements the speifiation of Generalized Consensus by building a novel timestamp-based mehanism on top of the base protool to enfore a total order among nonommutative ommands. We still rely on Figure 1 for showing the intuition. Command is assoiated with a unique logial timestamp T (see Setion V-A for the timestamp assignment), and it an be delivered and exeuted only after a quorum of nodes onfirms that no other ommand with timestamp T, where and T > T, will be exeuted before. Note that in this setion we do not distinguish between fast and lassi quorums, although in Setion V we explain that a fast quorum

4 PROPOSE: PROPOSE: C p OK: C {} OK: C {} STABLE: C {} PROPOSE: C p OK: C {} RETRY: STABLE: C 5 { } C 5 { } NACK: C { } OK: C { } p1 WAIT p 1 WAIT p 2 p 2 p 3 p 4 OK: {C} OK: {} STABLE: 4 4 {C} p 3 PROPOSE: 4 p4 OK: {} OK: {} STABLE: 4 {} (a) p 2 sends an OK message for at timestamp T = beause is in the predeessors set of, and is deided at timestamp T = 4. (b) p 2 rejets at timestamp T = beause is not in the predeessors set of, and is deided at timestamp T = 4. is deided at timestamp 5 after a retry. Fig. 2. Exeution of the wait ondition in CAESAR due to out of order reeption of non-ommutative ommands on node p 2. Command waits for ommand to be stable on node p 2, sine s timestamp T has been reeived after s timestamp T, and T = < T = 4. is required at this stage due to the lower-bound defined in [22]. Here, we assume s leader does not fail or is suspeted; the ase of faulty leaders is disussed in Setion V-E. Figure 1(b) shows how CAESAR applies this idea to the exeution of Figure 1(a). Node p broadasts by proposing it with timestamp ; then a quorum of nodes onfirms sine none of those nodes has already reeived with a timestamp greater than. The onfirmation from a proess p j is sent via an OK message, whih, unlike the base protool, inludes a predeessors set Pred j of the ommands observed by p j, and that should preede. When p 4 broadasts with timestamp 4, it reeives a quorum of replies from p 2, p 3, p 4, whih onfirms that an be exeuted with timestamp 4 and only after has been exeuted. This happens beause p 2 already observed at the time it reeived (see irle in Figure 1(b)), and it inluded Pred 2 = {} in the OK message for. A ommand leader an broadast the STABLE message as soon as it reeives a quorum of OK messages for that ommand, and it also inludes the timestamp and the set Pred, whih is the union of the predeessors sets reeived in the OK messages. Therefore, in CAESAR, unlike the base protool, a node an exeute when it reeives the STABLE message for and only after it has exeuted all the ommands in s Pred. As shown in Figure 1(b), a ommand s leader in CAESAR still guarantees a fast deision in two ommuniation delays as long as the proposed timestamp is onfirmed by a quorum of nodes and despite the non-uniform replies that it olleted (the set of predeessors olleted by p 4 for is different). This also onstitutes a signifiant differene between CAESAR and other state-of-the-art Generalized Consensus implementations, e.g., EPaxos, whih require at least two additional ommuniation delays before the exeution of in the example of Figure 1(b). In the following we answer two questions: what does a node do if it observes out of order timestamps (Setion IV-A)? How does a ommand s leader behave if one of the nodes in the replying quorum rejets a proposed timestamp (Setion IV-B)? A. Out of Order Timestamps Let us now onsider the senario in Figure 2(a), where, unlike the one in Figure 1, node p 2 reeives the PROPOSE for after having reeived the one for (see the irle on p 2 ). In this ase, p 2 annot diretly send an OK message for, beause T = < T = 4, and ould be finally deided at timestamp T without ever onsidering as its predeessor, and hene be exeuted before, with a resulting violation of the order of the timestamps. On the other hand, sending a rejetion for would require additional ommuniation delays, beause s leader would be fored to retry the deision proedure with a new timestamp. This overhead is unneessary if was reeived before on another node, whih ould be part of the quorum of replies to s leader. In this ase, CAESAR enfores a wait ondition for on p 2 (bar labelled wait along p 2 s timeline in Figure 2(a)) in order to prevent the exeution of any step for until p 2 reeives the final deision for. Afterwards, if the final deision for inludes in s Pred, p 2 an reply with an OK message to s leader. As a result, CAESAR is able to inrease the probability of deiding ommands in two ommuniation delays even in the ase of out of order reeption of timestamps. Note that the wait ondition does not ause deadlok sine only ommands with a lower timestamp, e.g.,, wait for the final deision of onfliting ommands with a higher timestamp, e.g.,. B. Rejetion of Timestamps In ase a node annot onfirm a timestamp T proposed for a ommand, it sends a rejetion N ACK to s leader, foring the leader to retry with a timestamp greater than T. This is the ase of Figure 2(b), where p 2 rejets T = for beause it already reeived the STABLE message for with timestamp T > T and is not in s Pred. p 2 also sends bak the set of ommands that aused the rejetion (i.e., ) to aid in hoosing the next timestamp for. In CAESAR, if a ommand s leader reeives at least one N ACK message for the proposed ommand, it assigns a new timestamp T new greater than any suggestion reeived in the N ACK messages, and it broadasts a RETRY message to ask for the aeptane of T new to a quorum of nodes. Note that if a node sends a N ACK message for a ommand to s leader, it means that s leader would reeive at least a N ACK message for from any other quorum due to the way a ommand rejetion is omputed (see Setion V). The RETRY message also ontains the predeessors set Pred, whih is omputed as the union of predeessors reeived in the quorum of replies from the previous phase, as the ase of Setion IV-A. Therefore, in Figure 2(b), p broadasts the RETRY with timestamp T new = 5 and Pred = { } for.

5 Retrying a ommand with a new timestamp does not entail restarting the proedure from the beginning. In fat, unlike the ase of a PROPOSE message, CAESAR guarantees that a RETRY message an never be rejeted (see Setions V-C and V-F). Suh a guarantee ensures starvation-free agreement of ommands. The reply to a RETRY message for ould ontain a set of additional predeessors that were not reeived by s leader during the previous ommuniation phase. This set is sent along with the STABLE message for. V. PROTOCOL DETAILS A ommand that is proposed to CAESAR an go through four phases before it gets deided and the outome of its exeution is returned to the lient. CAESAR shedules the exeution of those four phases in order to provide two modes of deision, alled fast deision and slow deision. A ommand is proposed by one of the nodes, whih assumes the role of s leader and oordinates the deision of by starting the fast proposal phase. If this phase returns a positive outome after having olleted replies from a quorum of FQ nodes, the leader an exeute the final stable phase, whih finalizes the deision of as a fast deision, with a lateny of two ommuniation delays. Otherwise, if the fast proposal phase returns a negative outome, the leader exeutes an additional retry phase, in whih it ontats a quorum of CQ nodes, before issuing the final stable phase. This results in a slow deision, with a lateny of four ommuniation delays. In this setion we desribe CAESAR by detailing the required data strutures in Setion V-A, the proedure for a fast deision in Setion V-B, the proedure for a slow deision in Setion V-C, and the behavior of the protool in ase of failures in Setion V-E. We also explain how CAESAR behaves in ase a leader is not able to ontat a fast quorum of nodes during the exeution of the fast proposal phase for a ommand, as long as no more than f nodes rash. This ase entails the exeution of an additional slow proposal phase after the fast proposal phase and before the remaining retry and stable phases. This part is overviewed in Setion V-D and detailed in the tehnial report [23]. In Figure 4 we provide the main pseudoode of CAESAR for the deision of a ommand. Eah horizontal blok of the figure is a phase, and phases are linked through arrows to indiate the transition from one phase to another. For instane, in ase of fast deision, we have a transition from the fast proposal phase to the stable phase; on the other hand all the other transitions are part of a slow deision. Moreover, the pseudoode is vertially partitioned in order to distinguish the part that is exeuted by the ommand s leader and the part that an be exeuted by any node (inluding the leader); it is also named as aeptor for historial reasons. Finally, the pseudoodes of auxiliary funtions and the reovery from a failure are provided in Figures 3 and 5, respetively. A. Data Strutures per node p i T S i. It is a logial lok with monotonially inreasing values in a totally ordered set of elements, and it is used to generate timestamps for the ommands that are proposed by p i. Its value at a ertain time is greater than the timestamp of any ommand that has been handled by p i before that time. We assume that whenever p i sends a ommand, T S i is updated with a greater value and used as timestamp T for the ommand. Also, whenever p i reeives a ommand with timestamp T, it updates its T S i with a value that is greater than T, if T T S i. We also assume that for any two T S i and T S j, of p i and p j respetively, the value of T S i is different from the value of T S j at any time. This is guaranteed by hoosing the values of T S i (T S j, respetively) in the set { k, i : k N} ({ k, j : k N}, respetively). The total order relation on those values is defined as follows: for any two k 1, i, k 2, j, we have that k 1, i < k 2, j k 1 < k 2 (k 1 = k 2 i < j). The initial value of T S i is, i. H i. It is the data struture reording the status of ommands seen by p i. It is represented as a map of tuples of the form, T, Pred, status, B, f ored where: is a ommand; T is the latest timestamp of ; Pred is the set of ommands that should preede in the final deision; status is the urrent status of, and it has values in the set {f ast-pending, slow-pending, aepted, rejeted, stable}; B is the ballot number assoiated with this event, and it has values in N; and fored is a boolean variable with values in {, }, and it indiates if the info assoiated with this event (e.g., Pred) has been fored by a reovery proedure. Eah tuple in H i is uniquely identified by the first element of the tuple, i.e., the ommand, and thus H i ontains at most one tuple per ommand. For a more ompat representation, we use the don t-are term whenever we are not interested in the value of a speifi element of a tuple. We also use the following notations: H i.update(, T, Pred, status, B, fored) to indiate that the protool appends the tuple, T, Pred, status, B, fored to H i, by first possibly deleting any existing tuple,,,,, from H i ; H i.get() to indiate that the protool retrieves a tuple assoiated with the ommand in H i ; and H i.getpredecessors() to indiate that the protool retrieves the set Pred of a tuple,, Pred,,, in H i. The initial value of H i is an empty map. Ballots i. It is an array mapping ommands to ballots, whih have values in N. Ballots i [] = B means that B is the urrent ballot for whih p i has proessed an event related to ommand. The initial values of Ballots i are. B. Fast Deision A lient proposes a ommand by triggering the event PROPOSE() on one of the nodes of CAESAR (lines I1 I2), whih beomes s leader. Let us all this node p i. p i enters the fast proposal phase for by hoosing the urrent value of T S i as timestamp T ime of. The other parameters of this phase are the ballot number Ballot and the whitelist Whitelist whose values, in this ase, are and empty set, respetively. The meaning of these parameters is stritly related to the reovery proedure due to node failures, and therefore we will

6 provide further details in Setion V-E. However, at this stage, it is enough to know that: - a ballot number for is an identifier of the urrent leader for, and a node p j reeiving a message with ballot number B an proess that message only if its urrent ballot, i.e., Ballots j [], for is not greater than B. - Whitelist for ontains the ommands that should be onsidered as predeessors of aording to the pereption of the node that is exeuting a reovery proedure for. Fast proposal phase. The purpose of the fast proposal phase for a ommand with a timestamp T ime is to propose, to a quorum of nodes, the aeptane of at T ime and ollet, from that quorum, the known predeessor set Pred of ommands that should be deided before at a timestamp less than T ime. To do so, p i broadasts a FASTPROPOSE message with and T ime, and it ollets FASTPROPOSER messages from a quorum of nodes (lines P1 P2). When a node p j reeives a FASTPROPOSE message with and T ime, it omputes the predeessor set Pred j by alling the COMPUTEPREDECESSORS funtion (line P13) and updates the entry for in H j by marking that as fast-pending with T ime and Pred j (line P14), and it alls the funtion WAIT (line P15) to hek the wait ondition, as desribed in Setion IV-A. p j also stores in H j whether the value of Whitelist is different from null or not (line P14). A FASTPROPOSER message for from a node p j ontains a timestamp T ime j and a predeessor set Pred j, and it an be marked with either OK or N ACK. If the message is marked with OK, then T ime j is equal to the proposed T ime, by meaning that p j did not rejet T ime. On the ontrary, if the message is marked with N ACK, then T ime j is greater than T ime meaning that p j rejeted T ime and suggested a greater timestamp for. In both ases, whether T ime has been rejeted or not, the predeessor set Pred j ontains all the ommands that should be deided before aording to the urrent knowledge of p j. WAIT (see lines 4 8 of Figure 3) fores to wait for any ommand in H j that does not ommute with to be marked with either aepted or stable, if s timestamp is greater than s timestamp and is not in s predeessor set. Afterwards, when the wait ondition does not hold anymore, WAIT returns N ACK in ase there still exists suh a ommand, with status either aepted or stable; otherwise the funtion returns OK. If WAIT returns OK, then p j sends T ime and the omputed Pred j bak to s leader by onfirming what the leader proposed (line P2). Otherwise, if WAIT returns N ACK (lines P16 P2), p j rejets the proposed timestamp by: marking the tuple of in H j as rejeted, suggesting the urrent value of T S j as a new timestamp for, and reomputing the predeessor set aording to the new timestamp. The predeessor set Pred j of is omputed as the set of ommands in H j that do not ommute with and have a timestamp smaller than s timestamp, with the following exeption (see lines 1 3 of Figure 3): if the Whitelist in input is not null and is not ontained in Whitelist, then has to appear with a status that is different from fast-pending 1: funtion Set COMPUTEPREDECESSORS(, T ime, Whitelist) 2: Pred j { : ( ) Whitelist = null, T,,,, Hj : T < T ime (Whitelist null Whitelist, T,, slow-pending/aepted/stable,, H j : T < T ime) } 3: return Pred j 4: funtion Boolean WAIT(, T ime) 5: wait until, T, Pred,,, H j, ( T ime < T Pred, T, Pred, aepted/stable,, H j) 6: if, T, Pred, aepted/stable,, H j : T ime < T Pred then 7: return N ACK 8: else return OK 9: funtion BREAKLOOP() 1:, T, Pred, stable, B, H j.get() 11: for all Pred :, T, Pred, stable, B, H j T < T do 12: H j.update(, T, Pred \ {}, stable, B, ) 13: for all Pred :, T, Pred, stable, B, H j T > T do 14: Pred Pred \ { } 15: H j.update(, T, Pred, stable, B, ) 16: funtion Boolean DELIVERABLE() 17: return ( H j.getpredecessors()) Deided j Fig. 3. Auxiliary funtions - node p j in H j in order to be inluded in Pred j. In ase of a fast deision (see FastDeision transition in Figure 4), the ommand leader p i is able to ollet a fast quorum of FQ replies that do not rejet T ime for (line P5). It then submits with the onfirmed T ime and the union of the reeived predeessor sets, i.e., Pred, to the next stable phase (lines P3 P4 and P6). Note that unlike other multi-leader onsensus protools [13], [1], a fast deision in CAESAR is guaranteed in ase a fast quorum onfirms the timestamp for a ommand, although those nodes an reply with non-equal predeessors sets. In the orretness proof of CAESAR (see Setion V-F), we show that suh a ondition is suffiient to guarantee the reoverability of the fast deision for even in ase the ommand leader and at most other f 1 nodes rash. Stable phase. The purpose of the stable phase for a ommand with a timestamp T ime and predeessor set Pred is to ommuniate to all the nodes, via a STABLE message, that has to be deided at timestamp T ime after all the ommands in Pred have been deided (line S1). In partiular, whenever a node p j reeives a STABLE message for, with T ime and set Pred (lines S2 S7), it updates the tuple for in H j with the new values and marks the tuple as stable (line S3). Whenever eah ommand in Pred has been deided (lines of Figure 3), p j an deide by triggering DECIDE() (lines S5 S7). This is orret beause, as we prove in Setion V-F, the phases exeuted before the stable phase guarantee that for any pair of stable and non-ommutative ommands and, with timestamps T ime and T ime respetively, if T ime < T ime then Pred, where Pred is the predeessor set of. Therefore, the deision order of non-ommutative ommands is guaranteed to follow the inreasing order of the ommands timestamps. However, this does not mean that if Pred, then T ime < T ime. Hene the stable phase has to

7 Command Leader Aeptors I1: Algorithm 1 Proposal, Retry, and Stable phases. I2: 1: 2: 3: Propose() T ime T Si FastProposalPhase(,, T ime, null) Fast Proposal Phase 4: FastProposalPhase(, Ballot, T ime, Whitelist) 5: P1: send FastPropose[, Ballot, T ime, Whitelist] to all pj 2 6: P2: reeive FastProposeR[, Ballot, T imej, Predj, OK/N ACK] 7: 8: 9: 1: 11: 12: 13: 14: 15: 16: 17: 18: 19: 2: 21: 22: 23: from all pj 2 S : S = F Q _ (timeout ^ S = CQ) P3: T ime M AX j {T imej : pi reeived SFastProposeR[, Ballot, T imej, Predj, OK/N ACK] from pj } P4: Pred j Predj : pi reeived FastProposeR[, Ballot, T imej, Predj, OK/N ACK] from pj P5: if S = F Q : pi reeived FastProposeR[, Ballot, T imej, Predj, N ACK] from pj then P6: StablePhase(, Ballot, T ime, Pred) P7: else if 9j : pi reeived FastProposeR[, Ballot, T imej, Predj, N ACK] from pj then P8: RetryPhase(, Ballot, T ime, Pred) P9: else P1: SlowProposalPhase(, Ballot, T ime, Pred) SlowProposalPhase(, Ballot, T ime, Pred) send SlowPropose[, Ballot, T ime, Pred] to all pj 2 reeive SlowProposeR[, Ballot, T imej, Predj, OK/N ACK] Slow Proposal from all pphase j 2 S : S = CQ T ime M AX j {T imej : pi reeived SlowProposeR[, Ballot, T imej, Predj, OK/N ACK] from pj } S P21: Pred j Predj : pi reeived P22: SlowProposeR[, Ballot, T imej, Predj, OK/N ACK] from pj : pi reeived SlowProposeR[, Ballot, T imej, Predj, N ACK] from pj then StablePhase(, Ballot, T ime, Pred) P24: else P25: RetryPhase(, Ballot, T ime, Pred) 24: P26: RetryPhase(, Ballot, T ime, Pred) 25: P27:send Retry[, Ballot, T ime, Pred] to all pj 2 26: P28:reeive RetryR[, Ballot, T ime, Predj ] from all pj 2 S : S = CQ S 27: Pred j Predj : pi reeived RetryR[, Ballot, T ime, Predj ] from pj 28: StablePhase(, Ballot, T ime, Pred) 29: 3: P11: P12: P13: P14: P15: P16: P17: P18: P19: P2: FastDeision(, Ballot, Time, Pred) P29: P3: P31: P32: P33: P34: P35: P36: P37: P38: P39: SlowDeisionFromProposal(, Ballot, Time, Pred) StablePhase(, Ballot, T ime, Pred) Phase sendretry Stable[, Ballot, T ime, Pred] to all pj 2 R5: R6: R7: R8: R1: R2: R3: R4: SlowDeisionFromRetry(, Ballot, Time, Pred) 2 Stable Phase S1: S2: S3: S4: S5: S6: S7: Fig. 4. C AESAR s pseudoode. The left part is exeuted by the ommand s leader pi, and the right part an be exeuted by any aeptor pj (inluding pi ). take are of breaking any possible loop that might be reated by the predeessor sets of the stable ommands, before trying to deliver them (line S4 and lines 9 15 of Figure 3). That is done as follows: for any two stable and non-ommutative ommands and with timestamps T and T, respetively, if T > T then is deleted from s predeessor set. When a ommand is stable on all nodes, the information about an be safely garbage olleted. C. Slow Deision In ase the leader of a ommand annot guarantee a fast deision for, then it has to exeute additional phases before the finalization of the stable phase for. This happens beause in the fast proposal phase for (lines I1 I2, P1 P4, and P11 P2), the ommand leader annot ollet a fast quorum of FAST P ROPOSE R messages that are all marked with OK (lines P7 P1) due to the following reasons: the fast quorum of olleted FAST P ROPOSE R messages atually inludes a message that rejets the proposed timestamp for and is marked with N ACK (lines P7 P8, and R1 R8); or the leader is only able to ollet a quorum of CQ FAST P ROPOSE R messages (lines P9 P1), beause either there are no FQ orret nodes in the system or the other N CQ nodes are too slow to provide their reply within a onfigurable timeout

8 to the ommand leader (line P2). In this subsetion, we refer to a slow deision by fousing on the former ase; the latter is explained in Setion V-D. Retry phase. This phase guarantees that a ommand is aepted by a quorum of CQ nodes after the previous proposal phase for ould not provide a fast deision, and before moving to the stable phase for. At this stage, the leader p i of broadasts a RETRY message with the maximum T ime among the ones suggested by the aeptors in the previous phase, and the predeessor set Pred as the union of the sets suggested by the aeptors in the previous phase (line R1). Then p i waits for a quorum of CQ RETRYR replies that onfirm the timestamp T ime for (line R2), before submitting T ime to the next stable phase (line R4). This guarantees that, even with f failures, there always exists a orret node that onfirmed T ime in this phase. It is important to notie that as in the ase of a FAST- PROPOSER message, a RETRYR message from a node p j also ontains p j s view of s predeessors set, whih will be inluded in the final Pred set in input to the next stable phase (line R3). This is beause, as shown in Setion IV-B, s leader has to inlude all the ommands that were not predeessors of aording to the timestamp proposed in the previous proposal phase but that have to be onsidered as predeessors aording to the new timestamp of this phase. Furthermore, a reply from an aeptor in this phase annot rejet the broadast timestamp for, beause, as it will be lear in the proof of orretness (see Setion V-F), at this stage CAESAR guarantees that there does not exist any aeptor p j and ommand suh that is stable on p j with timestamp T > T and is not in s predeessors set. Therefore, when a node p j reeives a RETRYR message with, T ime, and Pred, it only updates the tuple for in its H j by marking it as aepted with T ime and Pred (line R5), and it omputes a new predeessors set Pred j by alling the COMPUTEPREDECESSORS funtion (line R7), like in the fast proposal phase. Then, it sends a onfirmation RETRYR bak to the ommand leader with the new Pred j as well as the one previously reeived by the leader (line R8). D. Unavailability of Fast Quorums In CAESAR, as in other fast onsensus implementations [1], there might exist senarios where no fast quorum is available. This happens due to our hoie on the size of fast quorums, i.e., FQ, whih is greater than the minimum number of orret nodes in the system, i.e., N f. Therefore, under a period of asynhrony of the system, where a message an experiene an arbitrarily long delay, a node is not able to distinguish whether f nodes rashed or not, and hene a ommand leader that waits for replies from a fast quorum of nodes ould wait indefinitely in a fast proposal phase. This issue is solved in CAESAR by adopting a more ommon solution, namely the adoption of timeouts, but it requires the interposition of an additional slow proposal phase after the fast proposal phase and before either the retry or the stable phase (see lines P21 P39). In partiular, a ommand leader an deide to exeute a slow proposal phase without waiting for a fast quorum of FQ replies if it has olleted a quorum of CQ FASTPROPOSER messages for a ommand and none of the messages have rejeted the proposed timestamp (P9 P1). This senario an be onsidered as a orner ase of CAE- SAR s exeution and thus, for the sake of brevity, we deided to detail it in the tehnial report [23]. E. Reovery from Failures Whenever a node p i rashes, there might exist some ommand whose leader is p i and whose deision would never be finalized unless some expliit ation is taken. Indeed, let us suppose there exists a node p k that stores with a status different from stable. Then, aording to the pseudoode of Figure 4, p k would deide only after having reeived a STABLE message from p i. 1: RECOVERYPHASE() 2: Ballots k []++ 3: send RECOVERY[, Ballots k []] to all p j Π 4: reeive RECOVERYR[, Ballots k [],, T j, Pred j,, B j, / /N OP] from all p j S Π : S = CQ 5: MaxBallot MAX{B j : p i reeived RECOVERYR[, Ballots k [],, T j, Pred j,, B j, / ] } 6: ReoverySet { p j, T j, Pred j,, / : p i reeived RECOVERYR[, Ballots k [],, T j, Pred j,, B j, / ] from p j B j = MaxBallot } 7: if p j, T j, Pred j, stable, ReoverySet then 8: STABLEPHASE(, Ballots k [], T j, Pred j) 9: else if p j, T j, Pred j, aepted, ReoverySet then 1: RETRYPHASE(, Ballots k [], T j, Pred j) 11: else if p j, T j, Pred j, rejeted, ReoverySet then 12: T ime T S i 13: FASTPROPOSALPHASE(, Ballots k [], T ime, null) 14: else if p j, T j, Pred j, slow-pending, ReoverySet then 15: SLOWPROPOSALPHASE(, Ballots k [], T j, Pred j) 16: else if ReoverySet > then 17: T ime T j : p j, T j, Pred j, fast-pending, / ReoverySet 18: Pred j Predj : p j, T j, Pred j, fast-pending, / ReoverySet 19: if p j, T j, Pred j, fast-pending, ReoverySet then 2: WhiteList Pred 21: else if ReoverySet CQ then 22: WhiteList { Pred : S ReoverySet, S CQ p j, T j, Pred j, fast-pending, S, Pred j } 23: else 24: WhiteList null 25: FASTPROPOSALPHASE(, Ballots k [], T ime, WhiteList) 26: else 27: T ime T S i 28: FASTPROPOSALPHASE(, Ballots k [], T ime, null) 29: upon reeive RECOVERY[, Ballot] from p k Ballot > Ballots j[] 3: Ballots j[] Ballot 31: if H j.contains() then 32: send RECOVERYR[, Ballots j[], H j.getinfo()] to p k 33: else 34: send RECOVERYR[, Ballots j[], N OP] to p k Fig. 5. RECOVERY phase exeuted by node p k. Node p j is a reeiver of the RECOVERY message. For this reason, CAESAR also inludes an expliit reovery proedure (Figure 5) that finalizes the deision of ommands whose leader either rashed or has been suspeted. Given the aforementioned example, whenever the failure detetor of p k suspets p i, p k attempts to beome s leader and finalizes the deision of. This is done by exeuting a Paxos-like prepare phase, and olleting the most reent information about from

9 a quorum of CQ nodes as follows: p k inrements its urrent ballot for, i.e., Ballots k [], (line 2) and it broadasts a RECOVERY message for with the new ballot (line 3). Then, it waits for a quorum of CQ RECOVERYR replies, whih ontain information about, before finalizing the deision for (line 4). RECOVERYR from p j ontains either the tuple of in H j or N OP if suh a tuple does not exist (lines 31 34). A node p j that reeives a RECOVERY message from p k replies only if its ballot for is lesser than the one it has reeived. In suh a ase, p j also updates its ballot for (lines 29 3). Like in Paxos, this is done to guarantee that no two leaders an ompete to finalize the deision for the same ommand onurrently. In fat, if two leaders p k1 and p k2 both suessfully exeute lines 3 and 4 of the reovery proedure with ballots B 1 and B 2, respetively, then, if B 1 < B 2, for any quorum of nodes S, there always exists a node in S that never replies to p k1 (see the reeption of FASTPROPOSE, SLOWPROPOSE, RETRY, and STABLE messages in Figure 4). When node p k suessfully beomes s leader, it filters the information for that it has reeived by only keeping in ReoverySet the data assoiated with the maximum ballot, named MaxBallot in the pseudoode (lines 5 6). Eah tuple of the set is a sequene of node identifier, timestamp, predeessors set, status, and fored boolean indiating: the node that sent the information, the timestamp, the predeessors set, the status of on that node, and whether that information has been fored by a WhiteList or not on that node. Then, p k takes a deision for aording to the ontent of ReoverySet as follows. i) If there exists a tuple with status stable, then p k starts a stable phase for by using the neessary info from that tuple, e.g., timestamp and predeessors set (lines 7 8). ii) If there exists a tuple with status aepted, then p k starts a retry phase for by using the neessary info from that tuple (lines 9 1). iii) If there exists a tuple with status rejeted or ReoverySet is empty, was never deided, and hene p k starts a fast proposal phase for (lines 11 13, and 26 28) by using a new timestamp (as desribed in Setion V-B). iv) If there exists a tuple with status slow-pending, then p k starts a slow proposal phase for by using the neessary info from that tuple (lines 14 15). v) If the previous onditions are false, then ReoverySet ontains tuples with the same timestamp T ime and status fast-pending (lines 16 25). In this last ase, p k starts a proposal phase for with timestamp T ime beause might have been deided with that timestamp in a previous fast deision (line 25). If so, p k has to also hoose the right predeessors set that was adopted in that deision. Therefore, it has to either hoose a predeessors set in ReoverySet that was fored by a previous reovery, if any (lines 19 2), or it has to build its own WhiteList of ommands that should be fored as predeessors of (lines 21 24). This is done by notiing that: if was deided in a fast deision with ballot MaxBallot then the size of ReoverySet annot be lesser than CQ 2 + 1, whih is the minimum size of the intersetion of any lassi quorum and any fast quorum (lines 21 and 24); if a ommand was previously deided in a fast deision and it has to be a predeessor of, then there annot exist a subset of CQ tuples in ReoverySet, whose predeessors sets do not ontain (line 22). Note that, the ase in whih was previously deided in a slow deision and has to be a predeessor of is handled by the omputation of predeessors set in the fast proposal phase (see line P13 of Figure 4, and lines 1 3 of Figure 3). F. Corretness The omplete formal proof on the orretness of CAESAR is in the tehnial report [23], where we have also formalized a desription of the algorithm in TLA+ [24], whih has been model-heked with TLC model-heker. Here we provide the main intuition on how we proeeded in proving that CAESAR implements the speifiation of Generalized Consensus. Let us also define the prediate DECIDED[,T,Pred,B] as a prediate that is equal to true whenever a node deides a ommand with timestamp T, predeessors set Pred, and ballot B. Then we an prove that CAESAR guarantees Consisteny by proving the following two theorems: -,, (DECIDED[,T,Pred,B] DECIDED[, T,Pred,B] T < T Pred); - ( B, DECIDED[,T,Pred,B] Pred, DE- CIDED[, T,Pred,B] B B,(DECIDED[,T,Pred,B ] T = T Pred = Pred)). VI. IMPLEMENTATION AND EVALUATION We implemented CAESAR in Java and ontrasted it with four state-of-the-art onsensus protools: M 2 Paxos, EPaxos, Multi-Paxos, and Menius. We used the Go language implementations of EPaxos, Multi-Paxos, and Menius from the authors of EPaxos. For M 2 Paxos, we used the open-soure implementation in Go. Note that Go ompiles to native binary while Java runs on top of the Java Virtual Mahine. Thus, we use a warmup phase before eah experiment in order to kikstart the Java JIT Compiler. Competitors have been evaluated on Amazon EC2, using m4.2xlarge instanes (8 vcpu and 32GB RAM) running Ubuntu Linux Our benhmark issues lient ommands to update a given key of a fully repliated Key-Value store. Two ommands are onfliting if they aess the same key. The ommand size is 15 bytes, whih inlude key, value, request ID, and operation type. In our evaluations, we explored both onfliting and nononfliting workloads. When the lients issue onfliting ommands, the key is piked from a shared pool of 1 keys with a ertain probability depending on the experiment. As a result, by ategorizing a workload with 1% of onfliting ommands, we refer to the fat that 1% of the aessed keys belong to the shared pool. To measure lateny, we issued requests in a losed loop by plaing 1 lients o-loated with eah node ( in total), and for throughput the lients injeted requests to the system in an open loop. Performane of ompetitors has been olleted with and without network bathing (the aption indiates that). We deployed the ompetitors on five nodes loated in Virginia (US), Ohio (US), Frankfurt (EU), Ireland (EU), and

10 7 % 2% 1% 3% % 1% Caesar EPaxos M2Paxos Virginia Ohio Frankfurt Ireland Mumbai Lateny (mse) % 2% 1% 3% % 1% % 2% 1% 3% % 1% % 2% 1% 3% % 1% % 2% 1% 3% % 1% % 2% 1% 3% % 1% Fig. 6. Average lateny for ordering and proessing ommands by hanging the perentage of onfliting ommands. Bathing is disabled. Bars are overlapped: e.g., in the ase of 3% onflits in Virginia, lateny values are 9 mse, 18 mse, and 127 mse, for CAESAR, EPaxos, and M 2 Paxos, respetively. Mumbai (India). This onfiguration spreads nodes suh that the lateny to ahieve a quorum is similar for all quorumbased ompetitors. It is worth realling that in a system with 5 nodes, CAESAR requires ontating one node more than other quorum-based ompetitors to reah a fast deision. The round trip time (RTT) that we measured in between nodes in EU and US are all below 1ms. The node in India experienes the following delays with respet to the other nodes: 186ms/VA, 31ms/OH, 112ms/DE, 122ms/IR. As in EPaxos, CAESAR uses separate queues for handling different types of messages, and eah of these queues is handled by a separate pool of threads. In CAESAR, onfliting ommands are traked using a Red-Blak tree data struture ordered by their timestamp. Multi-Paxos is deployed in two settings: one where the leader is loated in Ireland, whih is a node lose to a quorum, and one where the leader is in Mumbai, whih needs to ontat nodes at long distane to have a quorum of responses. A. Non-faulty Senarios In Figure 6, we report the average lateny inurred by CAESAR, EPaxos, and M 2 Paxos to order and exeute a ommand. Given the lateny of a ommand is affeted by the position of the leader that proposes the ommand itself, we show the results olleted in eah site. Eah luster of data shows the behavior of a system while inreasing the perentage of onflits in the range of {% no onflit, 2%, 1%, 3%, %, 1% total order}. At % onflits, EPaxos and M 2 Paxos provide omparable performane beause both employ two ommuniation steps to order ommands and the same size for quorums, with EPaxos slightly faster beause it does not need to aquire the ownership on submitted ommands before ordering. The performane of CAESAR is slightly slower (on average 18%) than EPaxos beause of the need of ontating one more node to reah onsensus. When the perentage of onfliting ommands inreases up to %, CAESAR sustains its performane by providing an almost onstant lateny; all other ompetitors degrade their performane visibly. The reasons vary by protool. EPaxos degrades beause its number of slow deisions inreases aordingly, along with the omplexity of analyzing the onflit graph before delivering. For M 2 Paxos, the degradation is related to the forwarding mehanism implemented when the requested key is logially owned by another node. In that ase, M 2 Paxos passes the ommand to that node, whih beomes responsible to order it. This mehanism introdues an additional ommuniation delay, whih ontributes to degraded performane espeially in geo-sale where the node having the ownership of the key may be faraway. At last, we inluded also the ase of 1% onflits. Here all ompetitors behave poorly given the need for ordering all ommands, whih does not represent their ideal deployment. Lateny (mse) Multi-Paxos-IR Menius Multi-Paxos-IN Caesar Virginia Ohio Frankfurt Ireland Mumbai Fig. 7. Average lateny for ordering ommands of Multi-Paxos (with a lose and faraway leader), Menius, and CAESAR. Bathing is disabled. The lateny provided by the node in India is higher than other nodes. Here CAESAR is % slower than EPaxos only when onflits are low, beause CAESAR has to ontat one more faraway node (e.g., Virginia) to deliver fast. Performanes of Multi-Paxos and Menius are reported in Figure 7 beause these ompetitors are oblivious to the perentage of onfliting ommands injeted in the system. CAESAR % has also been inluded for referene. Menius s performane is similar aross the nodes beause it needs to ollet feedbaks from all onsensus partiipants, and therefore it performs as the slowest node and on average 6% slower than CAESAR. The version of Multi-Paxos with the leader in Mumbai (Multi-Paxos-IN) is not able to provide low lateny due to the delay that ommands experiene while waiting for a response from the leader. On the other hand, if the leader is plaed in Ireland (Multi-Paxos-IR) the quorum an be reahed faster than the ase of Multi-Paxos-IN, thus ommand lateny is signifiantly lower. Compared with results in Figure 6, Multi-Paxos-IR and Multi-Paxos-IN are, on average, 5% and

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00)

PBFT: A Byzantine Renaissance. The Setup. What could possibly go wrong? The General Idea. Practical Byzantine Fault-Tolerance (CL99, CL00) PBFT: A Byzantine Renaissane Pratial Byzantine Fault-Tolerane (CL99, CL00) first to be safe in asynhronous systems live under weak synhrony assumptions -Byzantine Paos! The Setup Crypto System Model Asynhronous