Geographic State Machine Replication

Size: px

Start display at page:

Download "Geographic State Machine Replication"

Myrtle Robinson
6 years ago
Views:

1 Università della Svizzera italiana USI Technical Report Series in Informatics Geographic State Machine Replication Paulo Coelho 1, Fernando Pedone 1 1 Faculty of Informatics, Università della Svizzera italiana, Switzerland Abstract Many current online services need to serve clients distributed across geographic areas. These systems are subject to stringent availability and performance requirements. In order to meet these requirements, replication is used to tolerate the crash of servers and improve performance by deploying replicas near the clients. Coordinating geographically distributed replicas, however, is challenging. We present GeoPaxos, a protocol that addresses this challenge by combining three insights. It decouples order from execution in state machine replication, it induces a partial order on the execution of operations, instead of a total order, and it exploits geographic locality, typical of geo-distributed online services. GeoPaxos outperforms state-of-the-art approaches by more than an order of magnitude in some cases. We describe GeoPaxos design and implementation in detail, and present an extensive performance evaluation. Report Info Published June 217 Number USI-INF-TR Institution Faculty of Informatics Università della Svizzera italiana Lugano, Switzerland Online Access 1 Introduction Many current online services must serve clients distributed across geographic areas. In order to ensure that clients experience high service availability and performance, servers are typically replicated and deployed over geographically distributed sites (i.e., datacenters). By replicating the servers, the service can be configured to tolerate the crash of a few nodes within a single datacenter or the disruption of an entire datacenter. Geographic replication can improve performance by placing the data close to the clients, which reduces service latency. GeoPaxos combines three insights to implement efficient state machine replication in geographically distributed environments. Combining these ideas proved to be challenging and resulted in a novel system that can fully exploit common geographically distributed cloud computing infrastructures (e.g., Amazon s EC2) and outperform related approaches by a large margin. First, GeoPaxos decouples operation ordering from operation execution. Although there are storage systems that distinguish order from execution (e.g.,[39]) and Paxos itself introduces different roles for the ordering and execution of operations[2], Paxos-based systems typically combine the two roles in a replica (e.g.,[14, 25, 28]). Combining order and execution in a geographically distributed system, however, leads to a performance dilemma. On the one hand, replicas must be deployed near clients to avoid communication with remote servers during the execution of operations. On the other hand, distributing replicas across geographic areas slows down the ordering of operations, since order requires replicas to coordinate. By decoupling order from execution, GeoPaxos can use a set of nodes to order operations and another set of nodes, the replicas, to execute operations. As a result, replicas can be geographically distributed without penalizing the ordering of operations. Second, instead of totally ordering operations before executing them, as traditionally done in state machine replication[19], GeoPaxos introduces a novel partial order protocol. It is well-known that total order is not necessary to ensure consistency in state machine replication[32] and a few designs have implemen- 1

2 ted partial ordering of operations (e.g.,[18, 25]). GeoPaxos differs from these systems in the way it implements partial order. GeoPaxos uses multiple independent instances of Multi-Paxos[5] to order operations hereafter, we call an instance of Multi-Paxos a partition. Operations are ordered by one or more partitions, depending on the objects they access. Operations that are ordered by a single partition are the most efficient ones since they involve servers in datacenters in the same region and subject to small communication delays. Operations that involve multiple partitions, require coordination among replicas in datacenters that may be far apart. Thus, multi-partition operations perform worse than single-partition operations. Third, to maximize the number of single-partition operations, GeoPaxos exploits geographic locality. Geographic locality is present in many online services. In online social services, for example, the probability of having a social connection between two individuals decreases as an inverse power of their geographic distance[2]. Some distributed systems exploit locality by sharding the data and placing shards near the users of the data (e.g.,[14, 35]). GeoPaxos does not shard the service state; instead, it distributes the responsibility for ordering operations. Operations are ordered by nodes deployed in the region where the most likely clients to access these objects are. GeoPaxos s flexibility results in excellent performance at the cost of extra nodes to order operations, when compared to traditional approaches that combine order and execution[14, 25, 28]. This paper makes the following contributions: It demonstrates the importance of decoupling the ordering from the execution of operations in a geographically distributed system. Although Paxos makes this distinction, Paxos-based systems combine ordering and execution in a replica. It proposes a novel partial ordering protocol that can take advantage of public cloud computing infrastructures such as Amazon EC2. In GeoPaxos, redundancy for fault tolerance is provided by replicas in datacenters in different availability zones, within the same region; redundancy for performance is provided by replicas in different regions. Although intra-region redundancy does not tolerate catastrophic failures in which all datacenters of a region are wiped out, most applications do not require this level of reliability. It shows how these ideas can be combined with geographic locality, a property present in many online services, leading to a state machine replication design that under some common circumstances outperforms state-of-the-art protocols by an order of magnitude. The rest of the paper is structured as follows. Section 2 details the system model and recalls fundamental notions. Section 3 overviews the main paper contributions. Section 4 details GeoPaxos. Section 5 describes our prototype. Section 6 presents our performance evaluation. Section 7 reviews related work and Section 8 concludes the paper. 2 Background In this section, we define our system model and assumptions ( 2.1), recall the notions of consensus and state machine replication ( 2.2), and briefly describe Paxos ( 2.3). 2.1 System model We consider a message-passing geographically distributed system. Client and server processes are grouped within datacenters (also known as sites or availability zones) distributed over different regions. The system is asynchronous in that there is no bound on message delays and on relative process speeds, but communication between processes within the same region experience much shorter delays than communication between processes in different regions. Processes are subject to crash failures and do not behave maliciously (e.g., no Byzantine failures). Service state can be replicated in servers in datacenters within the same region and across regions. Replication within a datacenter can tolerate the crash of some of the replicas; replication using servers located in different datacenters can tolerate the crash of a whole datacenter. Replication across regions is mostly used to explore locality, since storing data close to the clients avoids large delays due to expensive inter-region communication. We account for client-data proximity by assuming that clients have a preferred region[36]. 2

3 2.2 Consensus and replication Consensus is an abstraction whereby replicas agree on a common value (e.g., the next operation to be executed). More precisely, consensus is defined by three properties: (a) If a replica decides on a value, then the value was proposed by some process (validity). (b) No two replicas decide differently (agreement). (c) If a non-faulty process proposes a value, then eventually all non-faulty replicas decide some value (termination). Consensus requires additional assumptions to be solved[1, 2]. Since the protocols proposed in this work do not make explicit use of these assumptions, we simply assume that consensus can be implemented[2]. State machine replication is a principled approach to building highly available services[19, 32]. State machine replication regulates how service operations must be propagated to and executed by the replicas. Operation propagation has two requirements: (i) every non-faulty replica must receive every operation and (ii) no two replicas can disagree on the order of received and executed operations. If operations are deterministic, then replicas will reach the same state and produce the same output upon executing the same sequence of operations. State machine replication can be implemented as a series of consensus instances, where the i -th consensus instance decides on the i -th operation (or batch of operations) to be executed by the replicas[2]. Although this total order of operations is sufficient to implement state machine replication, it is not necessary[32]. State machine replication guarantees linearizability, a consistency criterion[12]. An execution is linearizable if there is a permutation of the operations in the execution that respects (i) the service s sequential specification and (ii) the real-time precedence of operations as seen by the clients. Operation o p i precedes command o p j if the response of o p i occurs before the invocation of o p j. 2.3 Classic Paxos Paxos is a fault-tolerant consensus protocol with important characteristics: it has been proven safe under asynchronous assumptions (i.e., when there are no timing bounds on message propagation and process execution), live under weak synchronous assumptions, and resilience-optimum[2]. Paxos distinguishes the following roles that a process can play: proposers, acceptors and learners. Clients of a replicated service are typically proposers, and propose operations that need to be ordered by Paxos before they are learned and executed by the replicated state machines. These replicas typically play the roles of acceptors (i.e., the processes that actually agree on a value) and learners. Paxos is resilience-optimum in the sense that it tolerates the failure of up to f acceptors from a total of 2f+ 1 acceptors to ensure progress (i.e., a quorum of f +1 acceptors must be non-faulty)[22]. In practice, replicated services run multiple executions of the Paxos protocol to achieve consensus on a sequence of values. We refer to multiple executions of Paxos chained together as Multi-Paxos[5] or Atomic Broadcast[4]. 3 Overview GeoPaxos combines three strategies: it exploits the different roles implemented by Paxos ( 3.1), it induces a partial order on operations ( 3.2), and it makes use of geographic locality ( 3.3). 3.1 Dissociating order from execution GeoPaxos takes advantage of the fact that Paxos allows to dissociate the ordering of operations, performed by the acceptors, from the execution of operations, performed by the learners (i.e., the service replicas). Systems based on Paxos typically combine the acceptor and learner roles in a replica (e.g.,[25, 14]). As a consequence, these systems are subject to a tradeoff: on the one hand, placing replicas near remote clients reduces the response time experienced by the clients; on the other hand, distributing replicas across geographic areas may slow down the ordering of commands (e.g., if no quorum exists involving nearby replicas [15]). GeoPaxos is not vulnerable to this tradeoff since the ordering of operations does not depend on the number and placement of replicas. 3

4 3.2 Partial versus total order GeoPaxos induces a partial order on operations. It has been early observed that interference-free operations can be executed in any order by replicas in state machine replication without violating consistency[32]. Two operations are interference-free if they do not access a common object. Intuitively, if two operations interfere, then they must be executed sequentially in the same order by each replica. Interference-free operations may be executed in different orders by the replicas, and even concurrently at a replica. GeoPaxos uses one or more independent instances of Multi-Paxos to order operations, where each instance is called a partition. To determine the partitions responsible for ordering an operation, we assign each object to a partition, the object owner we explain how to assign ownership based on geographic locality in 3.3 and how objects can change ownership dynamically in 4.2. Operations that access a single object or multiple objects with the same owner, single-partition operations, are ordered by the owner partition. Operations that access objects owned by multiple partitions, multi-partition operations, are ordered by all involved partitions. The challenge is to ensure that such operations are ordered consistently across partitions, that is, if operation op i is ordered before operation op j by partition A, then op j is not ordered before op i by partition B we explain how GeoPaxos ensures consistent order in 4.1. Every replica in GeoPaxos contains one learner per partition. Single-partition operations are learned and executed by the learner that learns the ordered operation. Multi-partition operations are learned by multiple learners but executed by only one learner at the replica. 3.3 Exploiting geographic locality GeoPaxos differs from other partial-order protocols in that it uses multiple partitions to order operations, as opposed to a single partition[21, 25, 27, 28]. This distinction allows GeoPaxos to account for locality in geographically distributed applications and assign object ownership in order to (i) minimize multi-partition operations and (ii) maximize single-partition operations that are ordered in the region where they are issued. C x preferred region DC 1 XYZ A DC 4 XYZ B DC 7 XYZ C XYZ A Replica with objects X,Y, and Z Acceptor in partition A DC 2 XYZ A DC 5 B DC 8 C DC : Datacenter DC 3 A DC 6 B DC 9 C object X Y owner partition A B Partition A Partition B Partition C Z C Region A Region B Region C Figure 1: A deployment of GeoPaxos with three regions and three datacenters in each region. Region A has two replicas, at datacenters D C 1 and D C 2. Regions B and C have one replica only, at D C 4 and D C 7, respectively. The problem of assigning object ownership to achieve properties (i) and (ii) is application-specific and orthogonal to GeoPaxos[11]. We illustrate a solution with an online social service. Online social services are notorious for exhibiting geographic locality, where the probability of having a social connection between two users is inversely proportional to their geographic distance[1, 3, 2]. In a social service, the problem of assigning object ownership can be reduced to a graph partitioning problem. In this graph, clients are vertices (i.e., service users) and their interconnections (i.e., friendship relations) are edges. Operations (i.e., gettimeline, post) involve interconnected clients see 6.4 for a detailed description of operations. A partitioning of the graph results in strongly connected subgraphs, each one weakly connected to the other subgraphs. By assigning object ownership in a subgraph to a partition in the preferred region of the clients in the subgraph, we achieve properties (i) and (ii) above. Figure 1 illustrates a deployment of GeoPaxos in three regions. Client C x s preferred region is A, where his friends, represented by object x, are located. Since partition A owns x, if C x issues an operation from region A (its preferred region), the operation will be a local single-partition operation, ordered by acceptors in partition A and executed by all replicas in all partitions. If C x issues an operation from region B, 4

5 the operation will be a remote single-partition operation. A client with friends in partitions B and C will issue multi-partition operations. Local single-partition operations are more efficient than remote singlepartition operations, and both are more efficient than multi-partition operations. 4 Design In this section, we detail how GeoPaxos orders and executes operations ( 4.1), present some extensions and improvements to the basic protocol ( 4.2), discuss practical aspects ( 4.3), and argue about the correctness of GeoPaxos ( 4.4). 4.1 The order protocol Clients can submit operations to any one of the replicas. At a replica, an operation has three attributes: The state" attribute identifies whether the operation is (a) waiting to be ordered, (b) ordered, or (c) done, after the replica has executed the operation. The dst" attribute is set by the replica and contains all partitions that own objects the operation accesses (or a superset of these partitions). The tp" attribute is a timestamp. Timestamps are tuples, with one entry per partition. For two operations op i, op i.tp<op j.tp if for all x op j.d s t op i.d s t, it holds that op i.tp[x]<op j.tp[x]. GeoPaxos ensures that if two operations interfere, then replicas execute the operations in the same order. To guarantee this property, GeoPaxos assigns timestamps to operations such that if op i interfere, then either op i.tp<op j.tp or op j.tp<op i.tp and executes operations in timestamp order. Replicas execute the following five steps to order operations (see also Figure 2). I. Upon receiving an operation op from a client, the replica initializes the operation s attributes and requests that the operation be ordered in each one of the partitions involved in op. We call this the operation s first communication round. A replica requests operation op to be ordered in partition x (i.e., atomically broadcasts operation op in partition x ) by executing primitive abcast[x](op). II. When a replica learns an ordered operation op in partition x, an event that we identify with primitive deliver[x](op), the replica assigns to op a tentative order in partition x. To compute the tentative order, replicas implement a logical clock vector, LC, with one entry per partition. The replica first increments partition x s entry in LC and assigns the result as x s tentative order to op. In partition x, op is now waiting for its timestamp. III. After a tentative order has been assigned to op for each partition involved in op, there are two cases to consider. If op is a single-partition operation, then the order proposed by the partition involved in o p becomes its timestamp and op transitions to the ordered state. If op involves multiple partitions, then its timestamp will be computed from the tentative orders proposed by partitions in op.d s t. To ensure that replicas update their logical clock vector consistently, op s timestamp is atomically broadcast to partitions in op.d s t. We call this the operation s second communication round. IV. Upon delivering op s timestamp in partition x, the replica updates x s logical clock and marks op as ordered in partition x. V. After op is in the ordered state in each one of the partitions involved in op and the replica has already executed all operations with a timestamp smaller than op s, the replica executes op, responds to the client, and marks op as executed, which allows operations ordered after op to be eventually executed. 4.2 Extensions and optimizations In this section, we describe three optimizations to improve the performance of the basic order protocol. 5

6 I When receive operation op from client op.state (,, ) op.dst partitions(op) op.tp (-,-, ) For all x in op.dst: abcast[x](round-1,op) II When deliver[x] (round-1,op) increment LC[x] op.tp[x] LC[x] op.state[x] waiting III When there is op s.t. state(op) = waiting op.dst = 1 yes let x be in op.dst op.state[x] ordered no op.tp[x] tmax(op) For all x in op.dst: abcast[x](round-2,op) IV When deliver[x] (round-2,op) LC[x] max(lc[x], tmax(op)) op.state[x] ordered V When there is op s.t. state(op) = ordered and there isn t op : state(op ) done and op.tp < op.tp Execute operation op and respond to client For all x in op.dst: op.state[x] done partitions(op) all partitions to be accessed by operation op state(op) = val for all x in op.dst: C.state[x] = val tmax(op) max(op.tp[], op.tp[1], ) Figure 2: The five steps of the GeoPaxos protocol Speeding up single-partition operations In the basic protocol described in the previous section, single-partition operations need one consensus execution and multi-partition operations need two consensus executions. Let op s and op m be two operations that access a common object in partition x (i.e., op s and op m interfere), where op s is single-partition and op m is multi-partition. Thus, op s and op m must be executed in the same order by all replicas. Consider an execution in which op s s communication round happens between op m s first and second communication rounds at partition x. From the protocol, op s is assigned an execution order at partition x greater than the one proposed for op m since op m is handled before op s. Hence, even though op s requires one communication round to be ordered, it can only be executed after op m completes its second communication round. We call the phenomenon by which the execution of an ordered operation is delayed by the ordering of another operation the convoy effect". Since there is a significant difference between the response times of single-partition and multi-partition operations, even a small percentage of multi-partition operations in the workload can add substantial delays in the execution of single-partition operations. Figure 3 depicts the response time CDF of multi-partition operations and single-partition operations in executions with %, 1% and 1% of multi-partition operations. (The details about the setup and the application can be found in 6.) In the workload without multipartition operations, the 92-th latency percentile of single-partition operations is below 3.5 msec. With 1% of multi-partition operations in the workload, the 92-th latency percentile of single-partition operations 6

7 reaches almost 8 msec; when the workload contains 1% of multi-partition operations, the 68-th latency percentile of single-partition operations reaches almost 8 msec. 1 8 local w/ % global op. local w/ 1% global op. local w/ 1% global op. global op. [%] Figure 3: The impact of the convoy effect on the latency of single-partition operations. To cope with the convoy effect, we let single-partition operations be executed as soon as they are ordered (at the end of step III). Consequently, the second communication round of a multi-partition operation does not delay the execution of a single-partition operation. Note that with this modification, the execution of a single-partition operation op s and a multi-partition operation o p m may not happen in timestamp order. Intuitively, this does not violate correctness because op s and op m are handled in the same total order within a partition, and so, all replicas agree that op s should be executed before op m is ordered Parallel execution of operations We improve the performance of a replica by multithreading (and parallelizing) the execution of operations that do not interfere[33, 17, 24]. In order to introduce concurrent execution of non-interfering operations, each replica spawns as many threads as the number of partitions so that operations that access different partitions can execute concurrently. Single-partition operations are executed by the thread in charge of the involved partition. Multi-partition operations require a barrier among the threads involved in the operation. This ensures that only one thread executes the operation and avoids race conditions Dynamically changing object ownership Object ownership must be assigned to partitions to minimize multi-partition operations and to maximize same-region single-partition operations. Since workloads and locality can vary over time, and it may be unfeasible to predict in which regions requests for specific objects will be issued beforehand (e.g., a user who travels to a different country and wants to access her data), GeoPaxos allows object ownership to change dynamically. In GeoPaxos, object ownership can be reassigned from one partition to another using the move(object_id_list, source, destination) operation, addressed to the source and the destination partitions, where object_id_list contains a list of objects whose ownership should be re-assigned. Note that a change of ownership does not involve any transfer of actual objects since every replica contains a full copy of the application state. 4.3 Practical considerations Partitions are available as long as there is a majority of operational acceptors in the partition. Clients connected to a replica that fails can reconnect to any operational replica, possibly in the client s preferred region. We experimentally evaluate the effects on performance when clients reconnect to a remote replica in 6.7. To recover from failures, the in-memory state of acceptors must be saved on stable storage (i.e., disk). In GeoPaxos, acceptors can persist their state in both asynchronous or synchronous mode. These modes represent a performance and reliability tradeoff: the asynchronous mode is more efficient but can cause information loss if an acceptor crashes before flushing its state to disk. We evaluate both persistency modes in 6. 7

8 As an optimization, multi-partition operations (with associated parameters) do not need to be sent to all partitions involved in the operation. It is sufficient that one partition receives the full operation while the other partitions receive only the unique id of the operation, so that the operation can be ordered in all involved partitions. 4.4 Correctness We first argue that if operations op i interfere, then replicas execute them in the same order. Since op i interfere, they access a common object, and thus are ordered. The claim follows from two facts: (a) for ordered operations op i either op i.t p< op j.t p or op j.t p< op i.t p ; and (b) replicas execute ordered operations in timestamp order. Without loss of generality, assume op i.t p< op j.t p. Thus, for all x op i.d s t op j.d s t, op i.t p[x]< op j.t p[x]. Fact (a) holds since logical clock values are unique and the timestamp of an ordered operation op is the maximum among the logical clock values proposed by each one of the destinations in op.d s t. Fact (b) holds because when an operation op is executed by a replica, there is no operation op at the replica with a smaller timestamp. Moreover, no future operation can have a smaller timestamp than op s timestamp since timestamps are monotonically increasing. We now show that GeoPaxos is linearizable. From the definition of linearizability (see 2.2), we must show that there is a permutationπof the operations in any execution of GeoPaxos that respects (i) the realtime ordering of operations as seen by the clients, and (ii) the semantics of the operations. Let op i be two operations submitted by clients C i and C j, respectively. There are two cases to consider. Case (a): op i are interference-free. Thus, op i access disjoint sets of objects. Consequently, the execution of one operation does not affect the execution of the other and they can be placed in any relative order inπ. We arrange op i inπso that their relative order respects their real-time dependencies, if any. Case (b): op i interfere. It follows from GeoPaxos order property above that replicas execute the operations in the same order. Since the two operations execute in sequence, the execution of the operations satisfies their semantics. We now show that the execution order satisfies any real-time constraints among op i. Without lack of generality, assume op i finishes before op j starts (i.e., op i precedes op j in real time). Thus, before op j is submitted by C j, op i has completed (i.e., C i has received op i s response). Since op j is ordered and then executed, we conclude that op i is ordered before op j. From the claims above, we can arrange op i inπaccording to their delivery order so that the execution of each operation satisfies its semantics. 5 Implementation GeoPaxos was implemented in C. 1 Our prototype allows to configure disk access mode, synchronous or asynchronous, by default set to asynchronous. We use Libpaxos 2 as the Paxos library. In GeoPaxos proposers and acceptors are single-threaded processes. To ensure liveness, the system starts with a default distinguished proposer, which exchanges heartbeats with the other proposers to allow progress in the event of a failure. Replicas are multithreaded processes. The learner for each partition is executed as an independent thread and only synchronizes with other learners when an operation involves multiple partitions. An additional thread handles the requests from the clients and, depending on the operation parameters, sets the destination partitions accordingly. Clients are multithreaded, with each thread usually connected to the closest replica. Operations are submitted in a closed loop, i.e., an operation is only sent after the response for the previous operation is received. 6 Evaluation In this section, we detail our experimental environment and benchmarks ( 6.1), compare the performance of GeoPaxos to other protocols under various conditions ( ), and assess the behavior of GeoPaxos in

9 the presence of failures ( 6.7). 6.1 Environment and benchmarks The evaluation was conducted in two environments, a local-area network (LAN) and a public wide-area network (WAN). While the LAN allows us to compare the protocols in a controlled environment, the WAN provides a realistic environment, aligned with the conditions GeoPaxos was developed for. The LAN consistes of a cluster of nodes, each one with an 8-core Intel Xeon L542 processor (2.5GHz), 8GB of memory, SATA SSD disks, and 1Gbps ethernet card. Each node runs CentOS bits. The RTT (round-trip time) between nodes in the cluster is.1 msec. We use Amazon EC2 in the WAN configuration, with each partition deployed in a different region. All the nodes are m3.large instances, with 2 vcpus and 7.5GB of memory. For the experiments with 3 partitions, we use 2 datacenters in California (CA), 3 datacenters in North Virginia (VA) and 3 datacenters in Ireland (EU). The regions of Oregon (OR), with 3 datacenters, and Tokyo (JP), with 2 datacenters, are included to complete the 5 partitions. Table 1 summarizes the RTT between these regions. RTT within a datacenter is smaller than 1 msec and between datacenters in the same region below 2.5 msec. VA EU OR JP CA VA EU OR 95 Table 1: Average RTT between regions, in milliseconds. Throughput [operations / sec] Multi-Paxos GeoPaxos EPaxos M2Paxos 3 Partitions 5 Partitions Partitions 5 Partitions (a) Peak throughput (b) Latency Figure 4: Performance in LAN (whiskers: 95% confidence interval for throughput, 99-th percentile for latency). In the LAN configuration, we use a key-value store service replicated with each of the three protocols. In our workload, all the client requests are 64-byte updates. In the WAN configuration, we use a social network service. Social networks are notorious for exhibiting locality properties of the sort that GeoPaxos can take advantage of[1, 3, 2]. Our social network has 1 users. The friendship relations follow a Zipf distribution with skew 1.5. There are two operations: gettimeline and post. The gettimeline operation returns the last 1 messages posted on a specified user timeline. The post appends a message to the timeline of all the followers of the specified user. While get- Timeline is always a single-partition operation, post depends on the partitions that own the followers of a user. To assign users to partitions, the social network was partitioned among 3 and 5 partitions using METIS[16], with the following results: Three partitions with 344, 317 and 3426 users; where 8% of users have followers in the same partition, 18% of users have followers in two partitions, and 2% of users have followers in all partitions. Five partitions with 1998, 1942, 257, 1943 and 26 users; 74% of users have followers in the same partition, 22% in two partitions, 2.6% in three partitions, 1% in four partitions, and.4% in five partitions. 9

10 The experiments were executed with 1 clients in each region and a mix of gettimeline and post operations in a proportion of 4: Performance in the LAN We compare GeoPaxos to Multi-Paxos (implemented with Libpaxos), EPaxos, 3 and M2Paxos 4 in configurations with 3 and 5 partitions. GeoPaxos uses three acceptors per partition, with the partition replica colocated with an acceptor (see Figure 1). Multi-Paxos, EPaxos and M2Paxos are deployed with one replica per partition. For Multi-Paxos, one of the replicas is the coordinator. Clients run in a closed loop and we increase the number of clients until the system is saturated and no increase in throughput is possible. For GeoPaxos, EPaxos and M2Paxos, where the clients are equally distributed among replicas, 1 simultaneous clients per partition are enough to saturate the system. Multi- Paxos saturates sooner, with around 8 clients per partition, equally distributed among proposers, which forward the operations to the coordinator. Furthermore, we set batching for Multi-Paxos, EPaxos and Geo- Paxos to 5, (i.e., at most 5 operations can be ordered with a single Paxos execution). M2Paxos does not provide batching. As depicted in Figure 4 (a), GeoPaxos and EPaxos have similar behavior. This is explained by the absence of leader in EPaxos and the independent ordering in each partition in GeoPaxos. M2Paxos combines clients and replicas in the same process, imposing high CPU usage. As the number of partitions increases and the amount of single-partition operations remains high, we expect GeoPaxos throughput to increase linearly, just like most partially replicated approaches. Multi-Paxos saturates when the coordinator reaches maximum CPU usage. Figure 4 (b) shows the latency at peak load for all protocols. GeoPaxos and EPaxos have similar results, substantially lower than Multi-Paxos, which suffers the effects of the overloaded coordinator. M2Paxos has the lowest median latency but much larger latency tail. 6.3 Latency in the WAN In the WAN experiments, GeoPaxos contains 3 acceptors and 1 replica per partition, with the acceptors distributed in different availability zones and the replica co-located with one acceptor in a node. Multi- Paxos, M2Paxos and EPaxos use one replica per partition. Clients are in the availability zone of their local replica. Figure 5 shows the median latency and 99-th percentile of the protocols in scenarios with a single client. In the remote client" configurations, the client is in Ireland (EU) and connect to the replica in CA; in all other executions the client is in California (CA). By deploying a single client, we aim to assess the protocols in the absence of queueing effects local client remote client 1 partition - local client 1 partition - remote client 2 partitions 3 partitions Multi-Paxos GeoPaxos local client remote client local client r.c. EPaxos M2Paxos local client remote client 1 partition - local client 1 partition - remote client 2 partitions 5 partitions local client 3 Partitions 5 Partitions Figure 5: Latency in WAN (whiskers: 99-th percentiles). 3 For the evaluation of EPaxos we used the authors original code, available athttps://github.com/efficient/epaxos and compiled using Go version For the evaluation of M2Paxos we used the authors original code, available athttps://bitbucket.org/talex/hyflow-go and compiled using Go version remote client local client r.c. 1

11 Multi-Paxos and GeoPaxos have their latency strictly related to the proximity to other replicas and the location of the clients. GeoPaxos also depends on the number of partitions that an operation is addressed to. For single-partition operations, the latency of GeoPaxos is around 2 msec, while the best case for EPaxos and Multi-Paxos is around 8 9 msec for both 3 and 5 partitions. M2Paxos takes around 63 msec to order a message. Even with a single client, M2Paxos has high latency for remote clients, almost twice the value of GeoPaxos s latency in the 5-partitions scenario. Operations that involve two partitions in GeoPaxos have latency between 8 msec and 9 msec. GeoPaxos only has higher latency than the other techniques when an operation involves all partitions, something that is expected to happen very scarcely, depending directly on the latency of the farthest replica. 1 Multi-Paxos GeoPaxos NC GeoPaxos EPaxos / / 5PC: single- / three- / five-partition operation 1 1 5PC 5PC 1 CA VA EU CA VA EU OR JP Figure 6: The impact of the convoy effect on latency (whiskers: 75-th percentiles). 6.4 Convoy effect We now compare GeoPaxos without the optimizations to cope with the convoy effect ( GeoPaxos") and with the optimizations described in to mitigate the convoy effect ( GeoPaxos NC"). We also include EPaxos and Multi-Paxos in our evaluation. M2Paxos is not evaluated in this setup since the available implementation cannot handle multi-partition operations. [%] Multi-Paxos GeoPaxos NC GeoPaxos EPaxos (a) Latency in WAN and convoy effect with 3 partitions. [%] Multi-Paxos GeoPaxos NC GeoPaxos EPaxos (b) Latency in WAN and convoy effect with 5 partitions. Figure 7: The impact of the convoy effect on latency. Figure 6 shows the results for 3 and 5 regions. Single-partition operations () in GeoPaxos suffer from the convoy effect in all regions. This can be seen by the large difference between the 75-th percentile and the median latency values. The OR region suffers less from the convoy effect because the partitioning computed by METIS resulted in almost no multi-partition operations in this region. Our proposed strategy to counter the convoy effect proved effective as it brought the 75-th latency percentile of GeoPaxos NC close to the median latency. Multi-Paxos and EPaxos do not suffer from the convoy effect since all messages are multi-partition. To better illustrate the benefits of GeoPaxos s strategy to handle the convoy effect, the cumulative distribution functions (CDF) for both setups are presented in Figure 7. With 3 partitions, GeoPaxos NC brings the percentage of low-latency single-partition operations from around 65% to more than 9%. With 5 partitions, half the single-partition operations experience the convoy effect originally and less than 15% are penalized with GeoPaxos NC. Part of the single-partition operations that display high latency with Geo- Paxos NC are due to queuing effects and CPU scheduling (2 vcpus in configurations with as many threads 11

12 as the number of partitions) and a minor fraction of users with followers exclusively external to their own partition (single-partition operations ordered by a remote partition, resulting in a remote client). Figure 8 exhibits the impact on throughput. GeoPaxos is 2x faster than EPaxos with 3 partitions (from 374 to 745 operations/sec) and GeoPaxos NC is 6x faster (2245 operations/second). With 5 partitions, the speedup over EPaxos is 3.8x and 7.6x for GeoPaxos and GeoPaxos NC, respectively (366, 1357 and 2767 operations/sec). Multi-Paxos experienced the lowest throughput due to the ordering of operations done by the single coordinator. Throughput [operations / sec] Multi-Paxos GeoPaxos GeoPaxos NC EPaxos 3 Partitions 5 Partitions Figure 8: The impact of the convoy effect on throughput. 6.5 Dynamic ownership changes In this experiment, conducted using our social network service running in a WAN, we re-assign object ownerships dynamically. The execution starts with a random assignment of object ownership to partitions. The first time an object is accessed in an operation (i.e., when a user executes his first post or gettimeline), the object is moved to the user s preferred region so that further accesses will be local to the user. Figure 9 shows the throughput in the first 14 seconds of the execution and the latency CDF of three time intervals. Until the system has rearranged object ownership so that objects are owned by the partitions at the clients preferred regions performance is low. Once object ownership has been assigned, something that happens after 75 seconds into the execution, most operations will be local single-partition. Throughput [operations / sec] 3 25 t 1 t 2 t Time [sec] (a) Throughput [%] t 1 t 2 t (b) Latency Figure 9: Impact of dynamic ownership changes on throughput and latency. Although conceptually M2Paxos supports the notion of object ownership, we could not compare it experimentally to GeoPaxos since the available implementation does not include object migration. 6.6 Synchronous disk writes In the experiments so far we have configured acceptors to write asynchronously to disk, which provides better performance than synchronous writes. Although asynchronously writes may result in information loss in the case of failures, controllers typically avoid the problem with battery-backed write caches. We now assess the performance of GeoPaxos with synchronous disk writes in the LAN setup presented in

13 GeoPaxos can benefit from servers with multiple disk drives to improve performance. In order to do that, we create partitions within the same LAN. Each partition uses its own set of acceptors (3 acceptors per partition) and each acceptor uses a different disk in the server. We configure the system with 1 clients per partition, running in closed loop. Figure 1 shows the results as we increase the number of partitions (and clients). Since the disk is the bottleneck in the execution, we see a linear increase in throughput as we add partitions. Throughput [operations / sec] Number of partitions Figure 1: Performance with message logging. 6.7 Performance under failures The last set of experiments assess GeoPaxos in the presence of replica failures. Initially, we configured the system with clients running in closed loop and equally distributed across partitions in order to keep the overall throughput between 1 and 12 operations per second, without saturating the replicas (see Figure 11). 1 VA replica crash, recovery replica in EU CA replica crash, recovery replica in CA 1 1 Throughput [operations / sec] Time [sec] 14 CA VA EU 12 1 CA replica fails! VA replica fails! Time [sec] Figure 11: Latency and throughput in WAN (3 regions, clients in all regions). Clients from 3 EC2 regions (CA, VA and EU) connect to their local replicas. Clients in CA have a backup replica in another datacenter in the same region, while clients in VA are configured to connect to the EU 13

14 replica in case of failures. After 3 seconds into the execution, a replica in CA is killed. Clients in this region immediately connect to the second replica of the region, which has a slightly greater latency (from.4 to 1.2 msec), resulting in some throughput decrease. A second replica is killed 3 seconds later, now in VA. Clients reconnect to the EU replica, but are subject to increased latency (around 14 msec). The replica in EU keeps a constant throughput despite two failures in two distinct regions. This is possible due to the partial ordering implemented by GeoPaxos and the separation of Paxos roles, disassociating the acceptor and proposer from the replicas. In a second configuration, we include the regions of OR and JP and start one replica per region. Clients from CA execute both single- and multi-partition operations. After almost 1 minute, the replica in CA is killed. The 1 clients in CA immediately connect to the VA replica (see Figure 12). The latency for multi-partition operations raises by approximately 1 RTT, from 85 msec to around 16 msec, reflecting the distance from the clients to the remote replica. Single-partition operations are more affected by local replica crash: the latency of 2.3 msec jumps to the same 16 msec. In this case, the additional 2 RTTs are caused by 4 inter-region messages: a) from client to the remote replica; b) from the remote replica back to be ordered in client s local region; c) the operation goes back to the remote replica after being ordered; and d) the remote replica replies to the client. This latency could be reduced if the remote replica, upon noticing the increasing number of requests from the affected region, triggered a move operation for the objects being order by that region, saving 1 RTT in further requests from the same clients. 1 1 Multi-partition operations Single-partition operations Throughput [operations / sec] Time [sec] Local replica fails! Time [sec] Figure 12: Latency and Throughput in WAN (5 regions, 1 clients in CA). 7 Related work If on the one hand state machine replication is widely used to increase service availability (e.g.,[3, 6, 13]), on the other hand it is also notably criticized for its performance. From single-leader algorithms, like Paxos[2], to leaderless algorithms (e.g.,[25, 38]) and variations that take the semantics of operations into account (e.g.,[27, 21]), all efforts have been directed to finding faster ways to order operations. None of the solutions, however, can avoid the large latency imposed by geographically distributed applications since at least a simple majority quorum of replicas is needed to order operations[22]. Furthermore, existing solutions experience reduced performance as the number of replicas increases. 14

15 GeoPaxos improves the performance of state machine replication by exploiting the fact that operations do not need a total order, but, on the contrary, can be partially ordered by multiple totally ordered instances of a Multi-Paxos, and judiciously placing the various Multi-Paxos instances close to their clients. Partially ordering operations with the goal of improving performance has been previously implemented by EPaxos and M2Paxos. EPaxos[25] improved on traditional Paxos[2] by reducing the overload on the coordinator and allowing any replica to order operations. Depending on the interference between operations, however, EPaxos can impose two additional communication steps in the ordering of operations, a high price to be paid by geographically distributed applications. Moreover, EPaxos does not take locality into account. M2Paxos[28] is an implementation of Generalized Consensus[21]. M2Paxos does not establish operation dependencies based on conflicts, but, similar to GeoPaxos, maps nodes (partitions in GeoPaxos) to accessed objects. M2Paxos guarantees that operations that access the same objects are ordered by the same node. It needs at least two communications steps for local operations and one additional step for remote operations. While GeoPaxos has a worst case of two inter-partition messages exchanges, M2Paxos s mechanism to deal with multiple-partition operations requires changing the ownership of all involved objects and does not provide any guarantees on the maximum number of communication delays. In geographically distributed scenarios, applications cannot afford such uncertainty. Several solutions that partition (i.e., shard) the data have appeared in the literature. Systems in this category are sometimes referred to as partially replicated systems, as opposed to designs in which each replica has a full copy of the service state, like in GeoPaxos. Spanner[14] is a partitioned distributed database for WANs. It uses a combination of two-phase commit and a TrueTime API to achieve consistent multipartition transactions. TrueTime uses hardware clocks to derive bounds on clock uncertainty, and is used to assign globally valid timestamps and for consistent reads across partitions. It requires ellaborate synchronization mechanisms to keep the clock skew among nodes within an acceptable limit. Spinnaker[29] is similar to the approach presented here. It also uses several instances of Multi-Paxos to achieve scalability. However, Spinnaker does not support operations across multiple Multi-Paxos instances. Differently from existing sharded systems, where replicas contain only part of the service state, in Geo- Paxos each replica contains the entire state. In doing this, GeoPaxos can improve performance without sacrificing the simplicity of the state machine replication approach. Moreover, there is no need to reshard data across nodes and to migrate data across nodes for load balance and in response to failures. Some systems seek to boost performance by exploiting transactional semantics. The most related to GeoPaxos are MDCC, Granola, Geo-DUR and P-Store. MDDC[18] is a replicated transactional data store that also uses several instances of Paxos. MDCC optimizes for commutative transactions, and uses Generalized Paxos to relax the order of commuting transactions. Granola[7] is distributed transaction coordination system that relies on real time timestamps to provide strong consistency among transactions. It depends on synchronized clocks and needs three communication delays to order multi-repository transactions, in the absence of aborts. P-Store[31] and Geo-DUR[34] are optimized for wide-area networks and have also looked into techniques to reduce the convoy effect. Differently from these systems, GeoPaxos does not require transaction support. For example, GeoPaxos does not need to handle rollbacks in case partitions do not agree on the order of operations[34]. Some solutions have faced the high latency" of state machine replication by weakening consistency guarantees. One example is eventual consistency[9], which allows replicas to diverge in case of network partitions, with the advantage that the system is always available. However, clients are exposed to conflicts and reconciliation must be handled at the application level. Walter[35] offers Parallel Snapshot Isolation (PSI) for databases replicated across multiple datacenters. PSI guarantees snapshot isolation and total order of updates within a site, but only causal ordering across datacenters. COPS[23] ensures a stronger version of causal consistency: in addition to ordering causally related write operations, it also orders writes to the same data items. GeoPaxos can largely benefit from graph partitioning techniques, although this is not the focus of this paper. Partitioning techniques could be applied to reassign the objects and keep the frequency of singlepartition operation as high as possible. Several partitioning techniques have been devised which could be used in GeoPaxos (e.g.,[8][26][37]). 15

Fast Atomic Multicast

Università della Svizzera italiana USI Technical Report Series in Informatics Fast Atomic Multicast Paulo R. Coelho 1, Nicolas Schiper 2, Fernando Pedone 1 1 Faculty of Informatics, Università della Svizzera