Scaling up the Preventive Replication of Autonomous Databases in Cluster Systems 1

Size: px

Start display at page:

Download "Scaling up the Preventive Replication of Autonomous Databases in Cluster Systems 1"

Oscar Pitts
6 years ago
Views:

1 Scaling up the Preventive Replication of Autonomous Databases in Cluster Systems 1 Cédric Coulon, Esther Pacitti, Patrick Valduriez INRIA and LINA, University of Nantes France Cedric.Coulon@lina.univ-nantes.fr, Esther.Pacitti@lina.univ-nantes.fr Patrick.Valduriez@inria.fr Abstract. We consider the use of a cluster system for Application Service Providers. To obtain high-performance and high-availability, we replicate databases (and DBMS) at several nodes, so they can be accessed in parallel through applications. Then the main problem is to assure the consistency of autonomous replicated databases. Preventive replication [8] provides a good solution that exploits the cluster s high speed network, without the constraints of synchronous replication. However, the solution in [8] assumes full replication and a restricted class of transactions. In this paper, we address these two limitations in order to scale up to large cluster configurations. Thus, the main contribution is a refreshment algorithm that prevents conflicts for partially replicated databases. We describe the implementation of our algorithm over a cluster of 32 nodes running PostGRESQL. Our experimental results show that our algorithm has excellent scale up and speed up. 1 Introduction High-performance and high-availability of database management has been traditionally achieved with parallel database systems [9], implemented on tightlycoupled multiprocessors. Parallel data processing is then obtained by partitioning and replicating the data across the multiprocessor nodes in order to divide processing. Although quite effective, this solution requires the database system to have full control over the data and is expensive in terms of software and hardware. Clusters of PC servers provide a cost-effective alternative to tightly-coupled multiprocessors. They have been used successfully by, for example, Web search engines using high-volume server farms (e.g., Google). However, search engines are typically read-intensive which makes it easier to exploit parallelism. Cluster systems can make new businesses such as Application Service Providers (ASP) economically viable. In the ASP model, customers applications and databases (including data and DBMS) are hosted at the provider site and need be available, typically through the Internet, as efficiently as if they were local to the customer site. To improve performance, applications and data can be replicated at different nodes so that users 1 Work partially funded by the MDP2P project of the ACI Masses de Données of the French ministry of research. 1 / 14

2 can be served by any of the nodes depending on the current load [1]. This arrangement also provides high-availability since, in the event of a node failure, other nodes can still do the work. However, managing data replication in the ASP context is far more difficult than in Web search engines since applications can be updateintensive and both applications and databases must remain autonomous. The solution of using a parallel DBMS is not appropriate as it is expensive, requires heavy migration to the parallel DBMS and hurts database autonomy. In this paper, we consider a cluster system with similar nodes, each having one or more processors, main memory (RAM) and disk. Similar to multiprocessors, various cluster system architectures are possible: shared-disk, shared-cache and sharednothing [9]. Shared-disk and shared-cache require a special interconnect that provide a shared space to all nodes with provision for cache coherence using either hardware or software. Shared-nothing (or distributed memory) is the only architecture that supports our autonomy requirements without the additional cost of a special interconnect. Furthermore, shared-nothing can scale up to very large configurations. Thus, we strive to exploit a shared-nothing architecture. To obtain high-performance and high-availability, we replicate databases (and DBMS) at several nodes, so they can be accessed in parallel through applications. Then the main problem is to assure the consistency of autonomous replicated databases. An obvious solution is synchronous replication (e.g. Read-One-Write-All) which updates all replicas within the same (distributed) transaction. Synchronous (or eager) replication enforces mutual consistency of replicas. However, it cannot scale up to large cluster configurations because it makes use of distributed transactions, typically implemented by 2 phase commit [5]. A better solution that scales up is lazy replication [6]. With lazy replication, a transaction can commit after updating a replica, called primary copy, at some node, called master node. After the transaction commits, the other replicas, called secondary copies, are updated in separate refresh transactions at slave nodes. Lazy replication allows for different replication configurations [7]. A useful configuration is lazy master where there is only one primary copy. Although it relaxes the property of mutual consistency, strong consistency 2 is eventually assured. However, it hurts availability since the failure of the master node prevents the replica to be updated. A more general configuration is (lazy) multi-master where the same primary copy, called a multi-owner copy, may be stored at and updated by different master nodes, called multi-owner nodes. The advantage of multi-master is high-availability and high-performance since replicas can be updated in parallel at different nodes. However, conflicting updates of the same primary copy at different nodes can introduce replica divergence. Preventive replication can prevent the occurrence of conflicts [8], by exploiting the cluster s high speed network, thus providing strong consistency, without the constraints of synchronous replication. The refresher algorithm in [8] supports multimaster configurations and its implementation over a cluster of 8 nodes shows good performance. However, it assumes that databases are fully replicated across all cluster nodes and thus propagates each transaction to each cluster node. This makes it unsuitable for supporting heavy workloads on large cluster configurations. 2 For any two nodes, the same sequence of transactions is executed in the same order. 2 / 14

3 Furthermore, it considers only write multi-owner transactions. However it may arise that a transaction must read from some other copy to update a multi-owner copy. In this paper, we address these two limitations in order to scale up to large cluster configurations. The main contribution is a multi-master refreshment algorithm that prevents conflicts for partially replicated databases. We also show the architectural components necessary to implement the algorithm. Finally, we describe the implementation of our algorithm over a cluster of 32 nodes running PostGRESQL and present experimental results that show that it yields excellent scale-up and speed up.. The rest of the paper is structured as follows. Section 2 presents the cluster system architecture in which we identify the role of the replication manager. Section 3 defines partially replicated configurations. Section 4 presents refreshment management including the principles of the refresher algorithm and the architectural components necessary for its implementation. Section 5 describes our prototype and show our performance model and experiments results. Finally, Section 6 concludes the paper. 2 Cluster System Architecture In this section, we introduce the architecture for processing user requests against applications into the cluster system and discuss our general solutions for placing applications, submitting transactions and managing replicas. Therefore, the replication layer is identified together with all other general components. Our system architecture is fully distributed since any node may handle user requests. Shared-nothing is the only architecture that supports sufficient node autonomy without the additional cost of special interconnects. Thus, we exploit a shared-nothing architecture. However, in order to perform distributed request routing and load balancing, each node must share (i) the user directory which stores authentication and authorization information on users and (ii) the list of the available nodes, and for each node, the current load and the copies it manages with their access rights (read/write or read only). To avoid the bottleneck of central routing, we use Distributed Shared Memory [4] to manage this shared information. Each cluster node has four layers (see Figure 1): Request Router, Application Manager, Transaction Load Balancer and Replication Manager. A request may be a query or an update transaction for a specific application. The general processing of a user request is as follows. When a user request arrives at the cluster, traditionally through an access node, it is sent randomly to a cluster node. There is no significant data processing at the access node, avoiding bottlenecks. Within that cluster node, the user is authenticated and authorized through the Request Router, available at each node, using a multi-threaded global user directory service. Thus user requests can be managed completely asynchronously. Next, if a request is accepted, then the Request Router chooses a node j, to submit the request. The choice of j involves selecting all nodes in which the required application is available and copies used by the request are available with the correct access right, and, among these nodes, the node with the lightest load. Therefore, i may be equal to j. The 3 / 14

4 Request Router then routes the user request to an application node using a traditional load balancing algorithm. Global User Directory Request Router Node 1 Fast Network Node 2 Application Manager Transaction Load Balancer Replication Manager DBMS Current Load Monitor Global Data Placement Node n Figure 1. Cluster system architecture However, the database accessed by the user request may be placed at another node k since applications and databases are both replicated and not every node hosts a database system. In this case, the choice regarding node k will depend on the cluster configuration and the database load at each node. Each time a new node is available, it must register to other nodes. To accomplish this, it uses the global shared memory that maintains the lists of available nodes. If a node is no more available, it is removed from the list either by itself, if it disconnects normally, or by the request router of another node, if it crashes. A node load is specified by a current load monitor available at each node. For each node, the load monitor periodically computes application and transaction loads. For each type of load, it establishes a load grade and stores the grades in a shared memory. A high grade corresponds to a high load. Therefore, the Request Router chooses the best node for a specific request using the node grades. The Application Manager is the layer that manages application instantiation and execution using an application server provider. Within an application, each time a transaction is to be executed, the Transaction Load Balancer layer is invoked which triggers transaction execution at the best node, using the load grades available at each node. The best node is defined as the one with lighter transaction load. The Transaction Load Balancer ensures that each transaction execution obeys the ACID (atomicity, consistency, isolation, durability) properties, and then signals to the Application Manager to commit or abort the transaction. 4 / 14

5 The Replication Manager layer manages access to replicated data and assures strong consistency in such a way that transactions that update replicated data are executed in the same serial order at each node. We employ data replication because it provides database access parallelism for applications. Our preventive replication approach, implemented by the multi-master refreshment algorithm, avoids conflicts at the expense of a forced waiting time for transactions, which is negligible due to the fast cluster network system. 3 Partially Replicated Configurations In this section, we discuss the configurations supported by our refreshment algorithm, then we define the terms used for the replication algorithm. In Figure 2, we can see the various configurations supported by our algorithm with two tables R and S. The primary copies are in upper case (e.g. S), the multi-master copies in underlined upper case (e.g. ) and the secondary copies in lower case (e.g. s ). S r', s', S 1 R 3, S 3, S 1 R 3, S N3 N3 N3 R r'', s'', S 2 R 4, S 4, s' s'' s' N2 a) Bowtie N4 N2 N4 b) Fully replicated N2 N4 c) Partially replicated N2 N3 d) Partially replicated Figure 2. Useful configurations First, we have a bowtie lazy-master configuration (Figure 2a) where there are only primary copies and secondary copies. Second, we have a multi-master configuration (Figure 2b) where there are only multi-master copies and all nodes hold all copies,. Thus, this configuration is also called fully-replicated. Then we have two examples of partially-replicated configurations (Figure 2c and 2d) where all types of copies (primary, multi-master and secondary) may be stored at any node. For instance, in Figure 2c, node carries the multi-owner copy and a primary copy S, node N2 carries the multi-owner copy and the secondary copy s, node N3 carries the multiowner copy R 3, and node N4 carries the secondary copy s. Such a partial replicated configuration allows an incoming transaction at node, such as the one in Example 1, to read S 1 in order to update : Example 1: Incoming transaction at node UPDATE SET att 1 =value WHERE att 2 IN (SELECT att 3 FROM S) COMMIT; 5 / 14

6 This transaction can be entirely executed at (to update ) and N2 (to update ). However it cannot be entirely executed at node N3 (to update R 3 ) which does not hold a copy of S. A Multi-Owner Transaction (MOT) may be composed of a sequence of read and write operations followed by a commit (as produced by the above statement). This is more general than in [8] which only considers write operations. A Multi-Owner Refresh transaction (MOR) corresponds to the resulting sequence of write operations of an MOT, as written in the Log History [8]. The origin node is the node that receives a MOT (from the client) which updates a set of multi-owner copies. Target nodes are the nodes that carry multi-owner copies which are updated by the incoming MOT. Note that the origin node is also a target node since it holds multi-owner copies which are updated by MOT. For instance, in Figure 2b whenever node receives a MOT that updates, then is the origin node and,n2, N3 and N4 are the target nodes. In Figure 2c, whenever N3 receives a MOT that updates R 3, then the origin node is N3 and the target node are, N2 and N3. 4 Preventive Refreshment for Partially Replicated Configurations In this section, we present the preventive refreshment algorithm for the partially replicated configurations. Next, we introduce the architecture used for the Replication Manager layer to show how the algorithm works, using an example. 4.1 Refresher Algorithm We assume that the network interface provides global First-In First-Out (FIFO) reliable multicast: messages multicast by one node are received at the multicast group nodes in the order they have been sent [3]. We denote by Max, the upper bound of the time needed to multicast a message from a node i to any other node j. We also assume that each node has a local clock. For fairness reasons, clocks are assumed to have a drift and to be ε-synchronized [2]. This means that the difference between any two correct clocks is not higher that ε (known as the precision)./ To define the refresher algorithm, we need a formal correctness criterion to define strong consistency. In multi-master configurations, inconsistencies may arise whenever the serial orders of two multi-owner transactions at two nodes are not equal. Therefore, multi-owner transactions must be executed in the same serial order at any two nodes. Thus, global FIFO ordering is not sufficient to guarantee the correctness of the refreshment algorithm. Hence the following correctness criterion is necessary: Definition 3.1 (Total Order) Two multi-owner transactions MOT1 and MOT2 are said to be executed in Total Order if all multi-owner nodes that commit both MOT1 and MOT2 commit them in the same order. 6 / 14

7 Proposition 3.1 For any cluster configuration C that meets a multi-master configuration requirement, the refresh algorithm that C uses is correct if and only if the algorithm enforces total order. We now present the refresher algorithm. Each multi-owner transaction is associated with a chronological time stamp value. The principle of the preventive refresher algorithm is to submit a sequence of multi-owner transactions in chronological order at each node. As a consequence, total order is enforced since multi-owner transactions are executed in same serial order at each node. To assure chronological ordering, before submitting a multi-owner transaction at node i, the algorithm has to check whether there is any older committed multi-owner transaction enroute to node i. To accomplish this, the submission time of a new multi-owner transaction MOT i is delayed by Max + ε (recall that Max is the upper bound of the time needed to multicast a message from a node to any other node). After this delay period, all older transactions are guaranteed to be received at node i. Thus chronological and total orderings are assured. Whenever a multi-owner transaction MOT i is to be triggered at some node i, the same node multicasts MOT i to all nodes 1, 2, n, of the multi-owner configuration, including itself. Once MOT i is received at some other node j (notice that i may be equal to j), it is placed in the pending queue for the multi-owner triggering node i, called multi-owner pending queue (noted moq i ). Therefore, at each multi-owner node i, there is a set of multi-owner queues, moq 1, moq 2,, moq n. Each pending queue corresponds to a node of the multi-owner configuration and is used by the refresher to perform chronological ordering. With partially replicated configurations, some of the target nodes may not be able to perform MOT i because they do not hold all the copies necessary to perform the read set of MOT i (recall the discussion on Example 1). However the write sequence of MOT i (which corresponds to its MOR i ) needs to be ordered using MOT i 's timestamp value in order to ensure consistency. So MOT i is delayed by Max + ε to prevent conflicts and waits for the reception of the corresponding MOR i (the resulting sequence of write operations of MOT i ). At origin node i, when the commitment of MOT i is detected, the corresponding MORi is produced and node i multicasts MOR i towards the involved target nodes. Finally, upon reception of the MOR i, at a target node j, the content of the waiting MOT i is replaced with the content of the incoming MOR i. Then, MOT i is executed which enforces strong consistency. A key aspect of our algorithm is to rely on the upper bound Max on the transmission time of a message by the global FIFO reliable multicast. Therefore, it is essential to have a value of Max that is not over estimated. The computation of Max resorts to scheduling theory (e.g. see [9]) and takes into account several parameters such as the global reliable network itself, the characteristics of the messages to multicast and the failures to be tolerated. In summary, our approach trades the use of a worst case multicast time at the benefit of reducing the number of messages exchanged on the network. This is a well known trade-off. This solution brings simplicity and ease of implementation. 7 / 14

8 4.2 Replication Manager Architecture To implement the multi-master refresher algorithm, we add several components to a regular DBMS. Our goal is to maintain node autonomy since the implementation solution does not require the knowledge of system internals. Figure 3 shows the architecture for refreshing partially replicated databases in our cluster system. The Replica Interface receives MOTs, coming from the clients. The Propagator and the Receiver manage the sending and reception (respectively) of transactions (MOT and MOR) inside messages within the network. Whenever the Receiver receives a MOT, it places it in the appropriate pending queue, used by the Refresher. Next, the Refresher executes the refresher algorithm to ensure strong consistency. Each time a MOT is chosen to be executed by the Refresher, it is written in a running queue which is then read by the Deliverer in FIFO order. The Deliverer submits transactions, read from the running queue, to the DBMS. For partially replicated configurations, when a MOT is composed of a sequence of reads and writes, the refresher at the target nodes must assure the correct ordering. However the MOT execution must be delayed until its corresponding MOR is received. This is because the MOR is produced only after the commitment of the corresponding MOT at the origin node. At the target node, the content of the MOT (sequence of read and write operations) is replaced by the content of the MOR (sequence of write operations). Thus, at the target node, when the receiver receives a MOR, it interacts with the Deliverer. At the origin node, the Log Monitor checks constantly the content of the DBMS log to detect whether replicated copies have been changed. For partially replicated configurations, whenever a MOT is composed of reads and writes, the log monitor detects the MOT commitment in order to produce the corresponding MOR. Then, the propagator multicasts the MOR to the involved target nodes. Clients DBMS Global Data Placement Repliqua Interface MOT Log Monitor MOR Propagator MOR Replication manager Network Deliverer Refresher MOT Receiver Figure 3. Preventive Replication Manager Architecture Let us now illustrate the refresher algorithm with an example of execution. In Figure 4, we assume a simple configuration with 3 nodes (, N2 and N3) and 2 copies (R and S). carries a multi-owner copy of R and a primary copy of S, N2 a 8 / 14

9 multi-owner copy of R, and N3 a secondary copy of S. The refreshment proceeds in 5 steps. In step 1, (the origin node) receives MOT from a client, using the Replica Interface, which reads S and updates. For instance, MOT can be the resulting read and write sequence produced by the transaction of Example 1. Then, in step 2, multicasts MOT, using the Propagator, to the involved target nodes, i.e. and N2. N3 is not concerned by MOT because it only holds a secondary copy s. In step 3, MOT can be performed using the Refresher algorithm at. At N2, MOT is also managed by the Refresher and then put in the running queue. MOT cannot yet be executed at the target node (N2 does not hold S) because the Deliverer needs to wait for its corresponding MOR (see step 4). In step 4, after the commitment of MOT at the origin node, the Log Monitor produces MOR and multicasts it to all involved target nodes. In step 5, N2 receives MOR and the Receiver replaces the content of MOT by that of MOR. The Deliverer can then submit MOT. Client MOT (r S, w R ), S, S, S Client Answer MOT, S, S MOT (r S, w R ) MOR(w R ) s s s s s N2 Step 1 N3 N2 N3 N2 N3 N2 N3 N2 Standby (perform) Perform Step 2 Step 3 Step 4 Step 5 N3 Figure 4. Example of preventive refreshment with partial configurations 5 Validation The refresher algorithm proposed in [8] for full preventive replication was shown to introduce a negligible loss of data freshness (almost equal to mutual consistency). The same freshness result applies to partial replication. In this section, we describe our implementation, the replication configurations we use and our performance model. Then we describe two experiments to study scale up and speed up. 5.1 Implementation We did our implementation on a cluster of 32 nodes. Each node is configured with a 1GHz Pentium 3 processor, 512 MB of memory and 40GB of disk. The nodes are linked by a 1 Gb/s Myrinet network. We use Linux Mandrake 8.0/Java and Spread toolkit that provides a reliable FIFO message bus and high-performance message service among the cluster nodes that is resilient to faults. We use the PostGRESQL / 14

10 Open Source DBMS at each node. We chose PostGRESQL because it is quite complete in terms of transaction support and easy to work with. Our implementation has four modules: Client, Replicator, Network and Database Server. The Client module simulates the clients of the ASP. It submits multi-owner transactions randomly to any cluster node, via RMI-JDBC, which implement the Replica Interface. Each cluster node hosts a Database Server and one instance of the Replicator module. For this validation which stresses transaction performance, we implemented directly the Replicator module within PostGRESQL. The Replicator module implements all system components necessary for a multi-owner node: Replica Interface, Propagator, Receiver, Refresher and Deliverer. Each time a transaction is to be executed it is first sent to the Replica Interface which checks whether the incoming transaction writes on replicated data. Whenever a transaction does not write on replicated data, it is sent directly to the local transaction manager. Even though we do not consider node failures in our performance evaluation, we implemented all the necessary logs for recovery to understand the complete behavior of the algorithm. The Network module interconnects all cluster nodes through the Spread toolkit. 5.2 Configurations We use 3 configurations involving 3 copies R, S and T with 5 different numbers of nodes (4, 8, 16, 24 and 32). This choice makes it easy to vary the placement of copies while keeping a balanced distribution. In the Fully Replicated configuration (FR), all nodes hold the 3 copies, i.e. {R,S,T}. In the Partially Replicated 1 configuration (PR1), one fourth of the nodes holds {R,S,T}, another holds {R,S}, another holds {R,T} and another holds {S,T}. In the Partially Replicated 2 configuration (PR2), one fourth of the nodes holds {R,S,T}, another holds {R}, another holds {S} and another holds {T}. Fully Replicated (FR) Partially Replicated 1 (PR1) Partially Replicate 2 (PR2) nodes RST RST RS RT ST RST R S T Table 1. Placement of copies per configuration 5.3 Performance Model Our performance model takes into account multi-owner transaction size, arrival rates and the number of multi-owner nodes. We vary multi-owner transactions arrival rates 10 / 14

11 (noted λ). We consider workloads that are either low (λ=3s) or bursty (λ=200ms). We define two types of multi-owner transactions (denoted by MOT ), defined by the number of write operations. Small transactions have size 5, while long transactions have size 50. Multi-owner transactions are either Single Copy (SC), only one replica is updated, or Multiple Copy (MC), several replicas are involved. Thus, with our 3 copies, we have 3 SC transactions (on either {R}, {S} or {T}) and 7 MC transactions (on either {R}, {S}, {T}, {R,S}, {R,T}, {S,T} or {R,S,T}). Param. Definition Values MOT Transaction Type Ltr λ Nb-Multi-Owner M Max + ε Configuration Multi-owner transaction size Involved a Single or Muli- Replicas Copies Long transaction ratio Mean time between transactions (workload) Number of replicated PostGRESQL instances Number of update transactions submitted for the tests Delay introduced for submitting a MOT Replication of R, S and T 5; 50 SC (single), MC (multi) 66% bursty: 200ms; low: 3s 4, 8, 16, 24, ms FR, PR1, PR2 Table 2. Performance parameters To understand the behavior of our algorithm in the presence of short and long transactions, we set a parameter called long transaction ratio as ltr = 66 (66% of update transactions are long). Updates are done on the same attribute of a different tuple. Finally our copies R, S and T have 2 attributes (the first attribute is used as primary key and the second as data) and carry each tuples and no indexes. The parameters of the performance model are described in Table Scale up Experiment This experiment studies the algorithm s scalability. That is, for a same set of incoming multi-owner transactions, scalability is achieved whenever in augmenting the number of nodes the response times remain the same. We consider SC and MC transactions in bursty and low workloads (λ = 200ms and λ = 20s), varying the number of nodes for each configuration (FR, PR1 and PR2). SC and MC transactions randomly update R, S or T. For each test, we measure the average response time per transaction. The duration of this experiment is the time to submit 50 transactions. The experiment results (see Figure 5) clearly show that for all tests scalability is achieved. In contrast with synchronous algorithms, our algorithm has linear response time behavior since for each incoming multi-owner transaction it requires only the multicast of n messages for nodes that carry all required copies plus 2n messages for nodes that do not carry all required copies, where n corresponds to the number of target nodes. The results also show the impact of the configuration on performance. First, on low workloads (see Figure 5.b and 5.d), the impact is very little (transaction response times are similar for all configurations) because at each node transaction executes alone. In bursty workloads (see Figure 5.a and 5.c), the number of transaction augments however several of them execute in parallel due to partial replication. Thus 11 / 14

12 partial replication increases inter-transaction parallelism by allowing different nodes to process different transactions. The performance improvement of PR over FR is best for SC transactions (see Figure 5.a) and quite significant (a factor of 2.33 for PR1 and of 7 for PR2). For MC transactions (see Figure 5.c), the performance improvement is still a factor of 3.5 for PR1 but only of 1.6 for PR2. This is because FR performs relatively better for MC transactions than for SC transactions. For SC transactions, PR2 which stores individual copies of R, S and T at different nodes outperforms PR1. But for MC transactions, PR1 outperforms PR2. Thus the configuration and the placement of the copies should be tuned to selected types of transactions Response Time (ms) FR PR1 PR2 Response Time (ms) FR PR1 PR Number of nodes Number of nodes a) Bursty Workload - SC b) Low Workload - SC Response Time (ms) FR PR1 PR2 Response Time (ms) FR PR1 PR Number of nodes Number of nodes c) Bursty Workload - MC d) Low Workload - MC Figure 5. Scalability results 5.5 Speed up Experiment The second experiment studies the performance improvement (speed up) for read queries when we increase the number of nodes. To accomplish this, we modified the previous experiment by introducing clients that submit queries. We consider MC read queries in bursty and low workloads (λ = 200ms and λ = 20s), varying the number of nodes for each configuration (FR, PR1 and PR2). Since each read query involves only 12 / 14

13 one node, the performance difference between SC and MC queries is negligible. Thus, we only consider MC queries which are more general. The number of read clients is 128. Each client is associated to one node and we produce an even distribution of clients at each node. Thus, the number of read clients per node is 128/number of nodes. The clients submit queries sequentially while the experiment is running. For each test, we measured the throughput of the cluster, i.e. the number of read queries per second. The duration of this experiment is the time of submitting 1000 read queries. The experiment results (see Figure 6) clearly show that the increase of the number of nodes improves the cluster s throughput. For bursty workloads (Figure 6.a), the improvement obtained with partial replication is moderate because the nodes are already overloaded by the update transactions. The improvement is due to the fact that the nodes are less used with PR1 and PR2 than with FR Queries per seconde FR PR1 PR2 Queries per seconde FR PR1 PR Number of nodes Number of nodes a) Bursty Workload - MC b) Low Workload - MC Figure 6. Speed up results For low workloads (Figure 6.b), the speed up obtained for all three configurations is quite significant and roughly the same. However, as in the previous experiment, the type of configuration does not matter because the nodes are not overloaded by update transactions. 6 Conclusion Preventive replication provides strong copy consistency, without the constraints of synchronous replication. In this paper, we addressed the problem of scaling up preventive replication to large cluster configurations. The main contribution is a multi-master refreshment algorithm that prevents conflicts for partially replicated databases. The algorithm supports general multi-owner transactions composed of read and write operations. Cluster nodes can support autonomous databases that are considered as black boxes. Thus our solution can achieve full node autonomy. We described the system architecture components necessary to implement the refresher algorithm. We validated our refresher algorithm over a cluster of 32 nodes 13 / 14

14 running PostGRESQL. To stress transaction performance, we implemented directly our replicator module within PostGRESQL. Our experimental results showed that our multi-master refreshment algorithm scales up very well. In contrast with synchronous algorithms, our algorithm has linear response time behaviour. We also showed the impact of the configuration on performance. For low workloads, full replication performs almost as well as partial replication. But for bursty workloads in partial replicated configurations an important amount of transactions are executed in parallel, thus the performance improvement of partial replication can be quite significant (by a factor up to 7). The speed up experimental results also showed that the increase of the number of nodes can well improve the query throughput. However the performance gains strongly depend on the types of transactions (single versus multiple copy) and of the configuration. Thus an important conclusion is that the configuration and the placement of the copies should be tuned to selected types of transactions. References 1. S. Gançarski, H. Naacke, E. Pacitti, P. Valduriez: Parallel Processing with Autonomous Databases in a Cluster System, Int. Conf. of Cooperative Information Systems (CoopIS), L. George, P. Minet: A FIFO Worst Analysis for a Hard Real Time Distributed Problem with Consistency Constraints, Int. Conf. On Distributed Computing Systems (ICDCS), V. Hadzilacos, S. Toueg: Fault-Tolerant Broadcasts and Related Problems. Distributed Systems, 2 nd Edition, S. Mullender (ed.), Addison-Wesley M.D. Hill et al. Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors. ACM Trans. On Computer Systems, 11(4), T. Özsu, P. Valduriez: Principles of Distributed Database Systems. Prentice Hall, Englewood Cliffs, New Jersey, 2nd edition, E. Pacitti, P. Valduriez: Replicated Databases: concepts, architectures and techniques, Network and Information Systems Journal, Hermès, 1(3), E. Pacitti, P. Minet, E. Simon: Replica Consistency in Lazy Master Replicated Databases. Distributed and Parallel Databases, Kluwer Academic, 9(3), May E. Pacitti, T. Özsu, C. Coulon : Preventive Multi-Master Replication in a Cluster of Autonomous Databases, Int. Conf. on Parallel and Distributed Computing (Euro-Par 2003), Klagenfurt, Austria, K. Tindell, J. Clark: Holistic Schedulability analysis for Distributed Hard Real-time Systems. Micro-processors and Microprogramming, 40, P. Valduriez: Parallel Database Systems: open problems and new issues. Int. Journal on Distributed and Parallel Databases, 1(2), / 14

Optimistic Preventive Replication in a Database Cluster

Optimistic Preventive Replication in a Database Cluster Cédric Coulon, Esther Pacitti, Patrick Valduriez INRIA and LINA, University of Nantes France {Cedric.Coulon, Esther.Pacitti}@lina.univ-nantes.fr,