TRABAJO DE INVESTIGACIÓN Partial Replication and Snapshot Isolation at the Middleware Level

Size: px

Start display at page:

Download "TRABAJO DE INVESTIGACIÓN Partial Replication and Snapshot Isolation at the Middleware Level"

Clara Moore
6 years ago
Views:

1 TRABAJO DE INVESTIGACIÓN Partial Replication and Snapshot Isolation at the Middleware Level ALUMNO Damián Serrano García PROFESORES Marta Patiño Martínez Ricardo Jiménez Peris CURSO 2005/2006

2 Contents 1 Introduction 3 2 System Model 4 3 Analytical Model 5 4 Simulation Partial Replication vs. Full Replication Partial Replication Pure Partial Replication The Protocol Protocol Outline Related Work 11 7 Conclusions and Future Work 13 1

3 List of Figures 1 Scale out for different values of r. (a) w = 0, (b) w = 0.2, (cc) w = so, so T, so P, c T and c P for w = Scale out of a pure partially replicated system for different values of r Replica control protocol

4 1 Introduction Data replication increases availability and enhances scalability of cluster based information systems. If a replica (a cluster node) fails, available replicas are able to take over and the more replicas are added to the cluster, the more the load that can be handled is. Nicola and Jarke describe several data replication models in [15], they combine the fraction of replicated data and the number of data copies. Briefly, the replication models are: All objects to all sites (full replication): all data is replicated at all replicas. All objects to some sites (one-dimensional partial replication): the replication is defined by the replication degree r which is the number of copies of all data objects. Some objects to all sites (one-dimensional partial replication): the replication degree r defines the replication as the fraction of data that is replicated at all replicas (some data could have only one copy). Some objects to some sites (two-dimensional partial replication): the replication degree r is a pair (r 1, r 2 ), r 1 is the fraction of data that is replicated and r 2 is the number of replicated data copies. Replication-per-object: the replication degree r is a function that defines the number of data copies for each data object. Nevertheless, the main troubleshooting of replication is to prevent replicas becoming different in the presence of updates. This challenge is known as replica control. Grey et al.[8] group the replica control protocols by two parameters: where updates can be performed and when updates are propagated to the other replicas. On the one hand, the parameter where can be divided into primary copy and update everywhere. The primary copy approach only permits updates to be performed at a particular replica (the master replica) and then it propagates changes to the rest of them, whereas the update everywhere approach (or multi-master replication) allows changes to take place at any replica. On the other hand, the parameter when establishes two more orthogonal approaches: lazy replication and eager replication. A replica control protocol is lazy (also called asynchronous) if changes are propagated after a transaction commits, or eager (also known as synchronous) is changes are propagated as part of the transaction. The advantage of a primary copy protocol is that avoids easily two replicas to differ because of updates, but it also introduces a potential bottleneck and a single point of failure. Update everywhere permits any replica to be modified, so it is more flexible but needs a more complex protocol than primary copy. Eager protocol mantains consistency easily, but requires extra communication between replicas before a transaction commits. In lazy protocols, inconsistencies may arise and to get a consistent state again can be non-trivial because the inconsistent data is committed[13]. This work deal with eager update everywhere replication protocols. To keep data copies consistent, every update imposes an amount of work to do for replicas having a copy of updated data. That is, a processing capacity 3

5 fraction in every replica is used for maintaining consistency. This phenomenon is known as update propagation overhead and scalability depends on it. The more the update propagation overhead is, the less the scalability is. In a fully replicated database the update propagation overhead takes place at every replica for any updated data at any replica, whereas in a partially replicated database only replicas having a copy of updated data are affected. We study in this work the effects of update propagation overhead on the scalability. Scalability is also limited by the serializability property which must be provided by any replicated database. A cluster based replicated database ensures serializability if the execution of transactions is equivalent to a serial execution in a non replicated database. Transactions executed concurrently in a replicated database must not conflict. Two concurrent transactions conflict if they executed at different replica, accessed data intersect and at least one of them updated data in the intersection. Therefore, conflicts can appear and must be resolved. Resolving conflicts usually cause some transactions to abort and depending on the used correctness criteria, read/write or write/write conflits could appear. One-copy snapshot isolation[7] is one of the correctness criteria used for ensuring serializability. It has acquired important relevance lately. One-copy snapshot isolation uses snapshot isolation[1] as isolation level. With snapshot isolation, a transaction T reads from a database snapshot, that is, a database copy that includes all changes made by committed transactions at the time T started. Write/write conflicts can take place when two concurrent transactions update the same data, but only one of them will be aborted. Writes never conflict with reads because each transaction always reads from a snapshot. Thus, with one-copy snapshot isolation there is the possibility of more scalability avoiding read/write conflicts. The contributions that mark the guidelines of this work are: First we explore partial replication as one additional mechanism to overcome the scalability ceiling of current approaches. For that end, we define the system model (Section 2) and then we design an analytical model, described in Section 3, for an all objects to some sites model. Then we show the benefits of partial replication discussing the analytical model results given by simulations in Section??. Finally we exploit new correctness criteria such as one-copy snapshot isolation developing a protocol for a partial replicated database (Section 5) at middleware level. The middleware level gives us the database system independence and makes for heterogeneous systems. 2 System Model In this section, we introduce the system model. We first describe the components of a partial replicated database and then we explain how transactions are executed. A partially replicated database consist of a set of nodes N = {N 1, N 2,..., N n }. Nodes, also called sites or replicas, communicate by exchanging messages. Sites 4

6 may fail by crashing or network errors, failed sites can be detected and recovered. After recovering, sites can rejoin the system. The set of nodes is divided into two disjoint subsets: T = {T 1..T t } and P = {P 1..P p }. We call the first one the total replicas set because every site on T has a full copy of the database. The second set is called the partial replicas set, each site in the subset has only a fraction of the database. We define r as the replication degree, i. e. the number of database copies in the partial replicated set (r = p means full replication). We follow a replication protocol as the one described in [14], but adapted to the partial replication context. A client interacts with the database sending transactions to a replica, this replica is called the coordinator for transactions of that client. Transactions are a sequence of read or write operations. Tables accessed by each operation and the kind of operation (i. e. query or update) are not needed to be known in advance. If the coordinator cannot execute an operation, the coordinator must redirect the operation to a replica that can execute it. The reason why a coordinator cannot execute an operation is that the coordinator does not store a copy of all accessed data (e. g. the operation updates a table that is not replicated at the coordinator). Obviously, total replicas can execute any operation and if a total replica act as a coordinator it will never redirect any operation at all. An operation is called local at the node where it is executed and remote at the other nodes. After executing all operations of a transaction, the client sends the transaction commit to the coordinator. Then, the coordinator propagates changes made by the transaction to the rest of replicas. Upon receiving the changes, each replica tests if there is any conflicting concurrent transaction and if there is no one, all the replicas write the changes. 3 Analytical Model In this section we describe the analytical model used for measuring the scalability of partial replicated databases. First, we define some concepts and then we explain the model. As in [12] we use the scale out which indicates how much of the nominal capacity of the entire database is used for performing local transactions[11]. The analytical model is based in the system model given in Section 2. We list some assumptions before describing the model: Each replica can execute C transactions per second. Database copies are evenly distributed among partial replicas. If there are r database copies in partial replicas and p partial replicas, the database fraction stored at each partial replica is r p. There exists a workload profile such that generates a w update percentage in every replica. The cost of executing a remote update, rt, is smaller than the cost of executing a local transaction, lt. A local transaction can be operations such as a large heavy query that is not necessary to execute again at remote sites. We include this behaviour as a parameter, the writing overhead 5

7 wo = lt rt, wo = 1 means that there is no difference between the cost of local transactions and the cost of remote transactions. The global system scale out, so, is the sum of each site scale out. In a partially replicated database with full and partial replicas, there is a scale out factor of fully replicated sites, so T, and a scale out factor of partially replicated sites, so P. If we define l T as the local transactions per second executed at each total replica and l P as the local transactions at each partial replica, we obtain the expressions for so T = lt C and so P = lp C. Assume we have a database formed by n replicas, t replicas being fully replicated and p replicas being partially replicated, the global system scale out is: so = t l T + p l P C Each replica executes its local transactions and a number of remote updates. The number of transactions executed by a total replica T i are its local transactions (l T ), remote updates originated in the other total replicas ((t 1) l T w) and updates originated in every partial replica (p l P w). Then, transactions executed in T i can be expressed as: (1) C = l T + ((t 1) l T + p l P ) w wo (2) Similarly, we obtain the expression for C in a partial replica. The number of transactions executed by a partial replica, P i, are l P local transactions and a number of remote updates. There are t l T transactions executed at total replicas which access all data, but only r p w of them generate remote updates at P i because P i only stores a r p database fraction. Remote updates executed in P i originated in the other partial replicas only can be generated by updates that access to copies of data stored in P i, i. e. updates that access to the other r 1 copies. So, the partial replica capacity is: C = l P + ((r 1) l P + t l T r ) w wo (3) p Given equations 6 and 7 we can operate and transform them into: l T = C p l P w wo 1 + (T 1) w wo (4) l P = C t r p l T w wo 1 + (r 1) w wo We can see in 4 that l T could be negative if C < p l P w wo (and the same for l P ). In that case, remote work is greater than capacity, i. e. partial replicas overload total replicas, or vice versa if l P < 0. To avoid these undesirable situations, we must control the number of transactions that can be executed by total replicas or partial replicas. We introduce two more parameters c T, c P [0; 1], the former avoids total replicas to overload partial replicas, the latter does the same with partial replicas, by reducing the capacity C. (5) l T = C c T p l P w wo 1 + (T 1) w wo (6) 6

8 n = 100 t = n p p = 10, 20,.., 90 r = 10, 20, 30, 40, 50, p w = 0, 0.2, 0.4 wo = 0.15 Table 1: Parameters l P = C c P t r p l T w wo 1 + (r 1) w wo Finally, dividing l T and l P by C, we have the expressions for so T and so P : so T = c T p so P w wo 1 + (T 1) w wo so P = c P t r p l T w wo 1 + (r 1) w wo The values of c T and c P are obtained with the simplex algorithm. The goal is to maximise the global system scale out, so = t so T + p so P, given the following restrictions: so T, so P 0, c T, c P [0; 1]. 4 Simulation In this section, we show the results of applying different values to the parameters in the model explained in Section 3. First, we compare the scale out of a fully replicated database with a partial replicated database (with total and partial replicas). Then, we give a deeper analysis of the partial replicated database scale out. Finally, we analyse the scale out of a pure partial replicated database (without total replicas). 4.1 Partial Replication vs. Full Replication Here, we compare the scale out having partial replicas with the scale out obtained without them (that is, with a fully replicated database). The parameters used are described in Table 1. Each graph in Figure 4.1 is the simulation for the global scale out (so) with different values of w: w = 0 in Figure 1(a), w = 0.2 in 1(b) and w = 0.4 in 1(c). There is a curve in the graphs for each value of r. Figure 1(a) shows the simulation results for w = 0, i. e. no updates in the workload. No updates means that the workload only consists of queries and therefore no capacity is used for performing remote transactions. So, the scale out is always maximal in total replication, r = p line, and also in partial replication independently of the number of database copies, r. The scale out in presence of updates is shown in Figures 1(b) and 1(c). Looking at Figure 1(b) we can say that, as expected, with updates the scale out increases with the number of partial replicas. However, the more copies the system has, the less the scale out is, e. g. if p = 40, the scale out is closed to 35 with r = 10, but it is closed to 25 with r = 30. Moreover, we can see that there exists a maximum scale out that it cannot be exceeded and it does not depends on the number of copies. That marks the point in which the global system scale out does not increase by adding more partial replicas, e. g. if r = 30 the scale (7) (8) (9) 7

9 out has the same values and is closed to 35 for p = 60,.., 90. We think that the bound exists because partial replicas overload total replicas. Increasing the proportion of updates (w = 0.4 in Figure 1(c)) causes that there are practically no differences between total replication and partial replication. (b) (c) Figure 1: Scale out for different values of r. (a) w = 0, (b) w = 0.2, (cc) w = Partial Replication As seen in Figure 1(b), we believe that partial replicas overload total replicas and therefore beyond certain number of partial replicas the scale out is bounded regardless the number of copies. Here we analyse deeper the scale out. The parameter are summarised in Table 2. We show the simulation results for the scale out of total replicas (so T ), the scale out of partial replicas (so P ), the proportion of total replicas capacity (c T ) and the proportion of partial replicas capacity (c P ). In Figure 2(a) we can see that the more partial replicas there are, the less the total replicas scale out is, e. g. for r = 10, so T is closed to 0.2 if p = 20, but p is closed to 0.1 if p = 30. There exists a value for p from which the total replicas scale out is zero, e. g. if p > 60, the scale out with r = 20 is always zero. Looking at Figure 2(c) we can see that total replicas are using its full capacity, even if so T = 0, e. g. if p > 60, c T with r = 20 is always maximal 8

10 n = 100 t = n p p = 10, 20,.., 90 r = 10, 20, 30, 40, 50 w = 0.2 wo = 0.15 Table 2: Parameters and so T = 0. That is, total replicas are always executing remote updates from partial replicas. We can proof that partial replicas overload total replicas looking at curve r = 10, as a example, in Figures 2(b) and 2(d). Partial replicas scale out (Figure 2(b)) increases with the number of partial replicas p until p = 40 then decreases with the number of partial replicas. We can see that the maximum value of so P (when p = 40) coincides with the minimum number of partial replicas needed to get the maximum value for so (Figure 1(b)). Then so P decreases since the proportion of partial replicas capacity decreases as well, e. g. c P with r = 10 (Figure 2(d)) if p = 40 is closed to 0.8, but is closed to 0.6 if p = 60. So, if p > 40 the capacity proportion, and therefore the scale out, of partial replicas decreases to avoid that partial replicas overload total replicas. (a) (b) (c) (d) Figure 2: so, so T, so P, c T and c P for w = 0.2 9

11 t = 0 p = 10, 20,.., 90 r = 10, 20, 30, 40, 50 w = 0.2 wo = Table 3: Parameters 4.3 Pure Partial Replication Figure 3: Scale out of a pure partially replicated system for different values of r Figure 4.3 shows the scale out curves for a pure partially replicated database, that is, without total replicas, for different values of r. We want to analyse here the scale out behaviour of that kind of replication. The parameters are summarised in Table 3. We can see that regardless of the value for r, the scale out of a pure partially replicated database is always linear, the scale out always increases when the number of partial replicas grows. There is no bound, the values of the scale out only depends on the number of copies. For example, the scale out is closed to 60 if r = 10 and p = 80, but it is closed to 40 if r = 40 and p = 80. Therefore the more database copies r there are, the less the scale out is. To conclude with, we can say that having total replicas in a partial database avoids distributed transactions, but the scale out is bounded and clearly near to the total replication scale out. 5 The Protocol In this section we propose a replica control protocol for a pure partial database which follows the system model described in Section 2. The protocol allows distributed transactions and uses one-copy snapshot isolation as correctness criterion. 5.1 Protocol Outline Figure 4 shows the protocol which is described in terms of events in a replica R k. Auxiliary functions, procedures and objects used in the protocol are described in Table??. Each replica R k executes the same protocol and communicates with the rest of them by exchanging messages. We suppose that there exist a multicast 10

12 with total order available (multicastt O). Total order property assures that all messages multicast with total order are delivered to all destinations in the same order. Replicas execute transactions and then there is a validation phase to decide if a transaction commits or not. The protocol needs some information at each replica: ws l ist (a list of already validated transactions) and to c ommit q ueue (a queue of transactions ready to commit). A client sends an operation op j of a transaction t i to a replica R k and R k is marked as the coordinator of t i. The coordinator of t i get its current database snapshot and associates it with t i. Then, if op j is local at R k, op j is executed in the snapshot of t i, whereas if t i is not local at R k, R k send op j to another replica R q that can execute op j. After executing every operation, results are submitted to the client. Upon the client request the commit of t i, the writeset of t i is retrieved, t i is locally validated and t i is submitted to the rest of replicas. Moreover, t i receives an identifier t i.cert with the value of last committed transaction to identify concurrent multicasts. Once t i (sent in total order) is delivered to all replicas, t i is globally validated by every replica and if validation succeeds t i is added to ws l ist and to commit queue. Otherwise, t i is aborted. Finally, upon t i becomes the first transaction in to commit queue it is committed. Local validation of t i looks for write/write conflicts by checking that there is no intersection between writeset of t i and any writeset of transactions in to commit q ueue. Already committed transactions were validated by the underlying DBMS. The global validation checks for write/write conflicts in concurrent submitted transactions, i. e. every transaction t j in ws list with t i.cert t j.cert. Function sender(op j) coordinator(t i) first(op j, t i) getsnapshot(r k, t i) local(op j, R k ) snapshot(t i) execute(op j, s) send(m, c) redirect(op j) getw riteset(t i) setw riteset(t i) multicastt O(t i) Specification Returns the sender of operation op j Returns the replica which coordinates transaction t i Returns true if op j is the first operation of transaction t i Associates a database snapshot to transaction t i Returns true if op j can be executed in R k Gets the snapshot in which t i is being executed Execute in the database operation op j in the given snapshot s Sends a message m to c Sends operation op j to a replica that can execute op j Retrieves the writeset of t i Applies the writeset of t i Multicasts t i in total order 6 Related Work In this section, we summarize proposed works that provides a partial replication protocol. [15] provides a long survey in performance models for distributed and replicated database systems. It covers all replication models, from full replication to replication per object. It only analyses the two-dimensional partial replication. That analysis is only focused in symmetric processing, but as it is demonstrated 11

13 Initialization to commit queue = ws list = last committed tid = 0 Upon receiving an operation op j of transaction t i If sender(op j ) = client then coordinator(t i ) = R k If first(op j, t i )R k = coordinator(t i ) then getsnapshot(r k, t i ) If local(op j, R k ) then execute(op j, snapshot(t i )) send(result(op j ), client) else redirect(op j ) Upon receiving the commit of t i getw riteset(t i ) If t j to commit queue such that t j.writeset t i.writeset then t i.local v alidation = false else t i.local v alidation = true t i.cert = last committed tid multicastt O(t i ) Upon receiving t i in total order If t i.local v alidation ( t j ws list such that t j.writeset t i.writeset t j.cert < t i.cert) then abort(t j ) else enqueue(t i, to commit queue) insert(t i, ws list) Upon t i becomes the first one in to commit queue dequeue(to commit queue) setw riteset(t i ) commit(t i ) last committed t id + + Figure 4: Replica control protocol 12

14 in [12] asymmetric processing in a replicated database system increases performance. The protocol in [4] (which extends [16] from total replication to partial replication) is lazy and focused on read-only transactions scalability. There exists a coordinator that schedules transactions from all clients and introduces a delay before submitting a transaction to assure total order. It avoids distributed transactions and needs information about transactions before they are executed. Read/write conflicts are possible. Holliday et al. in [9] extends the algorithm in [10]. It is based on epidemic communication which is also shown as a good choice for database replication in wide area networks. The algorithm forces order between all transactions introducing delays before propagating transactions. Moreover, writeset as well as readsets are submitted to all replicas and read/write conflicts are possible. The protocol has two versions: restricted (avoids distributed transactions) and unrestricted (allows distributed transactions). The protocol in [5] extends [6] to allow partial replication. Each replica execute transactions and changes are applied in total order. With partial replication, a new voting phase is needed which implies a communication overhead between replicas. It does not avoid read/write conflicts. Finally, Cecchet et al. in [3] improve its replication middleware proposed in [2]. The middleware knows the distribution of data and transactions to the proper replica. Distributed transactions are not permitted. Queries are sent to any of the replicas with the data, but updates are sent to every replica having a copy of data accessed (information about transacting is needed in advance). This algorithm is not scalable with updates because every replica executes each update transaction completely and concurrent updates are avoided. 7 Conclusions and Future Work In this work, we have analysed the scale out of a partially replicated database, with total an partial replicas. We have come to the conclusion that having total replicas in a partially replicated database avoids distributed transactions but the scale out of such a system is clearly closed to the scale out of a total replicated database. Then, we have analyse the scale out of a pure partial replicated database, i. e. with no total replicas and we have shown that a pure replicated database scale out is linear independently of the update percentage in the workload. Finally, we have proposed a replica control protocol that provides snapshot isolation at middleware level. Snapshot isolation avoids read/write conflicts and the middleware level makes for database system independence. We plan to implement the replica control protocol at middleware level and then aanalyseit using the TPC-W[17] benchmark in a middle scale cluster. The TPC-W is a sstandard benchmark which simulates an on line books retailer. 13

15 References [1] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O Neil, and P. O Neil. A Critique of ANSI SQL Isolation Levels. In Proc. of SIGMOD, pages 1 10, San Jose, USA, May ACM Press. [2] E. Cecchet, J. Marguerite, and W. Zwaenepoel. RAIDb: Redundant Array of Inexpensive Databases. Technical Report Technical Report 4921, Inria, [3] Emmanuel Cecchet, Julie Marguerite, and Willy Zwaenepoel. Partial replication: Achieving scalability in redundant arrays of inexpensive databases. In OPODIS, pages 58 70, [4] Cédric Coulon, Esther Pacitti, and Patrick Valduriez. Consistency management for partial replication in a high performance database cluster. In ICPADS (1), pages , [5] António Luís Pinto Ferreira de Sousa, Rui Carlos Oliveira, Francisco Moura, and Fernando Pedone. Partial replication in the database state machine. In NCA, pages , [6] F. Pedone et al. The Database State Machine Approach. PhD thesis. [7] Alan Fekete, Dimitrios Liarokapis, Elizabeth O Neil, Patrick O Neil, and Dennis Shasha. Making snapshot isolation serializable. ACM Trans. Database Syst., 30(2): , [8] J. Gray, P. Helland, P. O Neil, and D. Shasha. The Dangers of Replication and a Solution. In Proc. of the SIGMOD, pages , Montreal, [9] J. Holliday, D. Agrawal, and A. Abbadi. Partial database replication using epidemic communication, [10] J. Holliday, R. Steinke, D. Agrawal, and A. Abbadi. Epidemic algorithms for replicated databases, [11] R. Jiménez-Peris, M. Patiño-Martínez, G. Alonso, and B. Kemme. Improving the scalability of fault-tolerant database clusters. In Proc. of 22nd IEEE Int. Conf. on Distributed Computing Systems, 2002, Vienna, Austria, July [12] R. Jiménez-Peris, M. Patiño-Martínez, G. Alonso, and B. Kemme. Are Quorums an Alternative for Data Replication? ACM Transactions on Databases, 28(3): , September [13] B. Kemme. Database Replication for Clusters of Workstations. PhD thesis, Dept. of Computer Science, Swiss Federal Institute of Technology Zurich, [14] Yi Lin, Bettina Kemme, Marta Patiño-Martínez, and Ricardo Jiménez- Peris. Middleware based data replication providing snapshot isolation. In SIGMOD 05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages , New York, NY, USA, ACM Press. 14

16 [15] M. Nicola and M. Jarke. Performance Modeling of Distributed and Replicated Databases. IEEE Transactions on Knowledge and Data Engineering, 12(4): , July [16] E. Pacitti, T. Ozsu, and C. Coulon. Preventive multi-master replication in a cluster of autonomous databases, [17] Transaction Processing Performance Council. TPC-W v

SIPRe: A Partial Database Replication Protocol with SI Replicas

SIPRe: A Partial Database Replication Protocol with SI Replicas ABSTRACT J.E. Armendáriz Iñigo, A. Mauch Goya, J.R. González de Mendívil Univ. Pública de Navarra 31006 Pamplona, Spain {enrique.armendariz,