Scaling up the Preventive Replication of Autonomous Databases in Cluster Systems 1

Size: px
Start display at page:

Download "Scaling up the Preventive Replication of Autonomous Databases in Cluster Systems 1"

Transcription

1 Scaling up the Preventive Replication of Autonomous Databases in Cluster Systems 1 Cédric Coulon, Esther Pacitti, Patrick Valduriez INRIA and LINA, University of Nantes France Cedric.Coulon@lina.univ-nantes.fr, Esther.Pacitti@lina.univ-nantes.fr Patrick.Valduriez@inria.fr Abstract. We consider the use of a cluster system for Application Service Providers. To obtain high-performance and high-availability, we replicate databases (and DBMS) at several nodes, so they can be accessed in parallel through applications. Then the main problem is to assure the consistency of autonomous replicated databases. Preventive replication [8] provides a good solution that exploits the cluster s high speed network, without the constraints of synchronous replication. However, the solution in [8] assumes full replication and a restricted class of transactions. In this paper, we address these two limitations in order to scale up to large cluster configurations. Thus, the main contribution is a refreshment algorithm that prevents conflicts for partially replicated databases. We describe the implementation of our algorithm over a cluster of 32 nodes running PostGRESQL. Our experimental results show that our algorithm has excellent scale up and speed up. 1 Introduction High-performance and high-availability of database management has been traditionally achieved with parallel database systems [9], implemented on tightlycoupled multiprocessors. Parallel data processing is then obtained by partitioning and replicating the data across the multiprocessor nodes in order to divide processing. Although quite effective, this solution requires the database system to have full control over the data and is expensive in terms of software and hardware. Clusters of PC servers provide a cost-effective alternative to tightly-coupled multiprocessors. They have been used successfully by, for example, Web search engines using high-volume server farms (e.g., Google). However, search engines are typically read-intensive which makes it easier to exploit parallelism. Cluster systems can make new businesses such as Application Service Providers (ASP) economically viable. In the ASP model, customers applications and databases (including data and DBMS) are hosted at the provider site and need be available, typically through the Internet, as efficiently as if they were local to the customer site. To improve performance, applications and data can be replicated at different nodes so that users 1 Work partially funded by the MDP2P project of the ACI Masses de Données of the French ministry of research. 1 / 14

2 can be served by any of the nodes depending on the current load [1]. This arrangement also provides high-availability since, in the event of a node failure, other nodes can still do the work. However, managing data replication in the ASP context is far more difficult than in Web search engines since applications can be updateintensive and both applications and databases must remain autonomous. The solution of using a parallel DBMS is not appropriate as it is expensive, requires heavy migration to the parallel DBMS and hurts database autonomy. In this paper, we consider a cluster system with similar nodes, each having one or more processors, main memory (RAM) and disk. Similar to multiprocessors, various cluster system architectures are possible: shared-disk, shared-cache and sharednothing [9]. Shared-disk and shared-cache require a special interconnect that provide a shared space to all nodes with provision for cache coherence using either hardware or software. Shared-nothing (or distributed memory) is the only architecture that supports our autonomy requirements without the additional cost of a special interconnect. Furthermore, shared-nothing can scale up to very large configurations. Thus, we strive to exploit a shared-nothing architecture. To obtain high-performance and high-availability, we replicate databases (and DBMS) at several nodes, so they can be accessed in parallel through applications. Then the main problem is to assure the consistency of autonomous replicated databases. An obvious solution is synchronous replication (e.g. Read-One-Write-All) which updates all replicas within the same (distributed) transaction. Synchronous (or eager) replication enforces mutual consistency of replicas. However, it cannot scale up to large cluster configurations because it makes use of distributed transactions, typically implemented by 2 phase commit [5]. A better solution that scales up is lazy replication [6]. With lazy replication, a transaction can commit after updating a replica, called primary copy, at some node, called master node. After the transaction commits, the other replicas, called secondary copies, are updated in separate refresh transactions at slave nodes. Lazy replication allows for different replication configurations [7]. A useful configuration is lazy master where there is only one primary copy. Although it relaxes the property of mutual consistency, strong consistency 2 is eventually assured. However, it hurts availability since the failure of the master node prevents the replica to be updated. A more general configuration is (lazy) multi-master where the same primary copy, called a multi-owner copy, may be stored at and updated by different master nodes, called multi-owner nodes. The advantage of multi-master is high-availability and high-performance since replicas can be updated in parallel at different nodes. However, conflicting updates of the same primary copy at different nodes can introduce replica divergence. Preventive replication can prevent the occurrence of conflicts [8], by exploiting the cluster s high speed network, thus providing strong consistency, without the constraints of synchronous replication. The refresher algorithm in [8] supports multimaster configurations and its implementation over a cluster of 8 nodes shows good performance. However, it assumes that databases are fully replicated across all cluster nodes and thus propagates each transaction to each cluster node. This makes it unsuitable for supporting heavy workloads on large cluster configurations. 2 For any two nodes, the same sequence of transactions is executed in the same order. 2 / 14

3 Furthermore, it considers only write multi-owner transactions. However it may arise that a transaction must read from some other copy to update a multi-owner copy. In this paper, we address these two limitations in order to scale up to large cluster configurations. The main contribution is a multi-master refreshment algorithm that prevents conflicts for partially replicated databases. We also show the architectural components necessary to implement the algorithm. Finally, we describe the implementation of our algorithm over a cluster of 32 nodes running PostGRESQL and present experimental results that show that it yields excellent scale-up and speed up.. The rest of the paper is structured as follows. Section 2 presents the cluster system architecture in which we identify the role of the replication manager. Section 3 defines partially replicated configurations. Section 4 presents refreshment management including the principles of the refresher algorithm and the architectural components necessary for its implementation. Section 5 describes our prototype and show our performance model and experiments results. Finally, Section 6 concludes the paper. 2 Cluster System Architecture In this section, we introduce the architecture for processing user requests against applications into the cluster system and discuss our general solutions for placing applications, submitting transactions and managing replicas. Therefore, the replication layer is identified together with all other general components. Our system architecture is fully distributed since any node may handle user requests. Shared-nothing is the only architecture that supports sufficient node autonomy without the additional cost of special interconnects. Thus, we exploit a shared-nothing architecture. However, in order to perform distributed request routing and load balancing, each node must share (i) the user directory which stores authentication and authorization information on users and (ii) the list of the available nodes, and for each node, the current load and the copies it manages with their access rights (read/write or read only). To avoid the bottleneck of central routing, we use Distributed Shared Memory [4] to manage this shared information. Each cluster node has four layers (see Figure 1): Request Router, Application Manager, Transaction Load Balancer and Replication Manager. A request may be a query or an update transaction for a specific application. The general processing of a user request is as follows. When a user request arrives at the cluster, traditionally through an access node, it is sent randomly to a cluster node. There is no significant data processing at the access node, avoiding bottlenecks. Within that cluster node, the user is authenticated and authorized through the Request Router, available at each node, using a multi-threaded global user directory service. Thus user requests can be managed completely asynchronously. Next, if a request is accepted, then the Request Router chooses a node j, to submit the request. The choice of j involves selecting all nodes in which the required application is available and copies used by the request are available with the correct access right, and, among these nodes, the node with the lightest load. Therefore, i may be equal to j. The 3 / 14

4 Request Router then routes the user request to an application node using a traditional load balancing algorithm. Global User Directory Request Router Node 1 Fast Network Node 2 Application Manager Transaction Load Balancer Replication Manager DBMS Current Load Monitor Global Data Placement Node n Figure 1. Cluster system architecture However, the database accessed by the user request may be placed at another node k since applications and databases are both replicated and not every node hosts a database system. In this case, the choice regarding node k will depend on the cluster configuration and the database load at each node. Each time a new node is available, it must register to other nodes. To accomplish this, it uses the global shared memory that maintains the lists of available nodes. If a node is no more available, it is removed from the list either by itself, if it disconnects normally, or by the request router of another node, if it crashes. A node load is specified by a current load monitor available at each node. For each node, the load monitor periodically computes application and transaction loads. For each type of load, it establishes a load grade and stores the grades in a shared memory. A high grade corresponds to a high load. Therefore, the Request Router chooses the best node for a specific request using the node grades. The Application Manager is the layer that manages application instantiation and execution using an application server provider. Within an application, each time a transaction is to be executed, the Transaction Load Balancer layer is invoked which triggers transaction execution at the best node, using the load grades available at each node. The best node is defined as the one with lighter transaction load. The Transaction Load Balancer ensures that each transaction execution obeys the ACID (atomicity, consistency, isolation, durability) properties, and then signals to the Application Manager to commit or abort the transaction. 4 / 14

5 The Replication Manager layer manages access to replicated data and assures strong consistency in such a way that transactions that update replicated data are executed in the same serial order at each node. We employ data replication because it provides database access parallelism for applications. Our preventive replication approach, implemented by the multi-master refreshment algorithm, avoids conflicts at the expense of a forced waiting time for transactions, which is negligible due to the fast cluster network system. 3 Partially Replicated Configurations In this section, we discuss the configurations supported by our refreshment algorithm, then we define the terms used for the replication algorithm. In Figure 2, we can see the various configurations supported by our algorithm with two tables R and S. The primary copies are in upper case (e.g. S), the multi-master copies in underlined upper case (e.g. ) and the secondary copies in lower case (e.g. s ). S r', s', S 1 R 3, S 3, S 1 R 3, S N3 N3 N3 R r'', s'', S 2 R 4, S 4, s' s'' s' N2 a) Bowtie N4 N2 N4 b) Fully replicated N2 N4 c) Partially replicated N2 N3 d) Partially replicated Figure 2. Useful configurations First, we have a bowtie lazy-master configuration (Figure 2a) where there are only primary copies and secondary copies. Second, we have a multi-master configuration (Figure 2b) where there are only multi-master copies and all nodes hold all copies,. Thus, this configuration is also called fully-replicated. Then we have two examples of partially-replicated configurations (Figure 2c and 2d) where all types of copies (primary, multi-master and secondary) may be stored at any node. For instance, in Figure 2c, node carries the multi-owner copy and a primary copy S, node N2 carries the multi-owner copy and the secondary copy s, node N3 carries the multiowner copy R 3, and node N4 carries the secondary copy s. Such a partial replicated configuration allows an incoming transaction at node, such as the one in Example 1, to read S 1 in order to update : Example 1: Incoming transaction at node UPDATE SET att 1 =value WHERE att 2 IN (SELECT att 3 FROM S) COMMIT; 5 / 14

6 This transaction can be entirely executed at (to update ) and N2 (to update ). However it cannot be entirely executed at node N3 (to update R 3 ) which does not hold a copy of S. A Multi-Owner Transaction (MOT) may be composed of a sequence of read and write operations followed by a commit (as produced by the above statement). This is more general than in [8] which only considers write operations. A Multi-Owner Refresh transaction (MOR) corresponds to the resulting sequence of write operations of an MOT, as written in the Log History [8]. The origin node is the node that receives a MOT (from the client) which updates a set of multi-owner copies. Target nodes are the nodes that carry multi-owner copies which are updated by the incoming MOT. Note that the origin node is also a target node since it holds multi-owner copies which are updated by MOT. For instance, in Figure 2b whenever node receives a MOT that updates, then is the origin node and,n2, N3 and N4 are the target nodes. In Figure 2c, whenever N3 receives a MOT that updates R 3, then the origin node is N3 and the target node are, N2 and N3. 4 Preventive Refreshment for Partially Replicated Configurations In this section, we present the preventive refreshment algorithm for the partially replicated configurations. Next, we introduce the architecture used for the Replication Manager layer to show how the algorithm works, using an example. 4.1 Refresher Algorithm We assume that the network interface provides global First-In First-Out (FIFO) reliable multicast: messages multicast by one node are received at the multicast group nodes in the order they have been sent [3]. We denote by Max, the upper bound of the time needed to multicast a message from a node i to any other node j. We also assume that each node has a local clock. For fairness reasons, clocks are assumed to have a drift and to be ε-synchronized [2]. This means that the difference between any two correct clocks is not higher that ε (known as the precision)./ To define the refresher algorithm, we need a formal correctness criterion to define strong consistency. In multi-master configurations, inconsistencies may arise whenever the serial orders of two multi-owner transactions at two nodes are not equal. Therefore, multi-owner transactions must be executed in the same serial order at any two nodes. Thus, global FIFO ordering is not sufficient to guarantee the correctness of the refreshment algorithm. Hence the following correctness criterion is necessary: Definition 3.1 (Total Order) Two multi-owner transactions MOT1 and MOT2 are said to be executed in Total Order if all multi-owner nodes that commit both MOT1 and MOT2 commit them in the same order. 6 / 14

7 Proposition 3.1 For any cluster configuration C that meets a multi-master configuration requirement, the refresh algorithm that C uses is correct if and only if the algorithm enforces total order. We now present the refresher algorithm. Each multi-owner transaction is associated with a chronological time stamp value. The principle of the preventive refresher algorithm is to submit a sequence of multi-owner transactions in chronological order at each node. As a consequence, total order is enforced since multi-owner transactions are executed in same serial order at each node. To assure chronological ordering, before submitting a multi-owner transaction at node i, the algorithm has to check whether there is any older committed multi-owner transaction enroute to node i. To accomplish this, the submission time of a new multi-owner transaction MOT i is delayed by Max + ε (recall that Max is the upper bound of the time needed to multicast a message from a node to any other node). After this delay period, all older transactions are guaranteed to be received at node i. Thus chronological and total orderings are assured. Whenever a multi-owner transaction MOT i is to be triggered at some node i, the same node multicasts MOT i to all nodes 1, 2, n, of the multi-owner configuration, including itself. Once MOT i is received at some other node j (notice that i may be equal to j), it is placed in the pending queue for the multi-owner triggering node i, called multi-owner pending queue (noted moq i ). Therefore, at each multi-owner node i, there is a set of multi-owner queues, moq 1, moq 2,, moq n. Each pending queue corresponds to a node of the multi-owner configuration and is used by the refresher to perform chronological ordering. With partially replicated configurations, some of the target nodes may not be able to perform MOT i because they do not hold all the copies necessary to perform the read set of MOT i (recall the discussion on Example 1). However the write sequence of MOT i (which corresponds to its MOR i ) needs to be ordered using MOT i 's timestamp value in order to ensure consistency. So MOT i is delayed by Max + ε to prevent conflicts and waits for the reception of the corresponding MOR i (the resulting sequence of write operations of MOT i ). At origin node i, when the commitment of MOT i is detected, the corresponding MORi is produced and node i multicasts MOR i towards the involved target nodes. Finally, upon reception of the MOR i, at a target node j, the content of the waiting MOT i is replaced with the content of the incoming MOR i. Then, MOT i is executed which enforces strong consistency. A key aspect of our algorithm is to rely on the upper bound Max on the transmission time of a message by the global FIFO reliable multicast. Therefore, it is essential to have a value of Max that is not over estimated. The computation of Max resorts to scheduling theory (e.g. see [9]) and takes into account several parameters such as the global reliable network itself, the characteristics of the messages to multicast and the failures to be tolerated. In summary, our approach trades the use of a worst case multicast time at the benefit of reducing the number of messages exchanged on the network. This is a well known trade-off. This solution brings simplicity and ease of implementation. 7 / 14

8 4.2 Replication Manager Architecture To implement the multi-master refresher algorithm, we add several components to a regular DBMS. Our goal is to maintain node autonomy since the implementation solution does not require the knowledge of system internals. Figure 3 shows the architecture for refreshing partially replicated databases in our cluster system. The Replica Interface receives MOTs, coming from the clients. The Propagator and the Receiver manage the sending and reception (respectively) of transactions (MOT and MOR) inside messages within the network. Whenever the Receiver receives a MOT, it places it in the appropriate pending queue, used by the Refresher. Next, the Refresher executes the refresher algorithm to ensure strong consistency. Each time a MOT is chosen to be executed by the Refresher, it is written in a running queue which is then read by the Deliverer in FIFO order. The Deliverer submits transactions, read from the running queue, to the DBMS. For partially replicated configurations, when a MOT is composed of a sequence of reads and writes, the refresher at the target nodes must assure the correct ordering. However the MOT execution must be delayed until its corresponding MOR is received. This is because the MOR is produced only after the commitment of the corresponding MOT at the origin node. At the target node, the content of the MOT (sequence of read and write operations) is replaced by the content of the MOR (sequence of write operations). Thus, at the target node, when the receiver receives a MOR, it interacts with the Deliverer. At the origin node, the Log Monitor checks constantly the content of the DBMS log to detect whether replicated copies have been changed. For partially replicated configurations, whenever a MOT is composed of reads and writes, the log monitor detects the MOT commitment in order to produce the corresponding MOR. Then, the propagator multicasts the MOR to the involved target nodes. Clients DBMS Global Data Placement Repliqua Interface MOT Log Monitor MOR Propagator MOR Replication manager Network Deliverer Refresher MOT Receiver Figure 3. Preventive Replication Manager Architecture Let us now illustrate the refresher algorithm with an example of execution. In Figure 4, we assume a simple configuration with 3 nodes (, N2 and N3) and 2 copies (R and S). carries a multi-owner copy of R and a primary copy of S, N2 a 8 / 14

9 multi-owner copy of R, and N3 a secondary copy of S. The refreshment proceeds in 5 steps. In step 1, (the origin node) receives MOT from a client, using the Replica Interface, which reads S and updates. For instance, MOT can be the resulting read and write sequence produced by the transaction of Example 1. Then, in step 2, multicasts MOT, using the Propagator, to the involved target nodes, i.e. and N2. N3 is not concerned by MOT because it only holds a secondary copy s. In step 3, MOT can be performed using the Refresher algorithm at. At N2, MOT is also managed by the Refresher and then put in the running queue. MOT cannot yet be executed at the target node (N2 does not hold S) because the Deliverer needs to wait for its corresponding MOR (see step 4). In step 4, after the commitment of MOT at the origin node, the Log Monitor produces MOR and multicasts it to all involved target nodes. In step 5, N2 receives MOR and the Receiver replaces the content of MOT by that of MOR. The Deliverer can then submit MOT. Client MOT (r S, w R ), S, S, S Client Answer MOT, S, S MOT (r S, w R ) MOR(w R ) s s s s s N2 Step 1 N3 N2 N3 N2 N3 N2 N3 N2 Standby (perform) Perform Step 2 Step 3 Step 4 Step 5 N3 Figure 4. Example of preventive refreshment with partial configurations 5 Validation The refresher algorithm proposed in [8] for full preventive replication was shown to introduce a negligible loss of data freshness (almost equal to mutual consistency). The same freshness result applies to partial replication. In this section, we describe our implementation, the replication configurations we use and our performance model. Then we describe two experiments to study scale up and speed up. 5.1 Implementation We did our implementation on a cluster of 32 nodes. Each node is configured with a 1GHz Pentium 3 processor, 512 MB of memory and 40GB of disk. The nodes are linked by a 1 Gb/s Myrinet network. We use Linux Mandrake 8.0/Java and Spread toolkit that provides a reliable FIFO message bus and high-performance message service among the cluster nodes that is resilient to faults. We use the PostGRESQL / 14

10 Open Source DBMS at each node. We chose PostGRESQL because it is quite complete in terms of transaction support and easy to work with. Our implementation has four modules: Client, Replicator, Network and Database Server. The Client module simulates the clients of the ASP. It submits multi-owner transactions randomly to any cluster node, via RMI-JDBC, which implement the Replica Interface. Each cluster node hosts a Database Server and one instance of the Replicator module. For this validation which stresses transaction performance, we implemented directly the Replicator module within PostGRESQL. The Replicator module implements all system components necessary for a multi-owner node: Replica Interface, Propagator, Receiver, Refresher and Deliverer. Each time a transaction is to be executed it is first sent to the Replica Interface which checks whether the incoming transaction writes on replicated data. Whenever a transaction does not write on replicated data, it is sent directly to the local transaction manager. Even though we do not consider node failures in our performance evaluation, we implemented all the necessary logs for recovery to understand the complete behavior of the algorithm. The Network module interconnects all cluster nodes through the Spread toolkit. 5.2 Configurations We use 3 configurations involving 3 copies R, S and T with 5 different numbers of nodes (4, 8, 16, 24 and 32). This choice makes it easy to vary the placement of copies while keeping a balanced distribution. In the Fully Replicated configuration (FR), all nodes hold the 3 copies, i.e. {R,S,T}. In the Partially Replicated 1 configuration (PR1), one fourth of the nodes holds {R,S,T}, another holds {R,S}, another holds {R,T} and another holds {S,T}. In the Partially Replicated 2 configuration (PR2), one fourth of the nodes holds {R,S,T}, another holds {R}, another holds {S} and another holds {T}. Fully Replicated (FR) Partially Replicated 1 (PR1) Partially Replicate 2 (PR2) nodes RST RST RS RT ST RST R S T Table 1. Placement of copies per configuration 5.3 Performance Model Our performance model takes into account multi-owner transaction size, arrival rates and the number of multi-owner nodes. We vary multi-owner transactions arrival rates 10 / 14

11 (noted λ). We consider workloads that are either low (λ=3s) or bursty (λ=200ms). We define two types of multi-owner transactions (denoted by MOT ), defined by the number of write operations. Small transactions have size 5, while long transactions have size 50. Multi-owner transactions are either Single Copy (SC), only one replica is updated, or Multiple Copy (MC), several replicas are involved. Thus, with our 3 copies, we have 3 SC transactions (on either {R}, {S} or {T}) and 7 MC transactions (on either {R}, {S}, {T}, {R,S}, {R,T}, {S,T} or {R,S,T}). Param. Definition Values MOT Transaction Type Ltr λ Nb-Multi-Owner M Max + ε Configuration Multi-owner transaction size Involved a Single or Muli- Replicas Copies Long transaction ratio Mean time between transactions (workload) Number of replicated PostGRESQL instances Number of update transactions submitted for the tests Delay introduced for submitting a MOT Replication of R, S and T 5; 50 SC (single), MC (multi) 66% bursty: 200ms; low: 3s 4, 8, 16, 24, ms FR, PR1, PR2 Table 2. Performance parameters To understand the behavior of our algorithm in the presence of short and long transactions, we set a parameter called long transaction ratio as ltr = 66 (66% of update transactions are long). Updates are done on the same attribute of a different tuple. Finally our copies R, S and T have 2 attributes (the first attribute is used as primary key and the second as data) and carry each tuples and no indexes. The parameters of the performance model are described in Table Scale up Experiment This experiment studies the algorithm s scalability. That is, for a same set of incoming multi-owner transactions, scalability is achieved whenever in augmenting the number of nodes the response times remain the same. We consider SC and MC transactions in bursty and low workloads (λ = 200ms and λ = 20s), varying the number of nodes for each configuration (FR, PR1 and PR2). SC and MC transactions randomly update R, S or T. For each test, we measure the average response time per transaction. The duration of this experiment is the time to submit 50 transactions. The experiment results (see Figure 5) clearly show that for all tests scalability is achieved. In contrast with synchronous algorithms, our algorithm has linear response time behavior since for each incoming multi-owner transaction it requires only the multicast of n messages for nodes that carry all required copies plus 2n messages for nodes that do not carry all required copies, where n corresponds to the number of target nodes. The results also show the impact of the configuration on performance. First, on low workloads (see Figure 5.b and 5.d), the impact is very little (transaction response times are similar for all configurations) because at each node transaction executes alone. In bursty workloads (see Figure 5.a and 5.c), the number of transaction augments however several of them execute in parallel due to partial replication. Thus 11 / 14

12 partial replication increases inter-transaction parallelism by allowing different nodes to process different transactions. The performance improvement of PR over FR is best for SC transactions (see Figure 5.a) and quite significant (a factor of 2.33 for PR1 and of 7 for PR2). For MC transactions (see Figure 5.c), the performance improvement is still a factor of 3.5 for PR1 but only of 1.6 for PR2. This is because FR performs relatively better for MC transactions than for SC transactions. For SC transactions, PR2 which stores individual copies of R, S and T at different nodes outperforms PR1. But for MC transactions, PR1 outperforms PR2. Thus the configuration and the placement of the copies should be tuned to selected types of transactions Response Time (ms) FR PR1 PR2 Response Time (ms) FR PR1 PR Number of nodes Number of nodes a) Bursty Workload - SC b) Low Workload - SC Response Time (ms) FR PR1 PR2 Response Time (ms) FR PR1 PR Number of nodes Number of nodes c) Bursty Workload - MC d) Low Workload - MC Figure 5. Scalability results 5.5 Speed up Experiment The second experiment studies the performance improvement (speed up) for read queries when we increase the number of nodes. To accomplish this, we modified the previous experiment by introducing clients that submit queries. We consider MC read queries in bursty and low workloads (λ = 200ms and λ = 20s), varying the number of nodes for each configuration (FR, PR1 and PR2). Since each read query involves only 12 / 14

13 one node, the performance difference between SC and MC queries is negligible. Thus, we only consider MC queries which are more general. The number of read clients is 128. Each client is associated to one node and we produce an even distribution of clients at each node. Thus, the number of read clients per node is 128/number of nodes. The clients submit queries sequentially while the experiment is running. For each test, we measured the throughput of the cluster, i.e. the number of read queries per second. The duration of this experiment is the time of submitting 1000 read queries. The experiment results (see Figure 6) clearly show that the increase of the number of nodes improves the cluster s throughput. For bursty workloads (Figure 6.a), the improvement obtained with partial replication is moderate because the nodes are already overloaded by the update transactions. The improvement is due to the fact that the nodes are less used with PR1 and PR2 than with FR Queries per seconde FR PR1 PR2 Queries per seconde FR PR1 PR Number of nodes Number of nodes a) Bursty Workload - MC b) Low Workload - MC Figure 6. Speed up results For low workloads (Figure 6.b), the speed up obtained for all three configurations is quite significant and roughly the same. However, as in the previous experiment, the type of configuration does not matter because the nodes are not overloaded by update transactions. 6 Conclusion Preventive replication provides strong copy consistency, without the constraints of synchronous replication. In this paper, we addressed the problem of scaling up preventive replication to large cluster configurations. The main contribution is a multi-master refreshment algorithm that prevents conflicts for partially replicated databases. The algorithm supports general multi-owner transactions composed of read and write operations. Cluster nodes can support autonomous databases that are considered as black boxes. Thus our solution can achieve full node autonomy. We described the system architecture components necessary to implement the refresher algorithm. We validated our refresher algorithm over a cluster of 32 nodes 13 / 14

14 running PostGRESQL. To stress transaction performance, we implemented directly our replicator module within PostGRESQL. Our experimental results showed that our multi-master refreshment algorithm scales up very well. In contrast with synchronous algorithms, our algorithm has linear response time behaviour. We also showed the impact of the configuration on performance. For low workloads, full replication performs almost as well as partial replication. But for bursty workloads in partial replicated configurations an important amount of transactions are executed in parallel, thus the performance improvement of partial replication can be quite significant (by a factor up to 7). The speed up experimental results also showed that the increase of the number of nodes can well improve the query throughput. However the performance gains strongly depend on the types of transactions (single versus multiple copy) and of the configuration. Thus an important conclusion is that the configuration and the placement of the copies should be tuned to selected types of transactions. References 1. S. Gançarski, H. Naacke, E. Pacitti, P. Valduriez: Parallel Processing with Autonomous Databases in a Cluster System, Int. Conf. of Cooperative Information Systems (CoopIS), L. George, P. Minet: A FIFO Worst Analysis for a Hard Real Time Distributed Problem with Consistency Constraints, Int. Conf. On Distributed Computing Systems (ICDCS), V. Hadzilacos, S. Toueg: Fault-Tolerant Broadcasts and Related Problems. Distributed Systems, 2 nd Edition, S. Mullender (ed.), Addison-Wesley M.D. Hill et al. Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors. ACM Trans. On Computer Systems, 11(4), T. Özsu, P. Valduriez: Principles of Distributed Database Systems. Prentice Hall, Englewood Cliffs, New Jersey, 2nd edition, E. Pacitti, P. Valduriez: Replicated Databases: concepts, architectures and techniques, Network and Information Systems Journal, Hermès, 1(3), E. Pacitti, P. Minet, E. Simon: Replica Consistency in Lazy Master Replicated Databases. Distributed and Parallel Databases, Kluwer Academic, 9(3), May E. Pacitti, T. Özsu, C. Coulon : Preventive Multi-Master Replication in a Cluster of Autonomous Databases, Int. Conf. on Parallel and Distributed Computing (Euro-Par 2003), Klagenfurt, Austria, K. Tindell, J. Clark: Holistic Schedulability analysis for Distributed Hard Real-time Systems. Micro-processors and Microprogramming, 40, P. Valduriez: Parallel Database Systems: open problems and new issues. Int. Journal on Distributed and Parallel Databases, 1(2), / 14

Optimistic Preventive Replication in a Database Cluster

Optimistic Preventive Replication in a Database Cluster Optimistic Preventive Replication in a Database Cluster Cédric Coulon, Esther Pacitti, Patrick Valduriez INRIA and LINA, University of Nantes France {Cedric.Coulon, Esther.Pacitti}@lina.univ-nantes.fr,

More information

Load balancing of autonomous applications and databases in a cluster system

Load balancing of autonomous applications and databases in a cluster system Load balancing of autonomous lications and databases in a cluster system Stéphane Gançarski, Hubert Naacke, Patrick Valduriez LIP6, University Paris 6, France FirstName.LastName@lip6.fr Abstract. Clusters

More information

Parallel Processing with Autonomous Databases in a Cluster System

Parallel Processing with Autonomous Databases in a Cluster System Parallel Processing with Autonomous Databases in a Cluster System Stéphane Gançarski 1, Hubert Naacke 1, Esther Pacitti 2, and Patrick Valduriez 1 1 LIP6, University Paris 6 8, rue du Cap. Scott 75015

More information

Distributed KIDS Labs 1

Distributed KIDS Labs 1 Distributed Databases @ KIDS Labs 1 Distributed Database System A distributed database system consists of loosely coupled sites that share no physical component Appears to user as a single system Database

More information

Scheduling Transactions in Replicated Distributed Transactional Memory

Scheduling Transactions in Replicated Distributed Transactional Memory Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013 Concurrency control on chip multiprocessors significantly

More information

Transaction Routing with Freshness Control in a Cluster of Replicated Databases

Transaction Routing with Freshness Control in a Cluster of Replicated Databases Transaction Routing with Freshness Control in a Cluster of Replicated Databases Hubert Naacke 1, François Dang Ngoc 2, Patrick Valduriez 3 1 LIP6 University Paris 6, France Hubert.Naacke@lip6.fr 2 PRISM

More information

Replicated Database Transactions Processing in Peer-To-Peer Environments

Replicated Database Transactions Processing in Peer-To-Peer Environments Replicated Database Transactions Processing in Peer-To-Peer Environments Sofiane Mounine HEMAM 1,Khaled Walid HIDOUCI 2 1 LCSI Laboratory University Centre of Khenchela, CUK Khenchela, Algeria s_hemam@esi.dz

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions Transactions Main issues: Concurrency control Recovery from failures 2 Distributed Transactions

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

Distributed Transaction Management

Distributed Transaction Management Distributed Transaction Management Material from: Principles of Distributed Database Systems Özsu, M. Tamer, Valduriez, Patrick, 3rd ed. 2011 + Presented by C. Roncancio Distributed DBMS M. T. Özsu & P.

More information

SCALABLE CONSISTENCY AND TRANSACTION MODELS

SCALABLE CONSISTENCY AND TRANSACTION MODELS Data Management in the Cloud SCALABLE CONSISTENCY AND TRANSACTION MODELS 69 Brewer s Conjecture Three properties that are desirable and expected from realworld shared-data systems C: data consistency A:

More information

Introduction to Distributed Data Systems

Introduction to Distributed Data Systems Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January

More information

AGORA: A Dependable High-Performance Coordination Service for Multi-Cores

AGORA: A Dependable High-Performance Coordination Service for Multi-Cores AGORA: A Dependable High-Performance Coordination Service for Multi-Cores Rainer Schiekofer 1, Johannes Behl 2, and Tobias Distler 1 1 Friedrich-Alexander University Erlangen-Nürnberg (FAU) 2 TU Braunschweig

More information

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles INF3190:Distributed Systems - Examples Thomas Plagemann & Roman Vitenberg Outline Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles Today: Examples Googel File System (Thomas)

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Chapter 20: Database System Architectures

Chapter 20: Database System Architectures Chapter 20: Database System Architectures Chapter 20: Database System Architectures Centralized and Client-Server Systems Server System Architectures Parallel Systems Distributed Systems Network Types

More information

Correctness Criteria Beyond Serializability

Correctness Criteria Beyond Serializability Correctness Criteria Beyond Serializability Mourad Ouzzani Cyber Center, Purdue University http://www.cs.purdue.edu/homes/mourad/ Brahim Medjahed Department of Computer & Information Science, The University

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Distributed Databases Systems

Distributed Databases Systems Distributed Databases Systems Lecture No. 01 Distributed Database Systems Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Distributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases

Distributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases Distributed Database Management System UNIT-2 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi-63,By Shivendra Goel. U2.1 Concurrency Control Concurrency control is a method

More information

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

CSE 5306 Distributed Systems. Consistency and Replication

CSE 5306 Distributed Systems. Consistency and Replication CSE 5306 Distributed Systems Consistency and Replication 1 Reasons for Replication Data are replicated for the reliability of the system Servers are replicated for performance Scaling in numbers Scaling

More information

Basic vs. Reliable Multicast

Basic vs. Reliable Multicast Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?

More information

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology Mobile and Heterogeneous databases Distributed Database System Transaction Management A.R. Hurson Computer Science Missouri Science & Technology 1 Distributed Database System Note, this unit will be covered

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006 Google File System, Replication Amin Vahdat CSE 123b May 23, 2006 Annoucements Third assignment available today Due date June 9, 5 pm Final exam, June 14, 11:30-2:30 Google File System (thanks to Mahesh

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours

More information

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013 Chapter 1: Distributed Systems: What is a distributed system? Fall 2013 Course Goals and Content n Distributed systems and their: n Basic concepts n Main issues, problems, and solutions n Structured and

More information

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju Chapter 4: Distributed Systems: Replication and Consistency Fall 2013 Jussi Kangasharju Chapter Outline n Replication n Consistency models n Distribution protocols n Consistency protocols 2 Data Replication

More information

Continuous Timestamping for Efficient Replication Management in DHTs

Continuous Timestamping for Efficient Replication Management in DHTs Author manuscript, published in "International Conference on Data Management in Grid and P2P Systems (Globe), Spain" Continuous Timestamping for Efficient Replication Management in DHTs Reza Akbarinia

More information

GOSSIP ARCHITECTURE. Gary Berg css434

GOSSIP ARCHITECTURE. Gary Berg css434 GOSSIP ARCHITECTURE Gary Berg css434 WE WILL SEE Architecture overview Consistency models How it works Availability and Recovery Performance and Scalability PRELIMINARIES Why replication? Fault tolerance

More information

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems Consistency CS 475, Spring 2018 Concurrent & Distributed Systems Review: 2PC, Timeouts when Coordinator crashes What if the bank doesn t hear back from coordinator? If bank voted no, it s OK to abort If

More information

Efficient Scheduling of Scientific Workflows using Hot Metadata in a Multisite Cloud

Efficient Scheduling of Scientific Workflows using Hot Metadata in a Multisite Cloud Efficient Scheduling of Scientific Workflows using Hot Metadata in a Multisite Cloud Ji Liu 1,2,3, Luis Pineda 1,2,4, Esther Pacitti 1,2,3, Alexandru Costan 4, Patrick Valduriez 1,2,3, Gabriel Antoniu

More information

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Linearizability CMPT 401. Sequential Consistency. Passive Replication Linearizability CMPT 401 Thursday, March 31, 2005 The execution of a replicated service (potentially with multiple requests interleaved over multiple servers) is said to be linearizable if: The interleaved

More information

Three Read Priority Locking for Concurrency Control in Distributed Databases

Three Read Priority Locking for Concurrency Control in Distributed Databases Three Read Priority Locking for Concurrency Control in Distributed Databases Christos Papanastasiou Technological Educational Institution Stereas Elladas, Department of Electrical Engineering 35100 Lamia,

More information

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems Distributed Systems Outline Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems What Is A Distributed System? A collection of independent computers that appears

More information

The Replication Technology in E-learning Systems

The Replication Technology in E-learning Systems Available online at www.sciencedirect.com Procedia - Social and Behavioral Sciences 28 (2011) 231 235 WCETR 2011 The Replication Technology in E-learning Systems Iacob (Ciobanu) Nicoleta Magdalena a *

More information

Modern Database Concepts

Modern Database Concepts Modern Database Concepts Basic Principles Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz NoSQL Overview Main objective: to implement a distributed state Different objects stored on different

More information

Chapter 11 - Data Replication Middleware

Chapter 11 - Data Replication Middleware Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 11 - Data Replication Middleware Motivation Replication: controlled

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

It also performs many parallelization operations like, data loading and query processing.

It also performs many parallelization operations like, data loading and query processing. Introduction to Parallel Databases Companies need to handle huge amount of data with high data transfer rate. The client server and centralized system is not much efficient. The need to improve the efficiency

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2017/18 Unit 19 J. Gamper 1/44 Advanced Data Management Technologies Unit 19 Distributed Systems J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE ADMT 2017/18 Unit 19 J.

More information

10. Replication. Motivation

10. Replication. Motivation 10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Copyright 2003 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4. Other Approaches

More information

殷亚凤. Consistency and Replication. Distributed Systems [7]

殷亚凤. Consistency and Replication. Distributed Systems [7] Consistency and Replication Distributed Systems [7] 殷亚凤 Email: yafeng@nju.edu.cn Homepage: http://cs.nju.edu.cn/yafeng/ Room 301, Building of Computer Science and Technology Review Clock synchronization

More information

Replication and Consistency

Replication and Consistency Replication and Consistency Today l Replication l Consistency models l Consistency protocols The value of replication For reliability and availability Avoid problems with disconnection, data corruption,

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

Replication and Consistency. Fall 2010 Jussi Kangasharju

Replication and Consistency. Fall 2010 Jussi Kangasharju Replication and Consistency Fall 2010 Jussi Kangasharju Chapter Outline Replication Consistency models Distribution protocols Consistency protocols 2 Data Replication user B user C user A object object

More information

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 2 Distributed Information Systems Architecture Chapter Outline

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

Consistency and Replication (part b)

Consistency and Replication (part b) Consistency and Replication (part b) EECS 591 Farnam Jahanian University of Michigan Tanenbaum Chapter 6.1-6.5 Eventual Consistency A very weak consistency model characterized by the lack of simultaneous

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

CSE 444: Database Internals. Lecture 25 Replication

CSE 444: Database Internals. Lecture 25 Replication CSE 444: Database Internals Lecture 25 Replication CSE 444 - Winter 2018 1 Announcements Magda s office hour tomorrow: 1:30pm Lab 6: Milestone today and due next week HW6: Due on Friday Master s students:

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Replication and consistency Fall 2013 1 Replication 2 What is replication? Introduction Make different copies of data ensuring that all copies are identical Immutable data

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering Consistency: Relaxed SWE 622, Spring 2017 Distributed Software Engineering Review: HW2 What did we do? Cache->Redis Locks->Lock Server Post-mortem feedback: http://b.socrative.com/ click on student login,

More information

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures GFS Overview Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures Interface: non-posix New op: record appends (atomicity matters,

More information

Lecture 23 Database System Architectures

Lecture 23 Database System Architectures CMSC 461, Database Management Systems Spring 2018 Lecture 23 Database System Architectures These slides are based on Database System Concepts 6 th edition book (whereas some quotes and figures are used

More information

Chapter 2 System Models

Chapter 2 System Models CSF661 Distributed Systems 分散式系統 Chapter 2 System Models 吳俊興國立高雄大學資訊工程學系 Chapter 2 System Models 2.1 Introduction 2.2 Physical models 2.3 Architectural models 2.4 Fundamental models 2.5 Summary 2 A physical

More information

Client Server & Distributed System. A Basic Introduction

Client Server & Distributed System. A Basic Introduction Client Server & Distributed System A Basic Introduction 1 Client Server Architecture A network architecture in which each computer or process on the network is either a client or a server. Source: http://webopedia.lycos.com

More information

Consistency in Distributed Systems

Consistency in Distributed Systems Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in scale and widely distributed 1. concurrent execution of components 2. independent failure modes 3. transmission

More information

Module 7 - Replication

Module 7 - Replication Module 7 - Replication Replication Why replicate? Reliability Avoid single points of failure Performance Scalability in numbers and geographic area Why not replicate? Replication transparency Consistency

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

TRABAJO DE INVESTIGACIÓN Partial Replication and Snapshot Isolation at the Middleware Level

TRABAJO DE INVESTIGACIÓN Partial Replication and Snapshot Isolation at the Middleware Level TRABAJO DE INVESTIGACIÓN Partial Replication and Snapshot Isolation at the Middleware Level ALUMNO Damián Serrano García dserrano@alumnos.upm.es PROFESORES Marta Patiño Martínez Ricardo Jiménez Peris CURSO

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Consistency and Replication Jia Rao http://ranger.uta.edu/~jrao/ 1 Reasons for Replication Data is replicated for the reliability of the system Servers are replicated for performance

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety Copyright 2012 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4.

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Mutual consistency, what for? Replication. Data replication, consistency models & protocols. Difficulties. Execution Model.

Mutual consistency, what for? Replication. Data replication, consistency models & protocols. Difficulties. Execution Model. Replication Data replication, consistency models & protocols C. L. Roncancio - S. Drapeau Grenoble INP Ensimag / LIG - Obeo 1 Data and process Focus on data: several physical of one logical object What

More information

On the Study of Different Approaches to Database Replication in the Cloud

On the Study of Different Approaches to Database Replication in the Cloud Melissa Santana et al.: On the Study of Different Approaches to Database Replication in the Cloud TR ITI-SIDI-2012/011 On the Study of Different Approaches to Database Replication in the Cloud Melissa

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction Parallel machines are becoming quite common and affordable Prices of microprocessors, memory and disks have dropped sharply Recent desktop computers feature

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington Distributed storage systems

More information

S-Store: Streaming Meets Transaction Processing

S-Store: Streaming Meets Transaction Processing S-Store: Streaming Meets Transaction Processing H-Store is an experimental database management system (DBMS) designed for online transaction processing applications Manasa Vallamkondu Motivation Reducing

More information

Lazy Database Replication with Freshness Guarantees

Lazy Database Replication with Freshness Guarantees Lazy Database Replication with Freshness Guarantees Khuzaima Daudjee School of Computer Science University of Waterloo Waterloo, Ontario, Canada kdaudjee@uwaterloo.ca Kenneth Salem School of Computer Science

More information

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks)

DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK. UNIT I PART A (2 marks) DEPARTMENT OF INFORMATION TECHNOLOGY QUESTION BANK Subject Code : IT1001 Subject Name : Distributed Systems Year / Sem : IV / VII UNIT I 1. Define distributed systems. 2. Give examples of distributed systems

More information

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases. Technical Report

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases. Technical Report Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 Keller Hall 200 Union

More information

Chapter 2 Distributed Information Systems Architecture

Chapter 2 Distributed Information Systems Architecture Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter 2 Distributed Information Systems Architecture Chapter Outline

More information

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Distributed Systems Lec 10: Distributed File Systems GFS Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 1 Distributed File Systems NFS AFS GFS Some themes in these classes: Workload-oriented

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

Database Replication in Tashkent. CSEP 545 Transaction Processing Sameh Elnikety

Database Replication in Tashkent. CSEP 545 Transaction Processing Sameh Elnikety Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety Replication for Performance Expensive Limited scalability DB Replication is Challenging Single database system Large, persistent

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

Consistency & Replication

Consistency & Replication Objectives Consistency & Replication Instructor: Dr. Tongping Liu To understand replication and related issues in distributed systems" To learn about how to keep multiple replicas consistent with each

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication DB Reading Group Fall 2015 slides by Dana Van Aken Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports

More information

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University 740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess

More information

Cache Coherence in Distributed and Replicated Transactional Memory Systems. Technical Report RT/4/2009

Cache Coherence in Distributed and Replicated Transactional Memory Systems. Technical Report RT/4/2009 Technical Report RT/4/2009 Cache Coherence in Distributed and Replicated Transactional Memory Systems Maria Couceiro INESC-ID/IST maria.couceiro@ist.utl.pt Luis Rodrigues INESC-ID/IST ler@ist.utl.pt Jan

More information

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication Important Lessons Lamport & vector clocks both give a logical timestamps Total ordering vs. causal ordering Other issues in coordinating node activities Exclusive access to resources/data Choosing a single

More information

Consistency and Replication 1/62

Consistency and Replication 1/62 Consistency and Replication 1/62 Replicas and Consistency??? Tatiana Maslany in the show Orphan Black: The story of a group of clones that discover each other and the secret organization Dyad, which was

More information

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary Consistency and Replication Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary Reasons for Replication Reliability/Availability : Mask failures Mask corrupted data Performance: Scalability

More information

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Important Lessons. Today's Lecture. Two Views of Distributed Systems Important Lessons Replication good for performance/ reliability Key challenge keeping replicas up-to-date Wide range of consistency models Will see more next lecture Range of correctness properties L-10

More information

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05 Engineering Goals Scalability Availability Transactional behavior Security EAI... Scalability How much performance can you get by adding hardware ($)? Performance perfect acceptable unacceptable Processors

More information

Fine-Grained Lazy Replication with Strict Freshness and Correctness Guarantees

Fine-Grained Lazy Replication with Strict Freshness and Correctness Guarantees Department of Computer Science Institute of Information Systems Fine-Grained Lazy Replication with Strict Freshness and Correctness Guarantees Fuat Akal Can Türker Hans-Jörg Schek Torsten Grabs Yuri Breitbart

More information

Cluster-Based Scalable Network Services

Cluster-Based Scalable Network Services Cluster-Based Scalable Network Services Suhas Uppalapati INFT 803 Oct 05 1999 (Source : Fox, Gribble, Chawathe, and Brewer, SOSP, 1997) Requirements for SNS Incremental scalability and overflow growth

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Lecture XII: Replication

Lecture XII: Replication Lecture XII: Replication CMPT 401 Summer 2007 Dr. Alexandra Fedorova Replication 2 Why Replicate? (I) Fault-tolerance / High availability As long as one replica is up, the service is available Assume each

More information