GlobeTP: Template-Based Database Replication for Scalable Page 1 of 18 Web Applications Tobias Groothuyse, Swaminathan Sivasubramanian, and Guillaume Pierre. In procedings of WWW 2007, May 8-12, 2007, Banff, Alberta, Canada. Dina Adel Said dsaid@vt.edu
Problem Definition Page 2 of 18 How to provide a scalable infrastructure for hosting dynamically generated web content? Past Solutions: 1. Cache generated pages 2. Distribute the computational across multiple application servers 3. Cache the results of DB queries. Problems: Bottleneck resides in the throughput of the origin DB.
Problem Definition (cont.) Solution: Use DB Replication. Page 3 of 18 Problem: Doesn t scale linearly because all update, delete, insert (UDI) queries are performed to each DB relipca. Past solutions: 1. Increase the throughput of each individual sever 2. Partial Replication
Partial Replication Page 4 of 18 Past Solutions: Depending on the application programmer Gao et al. [2003] GlobeDB: Sivasubramanian et al. [2005]. Record-level replication granularity Provides excellent query latency A central sever maintains all the updates then sends batch updates to other servers. Does not improve the thoughput because the central server provides a bottleneck.
DBTP: Template-Based solution Page 5 of 18 The nature of web applications belong to small number of query templates. Query template: parameterized SQL query where parameters are passed at run time. By knowing these templates, table placements are selected to insure maximum throughput and reasonable latency.
Models Page 6 of 18 Application Model: The application programmer is required to specify explicity the application templates. System Model:
Main problems to consider Page 7 of 18 1. Cluster Identification: Ensure that the placement of tables would find at least one server to execute each query template. 2. Consider all the defined templates, read or UDI, and determine the best placement to provide the maximum throughput. 3. Define a load balancing algorithm that allows read queries to distribute efficiently.
Data Placement: Cluster Identification Page 8 of 18 Goal: Determines the set of tables that is needed to be replicated together so that templates function correctly. Meanwhile, number of servers that must execute the UDI query should be minimized. Characterize each query template: 1. Whether it is read or UDI 2. The set of tables that it accesses.
Data Placement: Load Analysis Page 9 of 18 Determines the load received by each of the cluster. Determines the load on Table Clusters: Read or UDI query Frequency of template occurrence Computational complexity for executing this query: Use DB systems tools to estimate the actual execution time. Run the query in a live system. Determines the load on DB servers (Read or UDI query)
Data Placement: Cluster Placement Page 10 of 18 Determines the placement of the cluster across the set of DB servers load achieved by each replica is minimized. Using exhaustive search O(2 N T /N!), where T is No. of tables and N number of Nodes.
Query Routing Page 11 of 18 Round Robin (RR): Efficient if all coming queries have the same cost. RR-QID: RR by Query ID Each Query template is identified by its QID. Each queue is associated with the set of DB servers that can server a certain QID. RR fashion is implemented for each queue. Cost-based Routing Upon arrival of incoming query, the query router estimates the current load on each DB server. The Query is scheduled to the least loaded DB server (that can serve the query).
Experiments Page 12 of 18 Compare Globe-TP with full DB replication using: TPC-W: standard e-commerce benchmark RUBBoS: bulletin-board benchmark modeled after slashdot.org
Experiments (cont.) Query latency distributions using 4 servers. Page 13 of 18
Experiments (cont.) Maximum achievable throughputs with 90% of queries processed within 100ms. Page 14 of 18
Advantages Page 15 of 18 Easily coupled with a distributed DB query cache. Does not require any modification in the application itself.
Disadvantages Page 16 of 18 Does not support transactions. However, it can be implemented through query router. Limitation due to table granularity partial replication. Fault Tolerance issues. Does not take into consideration the longterm load variations that must be expected when operating a popular dynamic web site.
References Lei Gao, Mike Dahlin, Amol Nayate, Jiandan Zheng, and Arun Iyengar. Application specific data replication for edge services. In WWW 03: Proceedings of the 12th international conference on World Wide Web, 449 460, Budapest, Hungary. 2003. ISBN 1-58113-680-3. Page 17 of 18 Swaminathan Sivasubramanian, Gustavo Alonso, Guillaume Pierre, and Maarten van Steen. Globedb: autonomic data replication for web applications. In WWW 05: Proceedings of the 14th international conference on World Wide Web, 33 42, Chiba, Japan. 2005. ISBN 1-59593-046-9.
Page 18 of 18 Thank you dsaid@vt.edu