GBO Preliminary Research Proposal

Size: px

Start display at page:

Download "GBO Preliminary Research Proposal"

Evan Morgan Cain
6 years ago
Views:

1 GBO Preliminary Research Proposal Xiaodan Wang 1 1 Johns Hopkins University, USA xwang@cs.jhu.edu I. INTRODUCTION Gray and Szalay [1] documented the data avalanche problem in the sciences in which improvements in physical instruments and better data pipelines lead to an exponential growth in data size. Paralleling this exponential trend is the accumulation of data at multiple, autonomous data sources. Exploring the resulting massive, widely-distributed data is of immense scientific value: as Gray and Szalay observe, the number of scientific discoveries increase polynomially with the number of participating data sources [2]. Federated databases is an attractive solution for the management and sharing of scientific data that are geographically distributed. Increasingly, Science discoveries are made by scanning large portions of the data to find correlations, mine data, extract features, and compute joins across distributed data sources [3]. These needle in a haystack queries are long running and data intensive so that query throughput limits performance. In order to facilitate data exploration, various scientific disciplines have built federations of databases. Federations allow data, which are too large to be widely replicated or stored at a single site, to be managed independently. Examples include SkyQuery [4], Genbank [5], and EcoliHub [6]. To ensure high job throughput and prevent starvation of traditional workloads in a federated environment, we propose new query processing disciplines. Scientific database federations built at a global scale render many goals of distributed query processing obsolete. Specifically, queries are I/O intensive and require non-indexed scans of multi-terabyte tables, which may take several hours to complete [3]. Data size and geography also dictate that transmitting data takes large amounts of time and has a profound impact on query performance. Thus, workloads are not latency sensitive due to the large data sizes (accessing data at each site takes tens of seconds at best). Our goal is maximizing query throughput in the federation through query scheduling techniques that incorporate network structure (exploit high capacity network paths) and account for data access requirements (maximize data sharing among queries). A guiding principle of our work is that rather than choosing the optimal plan for each query, we choose plans that penalize other concurrent queries minimally and improve overall query throughput. For example, algorithms that minimize completion time over-utilize the network by consuming all available resources to achieve a locally optimal plan [7]. Thus, we limit the amount of parallelism in query schedules to avoid multiple data transfers across large geographies. We also delay SELECT... FROM SDSS o, TWOMASS t, USNOB p WHERE XMATCH(o, t, p) < 3.5 and REGION( circle ) Web Mediator Wrapper Wrapper Wrapper USNOB CARD: 800 Result Result Fig. 1. TWOMASS CARD: 100 Probe Query Result SDSS CARD: 30 Spatial join in SkyQuery. completion of individual queries to permit sharing of full table scans across multiple queries. While our approach increases the response time of certain queries, we observe orders of magnitude improvement in system throughput for astronomy workloads. This in turn expands the scale of exploration in Astronomy by supporting queries over a larger federation and spatial region. II. QUERY SCHEDULING IN SKYQUERY Astronomy presents a good case study of data intensive queries against scientific database federation. SkyQuery, a federation of astronomy databases [4] typifies a data intensive database federation with dozens of sites distributed across three Continents. Each member is a multi-terabyte survey cataloging celestial objects from regions in the sky across multiple spectral properties. However, SkyQuery is facing a scalability crisis in terms of the ever expanding data size and number of sites [8]. For example, it is not uncommon for join queries to yield intermediate results that are hundreds of megabytes in size [9]. Also, with roughly thirty sites distributed across three continents, the physical distribution and connectivity of sites in SkyQuery will grow considerably in the near future [8]. We study query performance in SkyQuery to gain sight into issues that are common among scientific database federations. SkyQuery provides federated services to the public through Web-services and Web forms. Users submit queries through a mediator, which communicates with member databases via a shared wrapper interface. The principal query is crossmatch [4], which joins observations of the same astronomical object from several databases by correlating their location in space. Cross-match extends SQL by adding two clauses. The region clause specifies an area for conducting the search.

2 The xmatch clause specifies an unordered list of databases to visit. Figure 1 illustrates a sample execution of a cross-match query involving three sites. The mediator produces a serial schedule (a left-deep join order that visits each site serially) that joins sites by ascending cardinality, which is defined as the number of rows that satisfy the region clause at each site. This approach is currently employed in SkyQuery and is effective at the early elimination of tuples that do not participate in the join at subsequent sites. As a result, it minimizes computation for databases on uniform networks under a couple of assumptions: uniform join probabilities and linear I/O and processing costs in the number of tuples. While the current approach to query scheduling performs reasonably well, improvements can be made to reduce the cost of transmitting join results between sites and the cost of computing spatial joins at each site. A global-scale federation exhibits network heterogeneity (differing capacity on network paths between sites) in which communication overhead dominates up to 90% of query response time in SkyQuery. Thus, schedules should account for network structure rather than joining sites in a strictly ascending cardinality fashion. Another observation is that queries tend to overlap among specific spatial regions of interest. As a result, I/O cost can be shared by reusing results from the same region of interest across multiple queries that exhibit shared data access. Thus, our goal is to improve query performance by minimizing both network and I/O costs for data-intensive workloads in a federated environment. III. CONTRIBUTIONS In this section, we present our contributions from previously published work before mapping out a path for further research. Our first step is to model network costs in query scheduling decisions using a balanced network utilization metric. The metric rewards schedules that utilize paths with excess capacity and produce small intermediate results. We then present algorithms for scheduling inter-site joins, which exploit network locality and avoid narrow, long-haul paths (e.g. transferring data between sites that cross continental boundaries) in the federation by making these paths more costly during optimization. Next, we introduce LifeRaft, a data-driven, batch processing algorithm, designed to improve query throughput at a single site. LifeRaft batches queries with overlapping data requirements at each site and executes them against an ordering of the data that maximizes data sharing among queries. This decreases I/O and increases cache utility. Together, balanced network utilization and LifeRaft form a holistic solution for improving query throughput in a database federation. A. Related Works Kossmann [10] presents a detailed survey of both past and current query processing and optimization techniques that minimize computation and communication costs by exploiting, for instance, intra-query parallelism, caching, and data replication. Most commercial optimizers still rely on System R-style dynamic programming algorithms [11] that exhibit exponential-time complexity [12][13] and are unsuitable for large-scale database federations. Moreover, works that address the communication costs of distributed queries assume network uniformity [14][15]. They minimize the size of intermediate results, which in turn reduces computation and network costs on uniform networks. When compared with previous approaches, we simplify one aspect of query optimization (dealing with join selectivity), which allows us to consider non-uniform and non-metric network costs and balance network utilization over all paths. Our algorithms have low, polynomial-time complexity and, thus, they scale to large federations with hundreds of sites. Evaluating distributed queries using semi-joins can provide substantial network savings [14][16][17]. Semi-joins ship only attributes that are necessary for evaluating a join to another site in order to eliminate tuples that fail to satisfy the join predicate. In many settings, semi-joins are not attractive because computational overhead outweighs network savings on local area networks [18]. For network-bound queries on wide-area networks, semi-joins become more attractive because communication costs dominate performance. Our application of semijoins differs in that we use them to limit attribute aggregation, rather than reducing the cardinality of intermediate results. We also look to data-centric routing in wireless sensor networks for inspiration. Several works perform in-network aggregation of data from multiple sensors to a single base station [19][20][21]. They organize sensors into a spanning tree rooted at the base station and aggregate data along the tree in order to conserve power by minimizing network usage. For example, Meliou et al. [22] studies the NP-hard problem of finding an optimal tour for gathering data from a subset of sensors. They employ a similar metric to capture network utilization, but solve a different problem with different techniques: optimizing power consumption using dynamic programming alone. We explore more complex network structure, variablesize intermediate results, and employ dynamic programming on top of spanning tree solutions. In addition to communication cost, we minimize I/O for scan-intensive queries through batch processing. The query batching paradigm was studied for workloads against large datasets on tertiary storage in order to minimize I/O cost [23 25]. Yu and Dewitt explored this in the Paradise system by reordering queries over data stored on magnetic tape. The reordering achieves sequential I/O by collecting data requirements during a pre-execution phase (without physically performing the I/O), reordering tape requests, and finally executing queries concurrently in one batch. However, queries participating in the join midway must wait until the entire batch finishes. Our approach is not limited to sequential data processing. Sarawagi et. al. [24] provides non-sequential processing by partitioning the data into fragments that are physically contiguous on the tertiary device and schedules concurrent queries on a per fragment basis. While our work leverages some of the ideas described in these works, we also

3 explore additional metrics for high query throughput; namely, the amount of data contention and query starvation. Google s Map-Reduce [26] is an attractive paradigm for parallel computation and is evolving to encompass more dataintensive tasks. Yang et al. [27] extend the Map-Reduce paradigm to more efficiently support relational joins by adding a merge phase that processes heterogeneous datasets simultaneously. More recently, Olston et al. [28] combined the procedural style of Map-Reduce with declarative SQL constructs in a parallel programming paradigm. Agrawal et al. [29] incorporate batch processing for Map-Reduce environments and identify data sharing among map tasks. Jobs that scan the same files are co-scheduled to maximize throughput. While their results are theoretical and does not consider caching, we plan to adapt their solution to query scheduling. We also highlight the current approaches used in SkyQuery to achieve high throughput. The CasJobs [3] system avoids starvation of short queries by data-intensive scan queries through a multi-queue job submission system. The distinction between long and short queries is arbitrarily decided. Extra hardware is used to assign queries from each class to different servers, which is problematic since the longest short queries interfere with the short queue and the shortest long queries are starved. The throughput of long running queries is further improved by partitioning the data and evaluating the queries in parallel across servers. However, achieving a balanced distribution of the workload across multiple servers is difficult because certain regions in the sky are accessed more frequently. Our work does not use ad hoc mechanisms to distinguish long and short running queries. Instead, queries of all sizes are supported in a single system. B. Network-Aware Join Processing As the SkyQuery federation expands geographically, the data and scientific queries become large and naturally distributed, which leads to poor query processing performance. Each site may produce hundreds of megabytes of data to joined at the other sites. Query processing involves sending data from site to site, accumulating results, and eventually delivering query results to the scientist. Given the large data sizes and geographic distance between sites, query processing consumes vast amounts of network resources. Traditional query processing techniques are poorly suited to the scale and heterogeneity of database federations deployed at a global-scale. Previous work focuses on minimizing query completion time. This includes parallel computation at multiple sites and reducing the volume of network traffic. Algorithms that focus on reducing the volume of traffic underutilize the network because paths with excess capacity may be overlooked. Our solution balances the utilization of all network paths, improving performance by an order of magnitude for queries that include ten or more sites [30, 31]. Our algorithms identify network structure, such as the throughput of paths and clusters of sites. We then use this structure to identify excess capacity in the network and schedule joins on those paths. Fig. 2. (a) Previous Scheduler (b) Spanning Tree Approximation Comparison of Join Schedules in SkyQuery. Figure 2 illustrates the benefits of our scheduling techniques for a user query taken from SkyQuery s Web logs. Previous Scheduler denotes SkyQuery s existing algorithm, which minimizes the volume of network traffic but does not account for network distances. The resulting plan crosses the Atlantic several times and suffers from long data transfer times. In contrast, our Spanning Tree Approximation algorithm accounts for network structure in the plan generation and produces a schedule that avoids long-haul paths. For this example, our spanning tree-based approximation algorithm achieves a factor of twelve reduction in network utilization. The previous scheduler, which minimizes the volume of network traffic, does not account for network distances. This results in a plan that crosses the Atlantic several times. In contrast, by extracting network structure and including it in plan generation, we produce schedules that avoid long-haul paths and thereby reducing network costs by twelve-fold. Our scheduling technique uses a balanced network utilization metric to capture network structure. The metric accounts for the capacity of network paths (measured by TCP throughput of bulk transfers between sites) and the size of intermediate join results that are produced by a join schedule. Minimizing this metric for all queries reduces the total time in which network s paths are used to carry data associated with join queries, thereby reducing contention for network resources. However, there is no guarantee that the solution yields a minimum response time schedule for an individual query. Scheduling is performed based on local information and aggregate statistics collected about the system prior to optimization. This allows for decentralized optimization decisions, achieving scale and incurring no communication overhead during optimization. Providing more timely knowledge during optimization about concurrent queries initiated from all sites in the network and the state of every network path would incur significant overhead and is not desirable. A Spanning Tree Approximation algorithm is

4 used to minimize network utilization. The algorithm adapts the two-approximate solution to the traveling salesman problem (TSP) to query scheduling. Namely, join queries initiate at the minimum cardinality site and intermediates results traverse paths along a minimum spanning tree. We also explore parallel execution strategies by observing that geographically co-located sites form highly-connected clusters, which can compute the join in parallel and send the results to sites that are close to the mediator. To achieve a polynomial-time solution, we assume perfect join selectivity: the join of three relations with r m, r n, and r o tuples produces a result with min(r m, r n, r o ) tuples. This assumption holds for SkyQuery. We experimented with exhaustive, dynamic programming solutions that made no assumptions about selectivity at the expense of exponential complexity, but the performance benefits were negligible. C. Data-Driven Batch Scheduling Sites in SkyQuery service millions of queries each month [9] and is an ideal environment for batch processing. Many cross-match queries have long execution times (several hours or an entire day) and are not response time sensitive: they navigate the entire sky and perform full database scans. Thus, evaluating multiple scan-intensive queries concurrently places substantial demands on the disk and limits the scale of exploration. Luckily, select data regions experience frequent reuse (i.e. queries concentrated around star clusters of interest) in which 2% of the data account for more than half of the I/O requests. Queries that overlap in data access also occur close temporally, which benefits caching. Relaxing in-order scheduling in SkyQuery can reduce redundant I/O and achieve large improvements in query throughput. Specifically, rather than execute queries in arrival order, we can interleave I/O requests from multiple queries (possibly increasing the wait time of existing queries) based on the amount of contention between queries for shared data. This is accomplished by first pre-processing each query to identify its data access requirements. (Pre-processing should be inexpensive relative to I/O, which is true for scan-based workloads on data that is spatially or temporally defined). We can then co-schedule queries that access the same data to 1) eliminate redundant accesses to the disk and 2) amortize the cost of data access over multiple queries. However, co-scheduling must also account for wait times of existing queries. Short-lived queries (minutes or seconds) that focus on a small region of the sky and are highly selective also exist in SkyQuery and starving these queries is undesirable. We developed a data-driven, batch processing scheduler, LifeRaft [32], that identifies data sharing among queries and co-schedules queries against data exhibiting the highest degree of sharing. Our approach is data-driven because it focuses on the data requirements of each query, instead of the arrival order, in order to coordinate query processing with access to secondary storage. This is accomplished by partitioning relational data tables into equal-sized (same number of objects) buckets. (We use a space filling curve to order the spatial No Share LifeRaft Fig B 1 B 3 B 4 B 5 3 B 3 4 B 4 4 B B 5 B Q 1 i 3 3 B 3 5 B Qk B 3 1 B 1 Co-scheduling queries to amortize I/O. data while preserving spatial proximity within each bucket). Incoming queries are pre-processed to determine a list of sub-queries which satisfy the following property: each subquery operates on a single bucket and can be processed in any order. The result of the original query is obtained by combining the sub-query results. Finally, buckets are read from disk by scheduler one at a time by decreasing contention (number of pending queries for shared data) so that queries whose workload (list of sub-queries) overlaps the bucket are processed concurrently, incurring no additional I/O. Figure 3 illustrates the benefits of co-scheduling queries. The data table is partitioned into five buckets that are joined with three cross-match queries. We assume a uniform cost of one second of reading each bucket from disk. Also, only a single bucket can be cached at a time. The inter-query arrival time is one second; that is and arrives one and two seconds after respectively. No Share denotes inorder processing by the database, which first processes by reading and joining buckets through B 5 in sequential order. This schedule does not account for data overlap (i.e. both and access bucket ) and results in redundant I/O. The LifeRaft scheduler illustrates an execution order with minimal I/O. Here, we delay the join of against buckets and B 3 until the remaining queries arrive, which allows us to co-schedule queries that share data. However, such reordering leads to the completion of prior to both and. Potentially, this can delay the completion of certain queries indefinitely. IV. FUTURE WORK So far, we have decoupled the discussion of network-aware join processing and data-driven batch processing. LifeRaft schedules scan-based workloads in a single system environment. Looking forward, we want to extend LifeRaft to a federated environment it is not clear how join processing can be coordinated across multiple databases. Batch processing is effective in a single system because the data requirements of each query is provided a priori. This allows the scheduler to anticipate data regions that are accessed in the future and reorder existing queries appropriately. Applied to a federation,

5 A 2 A Fig. 4. B 1 2 B 1 1 B 2 2 B 1 1 B 1 Batch processing across sites. starting to employ federations rather than summary databases and data warehouses [33][34], which exhibit similar challenges; real-time processing over data generated all over the globe. Even without perfect join selectivity, incorporating scheduling optimization that capture network heterogeneity is valuable. We provide solutions for applications that depend on the accuracy of selectivity estimates and can tolerate exponential complexity. We also want to experiment our batch processing algorithm on other temporal-spatial databases in which queries can be sub-divided into data-defined units of work. The Turbulence database [35] is one such example, which is evolving toward larger data sets and workloads. REFERENCES this means that every site is aware of the data accessed by each query before queries visit and join these sites. This allows different sites to coordinate query execution order and maximize batch size over all sites (e.g. amortize I/O cost over more queries). In essence, a distributed version of data-driven, batch processing. We can sub-divide queries into buckets across multiple databases to facilitate inter-database batch processing. Figure 4 illustrates coordinated batch processing across two sites. Given sites A and B, pending sub-queries exist at B that operate on buckets B 1 and while query is pending at A, which also needs to visit and join against B. Note that since joins with bucket, B buffers work against and schedules subqueries against B 1 first. Once B receives the intermediate join results of from A, it then co-schedules previously pending work and incurs no additional I/O for. The drawback is that B may unduly starve requests while waiting for workload from A so that a desired batch size is achieved. Moreover, this approach constrains memory at B because to achieve a large batch size, it needs to buffer more data. Our immediate goal is to generalize and improve batch processing for a single site. Given our initial empirical results [32], we want to provide a theoretical treatment of the issues by adapting a recent work [29], which studies a similar problem in the Map-Reduce framework. Their work does not include caching and we plan to explore the impact of cache eviction policies. We also plan to address workload overflow in which intermediate join results of queries will need to be stored to disk and fetched into memory for processing. This requires the scheduler to migrate matching pairs of join results and buckets into memory for evaluation. We note that in federations, the queries may be quite large, because they include intermediate results from other sites that need to be joined with the current site. Moreover, we want to provide robust quality-of-service guarantees for interactive, short-lived queries in a unified framework. Short queries should finish quickly (regardless of the arrival order) and not risk starvation from prior, long-running queries. Finally, we want to evaluate the applicability of our work for applications outside of scientific databases. With respect to network-aware scheduling, OLAP and DSS applications are [1] J. Gray and A. Szalay, Where the Rubber Meets the Sky: Bridging the Gap Between Databases and Science, IEEE Data Engineering Bulletin, vol. 27, no. 4, pp. 3 11, [2] J. Gray and A. Szalay. (2003) Online Science: The World-Wide Telescope as a Prototype for the New Computational Science. Presentation at the Supercomputing Conference. [Online]. Available: Gray/JimGrayTalks.htm [3] W. O Mullane, N. Li, M. Nieto-Santisteban, A. Szalay, and A. Thakar, Batch is Back: CasJobs, Serving Multi-TB Data on the Web, in ICWS, [4] T. Malik, A. S. Szalay, A. S. Budavri, and A. R. Thakar, SkyQuery: A Web Service Approach to Federate Databases, in CIDR, [5] A. Kementsietsidis, F. Neven, D. V. de Craen, and S. Vansummeren, Scalable MultiQuery Optimization for Exploratory Queries over Federated Scientific Databases, in VLDB, [6] The EcoliHub Web Service. [7] S. Ganguly, W. Hasan, and R. Krishnamurthy, Query Optimization for Parallel Execution, in SIGMOD, [8] A. Szalay, J. Gray, A. Thakar, P. Kuntz, T. Malik, J. Raddick, C. Stoughton, and J. Vandenberg, The SDSS SkyServer - Public Access to the Sloan Digital Sky Server Data, in SIGMOD, [9] T. Malik, R. Burns, and A. Chaudhary, Bypass Caching: Making Scientific Databases Good Network Citizens, in ICDE, [10] D. Kossmann, The State of the Art in Distributed Query Processing, ACM Comput. Surv., vol. 32, no. 4, pp , [11] D. Kossmann and K. Stocker, Iterative Dynamic Programming: A New Class of Query Optimization Algorithms, ACM Trans. on Database Systems, vol. 25, no. 1, pp , [12] A. Deshpande and J. Hellerstein, Decoupled Query Optimization for Federated Database Systems, in ICDE, [13] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price, Access Path Selection in a Relational Database Management System, in SIGMOD, [14] A. L. P. Chen and V. O. K. Li, Optimizing Star Queries in a Distributed Database System, in VLDB, [15] P. Scheuermann and E. I. Chong, Distributed Join Processing Using Bipartite Graphs, in ICDCS, [16] D.-M. Chiu, P. A. Bernstein, and Y.-C. Ho, Optimizing Chain Queries in a Distributed Database System, SIAM J. Comput., vol. 13, no. 1, pp , [17] Y. Kambayashi, M. Yoshikawa, and S. Yajima, Query Processing for Distributed Databases using Generalized Semi-Joins, in SIGMOD, [18] H. Lu and M. J. Carey, Some Experimental Results on Distributed Join Algorithms in a Local Network, in VLDB, [19] C. Intanagonwiwat, R. Govindan, and D. Estrin, Directed Diffusion: A Scalable and Robust Communication Paradigm for Sensor Networks, in MOBICOM, [20] B. Krishnamachari, D. Estrin, and S. B. Wicker, The Impact of Data Aggregation in Wireless Sensor Networks, in ICDCSW, [21] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, Medians and Beyond: New Aggregation Techniques for Sensor Networks, in SenSys, [22] A. Meliou, D. Chu, J. Hellerstein, C. Guestrin, and W. Hong, Data Gathering Tours in Sensor Networks, in IPSN, 2006.

6 [23] J. Myllymaki and M. Livny, Relational Joins for Data on Tertiary Storage, in ICDE, [24] S. Sarawagi, Query Processing in Tertiary Memory Databases, in VLDB, [25] J.-B. Yu and D. J. DeWitt, Query Pre-Execution and Batching in Paradise: A Two-Pronged Approach to the Efficient Processing of Queries on Tape-Resident Raster Images, in SSDBM, [26] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI, [27] H. C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, Map-Reduce- Merge: Simplified Relational Data Processing on Large Clusters, in SIGMOD, [28] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, Pig Latin: A Not-So-Foreign Language for Data Processing, in SIGMOD, [29] P. Agrawal, D. Kifer, and C. Olston, Scheduling Shared Scans of Large Data Files, in VLDB, [30] X. Wang, R. Burns, A. Terzis, and A. Deshpande, International Conference on Data Engineering, in ICDE, [31] X. Wang, R. Burns, and A. Terzis, Throughput-Optimized, Global- Scale Join Processing in Scientific Federations, in NetDB, [32] X. Wang, R. Burns, and T. Malik, LifeRaft: Data-Driven, Batch Processing for the Exploration of Scientific Databases, in CIDR, [33] J. M. Hellerstein, M. Stonebraker, and R. Caccia, Independent, Open Enterprise Data Integration, IEEE Data Engineering Bulletin, vol. 22, no. 1, pp , [34] M. Stonebraker, P. M. Aoki, W. Litwin, A. Pfeffer, A. Sah, J. Sidell, C. Staelin, and A. Yu, Mariposa: A Wide-Area Distributed Database System, VLDB Journal, vol. 5, no. 1, pp , [35] E. Perlman, R. Burns, Y. Li, and C. Meneveau, Data Exploration of Turbulence Simulations Using a Database Cluster, in SC, 2007.

Throughput-Optimized, Global-Scale Join Processing in Scientific Federations

Throughput-Optimized, Global-Scale Join Processing in Scientific Federations Xiaodan Wang, Randal Burns, Andreas Terzis Computer Science Department The Johns Hopkins University {xwang, randal, terzis}@cs.jhu.edu