Servicing Range Queries on Multidimensional Datasets with Partial Replicas

Size: px

Start display at page:

Download "Servicing Range Queries on Multidimensional Datasets with Partial Replicas"

Aubrie Fletcher
5 years ago
Views:

1 Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz Department of Computer Science and Engineering Department of Biomedical Informatics Ohio State University Columbus OH Abstract Partial replication is one type of optimization to speed up execution of queries submitted to large datasets. In partial replication, a portion of the dataset is extracted, re-organized, and re-distributed across the storage system. The objective is to reduce the volume of I/O and increase I/O parallelism for different types of queries and for the portions of the dataset that are likely to be accessed frequently. When multiple partial replicas of a dataset exist, query execution plan should be generated so as to use the best combination of subsets of partial replicas (and possibly the original dataset) to minimize query execution time. In this paper, we present a compiler and runtime approach for range queries submitted against distributed scientific datasets. A heuristic algorithm is proposed to choose the set of replicas to reduce query execution. We show the efficiency of the proposed method using datasets and queries in oil reservoir simulation studies on a cluster machine. 1 Introduction A number of challenges should be overcome to carry out on-demand, interactive analysis of scientific datasets. The very large size of data in many scientific applications is a major problem. In many situations, large dataset sizes necessitate use of distributed storage. Another challenge is to efficiently extract subsets of data for processing and data analysis. Data subsets can be specified by spatio-temporal range queries or queries on other attributes such as pressure, permeability, speed. Several optimizations can be employed to efficiently search for and extract the data of interest for analysis. It is a common optimization to partition a dataset into chunks, each of which holds a subset of data elements, and use This research was supported by the National Science Foundation under Grants #ACI (UC Subcontract # ), #EIA , #EIA , #ANI , #ACI , #ACI , #CCF , #CNS , #CNS , Lawrence Livermore National Laboratory under Grant #B and #B (UC Subcontract # ), NIH NIBIB BISTI #P20EB000591, Ohio Board of Regents BRTTC #BRTT chunks as the unit of I/O and data management. When data is stored and retrieved in chunks, the number of disk seeks, hence I/O latency, is reduced. In addition, managing a dataset as a set of chunks decreases the cost of metadata management, indexing, and data declustering. Indexing is another optimization technique to speed up query execution [11]. It enables rapid searching and extraction of meta-data for data elements that intersect a given query. Efficient access to data also depends on how well the data has been distributed across multiple storage nodes. The goal of declustering [9, 14] is to distribute the data across as many storage units as possible so that data elements that satisfy a query can be retrieved from many sources in parallel. Caching is yet another optimization that targets multiple query workloads [1, 10, 19, 21]. Caching attempts to reduce the execution time of queries through data and computation reuse by maintaining the results of queries previously executed in the system. The performance of these techniques depends on data access and processing patterns of an application. If there is only one type of query, the dataset can be declustered and indexed to optimize the execution of that type of query. However, when there are multiple types of queries, a single indexing and declustering scheme may not give good performance for all query types. Partial replication is an optimization technique that can be employed when multiple queries and/or multiple types of queries are expected. The objective is to decrease the volume of I/O and increase I/O parallelism by re-organizing and re-distributing a portion of the dataset across the storage system. For example, if the elements of a dataset are organized into chunks and indexed based on a single attribute or a subset of attributes (e.g., atomic species or 3D coordinates), queries that involve only those attributes can be efficiently processed. Queries that involve another attribute, on the other hand, may not be answered efficiently. If a secondary index is created for that attribute, a full scan of the dataset can be avoided. However, this may still yield inefficient execution due to random seeks to different locations in data files to access tuples that match the query criteria. When partial replication optimizations are employed, a 1

2 query can be answered from data subsets and replicas of datasets. Since partial replication can reorganize and redistribute a selected portion of a dataset, there may not be oneto-one mapping between the chunks in the original dataset and those in the replica. In addition, it is possible that no single replica can fully answer the query. Thus, it becomes challenging to figure out how to put together the requested data by piecing data from multiple replicas and possibly from the original dataset. In this paper, we present a compiler and runtime approach for efficient execution of multidimensional range queries when partial replicas of a dataset exist. We present a compile-time query planning strategy to select the best combination of replicas in order to minimize query execution time in a distributed environment. The efficiency of the proposed strategy is demonstrated on a parallel machine using queries for analysis of data from oil reservoir simulations. 2 Related Work Several research projects have looked at improving I/O performance using different declustering techniques [9, 14]. Parallel file systems and I/O libraries have also been a widely studied research topic, and many such systems and libraries have been developed [3, 7, 13]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. In the context of replication, most of the previous work targets issues like data availability during disk and/or network failures, as well as to speed up I/O performance by creating exact replicas of the datasets [4, 5, 6, 16, 18, 20]. File level and dataset level replication and replica management have been studied [6, 16, 12]. In the area of partial replication [4, 5, 20] the focus had been on creating exact replicas of portions of a dataset to achieve better I/O performance. Our goal is different in the sense that we investigate strategies for evaluating queries when there are (partial) replicas stored. In data caching context, Dahlin et al. [8] have evaluated algorithms that make use of remote memory in a cluster of workstations. Franklin et al. [10] have studied ways for augmenting the server memory by utilizing client resources. Shrira and Yoder [19] have proposed a technique for managing cooperative caches in untrusted environments. Venkataraman et al. [21] have proposed a new buffer management policy, which approximates a global LRU page replacement policy. Andrade et.al. [1] implement an active semantic cache framework for data reuse among multiple data analysis queries submitted against scientific multi-dimensional datasets. While these approaches can be employed for management and replacement of replicas, our work focuses on improving query performance by re-organization of portions of input datasets. 3 Overview 3.1 Motivating Application In simulation-based oil reservoir management studies [17], the goal is to investigate changes in reservoir characteristics and assess how injection and production wells should be placed to maximize oil production and at the same time minimize effects to the environment. An oil reservoir is a 3-dimensional subsurface volume composed of different types of rocks, soil, holes, and underground water ways. A complex numerical model simulates 17 different variables (such as gas pressure, oil saturation, oil velocity) at each point of the reservoir mesh over many time steps. Due to computational and storage requirements, large scale simulation runs require use of distributed computers and storage systems. In order to improve I/O performance, a simulation dataset can be partitioned into a set of chunks, i.e., data elements are organized into application-specific groups and stored on disk. A chunk is the unit of I/O, i.e., data is retrieved from disk in chunks, even if only a subset of the data elements in the chunk are needed. One possible grouping of data elements divides each grid along x, y, and z dimensions as well as the time dimension into chunks (i.e., 4-dimensional rectangles). The chunks are stored on disk in data files. Each chunk contains a subset of grid points and time steps and the values of the variables that are computed at those grid points and time steps. Metadata associated with a chunk is the minimum bounding box of the chunk, the name of the file that contains the chunk, offset in the file, and the size of the chunk. An R-Tree index [11] can be built on space and time dimensions to store the bounding boxes of chunks. The index enables rapid search of chunks that intersect a given query. On a system with multiple storage devices, the chunks can be distributed across storage devices to achieve parallelism when a query accesses them. 3.2 Runtime Middleware We employed a middleware framework, called STORM, to implement the runtime support for execution of queries. STORM [15] is a suite of services that collectively provide basic database support for 1) selection of the data of interest from scientific datasets stored in files and 2) transfer of selected data from storage nodes to compute nodes for processing. The data of interest can be selected based on attribute values, ranges of values, and user-defined filters. STORM supports user-defined distribution of data elements across destination nodes and transfer of selected data subset from distributed source nodes to destination nodes according to the distribution. Scientific datasets are usually stored in a set of flat files with application-specific file formats. In order to provide common support for different applications, STORM employs virtual tables and select query abstractions, which are based on object-relational database model. 2

3 STORM has been developed using a component-based framework, called DataCutter [2], which enables execution of application data processing components in a distributed environment. Using the DataCutter runtime system, STORM implements several optimizations to reduce the execution time of queries. Distributed Execution of Filtering Operations. Both data and task parallelism can be employed to execute filtering operations in a distributed manner. If a select expression contains multiple (userdefined) filters, a network of filters can be formed and executed on a distributed collection of machines. Parallel Data Transfer. Data can be transferred from multiple data sources to multiple destination processors by STORM data mover components. Data movers can be instantiated on multiple storage units and destination processors to achieve parallelism during data transfer. 4 Query Execution with Partial Replicas In this section we present an algorithm for efficient execution of range queries when partial replicas of a dataset exist. We first describe the type of partial replicas we consider in our system and give an overview of our overall system. 4.1 Partial Replicas Considered Our main focus is on answering queries in the presence of partial replicas. Our system assumes that partial replicas have been created, and information about them is stored in a replica information file. A variety of approaches could be taken for selecting partial replicas. One possible approach is to use a group of representative queries to identify the portions of the dataset to be replicated. We assume that a hot region in a dataset is defined by a multi-dimensional bounding box. A partial replica of the dataset is then created from data elements, whose coordinates in the underlying multidimensional space fall into this bounding box. The contents of this bounding box are partitioned into chunks and these chunks are distributed across the system. We allow flexible chunking of partial replicas, i.e., different replicas can have different chunk shape and size, even if they cover the same multi-dimensional region. Moreover, the layout of chunks on disk can follow different dimension order. This is important, as different dimension order will influence the file seek time for retrieving a set of chunks. Once a partial replica is chunked and a dimension order is chosen, chunk layout is done by traversing the chunks in the dimension order. To maximize I/O parallelism when retrieving data, we assume that each chunk of a replica is partitioned across all available data source nodes. However, it should be noted that if the replicas have been created with different chunk shapes and sizes, there will not be one-to-one mapping between data chunks in the original dataset and those in the replicas. 4.2 System Overview Our support for servicing range queries in the presence of partial replicas has been built on top of our prior work on supporting SQL Select queries on multidimensional datasets in a cluster environment [22]. A high-level overview of our system is shown in Figure 1. The underlying runtime system we use, STORM, was briefly described in the previous section. The STORM system requires the users to provide an index function and an extractor function. The index function chooses the file chunks that need to be processed for a given query. The extractor is responsible for reading the file(s) and creating rows of the virtual table. In our previous work, we had described how a code generation tool can use metadata about the layout of the datasets and generate indexing and extractor functions [22]. The information about replicas available in the system is stored in the replica information file. This file captures information about the multi-dimensional range covered by each (partial) replica, the size and shapes of the chunks, and the dimension order for the layout of the chunks. Given a range query, the replica selection algorithm uses the replica information file and indices to determine which partial replicas and which chunks in partial replicas can be used to answer the query. Based on this information, it chooses a strategy or an execution plan for answering the query. This module then interacts with our modules for code generation. It is possible that no single replica can fully answer the query. In that case, it is necessary to generate subqueries for each selected replica that partially answers the query. A subquery corresponds to the use of one replica (or the original dataset) for answering a part of the query. The code generation modules generate indexing and extraction services for each subquery. 4.3 Query Execution Computing Goodness Value To execute a range query efficiently, we should minimize the total cost associated with the query execution on the required data. To that end, we need to establish a goodness value for replicas to choose the ones that achieve the best query execution performance. Because chunk is the basic unit of data access from disk, the goodness value is calculated on chunks. We should note that 1) the I/O cost we pay to retrieve a chunk might be too high when the chunk is only partially intersected by the query bounding box, and 2) the cost of a chunk is proportional to its size. Thus, we need to take into account both the cost per chunk and the amount of useful data the chunk contains to answer the query. Our goodness value is defined as the total amount of the useful data in a chunk normalized by the estimated retrieval cost of the chunk. goodness = useful data per chunk /cost per chunk The retrieval cost of a chunk is computed as: 3

Figure 1. Overview of Our System (a) (b) (c) Figure 2. Using Partial Replicas for Answering a Query (a) Query and intersecting datasets, (b) 4 Fragments from 2 Replicas (c) Chunks retrieved.

4 Figure 1. Overview of Our System (a) (b) (c) Figure 2. Using Partial Replicas for Answering a Query (a) Query and intersecting datasets, (b) 4 Fragments from 2 Replicas (c) Chunks retrieved. cost = k 1 C read operation + k 2 C seek operation We measure the cost of read operation C read operation as the number of pages fetched from disks when answering a query. In the formula, k 1 is used as the average read time for a page. The cost of seek operation C seek operation is measured as the number of seeks used to locate the start point for a chunk during query execution. k 2 is the average seek time for one read operation. This cost metric is relatively simplistic. However, experiments presented in Section 5 demonstrate that it is capable of capturing the dominant cost factor in query execution. As illustrated in the Figure 2(a), Replica 1 has two kinds of chunks that are needed to answer the query. The chunks whose boundaries intersect the query boundary are called partial chunks, since they contain redundant tuples, which are not used to answer the query and which incur extra filtering operations. The chunks whose bounding boxes fall completely within the query bounding box are called full chunks. In order to compute the contribution of full and partial chunks in estimating the goodness value, we introduce an intermediate unit, called Fragment, between a replica and its chunks. A fragment is defined as a group of partial or full chunks having same retrieval goodness value in a replica. Obviously, all full chunks of one replica should be grouped as one fragment. In Figure 2(b), Replica 1 has three fragments and Replica 2 has only one fragment. In our implementation, each fragment has a goodness value associated with it. goodness = useful data per fragment /cost per fragment Using this cost model, we estimate the execution cost for different combinations of replicas to answer the query and choose the best, i.e., the least expensive combination of replicas. A simple exhaustive search algorithm needs to explore a large search space to find the best combination of replicas. Suppose we are given r replicas with n chunks in total. In the worst case, the exhaustive algorithm needs to check upto 2 n potential combinations for finding the most efficient way to answer an issued query. In order to reduce the exponential search cost and find the efficient combination quickly, we use a heuristic algorithm to recommend a set of candidate replicas Replica Selection Algorithm Our approach has two main steps; a greedy heuristic (step 1) and one optimization extension (step 2). In the first step we obtain a set of candidate replicas, i.e., a set of candidate fragments, which are chosen based on their goodness value in a greedy manner. If the union of the fragments do not cover the query range completely, a remainder query range is generated to be directed to the original dataset as well. The candidate replicas with or without 4

5 the original dataset is our initial solution. This initial solution may contain chunks (from different replicas) that intersect each other in the underlying multi-dimensional space of the dataset. This results in redundant I/O volume as one or more chunks may contain the same data element. In the second step, we attempt to eliminate redundant I/O by checking chunks in one candidate replica against others. When multiple partial replicas exist, our algorithm finds all the replicas that can be used to answer the query, selects a combination of replicas that will have the minimum cost value, and performs an efficient set union operation among data elements selected from these replicas to answer the query. The set union operation is needed since two replicas may contain an overlapping set of data elements selected by the query. INPUT: Query Q; a given set of partial replicas R; the original dataset D. OUTPUT: a list of candidate replicas (fragments) with or without the original dataset D for answering the issued query. // Let F be the set of all fragments intersecting with // the query boundary 1. F 2. Foreach R i R and R i Q 3. Classify chunks that intersect the query into different fragments 4. Calculate the goodness value of each fragment 5. Insert the fragments of R i into F 6. End Foreach // Let S be the ordered list of the candidate fragments in // decreasing order of their goodness values 7. S 8. While F 9. Find the fragment (F i) with the maximum goodness value 10. F F F i 11. Append F i to S 12. Q Q (F i Q) 13. Foreach F j F 14. If F j intersects with F i 15. Subtract the region of overlap from F j 16. Re-compute the goodness value of F j 17. End Foreach 18. End While 19. If Q 20. Use D to answer the subquery Q 21. Append D for the range Q to S Figure 3. Greedy Algorithm The first step of the algorithm is shown in Figure 3. Chunks in each replica that intersect the query are categorized as partial or full chunks and into different fragments, and the respective goodness values of the fragments are calculated (steps 2-6). For a given query Q, let us denote the set of all fragments as F and the list of all chosen fragments in decreasing order of goodness value as S. We can apply our greedy search over F (the while loop over steps 8-18). We choose the fragment with the largest goodness value, move it from F to S, and modify Q by subtracting the range contained by this fragment. Note that for other fragments in F, if they have an overlap with this chosen fragment, the area of overlap should be subtracted and their goodness values should be re-computed. To simplify our algorithm implementation, we re-compute the goodness value based on the formula of per-fragment instead of per-chunk. As shown in Figure 2(a), if we choose the only fragment of Replica 2 in the beginning, we need to subtract the portion of Fragment 2 in Replica 1 that overlaps with the fragment of Replica 2 and modify its goodness value. The while loop iterates until F is empty. At this point, the chosen fragment list S is the initial candidate solution. We may need to include the original dataset in the initial solution and direct a subquery to it, if the union range of S can not completely answer the query (steps 19-21). INPUT: a list of candidate replicas (fragments) S in decreasing order of the goodness values. OUTPUT: a set of replicas with or without the original dataset D for answering the issued query. // Let Intersect be the variable indicating whether we can // improve the candidate solution 1. Intersect 0 2. Foreach F i S 3. Foreach F j S and F j F i 4. If F i F j 5. Intersect 1 and break 6. End Foreach 7. End Foreach 8. If Intersect = 1 9. Foreach F i S from the head of S 10. Foreach chunk C in F i // Let r be the union range contained by the // filtered areas of all other fragments 11. If chunk C is completely within r 12. Drop it from F i 13. Foreach F k after F i in list S 14. If C F k 15. Modify F k to retrieve the intersection 16. Subtract the intersection from C 18. if C = break 19. End Foreach 20. End Foreach 21. End Foreach Figure 4. Extension to the Greedy Algorithm The second step as shown in Figure 4 attempts to improve the initial solution obtained in the first step. Consider the example in Figure 2(a). Suppose after the first step, the 4 candidate fragments in Figure 2(b) together with the original dataset are the initial solution. Fragment 2 in Replica 1 requires additional filtering operations because the data in the regions of overlap can be retrieved from Fragment 4 in Replica 2. Clearly, the better solution would be to delete the two chunks that fall in the region of overlap from Fragment 4 and use Fragment 2 for that region as illustrated in 5

6 Figure 2(c). This solution results in fewer I/O operations and less filtering computation. In the second step, a check is done chunk by chunk for each fragment in S against other fragments. If the data contained in a chunk is also contained in the data area marked to be filtered by other fragments, the chunk should be deleted from the fragment. The running time of categorizing chunks and calculating the goodness valus is O(n), where n is the number of chunks intersecting with the query bounding box. The first step would loop in time O(m), where m is the total number of fragments containing those n chunks. Within the while loop, the running time is equal to O(m) for finding the fragment with the largest goodness value and O(m) for modifying other fragments if there is overlap with the chosen fragment. Thus, the total running time of the first step is O(m 2 ). The second step executes in O(n 2 ). Overall, the algorithm takes O(m 2 +n 2 ) to compute the set of replicas to be used. Since generally m << n, the runtime complexity of the algorithm is O(n 2 ). 5 Experimental Results We carried out an experimental performance evaluation of the proposed algorithm using a large oil reservoir simulation dataset [17] generated using an application emulator to control its size and chunking. Table 1 displays the properties of the original dataset and its replicas that were used in the experiments. The columns TIME (the time dimension) and X, Y, and Z (the three spatial dimensions) in the table contain the time and space ranges in each dimension and the length of each division along that dimension. The row value s : e : l shows the start value (s), the end value (e), and the length of division (l) in the corresponding dimension. Each grid point of reservoir mesh is repsented with a tuple containing 17 floats and 4 integers. Hence each tuple is 84 bytes. Approximate chunk size of each dataset is also displayed in the table. Both the original and replica datasets are declustered across the disks of 8 nodes. In the experiments, we used three groups of replicas (along with the original dataset). These groups are denoted as Set #1, Set #2, and Set #3 in the table. The four of queries used in the experiments are shown in Table 2. All of the experiments were carried out on a Linux cluster where each node has a PIII 933MHz CPU, 512 MB main memory, and three 100GB IDE disks. The nodes are inter-connected via a Switched Fast Ethernet. The first set of experiments demonstrates the scalability of our method with increasing data size. Query #1 from Table 2 was executed with TIMEVAL values set to 1199, 1399, 1599 and 1799 against the datasets in Set #1 of Table 1. Query execution time and the amount of data processed are displayed in Figure 5. Out of 6 replicas in Set #1, algorithm has chosen to use 4 of those replicas; 0, 1, 2, and 4. The query filters out more than 83% of the data that is retrieved from disk, when it is executed using the original dataset only. While the replicas do not eliminate filtering completely, they reduce it drastically, thus improving per- Figure 5. Query execution time and amount of data processed with and without replicas. Figure 6. Query execution time as the number of nodes is varied. formance. About 25% of the data retrieved from disk is filtered, if the query uses replicas. The second set of experiments demonstrates the performance of the proposed method when the number of nodes is varied. Query #2 from Table 2 was executed using the datasets in Set #1 of Table 1. Again, out of 6 replicas, algorithm has chosen to use only 4 of them; 0, 1, 2, and 4. Query execution time, as shown in Figure 6, scales almost linearly upto 4 nodes. Although the use of 8 nodes provides an improvement over 4 nodes, execution time is not reduced by half. We attribute this to the partitioning of replica chunks. When replicas are created, in order to provide a better load balance each chunk is partitioned across all storage nodes. For the replicas displayed in Table 1, chunk sizes are not large enough to result a good-sized chunk parts when they are partitioned into 8. Hence, seek time starts dominating the I/O overhead, resulting in poor I/O performance. 6

7 replica Set #1 Set #2 Set #3 Ranges of Attributes Chunk ID TIME X Y Z Size orig 0:1999:1 0:16:17 0:64:65 0:64:65 6MB :1799:10 0:7:4 0:31:4 0:31:4 54KB :1799:10 0:15:4 0:15:4 0:63:16 215KB :1799:10 0:15:4 0:63:16 0:15:4 215KB :1799:10 0:15:16 0:7:4 0:7:4 215KB :1799:10 4:11:8 16:47:8 16:47:8 430KB :1799:10 4:15:6 34:63:6 34:63:6 181KB :1799:10 0:15:4 0:31:4 0:63:4 54KB :1799:10 1:16:8 33:64:8 1:64:8 430KB :1199:200 0:16:1 0:64:1 0:64:1 17KB :1199:10 0:16:17 0:64:65 0:64:1 928KB :1199:10 0:16:17 0:64:1 33:64:32 457KB Table 1. Properties of the original dataset and its replicas. In the TIME, X, Y, Z columns, s : e : l denotes the start value, end value, and the length of division along each dimension. Query No. Query Expression 1 select * from IPARS where TIME 1000 and TIME TIMEVAL and X 0 and X 11 and Y 0 and Y 28 and Z 0 and Z 28; 2 select * from IPARS where TIME 1000 and TIME 1599 and X 0 and X 11 and Y 0 and Y 31 and Z 0 and Z 31; 3 select * from IPARS where TIME 1000 and TIME TIMEVAL and X 0 and X 15 and Y 0 and Y 63 and Z 0 and Z 63; 4 select * from IPARS where TIME 1000 and TIME 1199; Table 2. Queries used in the experiments. The goal of the third set of experiments is to show the robustness of the proposed algorithm. If we were to execute Query #3 of Table 2 against the datasets in Set #1 of Table 1 for TIMEVAL values set to 1199, 1399, 1599 and 1799, we would not be able to answer the query using only the replicas. Moreover, data chunks to be retrieved from the original dataset would actually cover all of the data that could be retrieved from the replicas. As a result, the use of replicas is not beneficial in this case. The second step of the proposed algorithm (see Section 4.3.2) is designed to also detect such cases and avoid using replica chunks. As is seen in Figure 7, if the replicas are used, we actually get worse performance than just using the original dataset. In the last set of experiments, we execute Query #4 of Table 1 against the original dataset, Set #2 and Set #3. The results are shown in Table 3. If the cost function took into account only the amount of data that would be processed, the algorithm would favor original dataset or Set #2 over Set #3. This is because with the original dataset and Set #2, a little less amount of data would need to be processed. However, Set #3 requires fewer chunks to produce the output and its execution time is dominated by the volume of I/O. The others require processing of more chunks. In that case, disk seeks start to dominate the I/O overhead. This experiment shows that a more accurate modeling of execution time is necessary for better estimates. In our target applications datasets are stored in a set original Set #2 Set #3 execution time (seconds) data processed (MB) number of seeks Table 3. Results from the execution of Query #4. of files. A partial replication strategy could be devised based on replication of subsets of files and we could use methods developed for file-based replica selection. Such a strategy aims to improve performance by placing files on faster storage systems and distributing them across more nodes. We expect that our strategy will achieve better performance than file replication only strategies since it not only can take advantage of faster storage and parallel I/O, but also decreases amount of data retrieved by reorganizing and repartitioning data. We plan to carry out experiments in a future work to quantify the performance improvements of our strategy on file replication based strategies. 6 Conclusions In this paper we have investigated a compiler-runtime approach for execution of range queries on distributed machines, when partial replica optimization is employed. A challenging issue is that when multiple partial replicas of a 7

8 Figure 7. Query execution time and amount of data processed with and without replicas. Algorithm uses only original dataset to answer the query. dataset exist, a query execution plan needs to be devised to efficiently compute the results of the query using a subset of replicas. We have proposed a cost metric and algorithm to select the best set of replicas that will minimize the execution time of a given query. Our experimental results show that 1) the algorithm scales well when the size of the query is increased, 2) it not only decreases the volume of I/O, but also increases the ratio of useful data to the total amount of data retrieved from disk, and 3) it is robust in that it can revert back to the original dataset if using partial replicas is not beneficial. References [1] H. Andrade, T. Kurc, A. Sussman, and J. Saltz. Active Proxy-G: Optimizing the query execution process in the Grid. In Proceedings of the 2002 ACM/IEEE SC02 Conference. ACM Press, Nov [2] M. D. Beynon, T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Distributed processing of very large datasets with DataCutter. Parallel Computing, 27(11): , Oct [3] P. H. Carns, W. B. Ligon, R. B. Ross, and R. Thakur. PVFS: A parallel file system for Linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages , Oct [4] C.-M. Chen and C. T. Cheng. Replication and retrieval strategies of multidimensional data on parallel disks. In Proceedings of the twelfth international conference on Information and knowledge management, pages ACM Press, [5] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2): , June [6] A. Chervenak, E. Deelman, I. Foster, L. Guy, W. Hoschek, A. Iamnitchi, C. Kesselman, P. Kunszt, M. Ripeanu, B. Schwartzkopf, H. Stockinger, K. Stockinger, and B. Tierney. Giggle: a framework for constructing scalable replica location services. In Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages IEEE Computer Society Press, [7] P. F. Corbett and D. G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3): , Aug [8] M. Dahlin, R. Wang, T. Anderson, and D. Patterson. Cooperative caching: Using remote client memory to improve file system performance. In Proc. Symp. on Operating Systems Design and Implementation, pages ACM Press, [9] C. Faloutsos and P. Bhagwat. Declustering using fractals. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, pages 18 25, Jan [10] M. J. Franklin, M. J. Carey, and M. Livny. Global memory management in client-server DBMS architectures. Technical Report CS- TR , University of Wisconsin, [11] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of SIGMOD 84, pages ACM Press, May [12] M. Manohar, A. Chervenak, B. Clifford, and C. Kesselman. Implementation and evaluation of a replicaset grid service. In Fifth IEEE/ACM International Workshop on Grid Computing (GRID 04). [13] J. M. May. Parallel I/O for High Performance Computing. Morgan Kaufmann Publishers, [14] B. Moon and J. H. Saltz. Scalability analysis of declustering methods for multidimensional range queries. IEEE Transactions on Knowledge and Data Engineering, 10(2): , March/April [15] S. Narayanan, U. Catalyurek, T. Kurc, X. Zhang, and J. Saltz. Applying database support for large scale data driven science in distributed environments. In Proceedings of the Fourth International Workshop on Grid Computing (Grid 2003), pages , Phoenix, Arizona, Nov [16] K. Ranganathan and I. Foster. Identifying dynamic replication strategies for high performance data grids. In Proceedings of International Workshop on Grid Computing, Denver, CO, November [17] J. Saltz, U. Catalyurek, T. Kurc, M. Gray, S. Hastings, S. Langella, S. Narayanan, R. Martino, S. Bryant, M. Peszynska, M. Wheeler, A. Sussman, M. Beynon, C. Hansen, D. Stredney, and D. Sessanna. Driving scientific applications by data in distributed environments. In Dynamic Data Driven Application Systems Workshop, held jointly with ICCS 2003, Melbourne, Australia, June [18] P. Sanders, S. Egner, and J. Korst. Fast concurrent access to parallel disks. In Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, pages Society for Industrial and Applied Mathematics, [19] L. Shrira and B. Yoder. Trust but check: Mutable objects in untrusted cooperative caches. In R. Morrison, M. J. Jordan, and M. P. Atkinson, editors, Advances in Persistent Object Systems, Proceedings of the 8th International Workshop on Persistent Object Systems (POS8) and Proceedings of the 3rd International Workshop on Persistence and Java (PJW3), pages 29 36, Tiburon, California, Morgan Kaufmann Publishers. [20] A. S. Tosun and H. Ferhatosmanoglu. Optimal parallel i/o using replication. In Proceedings of International Workshops on Parallel Processing ICPP, Vancouver, Canada, August [21] S. Venkataraman, J. F. Naughton, and M. Livny. Remote loadsensitive caching for multi-server database systems. In Proceedings of the 1998 International Conference on Data Engineering, pages IEEE Computer Society Press, [22] L. Weng, G. Agrawal, U. Catalyurek, T. Kurc, S. Narayanan, and J. Saltz. An Approach for Automatic Data Virtualization. In Proceedings of the Conference on High Performance Distributed Computing (HPDC), June

Optimizing Multiple Queries on Scientific Datasets with Partial Replicas

Optimizing Multiple Queries on Scientific Datasets with Partial Replicas Li Weng Umit Catalyurek Tahsin Kurc Gagan Agrawal Joel Saltz Department of Computer Science and Engineering Department of Biomedical