Stop-and-Resume Style Execution for Long Running Decision Support Queries

Size: px

Start display at page:

Download "Stop-and-Resume Style Execution for Long Running Decision Support Queries"

Gwenda Osborne
6 years ago
Views:

1 Stop-and-Resume Style Execution for Long Running Decision Support Queries Surajit Chaudhuri Raghav Kaushik Abhijit Pol Ravi Ramamurthy Microsoft Research Microsoft Research University of Florida Microsoft Research ABSTRACT Long running decision support queries can be resource intensive and often lead to resource contention in data warehousing systems. Today, the only real option available to the DBAs is to carefully select one or more queries and terminate them. However, the work done by such terminated queries is entirely lost even if they were very close to completion and these queries will need to be run in their entirety at a later time. In this paper, we show how instead we can support a Stop-and-Resume style query execution that can leverage partially the work done in the initial query execution. In order to re-execute only the remaining work of the query, a Stop-and-Resume execution would need to save all the previous work. But this technique would clearly incur high overheads which is undesirable. In contrast, we present a technique that can be used to save information selectively from the past execution so that the overhead can be bounded. Despite saving only limited information, our technique is able to reduce the running time of the resumed queries substantially. We show the effectiveness of our approach using TPCH queries.. INTRODUCTION Decision support queries can be long running. For example, recent TPC-H [7] benchmark results show that these queries might take even hours to execute on large datasets. When multiple long running queries are executed concurrently, they compete for limited resources including CPU, main memory, and disk space, especially the workspace area used to store temporary results, sort runs and spilled hash partitions. Contention for valuable resources can substantially increase the execution times of the queries. Today, the only real option available to the DBAs is to carefully select one or more queries (based on criteria such as the importance of the query or the amount of resources used by it or progress information) and terminate them. Terminating one or more queries releases all resources CPU, memory, and disk space allocated to them, and can be effectively used to complete other running queries. In today s database systems, the work done by such terminated queries is entirely lost even if they were very close to completion and these queries will need to be run in their entirety at a later Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Database Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permissions from the publisher, ACM. VLDB 07, September 3-8, 007, Vienna, Austria. Copyright 007 VLDB Endowment, ACM /07/09. time. In this paper we show how instead we can support a Stopand-Resume style query execution that can leverage partially the work done in the initial query execution to optimize the execution time if we were to resume its execution. One way to implement such a Stop-and-Resume style query execution is to suspend the execution threads of one or more lowpriority queries and resume them at a later time whenever appropriate. However, the main problem with this approach is that suspending the execution of a query only releases the CPU resources, the memory and disk resources are still retained until the query execution thread is resumed. In fact, any attempt to reuse all intermediate results without recomputation would require us to keep the state corresponding to those tuples and thus requiring in worst-case very large memory and/or disk resources (for instance, hash tables in memory, sort runs in disk etc) which could amount to large overheads. Therefore, in this paper we investigate Stop-and-Resume style query execution that is constrained to save and reuse only a fixed number of records (intermediate records or output records) and thus releasing all resources. We formulate a bounded query checkpointing problem. Such a formulation is attractive since it limits the resources retained by a query that has been terminated. The central issue then is to identify the set of records (whose cardinality does not exceed the prescribed bound) to save and reuse such that the work done during the query restart is minimized. The key challenge is to identify such a set of records irrespective of when the query is terminated. Given that there is a tradeoff between the amount of state saved and the efficiency of the query restart, there will naturally be cases where most of the work has to be redone if the bound the small. However, our experiments, based on a prototype built by modifying Microsoft SQL Server 005, indicate that there are many cases where bounded query checkpointing can be effective even when only a small amount of state is saved and reused. Of course, our algorithm yields increasing benefits when more state is allowed. Our proposal for Stop-and-Resume style execution is targeted at data-warehousing environments geared towards decision support. We assume that the database is read only except for a batched update window when no queries are executed. Finally, although saving and reusing intermediate results is reminiscent of the techniques used for dynamic query optimization [][6][5], the constraint of saving and reusing only a bounded amount of intermediate results is unique to our problem (see Section 7 for a more detailed comparison). The outline of the paper is as follows. In Section we review some preliminary concepts and also present desirable properties of Stop-and-Resume style query execution. Section 3 discusses Stopand-Resume execution for query plans consisting of a single pipeline. In this section, we introduce how the Stop-and-Resume

2 query execution can be supported in the framework of query processing engine of relational database systems. Based on a model of work for query execution, we also present an algorithm to optimally compute the set of intermediate results to save and reuse for the case of single pipeline execution plans. In Section 4, we extend our algorithms to work for more complex execution plans. We present an experimental evaluation of our prototype system in Section 5 and present discussions on important extensions in Section 6. Section 7 presents related work.. PRELIMINARIES. Definitions Stop-and-Resume style query execution involves two distinct phases, the initial run and the restarted run. The main approach we take in this paper focuses on saving intermediate results generated during the initial run (first query execution until it is terminated) and avoid recomputing the saved results during the restart run (re-execution of the same query at a later time). The Stop-and-Resume style query execution is described around query execution plans. A query execution plan is a tree where nodes of the tree are physical operators. Each operator exposes a GetNext interface and query execution proceeds in a demand driven fashion [7]. In this paper, for simplicity we do not consider parallel query execution plans. An operator is a blocking operator if it produces no output until it consumes at least one of its inputs completely. Consider an example of a Hash join between two tables (A and B). The Hash join is an example of blocking operator. The probe phase cannot begin until the entire build relation is hashed. A pipeline can be thought of as a maximal subtree of operators in an execution plan that execute concurrently. The examples in Figure illustrate plans that execute in a single pipeline. Every pipeline has one or more source nodes, the operator that is the source of the records operated upon by remaining nodes in the pipeline. The Table Scan A in Figure (a) and Index Scan A in Figure (b) are examples of source nodes. We discuss execution plans consisting of multiple pipelines in Section 4.. Desirable Properties of a Stop-and- Resume Style Query Execution In this section we discuss desirable properties of a Stop-and- Resume style query execution. Correctness: The query when restarted must produce exactly the same set of output records that would have been produced by the traditional query execution engine without checkpointing. Low Overhead: An ideal Stop-and-Resume style query execution should incur very low overheads during the first execution. If we don t terminate the query, the performance should be comparable to normal query execution. This is an important design point for viability. In addition, query termination must proceed instantaneously; this sharply constrains the amount of state that we can save. Generality: A Stop-and-Resume framework should be applicable irrespective of when the query is terminated and also work for a wide range of query execution plans. Efficiency of Restart: The sum of the execution time before the query is terminated and the execution time after it is restarted should be as close as possible to the execution time of uninterrupted query execution. The key metric is how much of the work done during the initial run can be saved during the restart run..3 Modeling Work during Query Execution As hinted in the introduction, our approach is to identify the best set of intermediate records to save and reuse in order to have an efficient restart. In order to reason about the best set we require a way to measure the amount of work done during query execution. One can naturally consider using the optimizer cost model for this purpose; however we are looking at a more lightweight alternative. In this paper we use the total number of GetNext calls measured over all the operators to model the work done during query execution. Recent work on progress estimation for queries also employ similar schemes to model the work done during query execution [3][3][4][4]. It is shown in [3] that the GetNext model has good correlation with execution times. We also present a validation of the GetNext model in Section 5., where we find that the amount of savings as measured by the GetNext model is almost the same as the savings in execution times. 3. SINGLE PIPELINE QUERY PLANS In this section, we begin our discussion of Stop-and-Resume query execution starting with the class of queries that consist of a single execution pipeline. We extend our solution to query execution plans consisting of multiple pipelines in Section 4. We focus our discussion on pipelines that consist of a single source node and where the results of the pipeline are obtained by invoking the operator tree on each source tuple and taking their union in order. For such a pipeline P, the result of invoking the operator tree on tuple t from the source is denoted as P(t). If the set of tuples returned by the source node is S, then the result of executing the pipeline is Ǔ t є S P(t) where Ǔ denotes an in-order union. We denote source tuples using the symbols t, t, and tuples produced at the root of P using the symbols r,r, We note that there are pipelines (having operators such as Top, Merge-Join) that do not fall in this class. Our techniques are applicable for such cases, but for simplicity we restrict ourselves to pipelines as defined above. Filter Table Scan A (a) Index Scan A Index Nested Loops Join Figure. Examples of a single pipeline query execution plans. (b) Index Seek B

3 3. A Framework for Stop-and-Resume Execution The setting for Stop-and-Resume execution is as follows. During the initial run of the query, the query can be killed at any point. The system is allowed to keep track of some state during the initial run. When the query is killed, this state is saved coupled with a modified execution plan that utilizes this state. During the restart run, this modified plan is executed. In general, we are allowed to save not only intermediate results (output of an operator) but also the internal state of an operator (such as hash tables and sort runs). In this paper, we focus on saving only intermediate results. We discuss the internal state of operators in Section 6. Consider a simple stop-resume technique that saves all intermediate results generated during the initial run. We save records at the root of the pipeline as and when they are produced. For example, for the plan shown in Figure (a), we save the all records returned by the Filter operator. When the query is terminated, these results are persisted. During the restart run, we can avoid re-computing these saved results by remembering the position of the record of Table A that was accessed last by the Scan node before termination. The query plan used for the restart run would first read the saved results, skip to the above position in the Scan node and continue scanning Table A from that position. We refer to this strategy as the SAVE- ALL strategy. In order to implement the SAVE-ALL strategy, we need to modify the database server with the following mechanisms: saving records at the root node of a pipeline, needed during the initial run, and skipping a set of records in the source node of a pipeline, needed during the restart run. We discuss these mechanisms in detail in Section 3.. Though simple and correct, the SAVE-ALL scheme has the following drawback. Since the query can be terminated at any point during execution, the number of intermediate results generated can be arbitrarily large. Consider the execution plan in Figure (a). If the selectivity of the filter predicate is close to, then the filter operator would return almost all the rows in Table A, thus resulting in a large set of intermediate results. In general, the amount of state retained by the SAVE-ALL strategy is unbounded. Thus, the overhead involved in persisting this state when the query is killed is unbounded, which is clearly undesirable. This limitation suggests that we restrict the amount of state that can be persisted during the initial run. We thus formulate a bounded query checkpointing problem. While we defer the formal specification of this problem to Section 3.3, our strategy is similar to SAVE-ALL except that we save only a bounded subset of the results persisted by SAVE-ALL and thus skip a smaller set of records in the source node. For a given bound on the amount of state we can persist, there are various choices of which subset of intermediate records to save. Our goal is pick an appropriate subset so that the amount of work saved during the restart run is maximum. Since we do not know when the query is going to be terminated, this subset has to be maintained during query execution and we are faced with an online problem for which we propose an optimal online algorithm in Section 3.4. Finally, we note that there is an inherent tradeoff between the amount of state that is saved and the amount of work we can save during restart. If we sharply bound the amount of state that can be saved, there will be cases when almost the whole work has to be redone. Consider queries that scan all the tuples in a table (e.g. select * from T) or sort all the tuples in a table. Irrespective of what k tuples we save; any scheme can skip at most k records in the scan. If k is small, most of the work has to be redone. Note that this has nothing to do with any specific algorithm for bounded checkpointing, rather with the paradigm itself. This should set our expectations. But as we discuss in this section and demonstrate in the experimental section (Section 5), there are many cases where bounded checkpointing can yield significant dividends at a low overhead. 3. Primitives for Stop-and-Resume Execution In this section, we outline some important primitives for supporting Stop-and-Resume execution. TupleIDs at the Source Node We assume that there is a unique tupleid value that can be used to identify each record in the source node. This can be done by using a tupleid value or by adding the primary key value to the key of a clustering index. For instance, Microsoft SQLServer adds the primary key to uniquefy any index key. We also assume that for any intermediate record (r) generated, the root of the pipeline can keep track of the tupleid value for the corresponding source node. We denote this primitive as Source(r). Skip-Scan Operator We introduce a new operator that can be used to implement the skip functionality outlined in the previous section; it can be thought as a generalized version of a scan operator. It takes two parameters as an input, a starting position (say LB) and an ending position (UB). The semantics of the operator is as follows; it scans the source node and stops the scan after processing the tuple at position LB. It then skips to the ending position (UB) and resumes the scan from the next tuple. The operator will include the record at position LB and will not include the record at position UB. We refer to this primitive as the skip-scan operator. Figure shows how the skip-scan operator can be used for the resumption of the execution plan shown in Figure (a). Conceptually, the results that are cached corresponding to the skipped region are unioned with the rest of the result of the filter operator. The skip-scan operator can be built on top of existing operators such as Table Scan and Clustered Index Scan utilizing the random access primitives from the storage manager. For instance, in a clustered index we can seek to the UB value using the key (we assume they are unique). In the case of Heap Files we could remember the page (pageid, slotid) to resume the scan from. In general, the skip-scan operator can be extended to skip multiple ranges. In this paper we primarily focus on skipping a single contiguous range at the source node. Signaling Mechanism Operators in the query execution plan (that can serve as root of execution pipeline) are extended with the ability to process the set of saved intermediate results. In addition, there needs to be a

4 simple signaling mechanism to coordinate between the skip-scan and the root of the pipeline at the time of the restart. Similar to the end-of-stream (EOS) message that a scan operator sends at termination, the skip-scan operator sends an end-of-lb (EOLB) message before it skips to the UB. On receiving the EOLB message the pipeline root processes the saved records. This mechanism is important in ensuring that the ordering of the tuples at the output of the pipeline is preserved. Figure, indicates how processing the saved records after getting the EOLB message can ensure the ordering of the tuples at the output of the filter operator. Filter LB Union Skip UB Figure. Skip-Scan Operator Filter 3.3 Bounded Query Checkpointing We now formally define the bounded query checkpointing problem. Based on the primitives discussed above, we first define a space of restart plans. Fix a pipeline P. Suppose that the tuples returned at the root are r,r, in order. We refer to record r i as the i th record. An ordered set I consisting of tuples r i,,r i+j returned contiguously at the root of the pipeline is referred to as a window of result tuples whose size is j+ tuples. Such a window identifies a restart plan R(I) as follows. Let LB and UB denote the boundaries of the source region that can be skipped by saving the tuples in I. Recall for an intermediate record r, Source(r) denotes the tupleid value at the source node that generated it. Formally: LB(I) is Source(r i- ) if i>, and 0 (start of the scan) if i=. UB(I) is Source(r i+j ). Then, R(I) is obtained by taking P and replacing the source scan with a skip scan operator seeded with these LB and UB values, and taking an ordered union of the tuples in I at the root of the pipeline. The window I is said to be valid if R(I) is equivalent to P. The bounded query checkpointing problem is the following online problem. Fix a budget of k tuples. As the pipeline P is executing, the goal is to maintain a valid window I of result tuples such that (a) its size is at most k, and (b) among all valid windows that satisfy condition (a), the execution cost of R(I) as measured in GetNext calls is minimum. While our approach can be extended to cover the case where the storage constraint is given in terms of number of bytes, in this paper we consider the constraint specified in terms of number of records for simplicity. We now introduce the notion of the benefit of a restart plan that we use instead of its cost. This notion is based on the number of GetNext calls skipped. We discuss the equivalence of the restart plan in Section 3.5. For record r i, let N i be the total number of GetNext calls issued in the pipeline until the point tuple r i was generated at the root. We define the benefit of saving a record r i as Benefit(r i ) = (N i N i- ) if i > = N i if i = The Benefit of a window (r i..,r i+j ), denoted as Benefit(r i, r i+j ) is the sum of the Benefits of all its records. Benefit(r i, r i+j ) measures the amount of work done in generating this set of intermediate results. We observe that: Claim: Let the total number of GetNext calls issued if the pipeline were run to completion be N. The cost of the restart plan corresponding to the window (r i.. r i+j ) is (N - Benefit(r i, r i+j )). Thus, our goal is to find the result window that has the maximum Benefit. Figure 3 illustrates the problem for the query plan in Figure (a) where the constraint on number of intermediate tuples k is set to the value 3. The Best-3 Region highlighted shows the window of result tuples with the maximum Benefit value. First-3 Region Table A Filter Best-3 Region Records that do not pass filtering criteria Output Records of Filter Operator Figure 3.Table A s disk representation with distribution of output records of Filter operator. Note that bounding the state that is maintained and persisted is critical to our problem formulation. We desire the overhead incurred owing to Stop and Resume execution be minimal. On the other hand, it is critical that even with a sharply bounded overhead, we are able to avoid a significant amount of work when the query is resumed. It is important to find if there exists a small window of result tuples that together yield significant benefit. By saving this small number of tuples, we can then avoid most of the work during resumption. We now report a study that measures the variation in the per-tuple benefit. We study this for a query over the TPC-H data that selects all rows from the lineitem table that satisfy the predicate l_shipdate <= The execution plan for this query is identical to the plan shown in Figure (a). We execute this query over the TPC-H dataset where the data distribution is skewed (refer to Section 5 for details). Figure 4 plots the benefit of an intermediate tuple r i against i. We note that as desired, there is significant non-uniformity in the benefit values. Indeed, most of the benefit is clustered in a range of about 000 tuples between r 4000 and r 6000, even though the lineitem table has 6 million tuples.

5 This indicates that even with a very tight budget on the state, we can achieve significant savings. We next present our online algorithm to exploit this nonuniformity and optimally compute the set of records to save for single pipeline plans. Total Number of GetNext Calls Number of Intermediate Results Figure 4. Benefit(r i ) vs i. 3.4 Opt-Skip Algorithm We now present our online algorithm Opt-Skip for the bounded query checkpointing problem that is guaranteed to preserve the optimal plan as query execution proceeds. /* B = buffer for current window, k= total budget */ /* BestB = buffer for best window */ Algorithm Opt-Skip For Each intermediate record r i do: Append r i to B If B.Size()<= k and Valid(B) then BestB = B Else If B.Size() > k then B = (r i-k+,., r i ) // advance window If Benefit(B) > Benefit(BestB) and Valid(B) then BestB = B Figure 5. Algorithm Opt-Skip If the number of records returned at the pipeline root is less than or equal to the budget k, then we can simply save all of them. Otherwise, we always keep track of the valid window of result records with size at most k that has the best Benefit value. We use an additional buffer of size k to progressively search for a window of k records with a higher Benefit value. This is achieved by maintaining a sliding window of size k as records are returned at the pipeline root. Whenever the current window is valid and has a higher benefit than the current best, we reset the current best appropriately. The sliding window ensures that no (ordered) set of records with a higher Benefit value is missed. Algorithm Opt-Skip is outlined formally in Figure 5. The current best window is maintained in a buffer BestB while the sliding window of k tuples is maintained in buffer B. Note that the Opt-Skip algorithm guarantees to give the optimal solution. 3.5 Checking Validity Notice that in the Algorithm Opt-Skip, we update the BestB buffer only if B is valid. We now discuss our method for checking that a window of result records is valid. Intuitively, we require that the set of source records skipped in the restart plan and the saved window of results be such that the skipped records generate exactly these results. We refer to this property as the skipping invariant. We capture this condition formally as follows. Consider a window of result records B = (r i,,r i+j ). The set of source tuples S skipped consists of all tuples with id > LB(B) and <= UB(B). Then the plan R(B) is equivalent to P if Ǔ t є S P(t) = B. For any window B of result tuples, we can observe that B is contained in Ǔ t є S P(t). However, the converse need not necessarily hold. Consider for example an Index Nested Loop Join between Tables A and B (Figure 6). Say, the outer relation has keys N and some record in the outer relation has more than k matches in the inner relation owing to duplicates. Figure 6(a) illustrates how this can lead to the violation of the skipping invariant (using k=3). Table A Table B k+ duplicate Sliding Window (Size k=3) (a) No Bestk (c) Table A k=3 duplicate Table B.. Figure 6. Handling duplicates using k+ buffers. We handle this issue by allocating two extra buffers in addition to the current window of k tuples (buffer B in Algorithm Opt-Skip). We use these k+ buffers to maintain a sliding window of size k+. Thus, if B is the window (r i,,r i+k- ), then we also store r i- and r i+k (if they exist corner cases can be easily handled.) We check that Ǔ t є S P(t) is contained in B by checking that (Source(r i- )!= Source(r i )) and (Source(r i+k- )!= Source(r i+k )). Figure 6 illustrates this. Using a window size of k+, we can detect that cases 6(b) and 6(c) are incorrect and only 6(d) maintains the skipping invariant in the presence of duplicates. Thus Opt-Skip uses k+ buffers. 4. SOLUTION FOR GENERAL CASE In the previous section, we introduced Stop-and-Resume execution for plans with single pipelines. The main idea behind Sliding Window (Size k+=5) (b) (d) No Bestk Bestk

6 our solution was to save results at the root of the pipeline and skip the corresponding portion of the source node of the pipeline. In this section, we extend our solution to more complex query execution plans that consist of multiple pipelines. 4. Execution Plans with Multiple Pipelines A query execution plan involving blocking operators (such as sort and hash join) can be modeled as a partial order of pipelines called its component pipelines where each blocking operator is a root of some pipeline. For example, the execution plan in Figure 7 features two pipelines, P and P. These pipelines correspond to the build side and probe side of a Hash Join, respectively. In pipeline P, the Table A is scanned and the records that pass the selection criteria of Filter operator are used in the build phase of the Hash Join. The execution of P commences after hashing is finished on the build side. The index on Table B is scanned and records are probed into hashed table for matches. Filter Table Scan A Hash Join Figure 7. Execution Plan with Multiple Pipelines 4. Bounded Query Checkpointing for Multipipeline Plans The space of restart plans defined in Section 3 for single pipeline plans (which we call single-pipeline restart plans in this section) naturally extends to multi-pipeline plans. We are now allowed to save the tuples produced at the root of more than one pipeline and use the skip-scan operator to skip reading portions of the corresponding source node. For instance, in the execution plan in Figure 7, we could save tuples produced by the filter operator in pipeline P and use the skip-scan operator for Table A. Another alternative is to save tuples produced by the probe side of the hash join operator and use the skip-scan operator on Table B. A multipipeline restart plan is obtained by replacing some subset of the component pipelines with corresponding single-pipeline restart plans. A multi-pipeline restart plan is valid if all its component restart plans are valid. It is not hard to see that validity is sufficient for equivalence to the original plan. Our goal, as with single pipeline plans is to find a valid restart plan such that the total state saved, counted as the number of tuples, is bounded and where the cost of the plan measured in terms of GetNext calls is minimized. Again, as with single pipeline plans, we introduce the notion of the benefit of a tuple returned at the root of some pipeline, which is identical to the case of single-pipeline plans. Thus, we are left with the online problem of maintaining the valid restart plan that yields the maximum benefit. One limitation of the above approach is that we restrict ourselves to pipelines where the source nodes support the skip-scan P P Index Scan B operation. In general there may be pipelines where the source node is say a sort node. We cannot currently exploit the skip-scan operator in sort pipelines. But the techniques described in Section 4. can still exploit the other pipelines in such execution plans. Similar to the case of simple selection queries (see Figure 4), the Benefit values for tuples that result from a join pipeline can also vary non-uniformly. We show this empirically for a query over the TPC-H data. The query we choose is such that the plan generated is the one in Figure 8 where we are building a hash table on the Orders.Orderkey and probing using Lineitem.Orderkey. We enforce the predicate o_orderdate <= ' ' on Orders. Figure 7 plots the benefit of join result tuple r i against i. We find a significant non-uniformity in the benefit values similar to Figure 4. Total Number of GetNext Calls Number of Intermediate Results Figure 8. Distribution of GetNext calls 4.3 Algorithms Any pipeline in an execution plan can be in one of three states, it has either completed execution or is currently executing or has not yet started. Naturally, the candidates for best-k tuples must be picked from the output of either the currently executing pipeline or from the pipelines that have completed their execution. The Opt-Skip algorithm outlined in Section 3 can potentially be applied to one or more pipelines. Notice that the Benefit value can be aggregated across the pipelines. Given a fixed number of intermediate results (k) to save, the challenge is in distributing the k tuples among different pipelines to maximize the benefit. Computing the optimal distribution of k tuples in the multipipeline case could however, require excessive bookkeeping, the algorithm would need to keep track of the best-k values for different values of k for different pipelines. The amount of working space required can be as high as nk in contrast to working space of k buffers required by the Opt-Skip algorithm for a single pipeline. In order to keep the overheads of our solution low, we examine different ways in which we can use the single pipeline solution (Section 3.3) as a building block in order to compute the best set of tuples to save for query execution plans with multiple pipelines. At any point, we need a valid restart plan. We always maintain the BestB buffer containing k tuples (see Section 3.4) for the current pipeline. Whenever a pipeline finishes execution or the query is terminated, we need to find the best distribution of the k tuples

7 among the current and previous pipelines. We now outline three approaches that consider multiple pipelines in computing the best k tuples to save. Current-Pipeline Approach: A simple scheme is to only consider the currently executing pipeline and apply the Opt-Skip algorithm from the previous section. But this could potentially be a bad choice. Consider a join query between Lineitem, Orders table where the query plan is similar to one showed in Figure 7 (assume A=Lineitem, B=Orders). Let the currently executing pipeline be the pipeline P involving the index Scan on the Orders table. If the query is terminated at this point, note that any scheme that uses the skip-scan operator on the pipeline P cannot avoid rescanning the entire lineitem table during the restart. A better strategy might actually utilize the skip-scan operator at the input to the join (pipeline P). In our experimental section, we show that using exclusively the current pipeline can sometimes turn out to be a bad decision. Max-Pipeline Approach: This approach picks the single best pipeline that can provide the best Benefit value given a budget of k intermediate tuples. One interesting property is that this algorithm requires a working space of only k buffers which is the same as the case for single pipelined plans. The algorithm is a generalization of the algorithm Opt-Skip. Note that the Opt-Skip algorithm described in the previous section keeps track of the current best (BestB) and a sliding window B that slides forward to find the better set. We can overlap the BestB from the pipelines that have already completed execution to compare with the window B for the pipeline currently executing. At any point we are now guaranteed to find the single best pipeline to cache k tuples across all previous pipelines. Merge-Pipeline Approach: This approach uses a heuristic to allocate the k tuples among different pipelines. It uses the Opt- Skipalgorithm to compute the k tuples to save for each pipeline independently. Consider the point where the second pipeline has finished executing. We now have k tuples cached corresponding to two contiguous regions in the source nodes of the two pipelines. Let the intermediate tuples saved be (r.. r k ) and (s.. s k ). We need to eliminate k tuples from this set of k tuples to satisfy the constraint of bounded query checkpointing. We employ a greedy heuristic that works as follows. Let rmin denote the tuple that has the smaller Benefit value among r and r k and smin denote the tuple that the smaller Benefit value among s and s k. We eliminate the tuple that has the minimum Benefit value among rmin and smin. Intuitively, the greedy heuristic iteratively removes the tuple that has the minimum Benefit value among the edges of the two ranges. This technique would distribute the k tuples while still retaining contiguous ranges in both the pipelines. After every pipeline completes execution, this algorithm will be invoked to get rid of k tuples. The working space required by this algorithm is 3k buffers. Note that we currently use the GetNext Model as a rough estimate to compare the benefit across different pipelines and pick the best one. For complex queries, involving expensive operations such as UDFs, a weighted version of the GetNext model becomes necessary to account for different weights for different pipelines. Sub-tree Caching: There is however, one special case for execution plans consisting of multiple pipelines. If for any pipeline the total number of intermediate results generated is less than the budget k value, then one can not only skip the corresponding leaf but also skip evaluating all the pipelines that preceded the current pipeline as well. We refer to this case as subtree caching. We ensure that the benefit value for the set of tuples is appropriately set to the number of GetNext calls issued over the entire sub-tree. In the following section, we present an experimental evaluation that quantifies the benefits of bounded checkpointing. 5. EXPERIMENTS In this section, we present an experimental evaluation of bounded query checkpointing. Our prototype was built by modifying the query execution engine of Microsoft SQL Server 005. We added the primitives required for Stop-and-Resume execution outlined in Section 3.. The clustered index scan operator was extended to accept a LB and UB value and implement the skip-scan operator as described in Section 3.. Currently, in the prototype the saved intermediate results from the initial run are cached in memory and reused during the restart. The main goals of the experiments are as follows To understand when using the skip-scan operator is most beneficial To examine the utility of bounded query checkpointing for complex queries To compare the different algorithms presented for the case of multi-pipeline execution plans Database: We use both benchmark and real data sets to evaluate our prototype. We use the TPC-H GB database. The original data generator does not introduce any skew in the data. We used a data generator that introduces a skewed distribution for each column independently in a relation. We use a zipfian distribution with a skew factor of z =. Every table had a clustered index on the primary key. For our workload, we used all the queries that reference the Lineitem table; these are typically the long running queries among the TPCH suite. In order to understand how the prototype works on real data sets, we also present an evaluation using the personal edition of the Sky Server database [8]. This is a publicly available database of astronomy data. Evaluation Metric: Queries that are terminated have to be rerun entirely in the absence of techniques suggested in this paper. The main evaluation metric we use is how much of the work done by the query during the initial run is saved in the restart due to the algorithms presented in this paper. We use the GetNext model of work (see Section.3 for details). Let T denote the total number of GetNext calls in the execution of a query Q. Let T denote the number of GetNext calls invoked during the initial run before the query is stopped. Let T be the number of GetNext calls invoked during the restart to reach the same point in execution where the query was originally stopped. Note that this point in execution is well defined for our restart plans. We define the as = (T-T) / T * 00 A main advantage of using the GetNext model for the work done is that we can talk precisely about when a query is stopped and restarted (after 50% of T is done etc.) and that the total number of GetNext calls for a query plan is fixed.

8 We instrumented the query execution to write out the total number of GetNext calls seen thus far as well as the amount of work that can be saved if the query were terminated at that point (this is the Benefit value corresponding to the best set of intermediate results saved until that point). After the query execution is complete, we can analyze this log to figure out what would happen if the query were terminated at any particular point in the past (for the current k value). 5. Use of GetNext Model We observed for the TPC-H queries used in our experiment the leaf nodes in the query execution plan were all clustered index scans. For such query plans, we observed that the Percentage Work Saved measured using the GetNext model was strongly correlated to the corresponding Speed Up in execution time. Prior work on progress estimation indicates that the GetNext model is robust for such scan-based plans [3]. For instance, Figure 8 illustrates the difference between the computed using execution times and the GetNext model for the following join query. The execution times were measured on a machine with 3. GHz CPU and GB of RAM. SELECT COUNT (*) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND l_receiptdate > The query plan is a hash join with the lineitem as the build and represents a simple case of a multi-pipeline plan. The constraint k value (50) was not large enough to cache all intermediate tuples satisfying the filter on the lineitem table. Figure 9 illustrates the absolute error in estimating the (using the GetNext model and actual execution times) as a function of when the query is terminated. Note that the absolute error is uniformly low (around %) across different termination points of the query. We also observed similar results for different selectivity values for the predicate. Absolute Error.60%.40%.0%.00% 0.80% 0.60% 0.40% 0.0% GetNext vs Execution Times 5% 50% 75% 90% Point of Term ination Absolute Error Figure 9. GetNext Model vs. Execution Times. We do note that for execution plans involving operators such as index seeks, a weighted version of the GetNext model would be more appropriate. 5. Effectiveness of the Skip-Scan Operator In this section, we first study the skip-scan operator in the context of a simple query in order to understand where it is likely to yield the best dividends and what its limitations are. Note that the key aspect that influences the efficiency of the skipscan operator (for a given selectivity value) is how these tuples are laid out with respect to the original ordering of the table (see Figure 3). In this section, we study the effectiveness of the skipscan operator by varying these parameters. We use a simple join query between the lineitem and orders table which has a range predicate on the l_receiptdate column of the lineitem table. We vary the selectivity value from around 0.005% to 0% by suitably varying the predicate on the l_receiptdate column. We picked a suitable k value so that it is large enough to save all the intermediate tuples only for the cases where the selectivity is less than 0.0%. For the above query, we used different versions of the lineitem table, one clustered on the l_orderkey attribute and another clustered on the l_shipdate attribute. We also studied the case in which the selected tuples are distributed with uniform gaps in the base relation (i.e. the predicate has no correlation with the clustering column). Figure 0 plots the values for all the 3 cases (clustered indexes on l_shipdate and l_orderkey and the uniform case). We assume the query is stopped at 50%; the results are similar for other points of termination Effectiveness of Skip-Scan Operator 0.005% 0.0% 0.0% % 0% Selectivity Unif orm l_orderkey l_shipdate Figure 0. Effect of Clustering. The first point to note is for the cases where the entire set of intermediate results can be saved (the 0.05% and 0.0% cases), the percentage work saved is maximum independent of the clustering of the tuples that pass the selection. In general, if the gaps between the tuples are uniformly distributed, then any set of k-tuples would have the same benefit value and moreover the benefit would decrease with increasing selectivity. In such cases, one can expect only relatively small benefits in the restart for small values of k. The clustered index on l_shipdate represents an interesting case. Even though the predicate is on the l_receiptdate attribute, the l_receiptdate is always greater than the corresponding l_shipdate. Tuples with higher l_shipdate value are guaranteed to have a higher l_receiptdate value. Thus, tuples that satisfy a range predicate on the l_receiptdate would always be clustered together and the rest of the portions can be skipped during the restart. This represents the best-case scenario for the skip-scan operator as this implicit correlation ensures that the skip-scan operator remains effective even with increasing selectivity. The clustered index on the l_orderkey represents an intermediate case (tuples with a higher l_orderkey value are likely to have a higher l_receiptdate but not guaranteed). For this case the

9 correlation (and as a result the benefits) are not as good as the case of the l_shipdate clustered index but at the same time it is better when compared to the uniform case. Thus, the skip-scan operator yields the maximum benefits when either selectivity is low or there is a strong correlation between predicate column and the clustering column. 5.3 Evaluation on TPC-H Queries In this section, we study the performance of our algorithms for bounded query checkpointing in the presence of complex decision support queries. One important factor is the monitoring overheads imposed by our techniques in the normal case (when the query is not terminated). For the TPC-H workload, we observed that for most of the queries the overheads were within 3% of the original query execution times. Figure shows the for the TPC-H workload assuming the queries are terminated at the 50% point. We used a k value of 50 (If we assume a database page size of 8K and an average size of an intermediate record as 3 bytes, 50 tuples can fit in a single page). The plots in the Figure show that we can save a significant portion of the work done in the initial run by utilizing only a small amount of state. Among the 7 queries in the workload, we can save 40% or more of the work done in the initial run for nearly half the queries in the workload. Query 7 represents a case where we can avoid redoing all the initial work (the percentage work saved is 00%); this is because the predicates are very selective and we can cache all the intermediate results generated thus far in a pipeline. An important point to note is that if the SaveAll technique were to be restricted to the same budget, then it would only be applicable for Query 7 in the workload, while the techniques proposed in this paper are more generally applicable. On the other hand for Queries and 3, we get practically no benefit. Query 3 for instance selects most of the tuples in lineitem, thus saving just 50 tuples does not help much. These represent cases where the selected tuples are distributed with uniform gaps (see Section 5.) and as a result, the skip-scan operator is not as effective. We observed that for most of the queries the Group-By operator was applied at the top of the query execution plan and at the 50% termination point, the operator had not yet started execution. Thus for most of the queries, saving internal operator state would not have been effective. The only exceptions are Query and Query 6, which are single pipeline queries. Query selects 97% of the tuples and computes a set of aggregates. For this case, saving and reusing the partial sums corresponding to the aggregates would have resulted in saving almost all the initial work. For Query 6, even though we do not save partial sums, we still obtain an improvement of 0%. We further discuss partial operator state in Section Algorithm Choice for Multiple Pipelines In Section 4., we presented three algorithms for the case of Multiple Pipelines (Current-Pipeline, Max-Pipeline and Merge- Pipeline). We had mentioned that it could potentially be useful to retain information about previous pipelines and the Current- Pipeline algorithm may not be optimal. Among the 7 queries executed we observed for nearly half of the queries this observation was true (K=50) TPCH Query Figure. (Termination at 50%). Figure illustrates how the varies as a function of when the query is terminated for the Current-Pipeline algorithm. For instance, Query 0 if stopped at around 60% completion can save around 5% of the work during the restart. But if the same query were stopped at 80% completion, there are practically no benefits. The key point to note is that there could be high variance in the benefits of the Current-Pipeline algorithm depending on the effectiveness of the skip-scan operator at different pipelines Current-Pipeline Algorithm 0.00 % 30% 40% 50% 60% 70% 80% 90% Point of Term ination Figure. Ignoring Previous Pipelines. Query 0 Query Query 4 Query 8 For Query 0 (a join of 4 relations), we also illustrate how the Max-Pipeline and the Merge-Pipeline algorithm compare to the Current-Pipeline algorithm in Figure 3. For this query, the best pipeline to skip is the one in which the source node is the Lineitem table. Both the Max-Pipeline and Merge-Pipeline algorithm are able to identify this; the Current-Pipeline algorithm however chooses to always go with the current pipeline, in this case the later pipelines (after the 70% point) are not good candidates for utilizing the skip-scan operator. In this case the savings obtained by using the Max-Pipeline and the Merge- Pipeline algorithm were the same. In our experiments, we found that the Max-Pipeline algorithm performed as well as the Merge-Pipeline algorithm for many cases. But, there were instances utilizing the skip-scan operator at multiple pipelines made a significant difference. For instance, Figure 4 shows how the Merge-Pipeline algorithm outperforms the Max-pipeline algorithm for TPCH Query. Query

10 involves self-joins of the Lineitem table and the Merge-Pipeline can utilize the skip functionality on multiple scans of the Lineitem table when compared to the Max-Pipeline algorithm which can use the skip-scan operator on at most one scan node. Notice that the savings are equal until the first pipeline finishes execution (at around 50%). The Merge-Pipeline algorithm can effectively utilize multiple pipelines after this point in execution. Comparing Multi-Pipeline Algorithms 5.6 Evaluation on SkyServer Database In this section, we study the effectiveness of bounded checkpointing on real data. We use the publicly available personal edition of the SkyServer database [8]. The query workload consists of 3 queries. Figure 6 shows the Percentage Work Saved for all the long running queries from this workload (the termination point is 50%). Effect of K Value Current-Pipeline Max-Pipeline Merge-Pipeline Percantage Work Saved (K=000) (K=5000) (K=50000) 30% 40% 50% 60% 70% Point of Termination 80% 90% TPCH Query Figure 3. Multi-Pipeline Algorithms. 0% 0% 30% Max-Pipeline vs Merge-Pipeline 40% 50% 60% Point of Termination 70% 80% 90% MAX-Pipeline (Query ) Merge-Pipeline (Query ) Figure 4. Max-Pipeline vs. Merge-Pipeline 5.5 Effect of k value In Figure 5, we study how the effectiveness of the algorithm changes with the change in the number of intermediate results that are allowed to be saved. We show results for k = 000, 5000 and We assume the termination point is 50%. As the plots indicate the benefits do not necessarily increase linearly with increasing the k value. There are a couple of reasons for this. Firstly, based on the intermediate sizes and the k values, subtree caching (Section 4) can make a huge difference. For instance, for the K=50000 case, we can save 00% of the work for 5 of the queries (compared to the single query in Figure 0). Another reason is the significant non-uniformity in the amount of work done for each intermediate record for the TPCH workload (as the plots in Figure 4 and 8 showed). Notice that if the work done in generating each intermediate result in a pipeline were uniform, then one would expect a linear return for an increasing value of k. These results (in conjunction with those in Section 5.3) show that our algorithms can be effective even when only a small amount of state is saved and reused. Of course, our algorithm yields increasing benefits when more state is allowed. Figure 5. as a function of k (K=50) SkyServer Query# Figure 6. Evaluation on SkyServer Database The results show that bounded checkpointing is useful for real workloads; for nearly half of the queries we can provide a savings of 40% or more. An important point to note is that queries and 3 are single pipeline queries with no group-by or aggregations. Thus the numbers reported are optimal (see Section 3.3); we cannot do better for these queries under the constraint of bounded checkpointing. 5.7 Summary The experimental results indicate that bounded query checkpointing can indeed be an effective alternative to terminating the query and losing all the work done in the initial run. Another important factor to consider is the overhead imposed by our proposed techniques in the normal case (if the query were not terminated). For the TPC-H workload, we compared the execution times of the queries with and without the overheads of the proposed techniques. For most of the queries, the overheads were within 3% of the query execution times. We summarize the experimental results below There are many cases where the skip-scan operator is useful (selective predicates, sub-tree caching or correlation between the predicates and the clustering

Stop-and-Restart Style Execution for Long Running Decision Support Queries

Stop-and-Restart Style Execution for Long Running Decision Support Queries Surajit Chaudhuri Raghav Kaushik Abhijit Pol Ravi Ramamurthy Microsoft Research Microsoft Research University of Florida Microsoft