Query Optimization to Meet Performance Guarantees for Wide Area Applications

Size: px

Start display at page:

Download "Query Optimization to Meet Performance Guarantees for Wide Area Applications"

Frank Hampton
6 years ago
Views:

1 Query Optimization to Meet Performance Guarantees for Wide Area Applications Vladimir Zadorozhny Louiqa Raschid University of Maryland College Park, MD 2742 Abstract Recent technology advances have enabled mediated query processing with Internet accessible Web- Sources. A characteristic of WebSources is that their access costs exhibit transient behavior. These costs depend on the network and server workloads, which are often affected by the time of day, day, etc. Given a distribution of access costs, a potential optimization strategy is based on the least expected cost (LEC). An LEC optimizer chooses WebSource(s) for remote relations, and query execution plan(s), that minimize the expected cost of the plan. A drawback of LEC optimization is that it is not a good predictor of the actual performance of a query. In a noisy environment, the actual latency (response time) of a query will occur within some range of values. A performance guarantee (PG) for a noisy environment will correspond to "at least X percentage of queries will have a latency of less than T units of time". In this paper, we propose an optimizer strategy that is sensitive to the objective of meeting performance guarantees (PG). For each query plan, a PG sensitive optimizer uses both the expected value of the cost distribution of the plan, as well as the expected delay of the plan. The delay is the deviation above the expected value. The plan is then characterized by an Expected Cost Delay (ECD) utility measure based on the expected value and expected delay of the cost distribution. The optimizer can choose different strategies to match a desired performance guarantee. The optimizer behavior can be optimistic, where it ignores the expected delay, or it can be conservative, where it emphasizes the expected delay. This trade-off of optimizer behavior is controlled by a cost factor and a delay factor. We examine various characteristics that impact the behavior of the PG optimizer. We validate our strategy using a simulation based study of the optimizer s aggregate behavior, for a set of queries and a set of remote relations on WebSources. We also experimentally validate the optimizer using traces of access costs for real WebSources. 1 Introduction Recent technology advances have enabled mediated query processing in the wide area environment with remote relations (objects) that are resident on Internet accessible WebSources. There are many characteristics of mediated query processing in this environment that pose substantial challenges. One characteristic is that typically there are several alternate WebSources for each remote relation. This is due to architectures of mirror sites or replicas, proxy sites with cached copies, or portals that gather information from several different sites. Thus, a mediator subquery (on a relation) can be submitted to multiple candidate WebSources. In order to choose between alternate WebSources, an optimizer has to compare their access costs. However, the second characteristic of Internet accessible WebSources is that their access costs exhibit transient behavior. The costs may be described as spatio-temporal since the response times depend on the client, remote server, and network This research has been partially supported by the Defense Advanced Research Project Agency under grant , and the National Science Foundation under grant IRI

2 and server workloads, which are often affected by the time of day, day, etc. They are typically represented as cost distributions. The transient behavior of WebSources makes it difficult to directly compare the cost of plan execution on (particular combinations of) WebSources. LEC (least expected cost) is a proposed approach [7] to compare the expected values of cost distributions for execution plans. However, the expected value reflects aggregate behavior, and is not a good reflection of the actual response time, i.e., the time to execute the query and download results from the source. A noisy source with high variance in its response time will behave differently from a stable source that is less noisy. Meanwhile, both sources may have the same, or similar expected values of the response time. This difference in actual performance is independent of the expected values of their access cost distributions. In this paper we consider the relationship between the choice of sources and a performance guarantee. A performance guarantee (PG) will correspond to "at least X percentage of queries will have a latency of less than T units of time". Consider the following example of two queries (or their execution plans), on source S 1 and S 2, respectively. The sources have different cost distributions. Figure 1 represents the response times for these two queries. The graphs show the percentage of queries and their response time, i.e., a quantile plot of response times, for hundreds of queries submitted to the two sources. Suppose we consider a performance guarantee "... 1% of the queries must execute in less than 4 seconds". It is clear that in the noisy Internet environment, neither of the queries (plans) can meet this performance guarantee. Figure 1 also identifies the risk of choosing either source; intuitively this corresponds to those queries whose response time exceeds 4 seconds. As can be seen in Figure 1, the risk associated with source S 1 appears to be greater than the risk associated with S 2. Conversely, we associate each of these choices with a value of benefit, i.e., queries with response time less then 4 seconds. The benefit of S 1 also appears to be greater than S risk S 1 9 risk S Percentage of queries Percentage of queries benefit Response Time (sec) 1 benefit Response Time (sec) Figure 1: Risk and benefit of Sources S 1 and S 2 for a performance guarantee of 4 seconds To summarize, given several noisy WebSources, each characterized by a cost distribution, an optimizer has to select a good combination of sources, and generate a good query execution plan for this choice of sources. A performance guarantee (PG) sensitive optimizer will have to choose among the sources and execution plans, to better meet a desired performance guarantee. 2

3 There has been extensive prior research on query optimization and evaluation that is relevant to our task. A recent issues of the IEEE Data Engineering bulletin highlights several projects. Reactive query evaluation techniques at the plan level are described in [1, 4, 21, 26]. Alternately, adaptive evaluation techniques at the plan and operator implementation level are described in [3, 13, 14, 16, 25]. Research in [2, 7, 19, 15, 17, 32] have addressed various aspects of the task of modifying an optimizer to handle distributions for various parameters, e.g., available memory, or intermediate join cardinality. The Tukwila project [16] provides an adaptive framework composed of rules, and adaptive operators that are sensitive to transient factors. Its optimizer also has a re-optimization capability based on pipelined fragments of query execution. The Telegraph project explores adaptive fine-grained data flow based query processing techniques using rivers, eddies, ripple join, XJoin, etc. [3, 13, 25]. They also focus on continuous and high frequency response to changing feedback. Query scrambling [1, 26] is a query optimization and evaluation technique to deal with transient conditions, e.g., unexpected delay. It is based on plan re-organization to avoid idle time, and the synthesis of new operators to execute in the absence of other work. It makes a key contribution of a response time based query optimizer to deal with delay, instead of the more traditional cost based approach. We draw up in this concept when we introduce the expected delay of a cost distribution. Finally, several key contributions were drawn from the least expected cost (LEC) optimizer of [7], and this is described elsewhere in the paper. In summary, while prior research has addressed distributions and transient conditions, they have not addressed our specific task of choosing among noisy sources, described by cost distributions, in order to develop an optimizer that is sensitive to performance guarantees in a noisy environment. Our approach to PG sensitive optimization is to consider a cost distribution for each query plan. We characterize the plan with the expected cost of the plan, as well as the expected delay. The expected delay represents the deviation in excess of the expected cost. We combine the expected cost and the expected delay using a cost factor, and a delay factor, to obtain a combined utility measure Expected Cost Delay ECD. The optimizer will choose a plan to maximize utility, i.e., it will minimize the ECD value. The optimizer can choose different strategies to meet a desired performance guarantee. We control the behavior of the optimizer by varying the values of the cost factor and delay factor. This will also vary the value of the ECD utility for the candidate plans. With higher values for the delay factor, the optimizer is more conservative, i.e., it tries to minimize the risk of queries that do not meet the performance guarantee, while choosing sources and plans. With lower values for delay factor, the optimizer is more optimistic, i.e., it tries to maximize the benefit of queries that do meet the performance guarantee, while choosing sources and plans. We demonstrate how we manipulate the cost factor and delay factor, and use the value of ECD to choose plans following a more conservative or optimistic strategy, to better meet a performance guarantee. The first contribution of our research is this approach to PG sensitive optimization in a noisy environment. An important issue for an optimizer when dealing with cost distributions is generating a good candidate plans so that the optimizer choice is close to optimal [7]. Our search space is large since we must consider alternate sources, or combinations of sources. Our second contribution is a two phase technique to generate good candidate plans. In the first phase, we use the concept of a pre-plan [27, 3] to represent a choice of sources and their cost distributions. We use a heuristic to determine the (approximate) cost of a pre-plan and we choose pre-plans to reflect an optimistic or conservative optimizer strategy. In the second phase, we use an extended relational optimizer to generate good plans for the pre-plans. 3

4 Web Query Optimizer Mediator Mediator Catalog Pre-Optimizer Relational Optimizer Mediator Cost Model Wrapper Cost Model Wrapper Wrapper Catalog Web Prediction Tool WebSource Evaluation Engine Java RMI... Wrapper W A N WebSource Web Query Broker... Java RMI Figure 2: Mediator Wrapper Architecture for WebSources Using a simulation based study, we verify that the PG sensitive optimizer s behavior is predictable and scalable with respect to the following: alternate sources with varying values of expected cost and expected delays; a mix of queries of varying complexity; a large search space of pre-plans and plans; and access cost distributions obtained from real WebSources. This paper is organized as follows: In section 2, we briefly review the mediator architecture and the components of the optimizer. In section 3, we use examples to illustrate sources that meet or do not meet performance guarantees. In section 4, we describe the implementation of PG sensitive optimizer. We define the Expected Cost Delay ECD utility measure, and describe the two phase approach to generating pre-plans and plans. We then describe our simulation based study of the PG optimizer behavior. We present the optimizer s aggregate behavior, for a set of queries and a set of remote relations on WebSources, in section 5. We also experimentally validate the optimizer using traces of access costs for real WebSources. In section 6, we study the optimizer strategy and the ECD trade-off. We illustrates how the optimizer can use an optimistic or conservative strategy to choose sources and plans that best meet some performance guarantee. 2 Architecture Wrapper mediator architectures have been proposed for interoperability of heterogeneous information sources [1, 19, 2, 22, 29]. Figure 2 presents the components of our architecture tailored for Internet accessible WebSources. A WebSource is accessible via the http protocol; a forms-based interface provides a limited query capability, and returns answers in XML or HTML. The mediator is an extension of the Predator ORDBMS [23]; our mediator uses the relational data model. The mediator has the task of decomposing a mediator query into subqueries; identifying relevant WebSources that can answer a subquery; and providing 4

5 query optimization and evaluation functionalities. Wrappers reflect the limited query capability of WebSources and handle mismatch between the mediator and the WebSource. We developed a novel technique for mediator query optimization using a pre-optimizer; the pre-optimizer produces a pre-plan for a mediator query. For each remote relation in the mediator query, the pre-plan identifies a relevant WebSource, its query processing capabilities, and its access cost distribution. A pre-plan is a higher level abstraction that circumscribes and describes a space of query execution plans or plans. The cost distributions of the WebSources characterize the cost of each pre-plan, as well as the cost of the (physical) plan for evaluating the mediator query. This is discussed in Section 4. A major challenge for our optimizer is the unpredictable behavior of WebSources, dictated by network and server workloads. These workloads may be affected by the Time of Day, the Day of the week, etc. We developed a tool, the Web (P)rediction (T)ool, WebPT to predict access cost distributions for WebSources [5, 12]. The WebPT is designed for a dynamic environment such as the Internet, which must be continually monitored. The WebPT is based on an online learning and prediction technique. The WebPT measures and predicts query feedback, i.e., the end-to-end latency to access a WebSource and to download data. It uses a number of parameters including the Time of Day and the Day of Week, to predict latency. We used the WebPT to develop a cost model and a Catalog of access cost distributions for WebSources [31]. For the study, data was gathered from several WebSources, e.g., USGS Geographic Names Information System (GNIS) [24], Landings Aviation Search Engines (Aircraft) [9] and the EPA Toxic Releases Inventory Database (EPA) [8]. The sources represent a diversity of real sources in a variety of application domains. Using experimental data of query feedback from WebSources, we were able to characterize sources as having High or Low Prediction Accuracy, with respect to the ability of the Catalog and cost model to predict access costs. We also identified the following WebSource characteristics that are correlated with High or Low prediction accuracy: The significance of the Time of day, the Day of week, etc., on predicting access costs. The level of noise and the corresponding variability of the access cost distribution. The variability of result cardinality and result size, for different queries submitted to the WebSource. To explain, in some WebSources, e.g., EPA and the ACM Digital Library [18], the Time of day and the Day of week were good predictors of network and source workloads. Alternately, access costs may be affected by the result cardinality, e.g., data resides in a database, and a page of answers is dynamically assembled, in ACM. The level of noise also varied; e.g., EPA had a low level of noise, whereas Aircraft was a noisy source. Our study shows that the significance of Time and Day on predicting access costs could be exploited to improve the accuracy of prediction, even in cases where the sources are noisy. For example, GNIS and ACM are noisy sources, but the cost model nevertheless developed an accurate cost distribution. Figure 3 plots the actual response times when submitting a query 5 times to a source with high prediction accuracy, EPA, and to a more noisy source with low prediction accuracy Aircraft. These cost distributions are used in our later experiments. 5

6 35 (a) EPA 35 (b) Aircraft 3 3 Response Time (sec) Response Time (sec) Query# Query# 3 Motivating Example Figure 3: Response times for (a) EPA and (b) Aircraft WebSources Consider two alternate sources, S 1 and S 2, for a remote mediator relation. The access cost (response time) distributions for these sources are represented by normal (Gaussian) distributions 1 with different values for the expected value, and the variance. The distributions are in Figure 4a. One choice for the optimizer is to choose a source with the least expected value (), i.e., source S 1. However, the performance of S 1, i.e., the actual response time of S 1, is not always better than the performance of S 2. Consider Figure 4b that represents a quantile plot for the response times for the two sources. To construct the quantile plots, we used the distributions of Figure 4a to generate the actual response time for some statistically significant (large) number of queries submitted to these sources, and the quantile plot represents the percentage of queries and their actual response time. We see that there is a crossover of the quantile plots of sources S 1 and S 2. If we consider the 5-th percentile, source S 1 provides a performance guarantee that the response time of 5 % of the queries will not exceed 4 seconds. However, if we consider the 9-th percentile, source S 2 provides a performance guarantee that the response time of 9 % of the queries will not exceed 6 seconds, whereas source S 1 can only guarantee that the response time of 9 % of the queries will not exceed 8 seconds. This illustrates that in noisy environments, the expected value alone is insufficient to compare sources and their performance. The optimizer needs a more flexible strategy that would allow it to consider both the risk and benefit of its choice. Figure 5 illustrates the risk and benefit of choosing S 2 instead of S 1. Recall that S 1 had the least expected value. Intuitively, the benefit corresponds to those queries where S 2 has lower values of response time, compared to S 1, while risk corresponds to those queries where S 2 has higher values of response time. A performance guarantee will correspond to "at least X percentage of queries will have a response time (latency) of less than T units of time". Different applications may favor different performance guarantees. An 1 We use Gaussian distributions only for purposes of explanation. 6

7 .8 (a) 1 (b) Probability $SS$ 1 1 $SS$ 2 2 Percentage of queries S 1 S Response Time (sec) Response Time (sec) Figure 4: Response time distributions (a) and quantile plots (b) for sources S 1 and S benefit S 1 S 2 Percentage of queries risk Response Time (sec) Figure 5: Risk and benefit in preferring S 2 over S 1 7

8 application may only benefit from some information within a certain time period, e.g., less then 3 seconds. Source S 1 has a higher percentile of queries that meet this guarantee, and it should be the optimizer s choice. In order to make this choice, the optimizer must ignore the risk of each source of not meeting this guarantee, and emphasize the benefit of each source of meeting this guarantee. This is defined to be an optimistic optimizer strategy. A different application may have a restriction that exceeding a certain deadline, e.g., exceeding 8 seconds, will be very expensive. Source S 2 has a lower percentile of queries that do not meet this guarantee, and it should be the optimizer s choice. In order to choose S 2, the optimizer must emphasize the risk of each source of not meeting the performance guarantee. This is defined to be a conservative strategy. Our objective is to develop an optimizer strategy that is sensitive to meeting certain performance guarantees in a noisy environment. Instead of choosing a source, or an execution plan for a query that has the least expected cost, the optimizer must use a strategy that best matches the given performance guarantee. In doing so, the optimizer emphasizes or ignores the risk and benefit of each source. 4 A Performance Guarantee Sensitive Optimizer The LEC approach to optimization [7] maximizes utility in choosing a query execution plan (or plan) for a query. Utility is associated with the expected cost of a plan. Our performance guarantee (PG) sensitive optimizer associates a utility measure that considers both the expected cost and the expected delay of a plan. We briefly review Least Expected Cost (LEC) based optimization. We introduce the concept of Expected Delay for a cost distribution and use it to express the PG sensitive risk and benefit. We define the Expected Cost Delay (ECD) utility measure for a query plan. Then we describe the implementation of the PG sensitive optimizer. We describe a pre-plan; it identifies a set of WebSources and their querying capabilities, for the remote relations of the mediator query. The pre-plan characterizes a space of (multiple) plans for a mediator query [27]. The PG sensitive optimizer has two phases. In the first phase, a remote cost pre-optimizer generates candidate pre-plans for a query. These candidate pre-plans are generated to reflect more optimistic or more conservative characteristics. In the second phase, a randomized relational optimizer uses the pre-plan to generate plans. 4.1 LEC Optimization Accurate cost estimation for a plan is very difficult since it requires accurate a priori knowledge of costs and details of the run-time environment. The cost of a plan is typically determined using a cost formula and the values for some vector of parameters V that affect the cost of the plan. While the cost of a plan depends on a wide number of parameters, in this paper, we only consider the impact of remote access costs (latency) associated with mediator relations resident on WebSources. The traditional approach is to first obtain some specific fixed value(s) for the set of parameters; typically this is the expected value. Then, the optimizer uses the specific value(s), and finds a plan p which has the least cost. This is the least specific cost method of obtaining a plan. The drawback of this approach is that when the fixed value is different from the actual value, and when there are discontinuities in the cost formula, this approach does not produce the best plan. 8

9 Research in [7] has proposed an alternative least expected cost (LEC) approach. They consider a probability distribution of values for the vector of parameters V. It is represented by discrete (or continuous) values for each parameter in V, together with a probability value. They also consider a set of candidate plans. Now, for each candidate plan, p, they determine the expected cost of this plan, given the probability distribution of values for V. Suppose V is represented by k discrete values, where each value V k has probability of occurrence P rob(v k ). Now, P the expected cost, EC of plan p, or EC(p), can be expressed as EC(p) = k (Cost-plan-p(V k ) P rob(v k )). They compare the expected cost of each plan, and choose the candidate plan with the least expected cost (LEC). The LEC plan maximizes the expected utility, where the utility is the negation of the cost. By maximizing utility, they minimize the expected cost, given some distribution of values for V. 4.2 Expected Delay In the previous section we illustrated that the cost distribution for a source determines its performance. Further, we showed that the expected value of the cost distribution, the Expected Cost (EC), reflects its aggregate behavior. It is insufficient to determine its actual performance. A source may have high values for response time because the distribution has a high expected value. Alternately, the expected value may be small, but access to the source may often be very noisy, leading to high response times. We introduce the concept of the Expected Delay (ED) for a distribution. Together, EC and ED more accurately reflect the actual performance. Consider a specific mediator client accessing a remote WebSource S. Assume that we characterize the access cost distribution of S, dists, as k discrete values, where each value val S k has a probability of occurrence prob val S k. Then, P the expected value [7] of this distribution is EC(distS) = k val S k prob val S k. For any source that is characterized by a distribution, one must also consider the expected deviation from the expected value EC(distS) of the distribution. To determine this deviation, we compute the excess of each of the k particular values val S k of the distribution, compared to the expected value EC(distS). This is excess S k. Figure 6a illustrates a normal (Gaussian) distribution with expected value. The deviation excess S k for some value val S k is also illustrated. When the excess is a negative value, we set the excess S k to. Since we only consider the positive excess, when the value val S k is greater than EC(distS) (or ), this deviation represents a delay, in comparison to the aggregate behavior characterized by the expected value EC(distS). Thus, excess S k can be considered to be some delay S k. The probability of this value of delay is prob val S k. Then, the expected P delay of the distribution dists is ED(distS) = k delay S k prob val S k. We refer to the access cost distributions of Figure 7a. Based on the expected value () of these distributions, the source ordering is [ S 1 < S 2 < S 3 ], where S 1 has the least expected value. However, if we examine the distributions, we see that source S 1 is a noisy source, compared to source S 2, i.e., the for S 2 is lower than the for S 1. Figure 6b indicates both the expected value and expected delay for the sources. Based on the expected delay, the source ordering is [ S 2 < S 3 < S 1 ]. To summarize, a (possibly) noisy source is more accurately characterized by both the expected value and the expected delay, based on its access cost distribution. The first value reflects the aggregate behavior for queries, and the second value reflects the deviation from the aggregate behavior. Since we only considered the 9

10 .8.7 Probability prob_val_s k.2 excess_s k = delay_s k dists.1 µ = 5 val_s k Response Time (sec) Figure 6: Explanation of Expected Delay Probability a) Multiple Distributions S1 S2 S Response Time (sec) Expected Delay (sec) b) Expected Cost vs Expected Delay S1 S2 S Expected Cost (sec) Figure 7: Expected Cost versus Expected Delay 1

11 positive excess or the positive deviation, this deviation is a delay. Finally, we express the PG sensitive risk (benefit) of a cost distribution. Assume that the PG corresponds to response times less P than T units of time. Then the risk is Risk(distS; T ) = k risk S k prob val S k, where risk S k is the positive excess of val S k compared to T. The expression for Benefit(distS,T) is similar. In section 6, we discuss how Risk(distS; T ) or Benef it(dists; T ) can be used to meet specific performance guarantee. 4.3 Expected Cost Delay (ECD) Measure As discussed in section 3, a cost distribution is described by both an expected value and the expected deviation from this value. The positive excess of this deviation can be considered to represent the expected delay. Consider the distribution of k values for the vector of parameters V. For each of the k discrete values V k, with probability P rob(v k ), we can associate a value of delay DelV k with the same probability. Then, given a plan p, the expected P cost EC(p) = kcost-plan-p(v k ) P rob(v k ). The expected P delay for this plan, or ED(p), is expressed as ED(p) = k (Delay-plan-p(DelV k) P rob(v k )). We will describe how Delay-plan-p(DelV k ) for a plan p is calculated from the cost formula for calculating the cost of the plan or Cost-plan-p(V k ). We now consider the concept of Expected Cost Delay (ECD) that uses a cost factor cf, and a delay factor df, to combine the expected cost and expected delay of a plan into a single measure. The (ECD) measure for plan p is defined as follows: ECD(p)= cf EC(p) + df ED(p). Given a set of candidate plans, and some value(s) for cf and df the optimizer chooses among the candidate plans for one that maximizes utility, i.e., it minimizes the ECD measure. 4.4 Calculating ECD for Query Plan We now describe an algorithm to calculate the values of EC(p) and ED(p) for a plan. To do so, we we extend the cost formula used by the optimizer to use access cost distributions for accessing sources. We assume that the access cost is an end-to-end latency. The cost of a plan is estimated in a bottom-up manner, starting from the terminal nodes of the plan tree. A terminal node that is a scan on a mediator relation that is resident on a remote WebSources is associated with an access cost distribution. We use these distributions to calculate the cost distribution for each interior node of the plan tree. The cost formula combines the distributions of child nodes to obtain the distribution for the parent node. Consider a three-way join query on relations R1, R2, and R3. Suppose that WebSources S1, S2, and S3 have been assigned to R1, R2, and R3, respectively. Figure 8 shows the plan tree and the cost distributions for each node. First, the scan on relation R1 (or R2 or R3) is assigned the cost distribution for the corresponding source. Next, we calculate the cost distribution for the join of R1 and R2. For simplicity, we assume that the join is implemented as a hash join, and we ignore the local (in memory) cost of the hash join implementation. Then, the cost of the join is the sum of the costs of the two scans on R1 and R2. We also assume that the 11

12 cost distributions for S1 and S2 are independent, and we use this assumption to calculate the joint probability distribution. For example, a scan cost of 12 seconds for R1 with probability.2 (from S1) and a scan cost of 25 seconds for R2 with probability.4 (from S2) will result in a join cost of 145 seconds with probability.8. This calculation of cost distributions may result in a distribution that has values with a very low probability, or adjacent values that are very close in value. Such a fine granularity in the distribution is not very useful to the optimizer. In addition, it can significantly decrease the optimizer performance for multi-way join queries. To solve this problem, we use a special technique to merge values in the calculated distribution that do not meet some threshold. This is illustrated in Figure 8. Consider the initial cost distribution for the root node of the plan tree (a 3-way join); it is labeled D1. Values whose probability is below a probability threshold, in this case a value of.5, are merged to obtain the distribution labeled D2. Finally, adjacent values whose cost difference is less than a threshold, in this case 1 seconds, are also merged, to obtain the final distribution labeled D3. To illustrate, the first two instances of D1 have a probability of.4 (less than.5), and they are merged. The merged value is the (weighted) mean of the two initial values, and its probability is the sum of the initial probability values. Thus, distribution D2 has an instance whose value is 215 seconds and a probability of.8. Distribution D2 has several adjacent instances whose values do not meet the cost difference threshold of 1 seconds, e.g., 215 and 22 seconds and 175 and 17 seconds. These values are merged as before to obtain the final distribution D3. The above transformations to merge values in the distribution are especially useful when applied to cost distributions for the interior nodes. With appropriate values for the minimum probability and the minimum cost difference threshold, they allow the optimizer to avoid unnecessary computation. Using the final cost distribution D3, for the root node of the plan, we can calculate the expected cost EC(p) and the expected delay ED(p) for plan p. The expected cost EC(p) for the final cost distribution D3 for the root node of the plan is seconds. Next, we calculate the expected delay ED(p) as described in section 3. We only consider those values of the cost distribution that are greater than the expected cost, since the delay only considers the positive excess. For example, the value of 218 seconds contributes a delay of seconds with a probability of.14 towards ED(p), whereas the value of 155 seconds does not contribute to the expected delay. Finally, using the values for the cost factor cf and the delay factor df we calculate the Expected Cost Delay Measure for plan p: cf EC(p) + df ED(p). 4.5 Two Phase Approach to Generate Candidate Pre-Plans and Produce Plans that Maximize ECD Utility Two issue that are important for LEC optimization to be practical and useful are described in [7], and these issues are relevant to the PG optimizer as well. The first issue is accurately determining the cost distributions for the parameters that affect the cost of a plan. In this paper, we only consider cost distributions for remote access costs (end-to-end latency) of accessing mediator relations resident on WebSources. The second problem is the extent of modification required to generate an appropriate set of candidate plans, so that the LEC optimizer is close to optimal. An LEC optimizer algorithm to generate b candidate plans, and invoke the optimizer once for each plan is presented in [7]. They choose each b to represent some discrete value (or some range of continuous values) for the vector of parameters, and show that each candidate plan is 12

13 (D1) Cost prob Cost prob S3: Min prob =.5 (D1) Min difference = 1 (D2) RT prob R Cost 215 (D2) prob Cost prob (D3) R1 RT 12 7 S1: S2: prob.2.8 RT prob R Figure 8: Distributions associated with query plan close to optimal for that value or range of values. The complexity of their (heuristic based) algorithm is shown to be much lower than the complexity of the least specific cost optimizer, (say System R style), for the same query. For the above algorithm to be close to optimal, the b values must be chosen to be representative of the distributions. Further, instead of only choosing the best plan for each of the b ranges, one may choose the top c plans for each of the b ranges. The choice of a representative set of b ranges for the cost distributions, and the choice of the top c candidate plans becomes important when there are discontinuities in the cost formula. An example of a discontinuity in the cost formula is the cost formula for the nested loop join when the size of one of the relations is either larger or smaller than the amount of available memory. We now describe our two phase approach to generate candidate plans for a PG sensitive optimizer. We first define a pre-plan and describe how we calculate its ECD. We then (informally) present our strategy to generate plans. The following hypothesis must hold for our approach to be effective: Given a mediator query where N of the relations are remote, it is the choice of a particular WebSource and its access cost distribution, for each of these N relations, that has the greatest impact on the performance guarantee. Thus, for the PG sensitive optimizer, the choice of WebSource for each of the N relations is analogous to the choice of b representative values, or ranges of values, for the vector of parameters that affect the cost of the plan. The choice of WebSource(s) and access cost distribution(s) for each of the remote relations corresponds to the information of a pre-plan. A pre-plan was defined in [27, 3]. A pre-plan associates a relevant WebSource, its query capability and access cost distribution, with each remote relation of the mediator query. A pre-plan also has information about dependencies between the query capability of WebSources. This dependency information is in a pre-plan dependency graph. Given a pre-plan, it characterizes a search space of query execution plans, or plans, each of which respects the pre-plan dependency graph [27]. We give an example of pre-plan(s) and we explain how we calculate the approximate ECD measure for the pre-plan. We then describe how the PG optimizer generates plans. 13

14 Generating Pre-Plans Consider a N-way join query over five relations R 1 : : : R 5. Suppose R 2 is a remote relation and it is implemented on two (alternate) WebSources, S 21 and S 22, with access cost distributions D 21 and D 22, respectively. Further, suppose that the query capability of source S 21 is such that it requires an input binding from the relation R 3, whereas S 22 does not have any required input bindings. Then, we obtain the following two pre-plans: pre-plan 1: f R 1 ; R 3 ; R 4 ; R 5 g, f R 2 (S 21 ; D 21 ) g [ f R 3! R 2 g pre-plan 2: f R 1 ; R 2 (S 22 ; D 22 ); R 3 ; R 4 ; R 5 g union ; The first element of each pre-plan specifies a partitioning of R 1 : : : R 5 into 2 partitions for pre-plan 1 and one partition in pre-plan 2. The choice of source and cost distribution for the remote relation R 2 is also specified. The second term specifies the dependencies; it is empty in pre-plan 2. To explain further, in the first pre-plan, the dependency (R 3! R 2 ) partitions the relations R 1 : : : R 5 into the two groups fr 1 ; R 3 ; R 4 ; R 5 g and fr 2 g. Every plan obtained from this pre-plan must respect the dependency (R 3! R 2 ), i.e., R 3 must precede R 2 in the join ordering of each plan. On the other hand, there is no dependency in the second pre-plan, and hence there are no restrictions on the join orderings of any plan obtained from the second pre-plan. Next, we briefly describe how we calculate the cost of a pre-plan; details are in [27]. First, since we approximate the cost of the pre-plan, we ignore all local processing costs, e.g., scans or joins of locally resident mediator relations, and we only focus on the remote relations. Suppose that there are no dependencies in the pre-plan (and the plan) between the remote relations, e.g, the second pre-plan. The pre-plan cost is the cost of accessing the remote relations, e.g., R 2 from the WebSource, and the cost of any join operators involving R 2. In this case, we assume that the cost of the join is calculated based on the hash join implementation. Suppose instead that there is exactly one dependency, e.g., (R 3! R 2 ). This is implemented as a dependent join [6]. The cost is calculated using the nested loop join implementation. there are dependencies: R 1! R 2, and R 3 ; :::; R n ; To correctly determine the cost of the nested loop join, where R 2 is the operand on the right subtree, we have to calculate the (output) cardinality of the left subtree tuples (that include relation R 3 ), that join with R 2. We assume that the optimizer will choose a good join ordering of the other relations to reduce (minimize) this cardinality. To do so, the optimizer will need to partition the other relations and determine which subset must occur in the left subtree with R 3, in a good (optimal) plan. This subset (of selections and joins) will act as a filter and reduce the cardinality of this subtree. The case where there is more then one dependent join in the pre-plan is much more complex, since finding a good (optimal) partitioning of the subgoals is more expensive. Details are in [27] Generating Plans We can now (informally) describe our strategy to generate an appropriate candidate set of pre-plans and plans for the PG sensitive optimizer. 14

15 In the first phase, for values of cf and df varying in the range [.,1.], we utilize a remote cost preoptimizer to generate the best pre-plan. We note that for the results reported in this paper, we display the optimizer choices for values of cf and df adding up to 1., i.e., the value of cf = (1. - df). These values were found to work well to illustrate the optimizer strategy. This pre-optimizer chooses a set of candidate WebSources and their cost distributions, and calculates the (approximate) ECD value, for the particular value of cf and df, using the method just described. The pre-optimizer can either exhaustively consider the combinations of WebSources, or it can use a randomized search strategy that is described in [28]. The granularity of the values of cf and df is varied by the pre-optimizer. For each pair of values (cf, df), it will choose either a single pre-plan or the top C pre-plans based on maximizing the ECD utility, i.e., the minimum ECD value. In the second phase, for each of the pre-plans, and the corresponding value of (cf, df), a randomized relational optimizer generates plans that respect the dependencies of the pre-plan. This task of generating plans that respect the pre-plan is described in [27, 3]. The optimizer can choose either a single plan, or the top c plans, based on maximizing the ECD utility, i.e., the minimum ECD value. Finally, depending on the desired optimizer strategy, i.e., optimistic or conservative, the optimizer would choose a pair of values (cf, df) that represents that strategy, and the corresponding winning pre-plan and plan. This would be a single winner or the top c winners, as described above. 5 Behavior of the Performance Guarantee (PG) Sensitive Optimizer 5.1 Simulation Environment Figure 2 presents the components of our architecture and simulation environment. The mediator extends the Predator ORDBMS [23] which is implemented in C++. The pre-optimizer to generate pre-plans was implemented in Quintus Prolog and integrated into the Predator mediator. The Predator evaluation engine was extended with an external scan operator and a dependent join operator [6, 1] to implement the (limited) access to WebSources and the mediator Catalog and cost model were appropriately extended. Synthetic access cost (Gaussian) distributions were generated using the Java Open Source Libraries for High Performance Scientific and Technical Computing [11]. These access cost distributions are labeled using the values of and for the distribution. In addition, experimental data of access to real WebSources, e.g., EPA [8], were gathered to develop a cost model for WebSources [31]. These traces were also used in our simulations and are labeled by their name. The Wrapper cost model provides relevant statistics and access costs for the WebSource. For the simulation, the PREDATOR mediator was executed on a Sun Ultra SPARC 1 with 64MB of memory, running Solaris 2.6. While validating access to WebSources, this machine was connected via a 1 Mbps Ethernet cable to the domain umiacs.umd.edu. This domain is connected to its ISP via a 27 Mbps DS3 line. 15

16 The mediator queries were complex N-way join queries and we used up to 12 queries in the simulation. They were generated to reflect the complex decision support queries and statistics of the TPC-D Benchmark. For each query, up to 5 relations were identified as relations resident on remote WebSources. For each such relation, we identified up to 3 alternate WebSources, with their dependencies and access cost distributions. All costs are end-to-end delays for downloading data. The mediator statistics were modified so as to vary the percentage of remote processing cost versus the total cost of the queries. Unless otherwise noted, we report on the aggregate behavior of the PG optimizer over all the queries. 5.2 Expected Cost Delay Trade-Off for Pre-Plans and Plans We report on the Expected Cost Delay (ECD) utility measure and the utility trade-off of the optimizer, as we varied sources and queries. In these experiments, we used synthetic access cost distributions. We typically report on the value of the ECD measure; the optimizer will maximize utility by minimizing on the ECD measure. In the first experiment (see Figure 9), we picked one relation to be a remote relation. There were 1 9 (a) Aggregate behavior (5 queries); 1 remote relation risk 4.5 x 16 4 (b) Aggregate behavior (5 queries): 1 remote rel. 5_1 4 5_2 4 5_ Percentage of queries Relative ECD (ms) benefit Response Time (ms) (5,4) (5,1) (5,2) (5,4) x Delay Factor Figure 9: Aggregate over Multiple Queries; Relative ECD; One Remote Relation; Varying Delay Factor 4 alternate WebSources for this relation; each WebSource had a cost distribution with a different value for expected cost EC, and expected delay ED. One of the sources was an ideal source, i.e., its value of ED was close to. The EC for the other sources were comparable to the ideal source, and the values of ED varied. Figure 9a reports on the quantile plots for the response times of query plans, one for each of the sources. To construct the quantile plots, we used the synthetic distributions to generate response times for the access to the WebSource. The quantile plot reports on the end-to-end delay or response time for the query trials. We report on the aggregate behavior over the query mix. Figure 9b compares the relative ECD for each plan/source, compared to the ECD of the plan/ideal source, as we vary the delay factor df. Figure 9a illustrates that each of the plans has some benefit compared to the ideal plan. Correspondingly, each of the plans also has some risk. We compare the quantile plot for the plan labeled > in Figure 9a, and its relative ECD, compared to the plan/ideal source, in Figure 9b. This plan > incurs the greatest risk compared to the plan/ideal source, as seen in Figure 9a. Correspondingly, as we increase the delay factor df, we see that the ECD for the plan >, 16

17 compared to the plan/ideal source, also increases. 4.5 x 16 (a) 2 remote relations; 1 dep. 8 x 15 (b) 2 remote relations; no dep _5+7_1 6_1+7_5 6_1+7_2 6_3+7_1 6_3+7_ Relative ECD (ms) Relative ECD (ms) _5+7_1 6_1+7_5 6_1+7_2 6_3+7_1 6_3+7_ Delay Factor Delay Factor Figure 1: Relative ECD with and without dependencies In the next experiment, (see Figure 1), we picked two remote relations. Each remote relation is associated with three alternate WebSources, and their response time distributions. In addition, we varied the query capability implemented by the WebSources. We chose two query capabilities, one of which introduced a dependency between the remote relation and a mediator relation. This had an impact on the query plan, i.e,, the query capability with a dependency introduced a dependent join operator in the plan. The plans with and without the dependent join operator had different ranges for ECD values. As before, the ideal source(s) had a value of ED close to. We chose two ideal sources, each of which introduces a dependency, and two ideal sources without a dependency in their query capability. We compared the plans with noisy sources to the plan with two ideal sources. Figure 1a reports on the relative ECD for the noisy plans, compared to the ideal plan, when all sources had a dependency in their query capability. Figure 1b reports on the relative ECD for the noisy plans, compared to the ideal plan, where none of the sources had a dependency. Both graphs report on the aggregate behavior over five queries. Figure 1b on the right had no dependencies. As the value of df increases, noisy plans with higher values of ED (+ and ) exhibit increasing values of ECD. This result is similar to our previous experiment with a single remote relation. However, when we examine Figure 1a on the left, which has dependencies, we note that as the value of df increases, the ECD for the noisy plans with higher values of ED, (+, ans ) increases much more rapidly, in comparison to the plans with lower values of ED, ( and >) Thus, we see a grouping of source behavior, based on their high or low values of ED, as we increase the delay factor df. This illustrates that when there are dependencies in the query capability, then the choice of WebSource and cost distribution, in particular the value of ED, has a greater impact on the optimizer performance guarantee. In a final experiment, we studied the scalability of the pre-optimizer and optimizer. We chose up to 5 of the relations to be remote. For each of the remote relations, there were up to 3 remote WebSources. Thus, the space of pre-plans, where each pre-plan corresponds to a source for one of the 5 remote relations, is large. We report on our first set of results, when only a limited number of pre-plans were explored in the top two graphs, and we report on the results where all the pre-plans were exhaustively explored, in the bottom two graphs of 17

18 7 x 15 5 rem rels x 3 rem srcs: limited search; pre plans 18 x 16 5 rem rels x 3 rem sources; limited search; plans 6 DF= DF=.25 DF=.5 DF= DF= DF=.25 DF=.5 DF=1 Expected Delay (msec) Expected Delay (msec) Expected Cost (msec) x Expected Cost (msec) x x 15 5 rels x 3 rem srcs: exhaustive search; pre plans 18 x 16 5 rels x 3 rem srcs: exhaustive search; plans 6 DF= DF=.25 DF=.5 DF= DF= DF=.25 DF=.5 DF=1 Expected Delay (msec) Expected Delay (msec) Expected Cost (msec) x Expected Cost (msec) x 1 7 Figure 11: Scalability of pre-optimizer and optimizer 18

19 Figure 11. To simplify the presentation, we only present the results for one complex N-way join query. Figure 11 identifies the values of EC and ED, for the winning pre-plan and plan, for varying values of the delay factor df. We first discuss the two lower graphs where the search was exhaustive. We expect that for lower values of df, when the optimizer is optimistic, the winning pre-plan and plan will have lower values for EC. The winning pre-plan and plan for df =.,.25, and.5 are labeled +,, and 2. For this particular query, the same pre-plan and plan are winners for these df values. We see that the value of ED is comparatively large since the optimizer is optimistic and ignores ED. On the other hand, when the optimizer is conservative, i.e., with higher values for df, the winning pre-plan and plan may have larger values for EC, but the values for ED should be low. The winning pre-plan and plan for a value of df = 1. is >, and it indeed reflects this behavior. Finally, we compare the winning pre-plans and plans when the search is not exhaustive (top two graphs) with the exhaustive search. We observe some anomalies in their behavior. For example, we expect that the value of ED for the winning pre-plan should decrease for df =.25 or.5 (+ and ) compared to the value of ED for df =.. However, the top left graph shows us that this value of ED for + and actually increases. This anomaly is due to limitations of our current implementation of the randomized pre-optimizer. This experiment illustrates that the PG sensitive optimizer s behavior is consistent and scalable up to a large search space of pre-plans and plans. In the next set of experiments we explore the optimizer behavior on cost distributions obtained from query traces from real WebSources. The sources are EPA [8] and Aircraft [9]. Figure 3 from section 2 plots the actual response times for these sources. The EPA had a low level of noise, whereas Aircraft was a noisy source. The optimizer used the EPA and Aircraft to generate query plans P 1 and P 2, and to calculate cost distributions for the plans P 1 and P 2. Figure 12a represents quantile plots of response time for P 1 and P 2 that were generated from those distributions. We observe that there is a significant risk and a very small benefit associated with choice of P 2 instead of P 1. EDC values for P 1 and P 2 are shown in Figure 12b for different values of delay factor df. The plan P 1 based on the less noisy EPA is the winner for all ranges of df, i.e., for all ranges of the optimizer strategies. This is consistent with the significant risk and a small benefit associated with choice of P 2. Figure 12b shows the relative difference in EDC of P 1 and P 2. When the optimizer is more optimistic, the relative EDC has lower value. It increases slightly with increasing df, and this reflects that when the optimizer is more conservative, it places more emphasis on the risk of choosing P 2. Thus, the behavior of the optimizer is consistent with respect to amount of risk and benefit in choosing query plans on real WebSources. To summarize, we have verified that the optimizer behavior was both predictable and scalable, with respect to the following: alternate sources for a remote relation with varying values of EC and ED; alternate sources with and without dependencies; aggregate behavior over several queries; synthetic and experimental cost distributions; large search space of pre-plans and plans. 19

Efficient Evaluation of Queries in a Mediator for WebSources

Efficient Evaluation of Queries in a Mediator for WebSources Vladimir Zadorozhny University of Pittsburgh Pittsburgh, PA 15260 vladimir@sis.pit t.edu Louiqa Raschid University of Maryland College Park,