Query Optimization to Meet Performance Guarantees for Wide Area Applications

Size: px
Start display at page:

Download "Query Optimization to Meet Performance Guarantees for Wide Area Applications"

Transcription

1 Query Optimization to Meet Performance Guarantees for Wide Area Applications Vladimir Zadorozhny Louiqa Raschid University of Maryland College Park, MD 2742 Abstract Recent technology advances have enabled mediated query processing with Internet accessible Web- Sources. A characteristic of WebSources is that their access costs exhibit transient behavior. These costs depend on the network and server workloads, which are often affected by the time of day, day, etc. Given a distribution of access costs, a potential optimization strategy is based on the least expected cost (LEC). An LEC optimizer chooses WebSource(s) for remote relations, and query execution plan(s), that minimize the expected cost of the plan. A drawback of LEC optimization is that it is not a good predictor of the actual performance of a query. In a noisy environment, the actual latency (response time) of a query will occur within some range of values. A performance guarantee (PG) for a noisy environment will correspond to "at least X percentage of queries will have a latency of less than T units of time". In this paper, we propose an optimizer strategy that is sensitive to the objective of meeting performance guarantees (PG). For each query plan, a PG sensitive optimizer uses both the expected value of the cost distribution of the plan, as well as the expected delay of the plan. The delay is the deviation above the expected value. The plan is then characterized by an Expected Cost Delay (ECD) utility measure based on the expected value and expected delay of the cost distribution. The optimizer can choose different strategies to match a desired performance guarantee. The optimizer behavior can be optimistic, where it ignores the expected delay, or it can be conservative, where it emphasizes the expected delay. This trade-off of optimizer behavior is controlled by a cost factor and a delay factor. We examine various characteristics that impact the behavior of the PG optimizer. We validate our strategy using a simulation based study of the optimizer s aggregate behavior, for a set of queries and a set of remote relations on WebSources. We also experimentally validate the optimizer using traces of access costs for real WebSources. 1 Introduction Recent technology advances have enabled mediated query processing in the wide area environment with remote relations (objects) that are resident on Internet accessible WebSources. There are many characteristics of mediated query processing in this environment that pose substantial challenges. One characteristic is that typically there are several alternate WebSources for each remote relation. This is due to architectures of mirror sites or replicas, proxy sites with cached copies, or portals that gather information from several different sites. Thus, a mediator subquery (on a relation) can be submitted to multiple candidate WebSources. In order to choose between alternate WebSources, an optimizer has to compare their access costs. However, the second characteristic of Internet accessible WebSources is that their access costs exhibit transient behavior. The costs may be described as spatio-temporal since the response times depend on the client, remote server, and network This research has been partially supported by the Defense Advanced Research Project Agency under grant , and the National Science Foundation under grant IRI

2 and server workloads, which are often affected by the time of day, day, etc. They are typically represented as cost distributions. The transient behavior of WebSources makes it difficult to directly compare the cost of plan execution on (particular combinations of) WebSources. LEC (least expected cost) is a proposed approach [7] to compare the expected values of cost distributions for execution plans. However, the expected value reflects aggregate behavior, and is not a good reflection of the actual response time, i.e., the time to execute the query and download results from the source. A noisy source with high variance in its response time will behave differently from a stable source that is less noisy. Meanwhile, both sources may have the same, or similar expected values of the response time. This difference in actual performance is independent of the expected values of their access cost distributions. In this paper we consider the relationship between the choice of sources and a performance guarantee. A performance guarantee (PG) will correspond to "at least X percentage of queries will have a latency of less than T units of time". Consider the following example of two queries (or their execution plans), on source S 1 and S 2, respectively. The sources have different cost distributions. Figure 1 represents the response times for these two queries. The graphs show the percentage of queries and their response time, i.e., a quantile plot of response times, for hundreds of queries submitted to the two sources. Suppose we consider a performance guarantee "... 1% of the queries must execute in less than 4 seconds". It is clear that in the noisy Internet environment, neither of the queries (plans) can meet this performance guarantee. Figure 1 also identifies the risk of choosing either source; intuitively this corresponds to those queries whose response time exceeds 4 seconds. As can be seen in Figure 1, the risk associated with source S 1 appears to be greater than the risk associated with S 2. Conversely, we associate each of these choices with a value of benefit, i.e., queries with response time less then 4 seconds. The benefit of S 1 also appears to be greater than S risk S 1 9 risk S Percentage of queries Percentage of queries benefit Response Time (sec) 1 benefit Response Time (sec) Figure 1: Risk and benefit of Sources S 1 and S 2 for a performance guarantee of 4 seconds To summarize, given several noisy WebSources, each characterized by a cost distribution, an optimizer has to select a good combination of sources, and generate a good query execution plan for this choice of sources. A performance guarantee (PG) sensitive optimizer will have to choose among the sources and execution plans, to better meet a desired performance guarantee. 2

3 There has been extensive prior research on query optimization and evaluation that is relevant to our task. A recent issues of the IEEE Data Engineering bulletin highlights several projects. Reactive query evaluation techniques at the plan level are described in [1, 4, 21, 26]. Alternately, adaptive evaluation techniques at the plan and operator implementation level are described in [3, 13, 14, 16, 25]. Research in [2, 7, 19, 15, 17, 32] have addressed various aspects of the task of modifying an optimizer to handle distributions for various parameters, e.g., available memory, or intermediate join cardinality. The Tukwila project [16] provides an adaptive framework composed of rules, and adaptive operators that are sensitive to transient factors. Its optimizer also has a re-optimization capability based on pipelined fragments of query execution. The Telegraph project explores adaptive fine-grained data flow based query processing techniques using rivers, eddies, ripple join, XJoin, etc. [3, 13, 25]. They also focus on continuous and high frequency response to changing feedback. Query scrambling [1, 26] is a query optimization and evaluation technique to deal with transient conditions, e.g., unexpected delay. It is based on plan re-organization to avoid idle time, and the synthesis of new operators to execute in the absence of other work. It makes a key contribution of a response time based query optimizer to deal with delay, instead of the more traditional cost based approach. We draw up in this concept when we introduce the expected delay of a cost distribution. Finally, several key contributions were drawn from the least expected cost (LEC) optimizer of [7], and this is described elsewhere in the paper. In summary, while prior research has addressed distributions and transient conditions, they have not addressed our specific task of choosing among noisy sources, described by cost distributions, in order to develop an optimizer that is sensitive to performance guarantees in a noisy environment. Our approach to PG sensitive optimization is to consider a cost distribution for each query plan. We characterize the plan with the expected cost of the plan, as well as the expected delay. The expected delay represents the deviation in excess of the expected cost. We combine the expected cost and the expected delay using a cost factor, and a delay factor, to obtain a combined utility measure Expected Cost Delay ECD. The optimizer will choose a plan to maximize utility, i.e., it will minimize the ECD value. The optimizer can choose different strategies to meet a desired performance guarantee. We control the behavior of the optimizer by varying the values of the cost factor and delay factor. This will also vary the value of the ECD utility for the candidate plans. With higher values for the delay factor, the optimizer is more conservative, i.e., it tries to minimize the risk of queries that do not meet the performance guarantee, while choosing sources and plans. With lower values for delay factor, the optimizer is more optimistic, i.e., it tries to maximize the benefit of queries that do meet the performance guarantee, while choosing sources and plans. We demonstrate how we manipulate the cost factor and delay factor, and use the value of ECD to choose plans following a more conservative or optimistic strategy, to better meet a performance guarantee. The first contribution of our research is this approach to PG sensitive optimization in a noisy environment. An important issue for an optimizer when dealing with cost distributions is generating a good candidate plans so that the optimizer choice is close to optimal [7]. Our search space is large since we must consider alternate sources, or combinations of sources. Our second contribution is a two phase technique to generate good candidate plans. In the first phase, we use the concept of a pre-plan [27, 3] to represent a choice of sources and their cost distributions. We use a heuristic to determine the (approximate) cost of a pre-plan and we choose pre-plans to reflect an optimistic or conservative optimizer strategy. In the second phase, we use an extended relational optimizer to generate good plans for the pre-plans. 3

4 Web Query Optimizer Mediator Mediator Catalog Pre-Optimizer Relational Optimizer Mediator Cost Model Wrapper Cost Model Wrapper Wrapper Catalog Web Prediction Tool WebSource Evaluation Engine Java RMI... Wrapper W A N WebSource Web Query Broker... Java RMI Figure 2: Mediator Wrapper Architecture for WebSources Using a simulation based study, we verify that the PG sensitive optimizer s behavior is predictable and scalable with respect to the following: alternate sources with varying values of expected cost and expected delays; a mix of queries of varying complexity; a large search space of pre-plans and plans; and access cost distributions obtained from real WebSources. This paper is organized as follows: In section 2, we briefly review the mediator architecture and the components of the optimizer. In section 3, we use examples to illustrate sources that meet or do not meet performance guarantees. In section 4, we describe the implementation of PG sensitive optimizer. We define the Expected Cost Delay ECD utility measure, and describe the two phase approach to generating pre-plans and plans. We then describe our simulation based study of the PG optimizer behavior. We present the optimizer s aggregate behavior, for a set of queries and a set of remote relations on WebSources, in section 5. We also experimentally validate the optimizer using traces of access costs for real WebSources. In section 6, we study the optimizer strategy and the ECD trade-off. We illustrates how the optimizer can use an optimistic or conservative strategy to choose sources and plans that best meet some performance guarantee. 2 Architecture Wrapper mediator architectures have been proposed for interoperability of heterogeneous information sources [1, 19, 2, 22, 29]. Figure 2 presents the components of our architecture tailored for Internet accessible WebSources. A WebSource is accessible via the http protocol; a forms-based interface provides a limited query capability, and returns answers in XML or HTML. The mediator is an extension of the Predator ORDBMS [23]; our mediator uses the relational data model. The mediator has the task of decomposing a mediator query into subqueries; identifying relevant WebSources that can answer a subquery; and providing 4

5 query optimization and evaluation functionalities. Wrappers reflect the limited query capability of WebSources and handle mismatch between the mediator and the WebSource. We developed a novel technique for mediator query optimization using a pre-optimizer; the pre-optimizer produces a pre-plan for a mediator query. For each remote relation in the mediator query, the pre-plan identifies a relevant WebSource, its query processing capabilities, and its access cost distribution. A pre-plan is a higher level abstraction that circumscribes and describes a space of query execution plans or plans. The cost distributions of the WebSources characterize the cost of each pre-plan, as well as the cost of the (physical) plan for evaluating the mediator query. This is discussed in Section 4. A major challenge for our optimizer is the unpredictable behavior of WebSources, dictated by network and server workloads. These workloads may be affected by the Time of Day, the Day of the week, etc. We developed a tool, the Web (P)rediction (T)ool, WebPT to predict access cost distributions for WebSources [5, 12]. The WebPT is designed for a dynamic environment such as the Internet, which must be continually monitored. The WebPT is based on an online learning and prediction technique. The WebPT measures and predicts query feedback, i.e., the end-to-end latency to access a WebSource and to download data. It uses a number of parameters including the Time of Day and the Day of Week, to predict latency. We used the WebPT to develop a cost model and a Catalog of access cost distributions for WebSources [31]. For the study, data was gathered from several WebSources, e.g., USGS Geographic Names Information System (GNIS) [24], Landings Aviation Search Engines (Aircraft) [9] and the EPA Toxic Releases Inventory Database (EPA) [8]. The sources represent a diversity of real sources in a variety of application domains. Using experimental data of query feedback from WebSources, we were able to characterize sources as having High or Low Prediction Accuracy, with respect to the ability of the Catalog and cost model to predict access costs. We also identified the following WebSource characteristics that are correlated with High or Low prediction accuracy: The significance of the Time of day, the Day of week, etc., on predicting access costs. The level of noise and the corresponding variability of the access cost distribution. The variability of result cardinality and result size, for different queries submitted to the WebSource. To explain, in some WebSources, e.g., EPA and the ACM Digital Library [18], the Time of day and the Day of week were good predictors of network and source workloads. Alternately, access costs may be affected by the result cardinality, e.g., data resides in a database, and a page of answers is dynamically assembled, in ACM. The level of noise also varied; e.g., EPA had a low level of noise, whereas Aircraft was a noisy source. Our study shows that the significance of Time and Day on predicting access costs could be exploited to improve the accuracy of prediction, even in cases where the sources are noisy. For example, GNIS and ACM are noisy sources, but the cost model nevertheless developed an accurate cost distribution. Figure 3 plots the actual response times when submitting a query 5 times to a source with high prediction accuracy, EPA, and to a more noisy source with low prediction accuracy Aircraft. These cost distributions are used in our later experiments. 5

6 35 (a) EPA 35 (b) Aircraft 3 3 Response Time (sec) Response Time (sec) Query# Query# 3 Motivating Example Figure 3: Response times for (a) EPA and (b) Aircraft WebSources Consider two alternate sources, S 1 and S 2, for a remote mediator relation. The access cost (response time) distributions for these sources are represented by normal (Gaussian) distributions 1 with different values for the expected value, and the variance. The distributions are in Figure 4a. One choice for the optimizer is to choose a source with the least expected value (), i.e., source S 1. However, the performance of S 1, i.e., the actual response time of S 1, is not always better than the performance of S 2. Consider Figure 4b that represents a quantile plot for the response times for the two sources. To construct the quantile plots, we used the distributions of Figure 4a to generate the actual response time for some statistically significant (large) number of queries submitted to these sources, and the quantile plot represents the percentage of queries and their actual response time. We see that there is a crossover of the quantile plots of sources S 1 and S 2. If we consider the 5-th percentile, source S 1 provides a performance guarantee that the response time of 5 % of the queries will not exceed 4 seconds. However, if we consider the 9-th percentile, source S 2 provides a performance guarantee that the response time of 9 % of the queries will not exceed 6 seconds, whereas source S 1 can only guarantee that the response time of 9 % of the queries will not exceed 8 seconds. This illustrates that in noisy environments, the expected value alone is insufficient to compare sources and their performance. The optimizer needs a more flexible strategy that would allow it to consider both the risk and benefit of its choice. Figure 5 illustrates the risk and benefit of choosing S 2 instead of S 1. Recall that S 1 had the least expected value. Intuitively, the benefit corresponds to those queries where S 2 has lower values of response time, compared to S 1, while risk corresponds to those queries where S 2 has higher values of response time. A performance guarantee will correspond to "at least X percentage of queries will have a response time (latency) of less than T units of time". Different applications may favor different performance guarantees. An 1 We use Gaussian distributions only for purposes of explanation. 6

7 .8 (a) 1 (b) Probability $SS$ 1 1 $SS$ 2 2 Percentage of queries S 1 S Response Time (sec) Response Time (sec) Figure 4: Response time distributions (a) and quantile plots (b) for sources S 1 and S benefit S 1 S 2 Percentage of queries risk Response Time (sec) Figure 5: Risk and benefit in preferring S 2 over S 1 7

8 application may only benefit from some information within a certain time period, e.g., less then 3 seconds. Source S 1 has a higher percentile of queries that meet this guarantee, and it should be the optimizer s choice. In order to make this choice, the optimizer must ignore the risk of each source of not meeting this guarantee, and emphasize the benefit of each source of meeting this guarantee. This is defined to be an optimistic optimizer strategy. A different application may have a restriction that exceeding a certain deadline, e.g., exceeding 8 seconds, will be very expensive. Source S 2 has a lower percentile of queries that do not meet this guarantee, and it should be the optimizer s choice. In order to choose S 2, the optimizer must emphasize the risk of each source of not meeting the performance guarantee. This is defined to be a conservative strategy. Our objective is to develop an optimizer strategy that is sensitive to meeting certain performance guarantees in a noisy environment. Instead of choosing a source, or an execution plan for a query that has the least expected cost, the optimizer must use a strategy that best matches the given performance guarantee. In doing so, the optimizer emphasizes or ignores the risk and benefit of each source. 4 A Performance Guarantee Sensitive Optimizer The LEC approach to optimization [7] maximizes utility in choosing a query execution plan (or plan) for a query. Utility is associated with the expected cost of a plan. Our performance guarantee (PG) sensitive optimizer associates a utility measure that considers both the expected cost and the expected delay of a plan. We briefly review Least Expected Cost (LEC) based optimization. We introduce the concept of Expected Delay for a cost distribution and use it to express the PG sensitive risk and benefit. We define the Expected Cost Delay (ECD) utility measure for a query plan. Then we describe the implementation of the PG sensitive optimizer. We describe a pre-plan; it identifies a set of WebSources and their querying capabilities, for the remote relations of the mediator query. The pre-plan characterizes a space of (multiple) plans for a mediator query [27]. The PG sensitive optimizer has two phases. In the first phase, a remote cost pre-optimizer generates candidate pre-plans for a query. These candidate pre-plans are generated to reflect more optimistic or more conservative characteristics. In the second phase, a randomized relational optimizer uses the pre-plan to generate plans. 4.1 LEC Optimization Accurate cost estimation for a plan is very difficult since it requires accurate a priori knowledge of costs and details of the run-time environment. The cost of a plan is typically determined using a cost formula and the values for some vector of parameters V that affect the cost of the plan. While the cost of a plan depends on a wide number of parameters, in this paper, we only consider the impact of remote access costs (latency) associated with mediator relations resident on WebSources. The traditional approach is to first obtain some specific fixed value(s) for the set of parameters; typically this is the expected value. Then, the optimizer uses the specific value(s), and finds a plan p which has the least cost. This is the least specific cost method of obtaining a plan. The drawback of this approach is that when the fixed value is different from the actual value, and when there are discontinuities in the cost formula, this approach does not produce the best plan. 8

9 Research in [7] has proposed an alternative least expected cost (LEC) approach. They consider a probability distribution of values for the vector of parameters V. It is represented by discrete (or continuous) values for each parameter in V, together with a probability value. They also consider a set of candidate plans. Now, for each candidate plan, p, they determine the expected cost of this plan, given the probability distribution of values for V. Suppose V is represented by k discrete values, where each value V k has probability of occurrence P rob(v k ). Now, P the expected cost, EC of plan p, or EC(p), can be expressed as EC(p) = k (Cost-plan-p(V k ) P rob(v k )). They compare the expected cost of each plan, and choose the candidate plan with the least expected cost (LEC). The LEC plan maximizes the expected utility, where the utility is the negation of the cost. By maximizing utility, they minimize the expected cost, given some distribution of values for V. 4.2 Expected Delay In the previous section we illustrated that the cost distribution for a source determines its performance. Further, we showed that the expected value of the cost distribution, the Expected Cost (EC), reflects its aggregate behavior. It is insufficient to determine its actual performance. A source may have high values for response time because the distribution has a high expected value. Alternately, the expected value may be small, but access to the source may often be very noisy, leading to high response times. We introduce the concept of the Expected Delay (ED) for a distribution. Together, EC and ED more accurately reflect the actual performance. Consider a specific mediator client accessing a remote WebSource S. Assume that we characterize the access cost distribution of S, dists, as k discrete values, where each value val S k has a probability of occurrence prob val S k. Then, P the expected value [7] of this distribution is EC(distS) = k val S k prob val S k. For any source that is characterized by a distribution, one must also consider the expected deviation from the expected value EC(distS) of the distribution. To determine this deviation, we compute the excess of each of the k particular values val S k of the distribution, compared to the expected value EC(distS). This is excess S k. Figure 6a illustrates a normal (Gaussian) distribution with expected value. The deviation excess S k for some value val S k is also illustrated. When the excess is a negative value, we set the excess S k to. Since we only consider the positive excess, when the value val S k is greater than EC(distS) (or ), this deviation represents a delay, in comparison to the aggregate behavior characterized by the expected value EC(distS). Thus, excess S k can be considered to be some delay S k. The probability of this value of delay is prob val S k. Then, the expected P delay of the distribution dists is ED(distS) = k delay S k prob val S k. We refer to the access cost distributions of Figure 7a. Based on the expected value () of these distributions, the source ordering is [ S 1 < S 2 < S 3 ], where S 1 has the least expected value. However, if we examine the distributions, we see that source S 1 is a noisy source, compared to source S 2, i.e., the for S 2 is lower than the for S 1. Figure 6b indicates both the expected value and expected delay for the sources. Based on the expected delay, the source ordering is [ S 2 < S 3 < S 1 ]. To summarize, a (possibly) noisy source is more accurately characterized by both the expected value and the expected delay, based on its access cost distribution. The first value reflects the aggregate behavior for queries, and the second value reflects the deviation from the aggregate behavior. Since we only considered the 9

10 .8.7 Probability prob_val_s k.2 excess_s k = delay_s k dists.1 µ = 5 val_s k Response Time (sec) Figure 6: Explanation of Expected Delay Probability a) Multiple Distributions S1 S2 S Response Time (sec) Expected Delay (sec) b) Expected Cost vs Expected Delay S1 S2 S Expected Cost (sec) Figure 7: Expected Cost versus Expected Delay 1

11 positive excess or the positive deviation, this deviation is a delay. Finally, we express the PG sensitive risk (benefit) of a cost distribution. Assume that the PG corresponds to response times less P than T units of time. Then the risk is Risk(distS; T ) = k risk S k prob val S k, where risk S k is the positive excess of val S k compared to T. The expression for Benefit(distS,T) is similar. In section 6, we discuss how Risk(distS; T ) or Benef it(dists; T ) can be used to meet specific performance guarantee. 4.3 Expected Cost Delay (ECD) Measure As discussed in section 3, a cost distribution is described by both an expected value and the expected deviation from this value. The positive excess of this deviation can be considered to represent the expected delay. Consider the distribution of k values for the vector of parameters V. For each of the k discrete values V k, with probability P rob(v k ), we can associate a value of delay DelV k with the same probability. Then, given a plan p, the expected P cost EC(p) = kcost-plan-p(v k ) P rob(v k ). The expected P delay for this plan, or ED(p), is expressed as ED(p) = k (Delay-plan-p(DelV k) P rob(v k )). We will describe how Delay-plan-p(DelV k ) for a plan p is calculated from the cost formula for calculating the cost of the plan or Cost-plan-p(V k ). We now consider the concept of Expected Cost Delay (ECD) that uses a cost factor cf, and a delay factor df, to combine the expected cost and expected delay of a plan into a single measure. The (ECD) measure for plan p is defined as follows: ECD(p)= cf EC(p) + df ED(p). Given a set of candidate plans, and some value(s) for cf and df the optimizer chooses among the candidate plans for one that maximizes utility, i.e., it minimizes the ECD measure. 4.4 Calculating ECD for Query Plan We now describe an algorithm to calculate the values of EC(p) and ED(p) for a plan. To do so, we we extend the cost formula used by the optimizer to use access cost distributions for accessing sources. We assume that the access cost is an end-to-end latency. The cost of a plan is estimated in a bottom-up manner, starting from the terminal nodes of the plan tree. A terminal node that is a scan on a mediator relation that is resident on a remote WebSources is associated with an access cost distribution. We use these distributions to calculate the cost distribution for each interior node of the plan tree. The cost formula combines the distributions of child nodes to obtain the distribution for the parent node. Consider a three-way join query on relations R1, R2, and R3. Suppose that WebSources S1, S2, and S3 have been assigned to R1, R2, and R3, respectively. Figure 8 shows the plan tree and the cost distributions for each node. First, the scan on relation R1 (or R2 or R3) is assigned the cost distribution for the corresponding source. Next, we calculate the cost distribution for the join of R1 and R2. For simplicity, we assume that the join is implemented as a hash join, and we ignore the local (in memory) cost of the hash join implementation. Then, the cost of the join is the sum of the costs of the two scans on R1 and R2. We also assume that the 11

12 cost distributions for S1 and S2 are independent, and we use this assumption to calculate the joint probability distribution. For example, a scan cost of 12 seconds for R1 with probability.2 (from S1) and a scan cost of 25 seconds for R2 with probability.4 (from S2) will result in a join cost of 145 seconds with probability.8. This calculation of cost distributions may result in a distribution that has values with a very low probability, or adjacent values that are very close in value. Such a fine granularity in the distribution is not very useful to the optimizer. In addition, it can significantly decrease the optimizer performance for multi-way join queries. To solve this problem, we use a special technique to merge values in the calculated distribution that do not meet some threshold. This is illustrated in Figure 8. Consider the initial cost distribution for the root node of the plan tree (a 3-way join); it is labeled D1. Values whose probability is below a probability threshold, in this case a value of.5, are merged to obtain the distribution labeled D2. Finally, adjacent values whose cost difference is less than a threshold, in this case 1 seconds, are also merged, to obtain the final distribution labeled D3. To illustrate, the first two instances of D1 have a probability of.4 (less than.5), and they are merged. The merged value is the (weighted) mean of the two initial values, and its probability is the sum of the initial probability values. Thus, distribution D2 has an instance whose value is 215 seconds and a probability of.8. Distribution D2 has several adjacent instances whose values do not meet the cost difference threshold of 1 seconds, e.g., 215 and 22 seconds and 175 and 17 seconds. These values are merged as before to obtain the final distribution D3. The above transformations to merge values in the distribution are especially useful when applied to cost distributions for the interior nodes. With appropriate values for the minimum probability and the minimum cost difference threshold, they allow the optimizer to avoid unnecessary computation. Using the final cost distribution D3, for the root node of the plan, we can calculate the expected cost EC(p) and the expected delay ED(p) for plan p. The expected cost EC(p) for the final cost distribution D3 for the root node of the plan is seconds. Next, we calculate the expected delay ED(p) as described in section 3. We only consider those values of the cost distribution that are greater than the expected cost, since the delay only considers the positive excess. For example, the value of 218 seconds contributes a delay of seconds with a probability of.14 towards ED(p), whereas the value of 155 seconds does not contribute to the expected delay. Finally, using the values for the cost factor cf and the delay factor df we calculate the Expected Cost Delay Measure for plan p: cf EC(p) + df ED(p). 4.5 Two Phase Approach to Generate Candidate Pre-Plans and Produce Plans that Maximize ECD Utility Two issue that are important for LEC optimization to be practical and useful are described in [7], and these issues are relevant to the PG optimizer as well. The first issue is accurately determining the cost distributions for the parameters that affect the cost of a plan. In this paper, we only consider cost distributions for remote access costs (end-to-end latency) of accessing mediator relations resident on WebSources. The second problem is the extent of modification required to generate an appropriate set of candidate plans, so that the LEC optimizer is close to optimal. An LEC optimizer algorithm to generate b candidate plans, and invoke the optimizer once for each plan is presented in [7]. They choose each b to represent some discrete value (or some range of continuous values) for the vector of parameters, and show that each candidate plan is 12

13 (D1) Cost prob Cost prob S3: Min prob =.5 (D1) Min difference = 1 (D2) RT prob R Cost 215 (D2) prob Cost prob (D3) R1 RT 12 7 S1: S2: prob.2.8 RT prob R Figure 8: Distributions associated with query plan close to optimal for that value or range of values. The complexity of their (heuristic based) algorithm is shown to be much lower than the complexity of the least specific cost optimizer, (say System R style), for the same query. For the above algorithm to be close to optimal, the b values must be chosen to be representative of the distributions. Further, instead of only choosing the best plan for each of the b ranges, one may choose the top c plans for each of the b ranges. The choice of a representative set of b ranges for the cost distributions, and the choice of the top c candidate plans becomes important when there are discontinuities in the cost formula. An example of a discontinuity in the cost formula is the cost formula for the nested loop join when the size of one of the relations is either larger or smaller than the amount of available memory. We now describe our two phase approach to generate candidate plans for a PG sensitive optimizer. We first define a pre-plan and describe how we calculate its ECD. We then (informally) present our strategy to generate plans. The following hypothesis must hold for our approach to be effective: Given a mediator query where N of the relations are remote, it is the choice of a particular WebSource and its access cost distribution, for each of these N relations, that has the greatest impact on the performance guarantee. Thus, for the PG sensitive optimizer, the choice of WebSource for each of the N relations is analogous to the choice of b representative values, or ranges of values, for the vector of parameters that affect the cost of the plan. The choice of WebSource(s) and access cost distribution(s) for each of the remote relations corresponds to the information of a pre-plan. A pre-plan was defined in [27, 3]. A pre-plan associates a relevant WebSource, its query capability and access cost distribution, with each remote relation of the mediator query. A pre-plan also has information about dependencies between the query capability of WebSources. This dependency information is in a pre-plan dependency graph. Given a pre-plan, it characterizes a search space of query execution plans, or plans, each of which respects the pre-plan dependency graph [27]. We give an example of pre-plan(s) and we explain how we calculate the approximate ECD measure for the pre-plan. We then describe how the PG optimizer generates plans. 13

14 Generating Pre-Plans Consider a N-way join query over five relations R 1 : : : R 5. Suppose R 2 is a remote relation and it is implemented on two (alternate) WebSources, S 21 and S 22, with access cost distributions D 21 and D 22, respectively. Further, suppose that the query capability of source S 21 is such that it requires an input binding from the relation R 3, whereas S 22 does not have any required input bindings. Then, we obtain the following two pre-plans: pre-plan 1: f R 1 ; R 3 ; R 4 ; R 5 g, f R 2 (S 21 ; D 21 ) g [ f R 3! R 2 g pre-plan 2: f R 1 ; R 2 (S 22 ; D 22 ); R 3 ; R 4 ; R 5 g union ; The first element of each pre-plan specifies a partitioning of R 1 : : : R 5 into 2 partitions for pre-plan 1 and one partition in pre-plan 2. The choice of source and cost distribution for the remote relation R 2 is also specified. The second term specifies the dependencies; it is empty in pre-plan 2. To explain further, in the first pre-plan, the dependency (R 3! R 2 ) partitions the relations R 1 : : : R 5 into the two groups fr 1 ; R 3 ; R 4 ; R 5 g and fr 2 g. Every plan obtained from this pre-plan must respect the dependency (R 3! R 2 ), i.e., R 3 must precede R 2 in the join ordering of each plan. On the other hand, there is no dependency in the second pre-plan, and hence there are no restrictions on the join orderings of any plan obtained from the second pre-plan. Next, we briefly describe how we calculate the cost of a pre-plan; details are in [27]. First, since we approximate the cost of the pre-plan, we ignore all local processing costs, e.g., scans or joins of locally resident mediator relations, and we only focus on the remote relations. Suppose that there are no dependencies in the pre-plan (and the plan) between the remote relations, e.g, the second pre-plan. The pre-plan cost is the cost of accessing the remote relations, e.g., R 2 from the WebSource, and the cost of any join operators involving R 2. In this case, we assume that the cost of the join is calculated based on the hash join implementation. Suppose instead that there is exactly one dependency, e.g., (R 3! R 2 ). This is implemented as a dependent join [6]. The cost is calculated using the nested loop join implementation. there are dependencies: R 1! R 2, and R 3 ; :::; R n ; To correctly determine the cost of the nested loop join, where R 2 is the operand on the right subtree, we have to calculate the (output) cardinality of the left subtree tuples (that include relation R 3 ), that join with R 2. We assume that the optimizer will choose a good join ordering of the other relations to reduce (minimize) this cardinality. To do so, the optimizer will need to partition the other relations and determine which subset must occur in the left subtree with R 3, in a good (optimal) plan. This subset (of selections and joins) will act as a filter and reduce the cardinality of this subtree. The case where there is more then one dependent join in the pre-plan is much more complex, since finding a good (optimal) partitioning of the subgoals is more expensive. Details are in [27] Generating Plans We can now (informally) describe our strategy to generate an appropriate candidate set of pre-plans and plans for the PG sensitive optimizer. 14

15 In the first phase, for values of cf and df varying in the range [.,1.], we utilize a remote cost preoptimizer to generate the best pre-plan. We note that for the results reported in this paper, we display the optimizer choices for values of cf and df adding up to 1., i.e., the value of cf = (1. - df). These values were found to work well to illustrate the optimizer strategy. This pre-optimizer chooses a set of candidate WebSources and their cost distributions, and calculates the (approximate) ECD value, for the particular value of cf and df, using the method just described. The pre-optimizer can either exhaustively consider the combinations of WebSources, or it can use a randomized search strategy that is described in [28]. The granularity of the values of cf and df is varied by the pre-optimizer. For each pair of values (cf, df), it will choose either a single pre-plan or the top C pre-plans based on maximizing the ECD utility, i.e., the minimum ECD value. In the second phase, for each of the pre-plans, and the corresponding value of (cf, df), a randomized relational optimizer generates plans that respect the dependencies of the pre-plan. This task of generating plans that respect the pre-plan is described in [27, 3]. The optimizer can choose either a single plan, or the top c plans, based on maximizing the ECD utility, i.e., the minimum ECD value. Finally, depending on the desired optimizer strategy, i.e., optimistic or conservative, the optimizer would choose a pair of values (cf, df) that represents that strategy, and the corresponding winning pre-plan and plan. This would be a single winner or the top c winners, as described above. 5 Behavior of the Performance Guarantee (PG) Sensitive Optimizer 5.1 Simulation Environment Figure 2 presents the components of our architecture and simulation environment. The mediator extends the Predator ORDBMS [23] which is implemented in C++. The pre-optimizer to generate pre-plans was implemented in Quintus Prolog and integrated into the Predator mediator. The Predator evaluation engine was extended with an external scan operator and a dependent join operator [6, 1] to implement the (limited) access to WebSources and the mediator Catalog and cost model were appropriately extended. Synthetic access cost (Gaussian) distributions were generated using the Java Open Source Libraries for High Performance Scientific and Technical Computing [11]. These access cost distributions are labeled using the values of and for the distribution. In addition, experimental data of access to real WebSources, e.g., EPA [8], were gathered to develop a cost model for WebSources [31]. These traces were also used in our simulations and are labeled by their name. The Wrapper cost model provides relevant statistics and access costs for the WebSource. For the simulation, the PREDATOR mediator was executed on a Sun Ultra SPARC 1 with 64MB of memory, running Solaris 2.6. While validating access to WebSources, this machine was connected via a 1 Mbps Ethernet cable to the domain umiacs.umd.edu. This domain is connected to its ISP via a 27 Mbps DS3 line. 15

16 The mediator queries were complex N-way join queries and we used up to 12 queries in the simulation. They were generated to reflect the complex decision support queries and statistics of the TPC-D Benchmark. For each query, up to 5 relations were identified as relations resident on remote WebSources. For each such relation, we identified up to 3 alternate WebSources, with their dependencies and access cost distributions. All costs are end-to-end delays for downloading data. The mediator statistics were modified so as to vary the percentage of remote processing cost versus the total cost of the queries. Unless otherwise noted, we report on the aggregate behavior of the PG optimizer over all the queries. 5.2 Expected Cost Delay Trade-Off for Pre-Plans and Plans We report on the Expected Cost Delay (ECD) utility measure and the utility trade-off of the optimizer, as we varied sources and queries. In these experiments, we used synthetic access cost distributions. We typically report on the value of the ECD measure; the optimizer will maximize utility by minimizing on the ECD measure. In the first experiment (see Figure 9), we picked one relation to be a remote relation. There were 1 9 (a) Aggregate behavior (5 queries); 1 remote relation risk 4.5 x 16 4 (b) Aggregate behavior (5 queries): 1 remote rel. 5_1 4 5_2 4 5_ Percentage of queries Relative ECD (ms) benefit Response Time (ms) (5,4) (5,1) (5,2) (5,4) x Delay Factor Figure 9: Aggregate over Multiple Queries; Relative ECD; One Remote Relation; Varying Delay Factor 4 alternate WebSources for this relation; each WebSource had a cost distribution with a different value for expected cost EC, and expected delay ED. One of the sources was an ideal source, i.e., its value of ED was close to. The EC for the other sources were comparable to the ideal source, and the values of ED varied. Figure 9a reports on the quantile plots for the response times of query plans, one for each of the sources. To construct the quantile plots, we used the synthetic distributions to generate response times for the access to the WebSource. The quantile plot reports on the end-to-end delay or response time for the query trials. We report on the aggregate behavior over the query mix. Figure 9b compares the relative ECD for each plan/source, compared to the ECD of the plan/ideal source, as we vary the delay factor df. Figure 9a illustrates that each of the plans has some benefit compared to the ideal plan. Correspondingly, each of the plans also has some risk. We compare the quantile plot for the plan labeled > in Figure 9a, and its relative ECD, compared to the plan/ideal source, in Figure 9b. This plan > incurs the greatest risk compared to the plan/ideal source, as seen in Figure 9a. Correspondingly, as we increase the delay factor df, we see that the ECD for the plan >, 16

17 compared to the plan/ideal source, also increases. 4.5 x 16 (a) 2 remote relations; 1 dep. 8 x 15 (b) 2 remote relations; no dep _5+7_1 6_1+7_5 6_1+7_2 6_3+7_1 6_3+7_ Relative ECD (ms) Relative ECD (ms) _5+7_1 6_1+7_5 6_1+7_2 6_3+7_1 6_3+7_ Delay Factor Delay Factor Figure 1: Relative ECD with and without dependencies In the next experiment, (see Figure 1), we picked two remote relations. Each remote relation is associated with three alternate WebSources, and their response time distributions. In addition, we varied the query capability implemented by the WebSources. We chose two query capabilities, one of which introduced a dependency between the remote relation and a mediator relation. This had an impact on the query plan, i.e,, the query capability with a dependency introduced a dependent join operator in the plan. The plans with and without the dependent join operator had different ranges for ECD values. As before, the ideal source(s) had a value of ED close to. We chose two ideal sources, each of which introduces a dependency, and two ideal sources without a dependency in their query capability. We compared the plans with noisy sources to the plan with two ideal sources. Figure 1a reports on the relative ECD for the noisy plans, compared to the ideal plan, when all sources had a dependency in their query capability. Figure 1b reports on the relative ECD for the noisy plans, compared to the ideal plan, where none of the sources had a dependency. Both graphs report on the aggregate behavior over five queries. Figure 1b on the right had no dependencies. As the value of df increases, noisy plans with higher values of ED (+ and ) exhibit increasing values of ECD. This result is similar to our previous experiment with a single remote relation. However, when we examine Figure 1a on the left, which has dependencies, we note that as the value of df increases, the ECD for the noisy plans with higher values of ED, (+, ans ) increases much more rapidly, in comparison to the plans with lower values of ED, ( and >) Thus, we see a grouping of source behavior, based on their high or low values of ED, as we increase the delay factor df. This illustrates that when there are dependencies in the query capability, then the choice of WebSource and cost distribution, in particular the value of ED, has a greater impact on the optimizer performance guarantee. In a final experiment, we studied the scalability of the pre-optimizer and optimizer. We chose up to 5 of the relations to be remote. For each of the remote relations, there were up to 3 remote WebSources. Thus, the space of pre-plans, where each pre-plan corresponds to a source for one of the 5 remote relations, is large. We report on our first set of results, when only a limited number of pre-plans were explored in the top two graphs, and we report on the results where all the pre-plans were exhaustively explored, in the bottom two graphs of 17

18 7 x 15 5 rem rels x 3 rem srcs: limited search; pre plans 18 x 16 5 rem rels x 3 rem sources; limited search; plans 6 DF= DF=.25 DF=.5 DF= DF= DF=.25 DF=.5 DF=1 Expected Delay (msec) Expected Delay (msec) Expected Cost (msec) x Expected Cost (msec) x x 15 5 rels x 3 rem srcs: exhaustive search; pre plans 18 x 16 5 rels x 3 rem srcs: exhaustive search; plans 6 DF= DF=.25 DF=.5 DF= DF= DF=.25 DF=.5 DF=1 Expected Delay (msec) Expected Delay (msec) Expected Cost (msec) x Expected Cost (msec) x 1 7 Figure 11: Scalability of pre-optimizer and optimizer 18

19 Figure 11. To simplify the presentation, we only present the results for one complex N-way join query. Figure 11 identifies the values of EC and ED, for the winning pre-plan and plan, for varying values of the delay factor df. We first discuss the two lower graphs where the search was exhaustive. We expect that for lower values of df, when the optimizer is optimistic, the winning pre-plan and plan will have lower values for EC. The winning pre-plan and plan for df =.,.25, and.5 are labeled +,, and 2. For this particular query, the same pre-plan and plan are winners for these df values. We see that the value of ED is comparatively large since the optimizer is optimistic and ignores ED. On the other hand, when the optimizer is conservative, i.e., with higher values for df, the winning pre-plan and plan may have larger values for EC, but the values for ED should be low. The winning pre-plan and plan for a value of df = 1. is >, and it indeed reflects this behavior. Finally, we compare the winning pre-plans and plans when the search is not exhaustive (top two graphs) with the exhaustive search. We observe some anomalies in their behavior. For example, we expect that the value of ED for the winning pre-plan should decrease for df =.25 or.5 (+ and ) compared to the value of ED for df =.. However, the top left graph shows us that this value of ED for + and actually increases. This anomaly is due to limitations of our current implementation of the randomized pre-optimizer. This experiment illustrates that the PG sensitive optimizer s behavior is consistent and scalable up to a large search space of pre-plans and plans. In the next set of experiments we explore the optimizer behavior on cost distributions obtained from query traces from real WebSources. The sources are EPA [8] and Aircraft [9]. Figure 3 from section 2 plots the actual response times for these sources. The EPA had a low level of noise, whereas Aircraft was a noisy source. The optimizer used the EPA and Aircraft to generate query plans P 1 and P 2, and to calculate cost distributions for the plans P 1 and P 2. Figure 12a represents quantile plots of response time for P 1 and P 2 that were generated from those distributions. We observe that there is a significant risk and a very small benefit associated with choice of P 2 instead of P 1. EDC values for P 1 and P 2 are shown in Figure 12b for different values of delay factor df. The plan P 1 based on the less noisy EPA is the winner for all ranges of df, i.e., for all ranges of the optimizer strategies. This is consistent with the significant risk and a small benefit associated with choice of P 2. Figure 12b shows the relative difference in EDC of P 1 and P 2. When the optimizer is more optimistic, the relative EDC has lower value. It increases slightly with increasing df, and this reflects that when the optimizer is more conservative, it places more emphasis on the risk of choosing P 2. Thus, the behavior of the optimizer is consistent with respect to amount of risk and benefit in choosing query plans on real WebSources. To summarize, we have verified that the optimizer behavior was both predictable and scalable, with respect to the following: alternate sources for a remote relation with varying values of EC and ED; alternate sources with and without dependencies; aggregate behavior over several queries; synthetic and experimental cost distributions; large search space of pre-plans and plans. 19

Efficient Evaluation of Queries in a Mediator for WebSources

Efficient Evaluation of Queries in a Mediator for WebSources Efficient Evaluation of Queries in a Mediator for WebSources Vladimir Zadorozhny University of Pittsburgh Pittsburgh, PA 15260 vladimir@sis.pit t.edu Louiqa Raschid University of Maryland College Park,

More information

Address: UMIACS, Room 3129, A.V. Williams. University of Maryland. College Park, MD Maryland, USA. Phone:

Address: UMIACS, Room 3129, A.V. Williams. University of Maryland. College Park, MD Maryland, USA. Phone: Contact Author: Louiqa Raschid e-mail: louiqa@umiacs.umd.edu Address: UMIACS, Room 3129, A.V. Williams. University of Maryland. College Park, MD 20742. Maryland, USA. Phone: 301-405-6747. Fax: 301-314-9658

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

AN OVERVIEW OF ADAPTIVE QUERY PROCESSING SYSTEMS. Mengmeng Liu. Computer and Information Science. University of Pennsylvania.

AN OVERVIEW OF ADAPTIVE QUERY PROCESSING SYSTEMS. Mengmeng Liu. Computer and Information Science. University of Pennsylvania. AN OVERVIEW OF ADAPTIVE QUERY PROCESSING SYSTEMS Mengmeng Liu Computer and Information Science University of Pennsylvania WPE-II exam Janurary 28, 2 ASTRACT Traditional database query processors separate

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Hash-Based Indexing 165

Hash-Based Indexing 165 Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Learning-based Query Performance Modeling and Prediction

Learning-based Query Performance Modeling and Prediction Learning-based Query Performance Modeling and Prediction Mert Akdere, Uǧur Çetintemel, Matteo Riondato, Eli Upfal, Stanley B. Zdonik Brown University, Providence, RI, USA {makdere,ugur,matteo,eli,sbz}@cs.brown.edu

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann Search Engines Chapter 8 Evaluating Search Engines 9.7.2009 Felix Naumann Evaluation 2 Evaluation is key to building effective and efficient search engines. Drives advancement of search engines When intuition

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

A Case for Merge Joins in Mediator Systems

A Case for Merge Joins in Mediator Systems A Case for Merge Joins in Mediator Systems Ramon Lawrence Kirk Hackert IDEA Lab, Department of Computer Science, University of Iowa Iowa City, IA, USA {ramon-lawrence, kirk-hackert}@uiowa.edu Abstract

More information

Performance Debugging for Distributed Systems of Black Boxes. Yinghua Wu, Haiyong Xie

Performance Debugging for Distributed Systems of Black Boxes. Yinghua Wu, Haiyong Xie Performance Debugging for Distributed Systems of Black Boxes Yinghua Wu, Haiyong Xie Outline Problem Problem statement & goals Overview of our approach Algorithms The nesting algorithm (RPC) The convolution

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement. COS 597: Principles of Database and Information Systems Query Optimization Query Optimization Query as expression over relational algebraic operations Get evaluation (parse) tree Leaves: base relations

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring

New Directions in Traffic Measurement and Accounting. Need for traffic measurement. Relation to stream databases. Internet backbone monitoring New Directions in Traffic Measurement and Accounting C. Estan and G. Varghese Presented by Aaditeshwar Seth 1 Need for traffic measurement Internet backbone monitoring Short term Detect DoS attacks Long

More information

Mining Temporal Indirect Associations

Mining Temporal Indirect Associations Mining Temporal Indirect Associations Ling Chen 1,2, Sourav S. Bhowmick 1, Jinyan Li 2 1 School of Computer Engineering, Nanyang Technological University, Singapore, 639798 2 Institute for Infocomm Research,

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

IV Statistical Modelling of MapReduce Joins

IV Statistical Modelling of MapReduce Joins IV Statistical Modelling of MapReduce Joins In this chapter, we will also explain each component used while constructing our statistical model such as: The construction of the dataset used. The use of

More information

CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1)

CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1) 71 CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1) 4.1 INTRODUCTION One of the prime research objectives of this thesis is to optimize

More information

Efficient Remote Data Access in a Mobile Computing Environment

Efficient Remote Data Access in a Mobile Computing Environment This paper appears in the ICPP 2000 Workshop on Pervasive Computing Efficient Remote Data Access in a Mobile Computing Environment Laura Bright Louiqa Raschid University of Maryland College Park, MD 20742

More information

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution

Announcements (March 1) Query Processing: A Systems View. Physical (execution) plan. Announcements (March 3) Physical plan execution Announcements (March 1) 2 Query Processing: A Systems View CPS 216 Advanced Database Systems Reading assignment due Wednesday Buffer management Homework #2 due this Thursday Course project proposal due

More information

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments Administrivia Midterm on Thursday 10/18 CS 133: Databases Fall 2018 Lec 12 10/16 Prof. Beth Trushkowsky Assignments Lab 3 starts after fall break No problem set out this week Goals for Today Cost-based

More information

An Adaptive Query Execution Engine for Data Integration

An Adaptive Query Execution Engine for Data Integration An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Presented by Peng Li@CS.UBC 1 Outline The Background

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Estimating the Quality of Databases

Estimating the Quality of Databases Estimating the Quality of Databases Ami Motro Igor Rakov George Mason University May 1998 1 Outline: 1. Introduction 2. Simple quality estimation 3. Refined quality estimation 4. Computing the quality

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building

More information

A Search Theoretical Approach to P2P Networks: Analysis of Learning

A Search Theoretical Approach to P2P Networks: Analysis of Learning A Search Theoretical Approach to P2P Networks: Analysis of Learning Nazif Cihan Taş Dept. of Computer Science University of Maryland College Park, MD 2742 Email: ctas@cs.umd.edu Bedri Kâmil Onur Taş Dept.

More information

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross Track Join Distributed Joins with Minimal Network Traffic Orestis Polychroniou Rajkumar Sen Kenneth A. Ross Local Joins Algorithms Hash Join Sort Merge Join Index Join Nested Loop Join Spilling to disk

More information

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems

Query Processing: A Systems View. Announcements (March 1) Physical (execution) plan. CPS 216 Advanced Database Systems Query Processing: A Systems View CPS 216 Advanced Database Systems Announcements (March 1) 2 Reading assignment due Wednesday Buffer management Homework #2 due this Thursday Course project proposal due

More information

Parser: SQL parse tree

Parser: SQL parse tree Jinze Liu Parser: SQL parse tree Good old lex & yacc Detect and reject syntax errors Validator: parse tree logical plan Detect and reject semantic errors Nonexistent tables/views/columns? Insufficient

More information

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 143 CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 6.1 INTRODUCTION This chapter mainly focuses on how to handle the inherent unreliability

More information

Ch. 7: Benchmarks and Performance Tests

Ch. 7: Benchmarks and Performance Tests Ch. 7: Benchmarks and Performance Tests Kenneth Mitchell School of Computing & Engineering, University of Missouri-Kansas City, Kansas City, MO 64110 Kenneth Mitchell, CS & EE dept., SCE, UMKC p. 1/3 Introduction

More information

Improving PostgreSQL Cost Model

Improving PostgreSQL Cost Model Improving PostgreSQL Cost Model A PROJECT REPORT SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Engineering IN Faculty of Engineering BY Pankhuri Computer Science and Automation

More information

Evaluation of Performance of Cooperative Web Caching with Web Polygraph

Evaluation of Performance of Cooperative Web Caching with Web Polygraph Evaluation of Performance of Cooperative Web Caching with Web Polygraph Ping Du Jaspal Subhlok Department of Computer Science University of Houston Houston, TX 77204 {pdu, jaspal}@uh.edu Abstract This

More information

Inital Starting Point Analysis for K-Means Clustering: A Case Study

Inital Starting Point Analysis for K-Means Clustering: A Case Study lemson University TigerPrints Publications School of omputing 3-26 Inital Starting Point Analysis for K-Means lustering: A ase Study Amy Apon lemson University, aapon@clemson.edu Frank Robinson Vanderbilt

More information

Analysis of Sorting as a Streaming Application

Analysis of Sorting as a Streaming Application 1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Query processing and optimization

Query processing and optimization Query processing and optimization These slides are a modified version of the slides of the book Database System Concepts (Chapter 13 and 14), 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan.

More information

Profile of CopperEye Indexing Technology. A CopperEye Technical White Paper

Profile of CopperEye Indexing Technology. A CopperEye Technical White Paper Profile of CopperEye Indexing Technology A CopperEye Technical White Paper September 2004 Introduction CopperEye s has developed a new general-purpose data indexing technology that out-performs conventional

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

EXTERNAL SORTING. Sorting

EXTERNAL SORTING. Sorting EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID

DESIGN AND OVERHEAD ANALYSIS OF WORKFLOWS IN GRID I J D M S C L Volume 6, o. 1, January-June 2015 DESIG AD OVERHEAD AALYSIS OF WORKFLOWS I GRID S. JAMUA 1, K. REKHA 2, AD R. KAHAVEL 3 ABSRAC Grid workflow execution is approached as a pure best effort

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Chapter 12, Part A Database Management Systems, R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset

More information

Chapter 11: Query Optimization

Chapter 11: Query Optimization Chapter 11: Query Optimization Chapter 11: Query Optimization Introduction Transformation of Relational Expressions Statistical Information for Cost Estimation Cost-based optimization Dynamic Programming

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

Introduction Alternative ways of evaluating a given query using

Introduction Alternative ways of evaluating a given query using Query Optimization Introduction Catalog Information for Cost Estimation Estimation of Statistics Transformation of Relational Expressions Dynamic Programming for Choosing Evaluation Plans Introduction

More information

Search and Optimization

Search and Optimization Search and Optimization Search, Optimization and Game-Playing The goal is to find one or more optimal or sub-optimal solutions in a given search space. We can either be interested in finding any one solution

More information

HYRISE In-Memory Storage Engine

HYRISE In-Memory Storage Engine HYRISE In-Memory Storage Engine Martin Grund 1, Jens Krueger 1, Philippe Cudre-Mauroux 3, Samuel Madden 2 Alexander Zeier 1, Hasso Plattner 1 1 Hasso-Plattner-Institute, Germany 2 MIT CSAIL, USA 3 University

More information

ATYPICAL RELATIONAL QUERY OPTIMIZER

ATYPICAL RELATIONAL QUERY OPTIMIZER 14 ATYPICAL RELATIONAL QUERY OPTIMIZER Life is what happens while you re busy making other plans. John Lennon In this chapter, we present a typical relational query optimizer in detail. We begin by discussing

More information

Modelling Structures in Data Mining Techniques

Modelling Structures in Data Mining Techniques Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor

More information

Avoiding Sorting and Grouping In Processing Queries

Avoiding Sorting and Grouping In Processing Queries Avoiding Sorting and Grouping In Processing Queries Outline Motivation Simple Example Order Properties Grouping followed by ordering Order Property Optimization Performance Results Conclusion Motivation

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Chapter 14 Query Optimization

Chapter 14 Query Optimization Chapter 14 Query Optimization Chapter 14: Query Optimization! Introduction! Catalog Information for Cost Estimation! Estimation of Statistics! Transformation of Relational Expressions! Dynamic Programming

More information

Chapter 14 Query Optimization

Chapter 14 Query Optimization Chapter 14 Query Optimization Chapter 14: Query Optimization! Introduction! Catalog Information for Cost Estimation! Estimation of Statistics! Transformation of Relational Expressions! Dynamic Programming

More information

Chapter 14 Query Optimization

Chapter 14 Query Optimization Chapter 14: Query Optimization Chapter 14 Query Optimization! Introduction! Catalog Information for Cost Estimation! Estimation of Statistics! Transformation of Relational Expressions! Dynamic Programming

More information

Supplementary Figure 1. Decoding results broken down for different ROIs

Supplementary Figure 1. Decoding results broken down for different ROIs Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas

More information

Tradeoffs in Processing Multi-Way Join Queries via Hashing in Multiprocessor Database Machines

Tradeoffs in Processing Multi-Way Join Queries via Hashing in Multiprocessor Database Machines Tradeoffs in Processing Multi-Way Queries via Hashing in Multiprocessor Database Machines Donovan A. Schneider David J. DeWitt Computer Sciences Department University of Wisconsin This research was partially

More information

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Copyright 2018, Oracle and/or its affiliates. All rights reserved. Beyond SQL Tuning: Insider's Guide to Maximizing SQL Performance Monday, Oct 22 10:30 a.m. - 11:15 a.m. Marriott Marquis (Golden Gate Level) - Golden Gate A Ashish Agrawal Group Product Manager Oracle

More information

Chapter 3. Algorithms for Query Processing and Optimization

Chapter 3. Algorithms for Query Processing and Optimization Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms

More information

Revisiting Pipelined Parallelism in Multi-Join Query Processing

Revisiting Pipelined Parallelism in Multi-Join Query Processing Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu Elke A. Rundensteiner Department of Computer Science, Worcester Polytechnic Institute Worcester, MA 01609-2280 (binliu rundenst)@cs.wpi.edu

More information

Part III. Issues in Search Computing

Part III. Issues in Search Computing Part III Issues in Search Computing Introduction to Part III: Search Computing in a Nutshell Prior to delving into chapters discussing search computing in greater detail, we give a bird s eye view of its

More information

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu FMA901F: Machine Learning Lecture 6: Graphical Models Cristian Sminchisescu Graphical Models Provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate

More information

Efficient in-network evaluation of multiple queries

Efficient in-network evaluation of multiple queries Efficient in-network evaluation of multiple queries Vinayaka Pandit 1 and Hui-bo Ji 2 1 IBM India Research Laboratory, email:pvinayak@in.ibm.com 2 Australian National University, email: hui-bo.ji@anu.edu.au

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 11/15/12 Agenda Check-in Centralized and Client-Server Models Parallelism Distributed Databases Homework 6 Check-in

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture X: Parallel Databases Topics Motivation and Goals Architectures Data placement Query processing Load balancing

More information

Question 1: What is a code walk-through, and how is it performed?

Question 1: What is a code walk-through, and how is it performed? Question 1: What is a code walk-through, and how is it performed? Response: Code walk-throughs have traditionally been viewed as informal evaluations of code, but more attention is being given to this

More information

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS 6.1 Introduction Gradient-based algorithms have some weaknesses relative to engineering optimization. Specifically, it is difficult to use gradient-based algorithms

More information

Chapter 13: Query Optimization. Chapter 13: Query Optimization

Chapter 13: Query Optimization. Chapter 13: Query Optimization Chapter 13: Query Optimization Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 13: Query Optimization Introduction Equivalent Relational Algebra Expressions Statistical

More information

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments WHITE PAPER Application Performance Management The Case for Adaptive Instrumentation in J2EE Environments Why Adaptive Instrumentation?... 3 Discovering Performance Problems... 3 The adaptive approach...

More information

DBMS Y3/S5. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation.

DBMS Y3/S5. 1. OVERVIEW The steps involved in processing a query are: 1. Parsing and translation. 2. Optimization. 3. Evaluation. Query Processing QUERY PROCESSING refers to the range of activities involved in extracting data from a database. The activities include translation of queries in high-level database languages into expressions

More information

New Bucket Join Algorithm for Faster Join Query Results

New Bucket Join Algorithm for Faster Join Query Results The International Arab Journal of Information Technology, Vol. 12, No. 6A, 2015 701 New Bucket Algorithm for Faster Query Results Hemalatha Gunasekaran 1 and ThanushkodiKeppana Gowder 2 1 Department Of

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

DBMS Query evaluation

DBMS Query evaluation Data Management for Data Science DBMS Maurizio Lenzerini, Riccardo Rosati Corso di laurea magistrale in Data Science Sapienza Università di Roma Academic Year 2016/2017 http://www.dis.uniroma1.it/~rosati/dmds/

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

Chapter 14: Query Optimization

Chapter 14: Query Optimization Chapter 14: Query Optimization Database System Concepts 5 th Ed. See www.db-book.com for conditions on re-use Chapter 14: Query Optimization Introduction Transformation of Relational Expressions Catalog

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Coding for the Network: Scalable and Multiple description coding Marco Cagnazzo

Coding for the Network: Scalable and Multiple description coding Marco Cagnazzo Coding for the Network: Scalable and Multiple description coding Marco Cagnazzo Overview Examples and motivations Scalable coding for network transmission Techniques for multiple description coding 2 27/05/2013

More information

GSLPI: a Cost-based Query Progress Indicator

GSLPI: a Cost-based Query Progress Indicator GSLPI: a Cost-based Query Progress Indicator Jiexing Li #1, Rimma V. Nehme 2, Jeffrey Naughton #3 # Computer Sciences, University of Wisconsin Madison 1 jxli@cs.wisc.edu 3 naughton@cs.wisc.edu Microsoft

More information

An Evaluation of Alternative Designs for a Grid Information Service

An Evaluation of Alternative Designs for a Grid Information Service An Evaluation of Alternative Designs for a Grid Information Service Warren Smith, Abdul Waheed *, David Meyers, Jerry Yan Computer Sciences Corporation * MRJ Technology Solutions Directory Research L.L.C.

More information

Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management

Outline. q Database integration & querying. q Peer-to-Peer data management q Stream data management q MapReduce-based distributed data management Outline n Introduction & architectural issues n Data distribution n Distributed query processing n Distributed query optimization n Distributed transactions & concurrency control n Distributed reliability

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information