Exploring the Semantics of Plan Diagrams

Size: px

Start display at page:

Download "Exploring the Semantics of Plan Diagrams"

Donna Garrison
6 years ago
Views:

1 Exploring the Semantics of Plan Diagrams A Thesis Submitted for the Degree of Master of Science (Engineering) in the Faculty of Engineering By Bruhathi HS Supercomputer Education and Research Centre INDIAN INSTITUTE OF SCIENCE BANGALORE , INDIA September 2011

2 Abstract Database Management Systems (DBMS) are software packages which are popularly used for storing and managing data. They store data in the form of tables called relations and the information present in these database relations are queried using Structured Query Language (SQL), which is declarative in nature. Given a database, the system configuration on which the database is running and a query on the database, the number of possible strategies (plans) by which the query can be executed is exponential. With the cost of plans being measured in terms of query response time, the difference in cost between the best execution plan and a randomly chosen one can be very high. Hence database systems make use of a module called query optimizer, which explores the exponential space of plans using dynamic programming techniques to find the best execution strategy for the query. This query optimizer module is especially important for supporting complex queries operating on varied datasets. Since the efficient execution of a given query depends on the optimizer s plan choice which is in turn primarily a function of the selectivities of the base relations participating in the query it is important to analyse the behavior of optimizers with the change in selectivity of the query. This analysis is aided by the production of a plan diagram that is a color-coded pictorial enumeration of the execution plan choices of a database query optimizer over this relational selectivity space. Our contributions in this thesis are three-fold, as outlined below - 1. In recent years, XML document format, which can be flexibly structured, has been extensively used to store data and has found its way into main-stream database technology. Database systems are now developed to store XML natively, to create indexes and statistics on XML data and also to query these XML documents. This has led to the dei

3 ABSTRACT ii velopment of optimizers for XML query processing. With monadic second-order logic (MSO) being the foundation for XML queries, as opposed to first order logic being the basis for relational queries, XML queries have more expressive power than their relational counterparts. The flexibility of XML along with the new features of XML queries brings new challenges for the XML query optimizer. In this thesis we have analysed one such optimizer, which is also a hybrid optimizer (optimizes both SQL and XML queries in a unified framework), through the production of XML plan diagrams. We have obtained plan diagrams for a variety of XML database benchmarks - TPoX, XBench and TPCH X (XML version of TPC-H benchamrk) using a set of XQuery templates. We observed that plan diagrams often feature complicated patterns even when the number of plans are very few and also that the optimizers selectivity estimation was in error when attributes had values in the negative range. 2. The plan diagrams which are currently produced have a hard-coded coloring scheme with the biggest plan (plan with the largest area) being assigned red, the second biggest being given blue, and so on. With this current coloring scheme, we will need to drill down into the plans in order to see the difference between them. In this thesis, we take a radically different approach to color the plans based on their semantic differences with other plans i.e., we try to color each plan such that the difference between the colors of any pair of plans reflects the differences in their operator trees. With this approach, the plan diagram will itself be a first-cut reflection of the differences between any pair of plans. We first discuss the plan-tree differencing strategy i.e., how to assign quantitative differences between every pair of plans based on their operator tree structure, which is also the input to our coloring problem. Our output should be colors, which makes the output space a color space. With RGB (Red, Green, Blue) being the color space chosen, our output space is three-dimensional. The challenge now is to take these distances between pairs of plans and assign equivalent color distances. Since there is no restriction on the dimensionality of the input space, scaling it down to a 3D space is inherently lossy. Also, this makes it infeasible to have an exact mapping of the plan distances in color space. So, we aim to

4 ABSTRACT iii optimize the function m 1 m v i =1 v j =v i +1 (P landist(v i, v j ) ColorDist(v i, v j )) 2 where P landist(v i, v j ) is the distance between the plans v i and v j and ColorDist(v i, v j ) is the distance assigned to the plans v i and v j in color space. We adopt Kruskal s Steepest Descent algorithm which is a well known Multidimensional technique as a solution to our optimization problem. This coloring strategy has been tested by coloring plan diagrams obtained for representative TPC-H based query templates on a suite of three industrialstrength database engines. The results show that the fidelity of this technique is greater than 70% for more than 80% of the plan distances. 3. The join order of an execution plan tree of a query is the order in which its relations are joined. Finding the optimal join order is one of the most crucial tasks in both SQL and XQuery optimization, with any savings in this front during optimization being extremely beneficial. This makes the study of join orders interesting and useful. The final contribution in this thesis is regarding the analysis on the number of join orders present in plan diagrams. Using plan diagrams obtained on a representative set of TPC-H query templates on a suite three industrial-strength engines, we have provided statistics on the number of join orders present in each of them. We have also gathered intra-engine statistics on the number of join orders that were similar across database engines, for a particular query template. We found that the number of join orders are quite substantial in plan diagrams (20-50) although they decreased when relations of relatively smaller sizes were not considered as part of the join order. With respect to common join orders across engines, we found that it was surprisingly very sparse (0-4). All the three features have been incorporated in the publicly-available Picasso optimizer visualization tool.

5 Contents Abstract List of Figures List of Tables i vii x 1 Introduction Overview of Picasso XML data Coloring of Plan Diagrams Roll-up of Plan Diagrams Contributions Organization Survey of Related Research Challenges of SQL Query Optimization Strategies for Plan Selection Refinements of Plan Choices at Run-time Industrial Strength SQL Optimizers XML query processing Multidimensional Scaling Join Orders iv

6 CONTENTS v 3 Analysis of Hybrid optimizers XML Statistics Collection XQuery Template Parsing XML Selectivity Computation Relational Selectivities XML Selectivities Computation Methods Selectivity Estimation Errors Plan Parsing Experimental Results Experimental Setup TPoX XBench TPCH X Selectivity Computation Problems Plan Coloring based on Difference Formal Definition Plan Distance Metrics Multi-Dimensional Scaling Kruksal s Steepest Descent Modelling our coloring problem as an MDS problem Steepest Descent adopted to our coloring problem Experimental Results and Analysis TPCH QT TPCH QT TPCH QT XML plan diagrams

7 CONTENTS vi 5 Join Orders Enumeration of Join Orders in Plan Diagrams Join Orders in XML plan diagrams Inter Engine Join Orders Conclusions Future Work Bibliography 88 A SQL templates 94 B XQuery templates 99 C Sample TPCH X documents 103

8 List of Figures 1.1 Sample SQL Query Sample SQL Query Execution Plan Example SQL Query Template (QT8) Picasso Diagrams for QT8 (X-Axis: SUPPLIER S ACCTBAL, Y-Axis: LINEITEM L EXTENDEDPRICE) Example XQuery Example XQuery plan tree Example Plan Tree for Join Order Picasso tables for storing db2cat information Example XML node and Statistics Example Picasso XQuery Template Xquery Template Constraints Predicate Selectivity Computation Example Plan Tree Database Schema for TPCH X XQuery Template for TPoX (QTX SEC) Picasso Diagrams for TPoX QTX SEC (X-Axis: CUSTACC /Customer/Accounts/Account/Balance/WorkingBalance, Y-Axis: ORDER /FIXML/Order/OrdQty/@Cash) Plan trees (QTX SEC) Selectivity Logs (QTX SEC) Picasso Diagrams for XBench QTX19 (X-Axis: ORDER /order/total, Y-Axis: /customers/customer/address id) vii

9 CONTENTS viii 3.13 XQuery template for XBench (QTX19) Plan Trees (QTX19) Selectivity Logs (QTX19) Picasso Query Templates Picasso Diagrams for TPCH X QTX5 (X-Axis: SUPPLIERS /Suppliers/Supplier/AcctBal, Y- Axis: CUSTOMERS /Customers/Customer/AcctBal Annotated Plan trees Partial (QTX5) Selectivity Logs (QTX5) XQuery template for TPCH X (QTX2) Selectivity Log Errors XQuery Templates for Selectivity error Trees for Jaccard metric Output Color Space Placement of Plans Picasso Diagrams for QT2 (X-Axis: PART P RETAILPRICE, Y-Axis: PARTSUPP PS SUPPLYCOST) Jaccard distances in QT Picasso Diagrams for QT8 (X-Axis: SUPPLIER S ACCTBAL, Y-Axis: LINEITEM L EXTENDEDPRICE) Jaccard distances in QT Picasso Diagrams for QT9 (X-Axis: SUPPLIER S ACCTBAL, Y-Axis: PARTSUPP PS SUPPLYCOST) Jaccard distances in QT Picasso Diagrams for XML benchmarks Jaccard distances in XML plan diagrams A.1 QT A.2 QT A.3 QT A.4 QT B.1 XQuery Template for TPoX (QTX SEC) B.2 XQuery Template for XBench (QTX19)

10 CONTENTS ix B.3 XQuery Template for TPCH X C.1 Sample XML Documents for TPCH X Schema

11 List of Tables 3.1 Document Count for TPCH X Statistics for QT2 with weights Statistics for QT8 with weights Statistics for QT9 with weights Statistics for XML QTs with weights Join Order Statistics for QT Join Order Statistics for QT Join Order Statistics for QT Join Order Statistics for XML plan diagrams Inter Engine Join Order Statistics Inter Engine Join Order Statistics w/o Nation and Region Inter Engine Join Order Statistics w/o Nation, Region and Supplier x

12 Chapter 1 Introduction To identify the most efficient strategy to execute SQL queries, database systems use query optimizers. The efficiency of these strategies, called plans, is usually measured in terms of query response time. An example query execution plan for the SQL query listed in Figure 1.1, is shown in Figure 1.2. This SQL query lists the mode of shipping for all items whose quantity is less than or equal to 45, and forms part of an order of price 45 or less. The query execution plan for this query uses the Hash Join algorithm to join the relations ORDERS and LINEITEM, after sequentially scanning both of them. select l shipmode from orders, lineitem where o orderkey = l orderkey and o totalprice 45 and l quantity 45 group by l shipmode order by l shipmode Figure 1.1: Sample SQL Query The difference in cost between the best execution plan and a randomly chosen one can be very high because of which optimization is a must. The importance of query optimizers has elevated in recent times due to the high degree of complexity in queries, operating on large data 1

13 CHAPTER 1. INTRODUCTION 2 Figure 1.2: Sample SQL Query Execution Plan stores, typically found in data warehousing applications. This is evident in benchmarks like TPC-H [49]. 1.1 Overview of Picasso Picasso [50] is a graphical tool for visualizing the behavior of modern database query optimizers, developed at the Indian Institute of Science over the past five years. The tool, which is now in use at educational and industrial sites worldwide, has provided a fresh perspective on the design and analysis of query optimizers through the introduction and development of the Plan Diagram notion. A plan diagram is a visual representation of the execution plan choices made by the optimizer over an input parameter space, whose dimensions may include database, query and system-related features. In a nutshell, plan diagrams visually capture the geometries of

14 CHAPTER 1. INTRODUCTION 3 the optimality regions of the parametric optimal set of plans (POSP) [4]. Currently, Picasso supports parameter spaces whose dimensions are comprised of relational selectivities. Consider a parametrized SQL query template that defines a relational selectivity space for example, QT8 shown in Figure 1.3, which is based on Query 8 of the TPC-H benchmark. This query determines the market share of a nation within a given region for a given part type when the s acctbal of the SUPPLIER relation and the l extendedprice of the LINEITEM relation are within their respective parametrized values. Here, selectivity variations on the SUPPLIER and LINEITEM relations are specified through the s acctbal :varies 1 and l extendedprice :varies predicates, respectively. select o year, sum(case when nation = BRAZIL then volume else 0 end) / sum(volume) from (select YEAR(o orderdate) as o year, l extendedprice * (1 - l discount) as volume, n2.n name as nation from part, supplier, lineitem, orders, customer, nation n1, nation n2, region where p partkey = l partkey and s suppkey = l suppkey and l orderkey = o orderkey and o custkey = c custkey and c nationkey = n1.n nationkey and n1.n regionkey = r regionkey and s nationkey = n2.n nationkey and r name = AMERICA and p type = ECONOMY ANODIZED STEEL and s acctbal :varies and l extendedprice :varies ) as all nations group by o year order by o year Figure 1.3: Example SQL Query Template (QT8) Plan Diagrams. The corresponding plan diagram for QT8, produced with Picasso on the DB2 relational engine, is shown in Figure 1.4(a). In this picture 2, each colored region represents a specific plan, and a set of 42 different optimal plans (from the optimizer s perspective), P1 through P42, cover the selectivity space. The value associated with each plan in the legend indicates the percentage area covered by that plan in the diagram the biggest, P1, for example, 1 To implement :varies, we use one-sided range predicates of the form Relation.attribute <= constant 2 The figures in this thesis should ideally be viewed from a color copy, as the grayscale version may not clearly register the features.

15 CHAPTER 1. INTRODUCTION 4 (a) Plan Diagram (b) Cost Diagram (c) Card Diagram (d) Reduced Plan Diagram (λ = 20%) Figure 1.4: Picasso Diagrams for QT8 (X-Axis: SUPPLIER S ACCTBAL, Y-Axis: LINEITEM L EXTENDEDPRICE)

16 CHAPTER 1. INTRODUCTION 5 covers about a quarter (26.86%) of the region, whereas the smallest, P42, is chosen in only 0.01% of the space. The mode of plan diagram production is the following [50]: Given a d-dimensional query template and a plot resolution of r, Picasso generates r d queries that are uniformly distributed over the selected selectivity space. Then, for each of these query locations, based on the associated selectivity values, a query with the appropriate constants is instantiated the constants are determined by carrying out an inverse-transform of the statistical meta-data available from the optimizer, typically in the form of histograms. This query is then submitted to the query optimizer to be explained, that is, to have its optimal plan computed and returned. After the plans corresponding to all the query points are obtained, a different color is associated with each unique plan, and each point is colored with its associated plan s color. Then, the rest of the diagram is filled in by painting the hyper-rectangle around each point with the same color. For example, in a 2-D plan diagram with a uniform grid resolution of 10, there are 100 real query points, and around each such point a square of dimension 10x10 is painted with the point s associated plan color. Plan colors are hard coded, with the biggest sized plan (size is measured in terms of volume coverage in the n-d space) being colored red, the second biggest being colored blue and so on. Related Diagrams. In conjunction with the plan diagram, Picasso also produces two additional diagrams, the Cost Diagram and the Card Diagram. The cost diagram quantitatively depicts the optimizer s (estimated) query processing costs of the plans shown in the plan diagram, while the cardinality diagram displays the optimizer s (estimated) query result cardinalities. The cost and card diagrams for the QT8 example are shown in Figures 1.4(b) and 1.4(c), respectively. Perhaps the most appealing aspect of Picasso is its construction of Reduced Plan Diagrams. Here, given a plan diagram and a cost-increase-threshold (λ) specified by the user, the plan diagram is recolored to a simpler picture, featuring only a subset of the original plans, such that the cost of any recolored query point does not increase by more than λ percent, relative to its original cost. That is, some of the original plans are completely swallowed by their siblings, leading to a reduced number of plans in the diagram, without materially affecting the query processing quality. For example, the QT8 plan diagram (Figure 1.4(a)) can be reduced with

17 CHAPTER 1. INTRODUCTION 6 XQUERY for $a in db2-fn:xmlcolumn( ORDER.ORDER )/order where $a/total <= return <Output> {$a/@id} {$a/order date} {$a/ship type} </Output> Figure 1.5: Example XQuery λ =20% to the diagram shown in Figure 1.4(d), where only three of the original 42 plans are retained. 1.2 XML data In recent years, data represented in the hierarchically-structured XML document format has found its way into mainstream database technology. As a case in point, DB2, IBM s popular and long-standing database system offering, provides organic support for XML data processing [6], beginning with version 9.1, released in More specifically, DB2 is now a hybrid or multistructured DBMS, which augments the traditional SQL-based relational platform with XMLtuned storage technologies, indexing mechanisms, query languages, and processing strategies. A new data type, called XML, has been introduced which stores the associated data in a tree-like structure, thus natively maintaining the logical hierarchies expressed in XML documents. Users interface with this XML data through queries expressed in either pure XQuery [53], or through a combination of XQuery and SQL. XPath and its super set XQuery are probably the two most popular languages to query XML. An example XQuery taken from the DCMD (Data Centric Multiple Documents) workload of the XBench benchmark [27] is shown in the Figure 1.5. The XQuery has been modified to conform to the syntax DB2 expects. This query lists the orders (order id, order date and ship date), with total amount larger than or equal to a certain number (110.0).

CHAPTER 1. INTRODUCTION 7 Figure 1.6: Example XQuery plan tree A sample plan for the XQuery of Figure 1.5 chosen by the optimizer is shown in Figure 1.6. This is the execution strategy designated as being optimal by the DB2 XQuery optimizer.

18 CHAPTER 1. INTRODUCTION 7 Figure 1.6: Example XQuery plan tree A sample plan for the XQuery of Figure 1.5 chosen by the optimizer is shown in Figure 1.6. This is the execution strategy designated as being optimal by the DB2 XQuery optimizer. Picasso is currently operational on a rich suite of industrial-strength database engines, including DB2, Oracle, SQL Server, Sybase and PostgreSQL and has been of great use in the analysis of the behavior of these optimizers. However, it is restricted to handling SQL query templates on relational schemas. In this thesis, we attempt to profile hybrid optimizers using Picasso. Here, by hybrid optimizers, we mean optimizers which optimize other query languages along with SQL. We are particularly interested in analysing the behavior of hybrid optimizers in their handling of XQuery, the recommended querying language for XML [54]. We have worked with DB2 hybrid optimizer, specifically with DB2 version 9.7, which we will hereafter generically refer to as DB2 XML. As a first step towards the required analysis, we describe in detail, our efforts in making Picasso compliant with DB2 XML. The porting process required collecting statistics of the

19 CHAPTER 1. INTRODUCTION 8 stored XML data from DB2 XML. This involved understanding how DB2 XML generates, stores and maintains statistical information for XML data and ways to retrieve this information. It also necessitated extending Picasso to handle XQuery templates along with SQL templates and introducing mechanisms to compute selectivities from the statistical information already gathered. Finally, Picasso s plan tree visualizer was extended to handle the new operators which were introduced by the XML plan trees. After the description on porting, we go on to describe the experimental results obtained on three XML benchmarks. With these results in hand, we describe the optimizers behavior in various scenarios and provide reasons for the same. We also provide details on a few cases where the optimizer was in error with respect to its estimation of selectivity values. 1.3 Coloring of Plan Diagrams With the introduction of plan diagrams and its original coloring scheme already done, we now present a new approach to color these plan diagrams. To reiterate, in Picasso versions 2.1 and older, plan colors are assigned based on a size-ordered sequencing of the plans (size is measured in terms of volume coverage in the n-d space). Specifically, the biggest plan is always assigned a red color, the second-biggest is assigned a blue color, and so on. The new approach is a semantically richer option wherein the colorings of plans also reflect their structural differences. For instance, if the biggest and second-biggest plans are very similar, they would both be assigned close shades of the same color, say red. With this new approach to coloring, the plan diagram itself provides a first-cut reflection of the plan-tree differences without having to go through the explicit invocation of the plan-differencing operator. The plan-differencing operator is a feature provided in Picasso where we visually display the differences between the trees of any two plans. In the new coloring approach, we have adopted this plan-differencing feature to assign differences between every pair of plans, we have then attempted to device an efficient coloring scheme that as far as possible visually displays these distances in the diagram. In this thesis, we first describe a basic metric that could be used as our plan differencing strategy. We provide examples where it falls short on taking database semantics into account,

20 CHAPTER 1. INTRODUCTION 9 and use the plan differencing methods which were employed in Picasso for the plan tree differencing feature. We then go on to describe the techniques used to optimize our problem. Since our problem is to convert plan differences assigned between every pair of plans into equivalent distances in color, we have the color space as our output. We have chosen RGB (Red, Blue, Green) color range as our output color space. We elaborate Kruskal s Steepest Descent approach [47, 48] adopted to our coloring problem. We also provide details about why we used this method, as opposed to the other iterative techniques, which in fact have guarantees on the rate of convergence. Finally, we provide extensive experimental results on a set of TPC-H based query templates. We also analyse the quality of coloring obtained on these plan diagrams using our MDS technique. We found that the coloring obtained is quite satisfactory with a moderately high percentage ( 70%) of the input plan dissimilarities being represented in color space with less than 30% error threshold. 1.4 Roll-up of Plan Diagrams In the plan diagrams obtained through Picasso, each plan can differ from the rest of the plans by one or more of the following - Number of operator nodes present (operator nodes are non-relation nodes). Names of certain operator nodes. Structure of the plan tree. Parameters supplied to certain operator nodes. When the difference between any two plan trees is only due to the parameters passed to certain operator nodes, it is not considered to be important. Hence, we have provided a feature in Picasso wherein we display the plan diagram at the operator level i.e., we consider all the plan trees which differ only in the parameters passed to certain operators to be the same and assign a single plan number and color to all of them. One of the more important differences between plan trees would be differences found in their structure. Structure of the plan tree is largely dependent on its join order. The join order

CHAPTER 1. INTRODUCTION 10 Figure 1.7: Example Plan Tree for Join Order of an execution plan tree of a query is the order in which its relations are joined.

21 CHAPTER 1. INTRODUCTION 10 Figure 1.7: Example Plan Tree for Join Order of an execution plan tree of a query is the order in which its relations are joined. As an example, consider the plan tree shown in the Figure 1.7. The join order of this plan tree would be NATION (((PART (SUPPLIER LINEITEM)) ORDERS ) PARTSUPP). Extending the idea of visualizing plan diagrams at the operator level, in this thesis, we have dealt with visualizing plan diagrams at the join order level. All plans which have the same join order are considered equivalent and assigned the same number and color. It has to be noted that the plans which are merged with this criteria may differ in other aspects. Through a set of TPC- H based query templates, we have provided results on the number of join orders present in plan diagrams for three different industrial strength engines. We also provide results on the number of common join orders present, across different engines for the same query template. We found that, in plan diagrams the number of join orders are low in comparison to the number of plans,

22 CHAPTER 1. INTRODUCTION 11 but not low enough to get anorexia. Also, we found very few join orders being common across engines for a given query template. 1.5 Contributions The contributions made in this thesis are as follows - 1. We have worked on extending Picasso to support the XML features of DB2. In this thesis, we describe analysis of a hybrid optimizer DB2 XML in its handling of XQuery and the initial results obtained on the optimizer over a variety of benchmark environments. The benchmark environment include TPoX [5] and XBench [27], which are native XML benchmarks, as well as TPCH X, our XML equivalent of the classical TPC-H [49] relational benchmark. We have also identified and enumerated the conditions that need to be satisfied in constructing valid XQuery templates that can be input to Picasso for producing semantically meaningful diagrams. With respect to the experimental results, we find that the plan diagrams produced with DB2 XML often feature patterns that merit further investigation. Further, there are considerable differences between the plan diagrams produced between equivalent relational and XML database environments. The operator trees characterizing the XML processing plans are also different from their SQL counterparts in their structural aspects. 2. A solution to the problem of semantically coloring plan diagrams was found. This technique represents the plan tree differences in color space with reasonable accuracy. The approach used is a Multidimensional Scaling technique called Kruskal s Steepest Descent algorithm. This algorithm was slightly modified in order to handle certain constraints in the coloring problem. Further, we have applied the coloring strategy to a host of plan diagrams from various engines and evaluated the quality of coloring. It can be easily seen that this technique is responsible for translating distances between plans to distances in color space. So, in order to reflect distances between plans in terms of color, it is important that we have a plan distance metric which measures distances

23 CHAPTER 1. INTRODUCTION 12 between plans from a Database perspective. Along these lines, we have illustrated how one of the naive metrics performs poorly in certain cases because of which we have adopted the plan distance metric from the plan-differencing operator of Picasso [50], briefly described in [19]. Finally, we have also provided an analysis on the similarity of plans in plan diagrams. 3. We have extended Picasso to provide visualization of plan diagrams at the join Order level. Along with this, we have done a study on the number of join orders present in numerous plan diagrams and have also to a certain extent provided comparative study of the number of join orders present across various optimization engines. 1.6 Organization The remainder of this thesis is organized as follows: Chapter 2 provides an overview of related research. The issues addressed in porting Picasso to DB2 XML and the results obtained through experiments are described in Chapter 3. The techniques used for semantic based coloring of plan diagrams, the metrics employed and the corresponding test results are elaborated in Chapter 4. Further, plan diagrams at the join order level along with the analysis of the join orders present are dealt with in Chapter 5. Finally, we conclude in Chapter 6 with a discussion of the open issues, as well as promising avenues for further investigation.

24 Chapter 2 Survey of Related Research Over the past few decades, a lot of research has been done in the field SQL query optimization. There are excellent surveys such as [25] and [26] which give a comprehensive view of the progress achieved over the years and an interested reader can refer the same. For a survey of work on query optimization related to Picasso we refer the reader to [40]. A brief summary of the same is provided in Sections 2.1 and 2.2. We then go on to describe the challenges in optimizing XML queries and to what extent it is feasible to reuse techniques developed for SQL query optimization. Further, we provide a short description of the available MDS techniques and their applicability to our semantic based coloring problem. Finally, we describe the importance and the recent research advances on join order enumeration. 2.1 Challenges of SQL Query Optimization The job of the SQL query optimizer is to take parsed SQL as input and to generate an efficient plan to execute the same. It is computationally very challenging to produce an efficient plan for a given query, since the search space of possible plans can be of the order of Ω(3 n ) [45], with n being the number of base relations in the given query. There are some important considerations in the design of an optimizer, some of which are elaborated below. 13

25 CHAPTER 2. SURVEY OF RELATED RESEARCH Strategies for Plan Selection There are different plan selection strategies including strategies like Randomized algorithms and use of set of heuristic rules. One of the important strategies which is widely incorporated in commercial optimizers is the exhaustive enumeration of search space using a cost based approach. The seminal paper by Selinger et al. [28] (System-R project) was one of the key contributors in cost based query optimization even though their search space was restricted to the set of left deep trees. They introduced the notion of interesting order, which is the usefulness of the output ordering of an operator in some subsequent operation. System-R s cost based optimizer uses dynamic programming to efficiently find a good plan. In this approach, optimal solutions to sub-problems are combined to form the optimal solution of the main problem. This is the principle of optimality [59]. The dynamic programming approach performs well for queries which do not have more than around a dozen base relations, after which it becomes infeasible due to the exponential increase in the search space complexity. To overcome this issue, a number of techniques have been proposed in literature, including Iterative Dynamic Programming (IDP) [33, 34], Skyline Dynamic Programming (SDP) [35] and Parametric Query Optimization (PQO) which was first proposed in [36] Refinements of Plan Choices at Run-time Inaccurate estimates of various parameters, lead to poor plan choices by the cost based query optimizer. When moderately accurate values of some of these parameters are available at run time and if their values remain constant throughout the execution time, then certain strategies, including PQO are proposed to improve upon the execution plan quality. One of the important strategies is to optimize partly at compile time and delay decisions which involve those parameters to run time. For parameters like predicate selectivity, whose values cannot be known at the start of the query execution, strategies are proposed by Kabra and DeWitt [21], Antonshenkov [24] and Roy et al [23] to improve upon the query execution plan.

26 CHAPTER 2. SURVEY OF RELATED RESEARCH Industrial Strength SQL Optimizers In order to study the cost distribution of query plans in industrial strength SQL optimizers Waas and Galindo [22] have devised various algorithms. From the observations made using these algorithms, they found that only 1% of the plans in the entire search space of plans, have cost within twice the optimum cost. Also, they found that for queries having more than 4-5 relations, the distributions correspond to Gamma-distributions with shape parameter close to 1. On a different note, Reddy and Haritsa [20] have studied the optimality space of industrial strength optimizers using the plan diagram notion. They observed that optimizers make extremely fine grained choices with changes in the query plan chosen, with minute changes in selectivity. They also observed that optimal plans often have irregular boundaries and highly intricate patterns, indicating strong non-linear cost models. In the follow up of this work [3], Harish et al. show that, through a process of plan reduction, the number of plans in the plan diagram can be substantially brought down (to around 10!) without materially affecting the plan costs (20% error tolerance threshold). 2.3 XML query processing XML is becoming the de facto standard for information interchange. With this being the case, databases are now being largely used to store and query the XML documents. There are many approaches to storing XML data 1. Shredding the XML documents to be stored in relational columns. This implies that the user given XML query will have to be converted to its equivalent SQL and also the results returned will be converted to their equivalent XML format if needed. 2. Store the XML as BLOB (Binary large object) or CLOB (Character large object) data types in relational columns. With this approach, it is generally not feasible to have indexes and very often, all documents may be need to be processed in their entirety with either XQuery or SQL, whichever the engine supports. 3. Develop specialized data management for XML, also known as native XML databases.

27 CHAPTER 2. SURVEY OF RELATED RESEARCH A hybrid approach where XML is stored natively along with relational data. Regarding the querying of XML documents in databases, D. Maier elaborated desired characteristics of an XML query language [60]. The development of some XML query languages are highly influenced by his criteria. Some of the important issues addressed in his proposal include XML output of the query, independence of schema, schema exploitation if possible, and optimized query operations. The operations defined in the proposal are: selection - choosing document or document element. extraction - pulling out elements of a document. reduction - removing sub-elements, restructuring or constructing a new set of element instances combination - merging operation carried out over two or more elements resulting in only single element. XQuery [54] has been established as the W3C recommendation in 2007 for querying XML. However prior to the establishment of W3C standard, there had been several researches proposing query languages for XML, including XML-QL [37], Lorel [38], Quilt [39] and XQL [61]. As given in [18], XQuery semantics is highly influenced by Quilt and XPath [62].. XQuery uses XPath for path expressions and FLWOR structure for describing the whole query. Also, it embodies many of the desired characteristics elaborated by D. Maier. While extensive research has been done in the field of SQL query optimization in the relational world, optimizing XQuery on XML is a relatively new area. The reason for this has been mentioned in [17]. XML was useful only for human understanding rather than data storage until a few years ago. Hence optimization on small documents did not seem very useful. Now that XML usage is shifting towards large scale data storage, more efforts are invested in XQuery/XPath optimization. There are several challenges introduced by XQuery which are elaborated in [1]. They include XML data being heterogeneous and hierarchical in nature as opposed to homogeneous relational data which has flat schemas, NULL values in XML being indicated by the absence of

28 CHAPTER 2. SURVEY OF RELATED RESEARCH 17 certain elements, with no need to be specified explicitly and the varying of XML schemas from document to document, and sometimes within the same column. The first cost-based optimizer was designed and implemented for the Lore system [11] and is given in [13]. The query language used in this system is Lorel which is an extension of OQL. They developed new indexing schemes and database statistics which describe the shape of the data graph. They also described plan enumeration strategies with numerous access methods using logical and physical query plans and a cost model. Following this, a lot of research has been done in XQuery optimization with the development of several XML compliant database systems. [42], [15], [16], [18] and [10] provide a survey on XML query languages and XML query processing. Also, an overview of XQuery optimization techniques and its implementation is given in [9]. We now concentrate on the optimization techniques and implementation details of the XQuery processing system in DB2 XML, which is our optimizer of interest. An excellent description of the cost based optimization done in DB2 XML is given in [1]. We give a brief summary of the same. DB2 XML has Native XML support along with relational data and so now supports XQuery along with SQL. This hybrid characteristic naturally brings support for SQL/XML language. They even have a hybrid optimizer where XQuery and SQL optimization is done in a unified framework. Given this, XQuery optimization largely reuses the already in-place relational optimization techniques. They use holistic algorithms and heuristics to process complex path expression within a single operator. This, they say reduces the search space without compromising on the quality of plans. They also use structural and value indexes in a combined fashion during plan generation. To this effect, they have introduced new operators such as XSCAN, XANDOR and XISCAN. 2.4 Multidimensional Scaling Multidimensional Scaling (MDS) comprises a family of techniques and has been researched extensively. MDS has been broadly classified into Metric MDS and Non-Metric MDS [43]. The basic distinction between these two is that techniques developed for Metric MDS are meant for

29 CHAPTER 2. SURVEY OF RELATED RESEARCH 18 interval level data, where as Non-Metric MDS techniques work on ordinal level data. But it is important to note that Non-Metric MDS techniques can work on higher level data as well. Among the various techniques developed, one of the more popular techniques for Metric Multidimensional techniques is the SMACOF [46] (Scaling by majorizing a complicated function) algorithm which has guarantees on, and, the rate of convergence. Also, one of the oldest and well known techniques for Non-Metric MDS is the Steepest Descent approach by Kruskal [47, 48]. There are other various techniques and many other features of MDS which have been thoroughly investigated in [46]. As per our knowledge, semantic coloring of plan diagrams with the application of an MDS technique is the first of its kind. MDS has been used extensively for various purposes, primarily for visualization of data, but from a database perspective, it is used for visualizing semantic differences in plan trees for the first time. 2.5 Join Orders Finding the optimal join order is one of the most crucial tasks in both SQL and XQuery optimization and consumes a lot of resources. Also significant attention has been given to join order enumeration in past research done on query optimization. This is true for both Top-down and Bottom-up optimizers. Selinger et al., in their seminal paper, proposed a dynamic programming algorithm to find the optimal join order for a given conjunctive query along with introducing cost-based query optimization [28]. They considered only the search space of left-deep trees, but the general idea of their algorithm can be used to explore the space of bushy trees as well. State of the art commercial query optimizers like DB2 [29] still have this as a core part of their optimization. Ono and Lohman [30] have elaborated on the computational complexity of enumerating join orders for a single select-project-join query block for a variety of join graphs. Further, more recently in 2006, Moerkotte and Neumann [31] showed that traditional bottom-up join enumeration algorithms are far worse than than the lower bound given by Ono and Lohman s and went on to describe a new bottom-up enumeration algorithm that is optimal over any join

30 CHAPTER 2. SURVEY OF RELATED RESEARCH 19 graph. Thus, join order enumeration being very important, but computationally very expensive, any savings during optimization would be very beneficial. So, if the number of join orders in any plan diagram turns out to be very less, the join order templates can be fed as skeletons to the optimization process thus saving a lot of optimization cost. To our knowledge no prior work has been done on join order analysis with respect to plan diagrams. However, the concept of re-use of plans was used extensively in [32], even though not in the plan diagram setting. Another closely related work is done by Satyanarayana et al. [41], where they re-use plans as in [32], but they characterize the plans in the form of join trees and join order templates, gather statistics of the join trees and join order templates, and use them to discover the optimal plan of a new query.

31 Chapter 3 Analysis of Hybrid optimizers A lot of effort has been invested in the recent past for analysing the behavior of SQL optimizers. One such tool, which has been widely popular and used as a graphical analyser is Picasso [50]. XML is now emerging as a competitive data exchange format and several popular database systems have been extended and developed to support XML storage and XML query processing along with relational data. There has been considerable interest in the past decade for optimizing XML queries and in paricular XQuery, which has been the W3C recommended standard for querying XML []. A single optimizer may be used for optimizing SQL and XML queries, in which case it is called a hybrid optimizer. In this chapter we focus on analysing the behavior of one such hybrid optimizer (DB2 XML) in the way it handles XQuery. Picasso has been enhanced to support plan and related diagrams for this hybrid optimizer and used as our analysis tool. Enhancing Picasso to generate plan and other diagrams from DB2 XML comprised of the following main tasks: XML Statistics Collection: Mechanisms for generation and storage of statistics on XML columns had to be first identified. Then, methods had to be devised to access and incorporate this information in the Picasso server. XQuery Template Parsing: The Picasso query template parser, which processes SQL templates to identify the :varies selectivity predicates, had to be extended to first denote, and then identify, such predicates in XQuery templates. 20

32 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 21 XML Selectivity Computations: Mechanisms for identifying, interpreting and computing selectivities from the statistical summaries and the optimizer s plan output had to be designed. Execution Plan Parsing: The Picasso visualizer had to be extended to include XML-specific operators (e.g. XSCAN), and to construct graphical descriptions of operator trees that semantically match those provided by the native DB2 XML plan visualizer. We describe each of the above steps in greater detail in this chapter. Finally, we provide extensive experimental analysis on DB2 XML s behavior using a set of three benchmarks. 3.1 XML Statistics Collection In the relational world of DB2, statistics collection is at the granularity of individual attribute columns, with all relevant information stored in the SYSIBM.SYSCOLDIST internal system table, which can be directly accessed through SQL queries. For XML, however, statistical information is organized with respect to element paths (e.g. /Customers/Customer/AccountBalance), and the storage and access of the information is considerably more involved, as explained next. We will hereafter use the term XML R to refer to a database relation that has one or more XML columns. Given an XML R, executing the runstats utility on this relation provides the LOW2KEY and HIGH2KEY values computed over all paths in the XML documents referenced by this relation, as well as the total number of such instances. Further, to obtain distributional information, the runstats utility has to be invoked using the with distribution option a special requirement in DB2 XML is that this option is effective only after an index has been created on the associated path. The statistics generated for the XML Rs corresponding to the element paths that define the dimensions of the Picasso diagram, are dumped into a text file using the db2cat utility. Due to the lack of utilities for processing this file, we wrote our own Perl parser to analyze the contents. The extracted information is stored in two tables, PICASSO XML SUMMARY and PICASSO XML HIST, whose schemas are shown in Figure 3.1. A restriction here is that the

33 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 22 parser has to be executed on the database engine before invoking Picasso. This is because db2cat is a local engine script that cannot be invoked by external clients [55], and its output has to be stored on the filesystem of the machine hosting the database engine. PICASSO XML SUMMARY (colid bigint, tbname varchar(50), elementpath long varchar, card bigint, high2key varchar(100), low2key varchar(100)) PICASSO XML HIST (indexid bigint, indexpath long varchar, colid bigint, tbname varchar(50), dtype varchar(32), value varchar(100), valcount bigint, distinctcount bigint) Figure 3.1: Picasso tables for storing db2cat information The PICASSO XML SUMMARY table contains the low keys, high keys and cardinalities of the various element paths, whereas the PICASSO XML HIST table contains the histogram values and data types for the terminating elements of the indexed paths. The information is separated over two tables because the db2cat utility dumps statistics for all paths of XML documents stored in a XML column, some of which may have an index, while others may not, in which case they would lack distributional information. Further, even in the presence of indexes, the information present in PICASSO XML SUMMARY would be repeated as many times as the number of buckets in the histogram summary, violating normalization imperatives. To make the above notions concrete, consider an example bank customer database, where the XML nodes have the structure shown in Figure 3.2(a), consisting of the customer s account number, name and balance. Sample summary statistics and a snippet of the (equi-depth) histogram on the acct element path (/root()/customer/acct/text()), as produced by DB2 XML, are shown in Figures 3.2(b) and 3.2(c), respectively. 3.2 XQuery Template Parsing The relational columns that form the dimensions of SQL-based Picasso plan diagrams are called Picasso Selectivity Predicates (PSP). These PSPs are easily specified using the :varies syntax embedded in SQL, as previously shown in Figure 1.3. With XQuery, however, the specification is more complicated since the plan diagram dimensions are now element paths in the XML

34 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 23 <customer> < acct> 1881 < /acct> < name> Pablo Picasso < /name> < bal> 1000 < /bal> < /customer> (a) XML node PathID = /root()/customer/acct/text() 2nd Highest Key = 5:6059 2nd Lowest Key = 3:01 Sum Node Cnt = 3721 Value Count Distinct Count (c) Frequency Histogram (b) Summary Stats Figure 3.2: Example XML node and Statistics hierarchy, and it is difficult to directly infer these paths from the :varies predicate in the query. In particular, the complete specification of a Picasso dimension requires information on the path structure, namespaces, table names, aliases and column names. We will heareafter use the term PSX (Picasso Selectivity XML element path) 1 to denote these dimensions and distinguish them from the relational PSP notation. Rather than constructing a complicated parser to resolve the above, we have taken the simple expedient work-around of having the user explicitly provide the PSX location information as a prologue to the query template, as shown in Figure 3.3, where there are two PSXs corresponding to customer accounts and orders. This XQuery template retrieves those accounts that have a working balance within a parametrized value and, for each of these accounts, extracts the associated orders involving cash transactions within a parametrized value. 1 Our usage of the term element here denotes both XML elements as well as XML attributes.

35 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 24 #NAMESPACE declare default element namespace ; #NAMESPACE declare namespace c= ; #PSX /c:customer/c:accounts/c:account/c:balance/c:workingbalance CUSTACC CUSTACC CADOC #PSX /FIXML/Order/OrdQty/@Cash ORDER ORDER ODOC # # XQUERY declare default element namespace ; declare namespace c= ; for $acct in db2-fn:xmlcolumn( CUSTACC.CADOC ) /c:customer/c:accounts/c:account[c:balance/c:workingbalance :varies] for $ord in db2-fn:xmlcolumn( ORDER.ODOC ) /FIXML/Order[@Acct=$acct/@id/fn:string(.)] where $ord/ordqty/@cash :varies return <result> {$acct} {$ord} </result> Figure 3.3: Example Picasso XQuery Template Here, the namespaces corresponding to the XQuery template are enumerated in the beginning of the prologue followed by PSXs. Each PSX in turn consists of its element path, table name (i.e. its XML R), alias and column name (i.e. the XML column), listed in that order. The beginning of each namespace and each PSX is indicated through the use of #NAMESPACE and #PSX, respectively, and the termination of the prologue is indicated through # #. Note that the :varies keywords are still embedded inside the Xquery template so as to unambiguously determine where values would be substituted for varying the PSX. As a final point, the order in which the PSX locations are listed in the prologue should match the lexical order of the corresponding :varies predicates within the XQuery template. 2 2 The explain command in DB2 XML requires single quotes to appear after the keyword XQUERY and at the end of the query. To conform with this syntax, all single quotes within XQuery templates are converted to double

36 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 25 Template Constraints. In order to meaningfully cover the full range of selectivities shown in the various Picasso diagrams, an XQuery template should satisfy a variety of conditions, which are enumerated below (for reference, the equivalent constraint in the SQL framework is also listed): 1. Each XML column of a relation can participate in at most one PSX. (In SQL templates, each relation can participate in at most one PSP.) 2. The element path in a PSX predicate typically consists of a logical segment, denoted L, that defines the semantic object whose selectivity is desired to be varied, and a physical segment, denoted P, which is downstream of L and whose value is actually varied in the query template. To make this concrete, consider the PSX /c:customer/c:accounts/c:account/ c:balance/c:workingbalance in the XQuery template of Figure 3.3. Here L is the segment (/c:customer/c:accounts/c:account), while P is the downstream segment /c:account/c:balance/c:workingbalance the parametrized variation across Customer Accounts is achieved through the varying of WorkingBalance values. Similarly, for the second PSX, the logical segment is /FIXML/Order and the physical segment is /Order/OrdQty/@Cash. As a final point, note that it is also acceptable to have templates where the logical segment itself terminates with the variable element, and there is no distinct P segment. For a PSX predicate with an explicit P segment, it should be ensured that many-to-one relationships do not occur in this segment. That is, in the graphical representation of the XML schema [8], there should be no * node in the path corresponding to P. (In SQL templates, there is no equivalent to these notions of logical and physical segments since hierarchies are inherently not present.) 3. If along with the PSX predicate, other predicates are also specified, then it should be ensured that the application of these additional predicates is deferred in the plan tree. Specifically, these non-psx predicates should not be applied in conjunction with the PSX quotes by the Picasso client.

37 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 26 predicate, at the leaf level, in the plan tree. (In SQL templates, PSP relations can feature in join predicates but not in any other equality or range predicates.) 4. The permissible data types for PSX element paths are integer, float, string and date. (The same set of data types are supported in SQL templates.) 5. The PSX element paths must have pre-generated statistical summaries. (The same constraint holds for PSPs in SQL templates.) 6. The PSXs should be on dense-domain paths in high cardinality XML columns. (The same constraint holds for PSPs in SQL templates.) To illustrate the constraints on XQuery templates, consider an XML database with CUS- TOMER and ORDER documents adhering to the schema graphs shown in Figures 3.4(a) and 3.4(b), respectively. On this database, consider the XQuery template shown in Figure 3.4(c), which returns the names and phone numbers of customers located within a parametrized address value and whose orders feature item quantities within a parametrized value this template is compatible with respect to all of the above conditions. However, if the query template were to be slightly altered as shown in Figure 3.4(d), it would become invalid as Condition 2 of the above requirements would be violated specifically, there is now a * node in the physical segment (order lines/( * )order line/quantity of item). 3.3 XML Selectivity Computation After the statistics are collected and the information regarding PSXs are retrieved from the XQuery template, the next step in making Picasso compliant to DB2 XML is to determine the constants that would result in the desired PSX selectivity values. In the following discussion, we first explain the selectivity computation process in the relational environment, and then describe its extension to the XML world.

38 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 27 (a) CUSTOMER Schema Graph (b) ORDER Schema Graph #PSX /customers/customer/address id CUSTOMER CUSTOMER CUSTOMER #PSX /order/order lines/order line/quantity of item ORDER ORDER ORDER # # XQUERY for $cust in db2-fn:xmlcolumn( CUSTOMER.CUSTOMER ) /customers/customer[address id :varies] for $order in db2-fn:xmlcolumn( ORDER.ORDER ) /order[customer id=$cust/@id]/order lines/order line[quantity of item :varies] return <Output> {$order} {$cust/first name} {$cust/last name} {$cust/phone number} </Output> (c) Valid XQuery Template #PSX /customers/customer/address id CUSTOMER CUSTOMER CUSTOMER #PSX /order/order lines/order line/quantity of item ORDER ORDER ORDER # # XQUERY for $cust in db2-fn:xmlcolumn( CUSTOMER.CUSTOMER ) /customers/customer[address id :varies] for $order in db2-fn:xmlcolumn( ORDER.ORDER )/order[customer id=$cust/@id and order lines/order line/quantity of item :varies] return <Output> {$order} {$cust/first name} {$cust/last name} {$cust/phone number} </Output> (d) Invalid XQuery Template Figure 3.4: Xquery Template Constraints

39 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS Relational Selectivities As mentioned earlier, Picasso estimates the constants that would result in the desired selectivities through an inverse-transform of the PSP statistical summaries. To verify that Picasso and the optimizer are mutually consistent with regard to this estimation, three types of selectivities are computed: Picasso Selectivity, Predicate Selectivity and Plan Selectivity, which are accessible through the Selectivity Log tab of the Picasso tool. These selectivities, whose values are expected to be close to each other in a valid execution, are defined as follows: Picasso Selectivity: This is the selectivity of the PSP as determined by the Picasso software through the statistical summaries of the database engine. Predicate Selectivity: This is the optimizer s estimated selectivity for the PSP when the unidimensional query select * from Table T where PSP(T) is optimized. Plan Selectivity: This is the selectivity associated with the node containing the PSP in the optimizer s plan choice for the user s query XML Selectivities For the above computations in the relational world, selectivity is simply and consistently interpreted as the fractional number of rows in a table that are relevant to processing the user s query. However, with XML data, the definition is not so straight-forward since information is organized in the form of nodes and documents containing these nodes. Therefore, selectivities can be computed at the granularity of nodes or documents. For example, consider the scenario where 100 XML nodes are organized in a single document, and the other extreme where there are 100 documents, each containing one of these nodes. In the former case, the document selectivity will always be 0 (no node in the document satisfies the predicate) or 1 (at least one node in the document satisfies the predicate), whereas in the latter, the document selectivity will represent the fractional number of nodes satisfying the predicate.

40 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 29 From a query processing perspective, node selectivities constitute a more meaningful metric than document selectivities since they essentially correspond to rows in tables. Unfortunately, the current implementation of indexes in DB2 XML is such that although information to compute node selectivities is available, only document selectivities are finally returned this issue is explicitly outlined in the following snippet from [1]: A scan of an XML index returns references to individual nodes that satisfy an index expression, but today, DB2 XML does not fully exploit this granularity. We currently use the result of an XML index scan to qualify entire documents that satisfy an XML index expression. Subsequent navigation from the root of each qualifying document determines the exact set of nodes that satisfies the full XPath expression. As a further complication, the document selectivities computed at the leaves of the plan trees later morph into node selectivities at the higher internal levels of the plan tree Computation Methods Modulo the above conceptual issues, the Predicate and Plan selectivities of the DB2 XML optimizer, which are shown in the Picasso Selectivity Logs, are computed in the following manner: Predicate Selectivity: The representative XQuery structure shown in Figure 3.5(a) is submitted to the Explain facility in DB2, with its place holders NAMESPACES, TABLE NAME, COLUMN NAME, PSX and constant-value filled up using the data from the XQuery template under consideration. The Predicate Selectivity is taken to be the estimated number of XML results in the final output, i.e., the number present as output in the RETURN operator of the plan tree returned by Plan Explain. As a concrete example, consider the PSX /FIXML/Order/OrderQty/@Cash featured in the example XQuery template of Figure 3.3. For this PSX, predicate selectivities are computed using the XQuery instance shown in Figure 3.5(b). Plan Selectivity: Similar to the implementation in the relational world, the Plan Selectivity of a PSX is extracted from the annotated plan tree produced by the Explain facility. It is

41 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 30 XQUERY NAMESPACES for $x in db2-fn:xmlcolumn( TABLE NAME.COLUMN NAME )PSX where $x <= constant-value return $x (a) XQuery Structure XQUERY declare default element namespace ; declare namespace c= ; for $x in db2-fn:xmlcolumn( ORDER.ODOC )/FIXML/Order/OrderQty/@Cash where $x<=20000 return $x (b) Example XQuery Figure 3.5: Predicate Selectivity Computation evaluated as the selectivity of the parent node of the XML relation associated with the PSX in the plan tree this parent node is typically a scan operator such as TBSCAN or XISCAN Selectivity Estimation Errors Apart from the above conceptual issue of document versus node selectivities, we have also encountered what appear, within our limited understanding, to be bugs in the DB2 XML implementation of selectivity estimation. Specifically, when the constants used in the PSX are negative values, then the Predicate and Plan selectivity estimations are way off from what would be expected from the statistical summaries. Secondly, when the PSX attribute is of type string, as opposed to integer or float, the selectivity estimations again bear little relationship to the expected values. These situations are explicitly shown later in Section of the experimental study.

CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 31 3.4 Plan Parsing Figure 3.

42 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS Plan Parsing Figure 3.6: Example Plan Tree To cater to the complex data processing requirements of XML, the DB2 XML engine has incorporated new XML-specific operators, which appear in the execution plan trees. These new operators include XSCAN, XISCAN, XANDOR and XUNNEST, and a sample execution plan featuring some of them is shown in Figure 3.6. All the operators have been integrated in the Picasso server modules. In order to display the plan trees graphically, the explain output stored in the DB2 XML explain tables (EXPLAIN PREDICATE, EXPLAIN ARGUMENT, EXPLAIN INSTANCE, EX- PLAIN OPERATOR, EXPLAIN OBJECT, EXPLAIN STATEMENT and EXPLAIN STREAM) were retrieved and parsed. Picasso has a generic visualizer that attempts to semantically mimic the native visualizers of

43 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 32 the various database engines and provide a common interface across these engines. The validity of our representation was confirmed through comparison with the native visualizer for all the templates. 3.5 Experimental Results In this section, we describe the experimental framework on which we evaluated the DB2 XML optimizer using Picasso, and the initial results obtained on this framework Experimental Setup All our experiments were carried out on a vanilla hardware platform specifically, a SUN ULTRA 20 system provisioned with 4GB RAM running Ubuntu Linux 64 bit edition With respect to DB2 XML, both EXTENDED OPTIMIZATION and HASH-JOIN were enabled Databases We worked with three different XML databases: TPoX, XBench and TPCH X. TPoX and XBench are native XML benchmarks, while TPCH X is an XML equivalent of the classical TPCH benchmark used in SQL databases. TPoX models a transaction processing environment, whereas XBench and TPCH X are representative of decision-support environments their construction details are given below. TPoX. The TPoX benchmark database was populated with CUSTACC, ORDER and SECURITY documents, corresponding to the XXS scale, which takes about 1GB of space. The data in these documents follow a variety of distributions ranging from uniform to highly skewed, as per the benchmark specifications [57]. The richest set of indexes recommended by the benchmark was created and statistics were collected on all these paths. XBench. The XBench benchmark [27] supports a choice of four different kinds of databases, namely TC/SD (Text centric, Single Document), TC/MD (Text centric, Multiple documents), DC/SD (Data centric, Single Document) and DC/MD (Data centric, Multiple documents). In

44 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 33 our study, we chose the DC/MD flavor since it appeared to be the most challenging from the optimizer s perspective. Specifically, DB2 XML was populated with the large option for DC/MD, resulting in a database size of around 1 GB with all data conforming to the uniform distribution. For each of the XQuery templates, indexes were created for all paths appearing in the template and statistics were collected on these paths. TPCH X. The NATION, REGION, SUPPLIER and CUSTOMER relations of the TPCH benchmark were converted to their equivalent NATIONS, REGIONS, SUPPLIERS and CUSTOMERS XML Rs. Further, the ORDER and the LINEITEM relations were combined into the ORDERS XML R, while the PART and PARTSUPP relations were merged into a single PARTS XML R. The schemas of the documents stored in these XML Rs are shown in Figure 3.7. These schemas were constructed and the data was generated using the examples provided with the Toxgene tool [2]. The database was populated with around 1GB of data, and indexes were created for all paths present in the XQuery templates. For each of the 6 tables shown in the schema, multiple documents were loaded, with an entity per document i.e., the NATIONS, REGIONS, CUSTOMERS, SUPPLIERS, PARTS and ORDERS tables were populated with one document per nation, region, customer, supplier, part and order entity, respectively. Each order document in turn included its corresponding lineitems and each part document had its supply information inlined. The number of documents populated in each of these tables is tabulated in Table 3.1 and a sample document for each of the 5 tables is given in Appendix C. The database was made roughly equivalent to the TPCH benchmark in terms of the cardinalities of the tables (except for ORDERS, which was populated with only documents as opposed to a target value of this was due to memory heap overflow problems occurring while generating data using Toxgene). Finally, the generated data was uniformly distributed over the associated domains Query Templates We have considered only two-dimensional XQuery templates in this study. These templates have been verified to be compatible with the constraints specified for legal XQuery templates in Section 3.2. All the predicates are on floating-point element values. The Picasso diagrams are produced at a resolution of 300 points in each dimension, resulting in almost a hundred

45 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 34 create table nations (nations xml) create table regions (regions xml) create table customers (customers xml) create table suppliers (suppliers xml) create table parts (parts xml) create table orders (orders xml) Figure 3.7: Database Schema for TPCH X Table Name Number of Documents nations 25 regions 5 customers suppliers parts orders Table 3.1: Document Count for TPCH X thousand queries being explained over the space. The DB2 XML optimizer itself was run at its default optimization level of 5 we intend to explore the Picasso diagram behavior at the highest optimization level of 9 in the near future. As described in Section 3.3, the constants to be used in the query templates are estimated through a linear-interpolation-based inverse transform of the histogram summaries obtained through the db2cat utility. Using these constants, Picasso outputs a Selectivity Log that explicitly tabulates the three types of selectivities delineated in Section 3.3, namely, Picasso, Predicate and Plan selectivities. Additionally, the log also indicates the relative and absolute differences between the Picasso and Predicate selectivity values. The cost-increase threshold used for plan diagram reduction in all our experiments is λ = 20%. The choice of this setting was based on the observations of [3], wherein anorexic reduction was usually observed at this value for SQL-based query templates.

46 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS TPoX We constructed a Picasso template from the cust sold security.xqr XQuery provided with the TPoX benchmark, as shown in Figure 3.8. This template, which we will hereafter refer to as QTX SEC, returns the customer accounts whose working balance is within a parametrized value and which have been used to trade one or more securities, and, for each of these accounts, the trading amounts of the associated orders being within a parametrized value, the result being alphabetically sorted on Account titles. #NAMESPACE declare default element namespace ; #NAMESPACE declare namespace c= ; #PSX /c:customer/c:accounts/c:account/c:balance/c:workingbalance CUSTACC CUSTACC CADOC #PSX /FIXML/Order/OrdQty/@Cash ORDER ORDER ODOC # # XQUERY declare default element namespace ; declare namespace c= ; for $cust in db2-fn:xmlcolumn( CUSTACC.CADOC ) /c:customer/c:accounts/c:account[c:balance/c:workingbalance :varies] for $ord in db2-fn:xmlcolumn( ORDER.ODOC )/FIXML/Order [@Acct=$cust/@id/fn:string(.) and OrdQty/@Cash :varies] order by $cust/c:accounttitle/text() return <Customer> {$cust/c:accounttitle} {$cust/c:currency} </Customer> Figure 3.8: XQuery Template for TPoX (QTX SEC) Picasso Diagrams. The Picasso diagram suite for QTX SEC is shown in Figures 3.9(a) through 3.9(d). The plan diagram consists of 14 plans with plan P1 occupying about three-quarters of the space and plan P14 present in only 0.001% of the diagram, resulting in an overall Gini co-efficient of Further, the number of plans decreases to 6 when parameter differences between plan trees are not considered. The low density of plans may perhaps be due to the

47 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 36 (a) Plan Diagram (b) Cost Diagram (c) Card Diagram (d) Reduced Plan Diagram (λ = 20%) Figure 3.9: Picasso Diagrams for TPoX QTX SEC (X-Axis: CUSTACC /Customer/Accounts/Account/Balance/WorkingBalance, Y-Axis: ORDER /FIXML/Order/OrdQty/@Cash) queries being simpler, from an optimization perspective, in a transaction processing environment. We also see that the plan diagram predominantly consists of vertical blue bands (plan P2) on the red region (plan P1). Only close to the X and Y-axes do we find other plans, such as the yellow horizontal stripe of plan P4, and the brown and orange vertical bands of plans P3 and P6, respectively. The operator trees for plans P1 (red) and P2 (blue) are shown in Figures 3.10(a) and 3.10(b), respectively. We see here that while they both have the same join order, namely

48 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 37 CUSTACC ORDER, there is a structural difference between the two plans due to the repositioning of an NLJOIN-XSCAN pair. As is evident in the trees, the indexes ACCT BAL and ORDER ACCOUNTID are used to scan the CUSTACC and ORDER relations, respectively. The purple vertical stripe (plan P5) also has the same join order as that of plans P1 and P2, but uses the index ORDER CASH, instead of ORDER ACCOUNTID, to access the ORDER relation. In the region near and parallel to the X-axis, the join order changes to ORDER CUSTACC for plans P4, P7, P8, P9 and P14, with all five collectively occupying less than 2% of the area. Finally, when the plan diagram is reduced with λ = 20%, the number of plans comes down to 11 plans, as shown in Figure 3.9(d). The blue vertical bands of plan P2 require a λ of 70% in order to disappear due to swallowing by the red plan P1. The cost and card diagrams are as shown in Figures 3.9(b) and 3.9(c). Interestingly, we observe here that while the result cardinality has a simple affine relationship with the PSX selectivities, the cost diagram in contrast exhibits a highly non-linear behavior. Selectivity Logs. The first and last five entries of the selectivity logs for the /Customer/Accounts/Account/Balance/WorkingBalance and /FIXML/Order/OrdQty/@Cash PSXs are shown in Figures 3.11(a) and 3.11(b), respectively. We observe here that the Picasso and Predicate selectivities are comparable throughout the selectivity range. However, with regard to Plan selectivities, we observe certain discrepancies in both the PSXs for the WorkingBalance PSX, the selectivities are accurate till the 25.70% mark but subsequently saturate and remain at this level, while with PSX, they are actually 0% for most of the range XBench We now turn our attention to the XBench native XML benchmark. For this benchmark, a representative XQuery template, QTX19, which is based on XQuery 19 of the queries listed for the DC/MD database, is shown in Figure It attempts to retrieve all orders within a parametrized total value for which the associated customers are located within a parametrized address value, the result being sorted by the order date.

49 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 38 (a) Plan P1 (Red)

50 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 39 (b) Plan P2 (Blue) Figure 3.10: Plan trees (QTX SEC)

51 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 40 Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (a) PSX: /Customer/Accounts/Account/Balance/WorkingBalance Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (b) PSX: Figure 3.11: Selectivity Logs (QTX SEC) Picasso Diagrams. The Picasso diagram suite for QTX19 is shown in Figures 3.12(a) through 3.12(d). We see here that the plan diagram consists of 42 plans, and the area distribution of these plans is highly skewed, with the largest plan occupying a little over 20% of the space and the smallest taking 0.001%, the overall Gini co-efficient being as high as Most of the differences between the plans are subtle, arising out of their operator parameters, rather than the plan tree structures themselves for example, a common parameter difference between plans is attributable to the TMPCMPRS parameter, which is present in the TEMP and SORT operators. When such parameter differences are ignored, the number of plans comes down sharply to just

52 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 41 (a) Plan Diagram (b) Cost Diagram (c) Card Diagram (d) Reduced Plan Diagram (λ = 20%) Figure 3.12: Picasso Diagrams for XBench QTX19 (X-Axis: ORDER /order/total, Y-Axis: /customers/customer/address id) 10. In Figure 3.12(a), plan P2 (dark blue) blends into the plan P1 s region (red) in a wave-like pattern. The plan trees for P1 and P2 are shown in Figures 3.14(c) and 3.14(b), respectively. We see here that the plans have different join orders P1 computes ORDER CUSTOMER whereas P2 evaluates CUSTOMER ORDER that is, they differ on which relation is outer and which is inner in the join. In addition, there are also a few other structural differences between the plans. Near and parallel to the Y-axis, we see a yellow vertical strip (plan P4) which is sprinkled

53 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 42 #PSX /order/total ORDER ORDER ORDER #PSX /customers/customer/address id CUSTOMER CUSTOMER CUSTOMER # # XQUERY for $order in db2-fn:xmlcolumn( ORDER.ORDER )/order, $cust in db2-fn:xmlcolumn ( CUSTOMER.CUSTOMER )/customers/customer where $order/customer id = $cust/@id and $order/total :varies and $cust/address id :varies order by $order/order date return <Output> {$order/@id} {$order/order status} {$cust/address id} {$cust/first name} {$cust/last name} {$cust/phone number} </Output> Figure 3.13: XQuery template for XBench (QTX19) with light orange spots (plan P16). Plan P4 is shown in Figure 3.14(a) and differs from P16 only in the positioning of an NLJOIN-XSCAN pair. The cost and card diagrams are shown in Figures 3.12(b) and 3.12(c), respectively. We observe here a simple affine relationship for both cost and cardinality with regard to the selectivity values. When reduction with λ = 20% is applied to the plan diagram, the number of plans goes down from 42 to 21, as shown in Figure 3.12(d). Interestingly, the wave pattern produced by plan P2 remains intact it is eliminated only when λ is increased to 30%. However, it is possible that the P2-associated wave pattern would have been removed at a lower λ if our reduction algorithm had supported partial swallowing, wherein a subset of the points associated with a plan are replaced currently, Picasso only supports total swallowing, allowing a plan s points to be replaced only if all its points are replaced. We plan to investigate this issue in further detail by implementing partial swallowing in the Picasso tool.

54 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 43 (a) Plan P4 (Yellow) (b) Plan P2 (Blue)

55 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 44 (c) Plan P1 (Red) Figure 3.14: Plan Trees (QTX19)

56 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 45 Selectivity Logs. The first and last five entries of the selectivity logs corresponding to the /order/total and /customers/customer/address id PSXs are shown in Figures 3.15(a) and 3.15(b), respectively. The Picasso selectivities for /order/total show marginal differences with regard to the Predicate selectivities for the initial entries and, with increasing selectivity, eventually become close to these values. However, Plan selectivity values remain close to the expected values only until the 50% mark, after which they saturate and remain the same irrespective of the constant used, similar to our observations in the TPoX environment. Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (a) PSX: /order/total Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (b) PSX: /customers/customer/address id Figure 3.15: Selectivity Logs (QTX19)

57 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 46 On the other hand, for /customers/customer/address id, the Picasso selectivities match the Predicate selectivities with tolerable deviations for most of the selectivity range however, the Predicate selectivities for the last four entries (98.83% through 99.83%) remain static at a value of 98.60%. This is in spite of close to 3000 records being present between the Picasso constants associated with 98.83% and 99.83%. Further, our plan selectivity extraction mechanism does not work even partially in this case, returning a value of 0% throughout the selectivity range TPCH X In our earlier work on SQL databases [50], we had constructed a variety of TPCH-based parametrized SQL query templates. For example, QT5, the template shown in Figure 3.16(a), is based on TPCH Query 5. This query lists the revenue volume done through local suppliers, when the c acctbal of the CUSTOMER relation and the s acctbal of the SUPPLIER relation are within their respective parametrized values. XQuery templates similar to these SQL templates were created for evaluating DB2 XML as a case in point, the QTX5 template shown in Figure 3.16(b) is similar to QT5. Picasso Diagrams. The suite of Picasso diagrams for QTX5 is shown in Figures 3.17(a) through 3.17(d). Further, for comparative purposes, we also show the plan diagram for the SQL version, QT5, in Figure 3.17(e). In Figure 3.17(a), we can see that there are 25 plans in the plan diagram with a wide variation in the areas covered by these plans. The top three plans occupy more than 70% of the space, and the last three occupy only 0.068% of the region this skew in area coverage is captured through the Gini coefficient, whose value is When parameter differences between plan trees are not considered, the number of plans reduces to 16 in the plan diagram. The plan diagram is also characterized by a variety of intricate patterns, and has several regions where many plans intermingle the most prominent example is found in the upper half region of the plan diagram where vertical bands of Plan P1 (red), Plan P2 (blue), Plan P3 (brown) and Plan P6 (orange) alternate with one another in a convoluted manner. Yet another example is the intermingling of plan P1 (red), plan P3 (brown), plan P5 (purple) and plan P9

58 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 47 select n name, sum(l extendedprice * (1 - l discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c custkey = o custkey and l orderkey = o orderkey and l suppkey = s suppkey and c nationkey = s nationkey and s nationkey = n nationkey and n regionkey = r regionkey and r name = ASIA and o orderdate >= and o orderdate < and c acctbal :varies and s acctbal :varies group by n name order by revenue desc (a) SQL template for TPCH (QT5) #PSX /Suppliers/Supplier/AcctBal SUPPLIERS SUPPLIERS SUPPLIERS #PSX /Customers/Customer/AcctBal CUSTOMERS CUSTOMERS CUSTOMERS # # XQUERY for $reg in db2-fn:xmlcolumn( REGIONS.REGIONS ) /Regions/Region[Name= AFRICA ] for $nat in db2-fn:xmlcolumn( NATIONS.NATIONS ) /Nations/Nation[xs:double(RegionKey)=$reg/@key/xs:double(.)] let $sup := db2-fn:xmlcolumn( SUPPLIERS.SUPPLIERS )/Suppliers /Supplier[(xs:double(NationKey)=$nat/@key/xs:double(.)) and (AcctBal :varies)], $cust := db2-fn:xmlcolumn( CUSTOMERS.CUSTOMERS )/Customers /Customer[(xs:double(NationKey)=$sup/NationKey/xs:double(.)) and (AcctBal :varies)], $line := db2-fn:xmlcolumn( ORDERS.ORDERS )/Orders/Order [xs:double(custkey) = $cust/@key/xs:double(.) and OrderDate[. > and. < ]] /LineItem[xs:double(SuppKey) = $sup/@key/xs:double(.)] let $summand :=(for $x in $line return $x/extendedprice*(1 -$x/discount)) order by sum($summand) return <result> {$nat/@key} {$nat/name} <revenue> {sum($summand)} </revenue> </result> (b) XQuery template for TPCH X (QTX5) Figure 3.16: Picasso Query Templates

59 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 48 (a) Plan Diagram (b) Cost Diagram (c) Card Diagram (d) Reduced Plan Diagram (λ = 20%) (e) Plan Diagram (QT5) Figure 3.17: Picasso Diagrams for TPCH X QTX5 (X-Axis: SUPPLIERS /Suppliers/Supplier/AcctBal, Y-Axis: CUSTOMERS /Customers/Customer/AcctBal

60 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 49 (turquoise) in the lower half of the diagram. Let us denote the XML Rs in the XQuery template of Figure 3.16(b), namely REGIONS, NATIONS, SUPPLIERS, CUSTOMERS and ORDERS, as R, N, S, C and O, respectively. When we drill down into the plan internals, we find that there are only three different join orders in the whole plan space! Plans P1, P3, P13 and P17 have the left-deep join order (((R N) S) C) O, plans P2, P5, P6, P9, P10, P11, P14, P15 and P20 have the bushy join order ((R N) (S C)) O and plans P4, P7, P8, P12, P16, P18, P19 and P21-P25 also have a bushy join order which is ((R N) S) (C O). The explicit trees for these plans are too large to include here, and have therefore been made available at our project URL [56]. Turning our attention to the Cost Diagram (Figure 3.17(b)), we observe that the cost steadily increases with the increase in selectivity of the CUSTOMERS XML R. The effect of the SUP- PLIERS relation seems mild in comparison, perhaps because the CUSTOMERS relation is about 15 times the size of the SUPPLIERS relation. We also see that the cost plateaus out when approximately 50% selectivity is reached in the PSX dimensions. To analyze this behavior, we looked into the annotated plan trees at different points in the selectivity space. Annotated plan trees at roughly 40%, 60% and 98% selectivities in both dimensions are partially shown in Figures 3.18(a), 3.18(b) and 3.18(c). The total cost of these plans being 4.87e5, 6.08e5 and 6.10e5, respectively, it can be inferred by looking at these subtrees that the overall costs are dominated by the NLJOIN operator in the CUSTOMERS subtree. For a selectivity increase selectivity from 40% to 60%, a cost increase of 1.12e5 is observed in the NLJOIN operator corresponding to the CUSTOMERS relation. However, on moving from 60% to 98%, the cost goes up by merely 0.01e5. It appears puzzling that the NLJOIN cost does not scale up at least linearly with increasing selectivity and this issue needs further investigation. Moving on to the cardinality diagram (Figure 3.17(c)), it shows a constant result set size of 1 throughout the space since the query computes an aggregate operation (sum) in the final step. Now, when the plan diagram is subject to λ = 20% reduction, the original 25 plans come down to 6, as shown in Figure 3.17(d). Further, several of the intricate patterns are eliminated for example, a part of the lower half region of rapidly alternating plans is singly dissolved by the red plan (Plan P1).

61 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 50 (a) 40% Selectivity (b) 60% Selectivity

CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 51 (c) 98% Selectivity Figure 3.18: Annotated Plan trees Partial (QTX5) Finally, when we compare the XML and SQL-based plan diagrams in Figures 3.17(a) and 3.

62 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 51 (c) 98% Selectivity Figure 3.18: Annotated Plan trees Partial (QTX5) Finally, when we compare the XML and SQL-based plan diagrams in Figures 3.17(a) and 3.17(e), respectively, we notice that the XML diagram is significantly more complex in comparison in terms of their spatial layouts even though the numbers of plans is the SQL-based diagram is more than twice the number found in the XML diagram. This certainly merits further investigation. Selectivity Logs. The selectivity logs for the /Suppliers/Supplier/AcctBal and /Customers/Customer/AcctBal PSXs are shown in Figures 3.19(a) and 3.19(b), respectively specifically, the first five and last five entries are extracted and shown here. The Picasso selectivities for /Suppliers/Supplier/AcctBal show tolerable differences with regard to the Predicate selectivities for the initial entries and, with increasing selectivity, eventually become close to these values. On the other hand, for /Customers/Customer/AcctBal, the Picasso selectivities match the Predicate selectiv-

63 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 52 ities with tolerable deviations for most of the selectivity range however, for the last eight Picasso selectivity entries (97.50% through 99.83%), the Predicate selectivity stays put at 97.24%. This is in spite of the fact that there are more than 3500 records between the constants associated with 97.50% and 99.83%. Again, this is an issue that needs to be investigated further. Finally, the Plan selectivity values for both the PSXs show behavior similar to the previous cases the selectivity values saturate at the 50% mark Query Complexity Problem Apart from the above issue with plan selectivities, we also ran into another problem when we evaluated the XQuery template QTX2 shown in Figure 3.20, which is based on Query 2 of TPCH. For this template, the plan diagram features only two very similar plans, the difference being a slight variation in their join orders, with the first NLJOIN operator, which combines the SUPPLIERS and PARTS relations, having SUPPLIERS as the outer relation in P1 (red), and PARTS as the outer relation in P2 (blue). It is certainly in the realm of possibility that this plan diagram genuinely features only a few plans, but we also think that the following warning message issued by the DB2 XML optimizer may have impacted the observed behavior: SQL0437W Performance of this complex query may be sub-optimal. Reason code: 1". SQLSTATE=01602" This warning persisted even when the optimization level was lowered and the heap space was sufficiently increased (upto 3GB). However, when the template was simplified by removing some constructs, the message went away and the resulting diagram in fact featured quite a few more plans than those seen with QTX2.

64 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 53 Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (a) PSX: /Suppliers/Supplier/AcctBal Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (b) PSX: /Customers/Customer/AcctBal Figure 3.19: Selectivity Logs (QTX5)

65 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 54 #PSX /Parts/Part/RetailPrice PARTS PART1 PARTS #PSX /Parts/Part/PartSupp/SupplyCost PARTS PART2 PARTS # # XQUERY for $part in db2-fn:xmlcolumn( PARTS.PARTS )/Parts/Part [RetailPrice :varies]/partsupp, $sup in db2-fn:xmlcolumn( SUPPLIERS.SUPPLIERS )/Suppliers/Supplier, $nat in db2-fn:xmlcolumn( NATIONS.NATIONS )/Nations/Nation, $reg in db2-fn:xmlcolumn( REGIONS.REGIONS )/Regions/Region where xs:double($sup/@key)=xs:double($part/suppkey) and $sup/nationkey/xs:double(.)=$nat/@key/xs:double(.) and $nat/regionkey/xs:double(.)=$reg/@key/xs:double(.) and $reg/name = EUROPE and $part/supplycost <=(let $reg := db2-fn:xmlcolumn( REGIONS.REGIONS ) /Regions/Region[Name= EUROPE ], $nat := db2-fn:xmlcolumn( NATIONS.NATIONS ) /Nations/Nation[xs:double(RegionKey)=$reg/@key/xs:double(.)], $sup := db2-fn:xmlcolumn( SUPPLIERS.SUPPLIERS ) /Suppliers/Supplier[$nat/@key/xs:double(.) = xs:double(nationkey)], $partsupp := db2-fn:xmlcolumn( PARTS.PARTS ) /Parts/Part[xs:double(PartKey)=$part/../PartKey/xs:double(.)]/PartSupp[ ($sup/@key/xs:double(.)=xs:double(suppkey)) and (SupplyCost :varies) ] return min($partsupp/supplycost)) order by $sup/acctbal descending, $nat/name, $sup/name, $part/../partkey return <final> {$sup/acctbal} {$sup/name} {$nat/name} {$part/../partkey} {$part/../mfgr} {$part/supplycost} {$sup/address} {$sup/phone} {$sup/comment} </final> Figure 3.20: XQuery template for TPCH X (QTX2)

66 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS Selectivity Computation Problems As pointed out earlier, selectivity estimation in DB2 XML seemed to be aberrant when the constants used in the predicate were negative numbers or of the string data type. An example for both of them is presented in this section Case: Negative Numbers An extract of the Selectivity log in the case of negative numbers is given in Figure 3.21(a). The XQuery template for which the selectivity log was obtained is shown in Figure 3.22(a) and this template corresponds to the TPCH X database environment. It can be seen that in the Predicate Selectivity column, a constant of 0.26 appears in the first 5 rows. In fact this appears till the numbers in the Constant column of the Selectivity log are in the domain of negative values, and, changes to correct estimation values when the positive domain is entered. This seems to be an error since there are quite a few elements in the negative domain. This fact is confirmed by the observation that the errors in Predicate Selectivity estimation disappeared when the data values were linearly translated without affecting their distribution (a constant was added to all the data values, so that all values reside only in the positive domain). The Selectivity log for the linearly translated data is shown in Figure 3.21(b). Data sets corresponding to the case of negative numbers and to the linearly translated version can be downloaded from [17]. In fact, in a test scenario, where all the values of a certain PSX were situated in the negative domain, erroneous selectivity estimation was observed throughout the range. As a case in point, 100% selectivity was erroneously reported as 0.0% Case: String Data Type An extract of the Selectivity log for this case and its corresponding XQuery template are shown in Figures 3.21(c) and 3.22(b), respectively. This particular scenario was stumbled upon in the TPoX database environment. In this case, the Predicate selectivity estimation values, as seen in the Predicate Selectivity column in Figure 3.21(c) were explicitly verified with the statistical summaries obtained through db2cat and were found to be inconsistent.

67 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 56 Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (a) Selectivity Log: Negative Numbers Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Selectivity % % % % % % % % % % (b) Selectivity Log: Positive Values Picasso Constant Predicate Relative Absolute Plan Selectivity Selectivity Difference Difference Sel AAKSOCX % ABGOBJVG % ABXSKUN % ACGYEIN % ACVFJ % WMQFUMK % WPOUFSO % WTDAICW % XMATHH % ZAUTV % (c) Selectivity Log: String Data Type Figure 3.21: Selectivity Log Errors

68 CHAPTER 3. ANALYSIS OF HYBRID OPTIMIZERS 57 XQUERY for $x in db2-fn:xmlcolumn( CUSTOMERS.CUSTOMERS ) /Customers/Customer/AcctBal where $x :varies return $x (a) XQuery template: Negative Numbers XQUERY declare default element namespace ; declare namespace c= ; for $x in db2-fn:xmlcolumn( ORDER.ODOC ) /FIXML/Order/Instrmt/@Sym where $x :varies return $x (b) XQuery template: String Data Type Figure 3.22: XQuery Templates for Selectivity error

69 Chapter 4 Plan Coloring based on Difference With the ever increasing complexity of queries, there is an increase in the number and intricacy of plans in the plan diagrams. This holds true for both relational and XML queries. So, the analysis of the plan diagrams is certainly not easy. In order to alleviate this problem to a certain extent, we have attempted to color the plan diagrams such that the difference in colors between any two plans reflects the differences in their underlying tree structure. Given m number of plans, along with their plan trees, we first describe a basic metric (Jaccard metric) that attempts to quantitatively describe the semantic difference between any two plans. We go on to show how this metric might perform undesirably in certain cases and then describe an extended version of the same, adopted from the plan-differencing operator of Picasso, which is also briefly described in [19]. Further, given any plan distance metric, we explicate our techniques for semantically coloring plan diagrams as described above. It is important to note that these techniques are metric-agnostic and can be used in conjunction with any metric. Finally, we provide experimental results obtained by employing the described techniques. We also provided analysis on the quality of coloring obtained in each of the plan diagrams. 4.1 Formal Definition Consider an undirected complete graph G(V, E) with m vertices, where each vertex v V represents a POSP (Parametric Optimal Set of Plans) plan. A vertex v has an associated weight w v 58

70 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 59 in the range (0, 1], representing the fractional volume covered by the plan in the plan diagram. Further, each edge e ij E has an associated weight w eij in (0, 1], representing the distance between the plans at its vertices v i and v j. The objective now is to come up with an efficient algorithm that assigns a coloring to all the vertices in V such that, given any pair of vertices (v i, v j ), the color distance between them has as close a correspondence as possible to the weight w eij of the edge joining these vertices. Further, if a choice has to be made, vertices with larger weights receive preferential treatment in obtaining a more accurate coloring. That is, our aim is to minimize the objective function - m 1 m v i =1 v j =v i +1 w eij (P landist(v i, v j ) ColorDist(v i, v j )) 2 where P landist(v i, v j ) is the distance between the plans at its vertices v i and v j and ColorDist(v i, v j ) is the distance assigned to the points v i and v j in color space by our technique. 4.2 Plan Distance Metrics One of the first metrics we experimented with was the Jaccard metric, alternatively known as Jaccard Distance/Dissimilarity. It is a function J defined as J(A, B) = 1 A B A B where A and B are sets, which in our context correspond to the sets of nodes of the trees of any two plans. Specifically, if T i and T j are sets of nodes of two plan trees (all the nodes of a plan tree form its set) corresponding to the vertices (plans) v i and v j, respectively, then the Jaccard distance between the vertices v i and v j would be J(T i, T j ) = 1 T i T j T i T j To give an example, consider two plan trees of vertices (plans) v 1 and v j shown in Figures 4.1(a) and 4.1(b), respectively. The node sets corresponding to these trees will be T 1 = {RETURN, JOIN A, JOIN B, REL A, REL B, REL C, REL D} and T 2 = {RETURN, JOIN C, JOIN B, REL A, REL B, REL C, REL D},

71 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 60 (a) T 1 (b) T 2 Figure 4.1: Trees for Jaccard metric respectively. So, the Jaccard distance between these two vertices would be, J(T i, T j ) = 1 T i T j T i T j = 1 6/8 = 0.25 This low value indicates that these two plans are very similar to each other. But by a mere glance at the plan trees, we can observe that the join orders are totally different and so are the structures of the plan trees. This is an inherent weakness of the metric as a result of its set based similarity computation, which does not take into account the position of the node in the tree and other nodes in its vicinity. Due to this drawback, we adopted the plan tree differencing strategy from the plan-differencing operator of Picasso, which is briefly described in [19]. There are a few additional details which are necessary here. In [19], parameter differences between operators are not considered i.e., if two nodes have the same name, same input relations and nodes, but have different parameters passed to them, then they are essentially considered to be the same. In the original plan-differencing operator, a little more importance is given to these differences because, there are existence of plans which are exactly the same but for these kind of differences. Ignoring these differences will be equivalent to coloring the plan diagrams with fewer number number of plans than what is present (plan diagram at operator level). Hence in order to accurately color the original parameter level plan diagram, taking into account all differences, big and small, we assign a small dissimilarity value, ɛ to nodes which differ in just their parameters. Using this method described, we calculated the plan distances between every two plans resulting in a matrix of dissimilarities. It is important to note that the value of

72 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 61 dissimilarity between any two vertices, as given by this metric is always in the range (0, 1], as required. We now go on to describe the techniques employed to convert the plan distances to distances in color space. 4.3 Multi-Dimensional Scaling As described in [46], Multidimensional scaling (MDS) is a method that represents measurements of similarity (or dissimilarity) among pairs of objects as distances between points of a low-dimensional multidimensional space. The classifications of MDS are provided in [58, 43]. Since our input data is at the interval level, Metric multidimensional scaling is the technique suitable for our purposes. In metric multidimensional scaling, one of the very popular strategies is the Stress Majorization technique using the SMACOF (Scaling by majorizing a complicated function) algorithm [46]. Due to reasons which are brought to light later in this chapter, we could not use this technique successfully even though it has guarantees on, and rate of, convergence. Instead, we used a Non-metric MDS technique, which is Kruskal s Iterative Steepest Descent [47, 48] approach and slightly modified it to achieve our goal. The Steepest Descent approach is as outlined below Kruksal s Steepest Descent The basic algorithm is same as the standard Kruskal s Steepest Descent. Given a function f for which the minima needs to be determined, we initially, assign a random point as our minima to kick start the iterative process. In any iteration, we determine the direction of movement from the current point to the true minima by computing the slope of f at the current point. We start moving in the direction of the negative gradient by a chosen step size h. We continue this iterative process till it either converges or a user-given convergence criteria is met. Choosing the step size (h) carefully is important for the algorithm to converge quickly and also to monotonically decrease the objective function in every iteration. Also, starting at a bad

73 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 62 random value might end up in a local minima which is nowhere close to the global minima. To reduce the possibility of this happening, it is recommended to run the method on a given input with a variety of random positions [48]. We have adopted the recommendation of optimizing with many random inputs and have also carefully chosen h as elaborated in the next section. 4.4 Modelling our coloring problem as an MDS problem In the previous section we introduced MDS and briefly described one of the methods. Here, we will elaborate on viewing our problem in the light of MDS. One of the first design considerations before employing MDS is the number of dimensions in the output space. Since our problem, as defined in 4.1 is to convert the dissimilarities into equivalent distances in color, we have the color space as our output. This being the case, we have two choices to make, one the RGB (Red, Green and Blue) color space, and the other, the CMYK (Cyan, Magenta, Yellow, Key - Black) model. If we choose the CMYK model, when the percentage of the Key (Black) color is 100%, no change in any color by any amount will make any difference to the final color. This will make a part of the output space redundant and the colors of the vertices placed in that area by the MDS program will not reflect the dissimilarity between them. So, to prevent such situations, we chose the RGB model which makes our output space three dimensional. Thus, we have the mapping g : D ([0, 1] 3 ) n where g represents the scaling function, D stands for the set of all symmetric matrices of the type [0 1] n n (i.e., the set of all dissimilarity matrices), one of which is given to us as input, n represents the number of vertices or equivalently, plans in the plan diagram and the three coordinates of the (0, 1] 3 space represent the colors Red, Green and Blue, respectively. The next design issue to be dealt is the variant of MDS to use - Metric or Non-metric, and after choosing one of them, the specific technique to be employed. As already mentioned in 4.3, we have chosen a Non-Metric MDS technique, the Steepest Descent approach, even though our input dissimilarity data is at the interval level for reasons elaborated as follows. We know that all input dissimilarity values are in the range (0, 1] and that each of the color

4. But the output of the SMACOF algorithm is in R 3.

74 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 63 Figure 4.2: Output Color Space values (Red, Green, Blue) are represented in the range [0, 1]. The latter part necessitates every point in the output of our MDS program to lie in the [0, 1] 3 range, thus making the output space into an unit cube as shown in the Figure 4.4. But the output of the SMACOF algorithm is in R 3. Even after translating the output towards the origin, many points stayed outside the unit cube and scaling down the values into the unit cube lead to deterioration in the quality of coloring. Because of these reasons, we could not employ the SMACOF algorithm and a quick analysis of the Steepest Descent algorithm led to the observation that it can easily be modified to accommodate our constraint. Hence we adopted Steepest Descent method even though it is meant to be a Non-metric MDS technique. The slight modification done is discussed further in this section. There is another important feature specific to our coloring problem. As already mentioned, out output space is an unit cube with maximum and minimum values possible being (0.0, 0.0, 0.0) and (1.0, 1.0, 1.0), respectively. The maximum possible distance between any two points in any cube would be along its diagonal, which is shown by a red line joining the vertices A and H in Figure 4.4. Hence in our unit cube, the maximum possible distance is 3. But the maximum possible distance in our input space is limited to 1. This would prove to be a handicap since we would be confined to a smaller color space and thus may lead to inaccurate representation of distances in the color space. To overcome this drawback, we scaled up all the dissimilarities by a factor of 3.

75 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 64 Figure 4.3: Placement of Plans It should also be noted that it might not be possible to get a zero objective value or in other words, ideal coloring using this technique or otherwise. This is because it may not be possible to embed all the plans in 3D space such that the distances between them satisfy their respective dissimilarities. To give an example, consider the following dissimilarity matrix for 5 plans - P1 to P In 3D space, we can have at most 4 points (P1 - P4), to be at equal distances (0.5) from each other as shown in the Figure 4.4. There is no means to place the fifth point P5, such that it is at a distance of 0.5 from every point (P1 - P4) so as to obtain ideal coloring. This example clearly shows the infeasibility of obtaining an optimal solution, no matter what the technique Steepest Descent adopted to our coloring problem The input to the Steepest Descent algorithm would be a matrix of dissimilarities, corresponding to every pair of plans. We then compute the following steps iteratively till we meet our conver-

76 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 65 gence criteria. Our convergence criteria is that the change in the value of the objective function between two iterations is sufficiently low (< 1.0E 8 ) and the change in position of the vertices (plans) between the same two iterations is not more than a certain value (< 1.0E 6 ). Computing Slopes. This is done by taking the partial derivative of our objective function at every every co-ordinate of every vertex and by adding them up. If any of the co-ordinates of any point is near the border of the unit cube (our output space), then we make the corresponding slope to be zero so that there is no movement along these co-ordinates. Step Size. After the gradient is computed, we need to calculate the step size in order to move the vertices towards attaining the minima of our objective function. The step size is the minimum of the maximum distance any co-ordinate of any vertex can move inside the cube, in the direction of the negative slope. Move the vertices. We move the vertices along their negative gradient with the step size decided. If the objective value increases, then we halve the step size and again move the vertices. We continue this procedure till a suitable step size is found or no move can be made in this iteration. 4.5 Experimental Results and Analysis Till now we dealt with the theoretical aspects of Multi-dimensional techniques, particularly the Steepest Descent method of Kruskal. Now, let us look at the coloring of plan diagrams with the application of these methods. Our test bed consists of executing the Picasso Database Optimizer Visualizer and the database engines on an Intel Core 2 Duo machine having a clock speed of 2.10 GHz (T8100), with 4GB RAM (3GB usable), running a copy of Windows 7 Professional (32 bit). We have taken queries from the TPCH benchmark and modified them slightly to make them into query templates. We have currently worked with only two dimensional query templates. We generated MDS colored plan diagrams for these query templates on three optimizers, anonymously referred to as Opt A, Opt B and Opt C 1. Experiments are conducted by 1 Note for review purposes only: These three optimizers are IBM DB2 9.7, Oracle 11g and Microsoft SQL Server 2008, respectively.

77 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 66 substituting w ei j in the equation of Section 4.1 with A i + A j, where A i and A j are the areas of the plans i and j - given a choice, the plans will greater area will be treated better than the rest. 1, indicating that all plans are given equal weightage. 1, when we want the plans with relatively lesser area to be colored better than the A i + A j plans with larger area. We elaborate on the first case, and for other cases, we mention only cases we found to be interesting. Also, we provide details about the MDS colored plan diagrams and statistics about the quality of coloring obtained for query template 2 in Subsection 4.5.1, and in the following two subsections, we briefly described the characteristics of the coloring obtained for query templates 8 and 9, and provide details wherever necessary as similar analysis follows TPCH QT2 The plan diagrams for query template 2 of the TPCH benchmark, colored using MDS are shown as shown in Figures 4.4(a), 4.4(b) and 4.4(c) for the optimizers Opt A, Opt B and Opt C, respectively. In the process of analysing the plans in these plan diagrams, let us first look at the distribution of Jaccard dissimilarities between plans for all optimizers. A graph of the same is shown in Figure 4.5. It is very evident that Opt C, along with having a large number of plans (80) has a very conspicuous peak at the point where the Jaccard dissimilarity is around 1.5, indicating that with our adopted metric, most plans in Opt C are highly dissimilar from each other. This is in contrast to Opt A, which not only has very few plans (15) in comparison, but also the maximum dissimilarity between any two plans is around 07, which is less than half the maximum dissimilarity in Opt A. Also, the distribution of dissimilarities is more or less uniform in the range (0.0, 0.7). Opt B has a moderate number of plans (29), and a slight peak at Jaccard dissimilarity of 1.2. The maximum Jaccard distance observed in Opt B is Now, with these numbers in

78 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 67 mind, it would not be wrong to expect a plan diagram which is largely differently colored in the case of Opt C and with fewer colors for both Optimizers A and B. If we look at Figure 4.4(a), the plan diagram for Opt A, the coloring approximately matches our prediction, with shades of maroon in the upper half of the diagram and blue in the bottom half. The maximum color distance of is seen between plans P9 and P14 with (Jaccard dissimilarity is 0.727). These plans correspond to the dark blue in the centre of the bottom half and a thin vertical bar of light orange in the leftmost part of the central patch. The maximum Jaccard dissimilarity corresponds to P3 and P10 (0.794), which are the light blue squarish patch in the centre of the bottom part and the dark orange vertical strip at the leftmost part of the upper half. The color dissimilarity corresponding to this pair is Further, looking at the plan diagram for Opt B, we can see that a large part of the diagram is colored with the blue-green shade. But when we look closely, we can observe that there are a large number of plans varying widely in their colors near the axes. In fact, the biggest color difference is 1.732, which is the highest possible, corresponding to one of the diagonals in the cube. This is between P7 and P18 (Jaccard dissimilarity being 1.544), which in the plan diagram are the bright green patch near the Y-axis and the bright pink near the X-axis. The maximum Jaccard distance also happens to be in this case, and is shared by more than a pair of plans. But all pairs are not situated on opposite sides of the diagonal in the cube. Finally, in the case of Opt C, where we expected a plan diagram with large color variations, looking at Figure 4.4(c) is quite surprising. A large part of the plan diagram has shades of red along with orange yellow and pink. The explanation for this is that a large area of the diagram is occupied by very few plans, this being evident by the high Gini co-efficient of Quite a few plans are found concentrated near the axes, and these plans do have colors which are far apart in color space. We have plans P15 and P23, colored white and black, respectively, both being thin horizontal strips near the parallel to the X-axis. This would be one of the diagonals in our cube. Also, there is the P20 and P50 pair having bright green and pink, respectively which are situated in the ends of another diagonal in the cube. The accuracy of the interpretations that can be drawn from the MDS colored plan diagrams does not seem far from reality when we look at the statistics of the generated coloring in Ta-

79 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 68 (a) Opt A (b) Opt B (c) Opt C Figure 4.4: Picasso Diagrams for QT2 (X-Axis: PART P RETAILPRICE, Y-Axis: PARTSUPP PS SUPPLYCOST)

80 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 69 Figure 4.5: Jaccard distances in QT2 ble In the following discussion, we refer to pairs of plans being treated well, and by that we mean that the ratio of the color distance between any two plans to the Jaccard dissimilarity between them is between 0.7 and 1.3. (error tolerance level is 30%). In the case of minimizing our objective function with weights, each of the plans is associated with the area covered by the plan in the plan diagram. Hence, it is perhaps reasonable to look at the weighted set of plans pairs treated well. The third column % of points pairs treated well gives gives the sum of weights of the set of plan pairs treated well divided by the total sum of weights of the plan pairs. Naturally, like the weights assigned in the objective function, the weight given to a plan pair is the sum of the areas of the two plans. Note that, when each point in the plan diagram represents a different plan, % of point pairs treated well is equal to % of plan pairs treated well.

81 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 70 Optimizer No of % of point pairs % of plan pairs Lowest Highest Objective Objective plans treated well treated well Ratio ratio with weights w/o weights A % 86.67% E 3 4.9E 3 B % 89.4% E C % 67.94% Table 4.1: Statistics for QT2 with weights From the table we can see that, in the case of all three optimizers, more than 90% of the point pairs are colored within our error threshold. The number of plan pairs treated well is above 85% for Opt A and B and is slightly less, around 67% for Opt C. The reason for this low percentage is that we have tried to optimize the objective function with weights being the areas of the plans. Because of this, plans with larger areas tend to be colored well, and since we have a high skew (Gini Co-efficient of 0.79), a large part of the plan diagram will be occupied by very few plans, which will be colored well at the cost of many other plans (which occupy very less area) not being colored within our error tolerance range. Also, one can notice the increase in the above two percentages as the value of the objective function decreases. Even though our objective function directly does not optimize the % of point pairs treated well or the % of plan pairs treated well, it has the necessary effect in our situation. The objective function value calculated without weights is more than the value obtained when it is calculated with weights as we can see in columns 7 and 8. This is obvious since the optimization was carried out with weights. The columns highest ratio and lowest ratio provide information about the worst case performance of our technique. Since there is no bound on how badly the technique can perform, the ratios can get as bad as possible, like in the case of Opt C, the lowest ratio is as low as 5.60E TPCH QT8 Similar to the previous Subsection 4.5.1, MDS colored plan diagrams and statistics on coloring for the three optimizers, Opt A, Opt B and Opt C for the case of query template 8 of the TPCH

82 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 71 Optimizer No of % of point pairs % of plan pairs Lowest Highest Objective Objective plans treated well treated well Ratio ratio with weights w/o weights A % 77.17% E B % 71.54% C % 60.17% Table 4.2: Statistics for QT8 with weights benchmark are given in Figure 4.6 and Table 4.2, respectively. The graph of Jaccard distances vs frequency of occurrence of Jaccard distances is also provided in Figure 4.7. For this particular query template, the three optimizers, Opt A, B and C have 40, 23 and 128 plans, respectively. Similar to what we saw in the previous discussion, Opt C, having a relatively large number of plans (128), has a rather flat peak in the dissimilarity range [1, 1.5] indicating that there a large number of plans which are highly dissimilar from each other. Opt B, with a low number of plans (23), has a relatively flat distribution with no apparent peaks and has a Jaccard dissimilarity range of (0, 1.4]. Opt A, like in the previous case has a smaller dissimilarity range than the other two (0, 1.1]. It is interesting to see that even though the number of plans in Opt A has increased and is now more than that of Opt B (unlike the previous case), the dissimilarity range is still lower than that of Opt B. This shows that larger number of plans does not imply higher dissimilarity between plans. The maximum color distance achieved in all three engines is slightly higher than the corresponding maximum Jaccard distance in each of them. In Opt A, the maximum color distance is 1.20, and maximum Jaccard distance is Both these distances are between P1 and P23, P1 being the dark green colored plan at the top right and P23 being the slightly purple patch near the Y-axis. In Opt B, the maximum color and Jaccard distances are 1.63 and 1.44, between plans (P1, P15) and (P5, P16), respectively. P1 is the bright red plan, P5 is the dark brown patch at the top right, right next to P1. P15 is the bright green vertical bar parallel and very close to the Y-axis and P16 is the white patch near the origin, close to P15. The maximum color and Jaccard distances in Opt C are 1.73 and 1.70 corresponding to plan pairs (P117, P22) and (P122, P31), respectively. Plans P117 and P122 occupy too small a area to be noticed by the eye in the plan diagram. The objective value achieved in the case of all three optimizers is sufficiently low to indicate

83 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 72 (a) Opt A (b) Opt B (c) Opt C Figure 4.6: Picasso Diagrams for QT8 (X-Axis: SUPPLIER S ACCTBAL, Y-Axis: LINEITEM L EXTENDEDPRICE)

84 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 73 Figure 4.7: Jaccard distances in QT8 that the plan diagram is reasonably well colored. Further, to justify this claim we can see that the third column which is % of point pairs treated well, has more than 85% of the point pairs being colored within our error threshold. The fourth column which refers to % of plan pairs treated well has relatively low percentages. This can be attributed to two reasons. One being that we are optimizing with weights and hence it is not expected that it performs very well with respect to plan pairs, and the other reason is that it could also be the result of a weak plan distance metric. The highest and lowest ratios, as explained previously are unbounded and hence can be arbitrarily high or low. These are tabulated in columns 5 and 6, respectively.

85 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 74 Optimizer No of % of point pairs % of plan pairs Lowest Highest Objective Objective plans treated well treated well Ratio ratio with weights w/o weights A % 62.7% 7.0E B % 72.82% 6.6E C % 66.33% 6.5E Table 4.3: Statistics for QT9 with weights TPCH QT9 MDS colored plan diagrams for QT9 for all three optimizers are given in Figure 4.8. A plot of Jaccard distances vs frequency of occurences and the table of statistics which reflects the performance of MDS on these plan diagrams are also given in Figure 4.9 and Table 4.3, respectively. Inferences similar to the previous two subsections can be drawn regarding the coloring of plan diagrams, the jaccard vs frequency plot and the performance statistics of the coloring obtained XML plan diagrams We now present MDS colored plan diagrams for the XML benchmarks. The production and analysis of these diagrams were described in Chapter 3. Figures 4.10(a), 4.10(c) and 4.10(e) are the MDS colored plan diagrams of TPoX, XBench and TPCH X benchmarks, respecctively. A plot of the Jaccard distances vs frequency for all these three benchmarks is given in Figure Finally, the statistics regarding the quality of coloring obtained in all three benchmarks is tabulated in Table If we look at the Jaccard distances vs frequency plot in Figure 4.11, we can see that the plot for the XBench benchmark has highly conspicuous peaks across its dissmilarity range. The XBench benchmark has the highest number of plans, and the highest maximum Jaccard distance (1.066) among the three benchmarks. TPCH X, on the other hand has the lowest maximum Jaccard distance of The plot for TPoX has a break in the range , with no pair of plans having dissimilarities in this range.

86 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 75 (a) Opt A (b) Opt B (c) Opt C Figure 4.8: Picasso Diagrams for QT9 (X-Axis: SUPPLIER S ACCTBAL, Y-Axis: PARTSUPP PS SUPPLYCOST)

87 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 76 Figure 4.9: Jaccard distances in QT9 The coloring of the plan diagram of TPCH X reflects its short dissimilarity range, with similar colors throughout the diagram. The biggest color distance is between plans P14 and P21 (0.53), which also have the maximum Jaccard dissimilarity between them (0.48). The location of these plans are hard to point out since they occupy very less area. In the case of TPoX, it is easy to observe that more than 96% of the plan diagram is occupied by the top two plans which are colored green and black. In the plan diagram corresponding to XBench, the maximum Jaccard dissimilarity is between plans P2 and P8, corresponding to the purple sine wave like pattern in middle of the plan diagram and the bottom dirty yellow horizontal patch, respectively. The maximum color distance is between plans P27 and P42 (1.12), which also happen to be another pair having the maximum Jaccard dissimialrity of 1.06 in the diagram. Finally, the statistics for the coloring obtained is as given in Table We have achieved

88 CHAPTER 4. PLAN COLORING BASED ON DIFFERENCE 77 (a) TPoX (c) XBench (e) TPCH X Figure 4.10: Picasso Diagrams for XML benchmarks

Analyzing the Behavior of a Distributed Database. Query Optimizer. Deepali Nemade

Analyzing the Behavior of a Distributed Database Query Optimizer A Project Report Submitted in partial fulfilment of the requirements for the Degree of Master of Engineering in Computer Science and Engineering