The Use of Statistics in Semantic Query Optimisation

Size: px

Start display at page:

Download "The Use of Statistics in Semantic Query Optimisation"

Jayson Harrell
5 years ago
Views:

1 The Use of Statistics in Semantic Query Optimisation Ayla Sayli ( saylia@essex.ac.uk ) and Barry Lowden ( lowdb@essex.ac.uk ) University of Essex, Dept. of Computer Science Wivenhoe Park, Colchester, CO4 3SQ, Essex, UK Abstract An important aspect of semantic query optimisation is automatic rule derivation. These rules are used to make the query process more intelligent. However, the set of the rules generated may become very large and some rules in the set may not be useful. This problem is normally referred to as the utility problem. Our paper is concerned with limiting the rules set using the chisquare test in statistics to determine a relationship degree between the antecedent attribute of a candidate rule and the consequent attribute of the rule. We have constructed the chi-square table according to the condition on the antecedent attribute of the rule and the condition on the consequent attribute of the rule. If the rule does not have a 'strong relationship degree it can be added into a secondary rules set. This secondary rules set can be used as a filter to avoid derivations of similar weak rules for future queries. Otherwise, the rule is added into a primary rules set. This primary rules set contains all strong rules. In case of large databases, it is hoped that this additional test can reduce the size of the rules set for semantic query optimisation. Introduction Semantic Query Optimisation (SQO) is a comparatively recent approach which can be used to transform a query into an alternative query that has the same answer but can be processed more efficiently. The main difference between the semantic approach and the other optimisation approaches is to use rules during the query optimisation [Graefe and Dewitt, 987; King, 98]. Moreover these rules are derived automatically for the given query when they are needed. Derivation and use of rules makes SQO more intelligent and less expensive. Using any of a number of approaches to SQO it is possible to derive all rules from a given query and databases. The approaches may be classified as heuristic_based systems [Siegel et al., 992], logic_based systems [Chakravarthy et al., 990], graph_based systems [Shenoy and Ozsoyoglu, 989] and data_driven systems [Hsu and Knoblock, 994; Lowden et al., 995; Shekhar et al., 993]. However, having a large rules set remains a problem in all the existing systems since rules are produced automatically regardless of how effective they might be in the query transformation process [Chan and Wong, 99; Han et al., 993; Piatetsky-Shapiro and Matheus, 993; Savnik and Flach, 993; Ziarko, 99]. This is known as the utility problem. For this reason we are suggesting the use of the chi-square test in statistical methods to measure the relationship degree of a Candidate Rule (CR) with a given confidence level and degree of freedom [Chan and Wong, 99]. If the relationship degree of the CR is greater than the decision value of the test, we store it in a primary rules set as a strong rule. Otherwise it is placed in a secondary rules set as a weak rule. The remainder of this paper is structured as follows. Our approach to SQO with statistical additions is shown in section 2, in section 2. the automatic rule derivation for SQO is given for the unmatched CRs using the chi-square test. This test is given in more detail in section 3. In section 4, we prove applicability of the test using a data-driven system approach [Lowden et al., 995]. Finally, in section 5, the results of three experiments using the system are given. 2 Semantic Query Optimisation In general SQO takes a given query in a query language such as SQL, which is the language used in this paper. First it adopts one of the approaches [Chakravarthy et al., 990; Lowden et al., 995; Shenoy and Ozsoyoglu, 989; Siegel et al., 992] to generate a CR from the represented query and a given database. Secondly a check is made to see if there is a rule in the rules set which can match the CR, the matching rule may be used to derive an alternative query. Otherwise the unmatched CR goes into the rule derivation process for future query. In our paper we are going to use two rule sets, a primary rules set and a secondary rules set. Firstly we check the CR against the primary rules set. If there is no matching rule then we examine the secondary rules set to see if there is a weak rule that matches the CR. If the CR is unmatched in both rules sets, the new CR can be used to derive a new rule for future queries (see section 3 Thirdly, in the case of finding a matching rule, SQO transforms the query into an alternative query according to the standard transformation rules of Constraint INtroduction (CIN) and Constraint REmoval (CRE) [Chakravarthy et al., 990; Graefe and Dewitt 987; King, 98; Shenoy and Ozsoyoglu, 989; Siegel et al., 992]. Finally when all alternative queries have been found with their transformations costs, the optimum query is selected. Moreover a matching rule is more likely to be found in the primary rules set since this set contains the strongest rules for the given database. The system is illustrated in Figure.

2 Presenting a given query in SQL Primary Rules Set Producing CRs of the given query by one of approaches in SQO Matching CRs Automatic Rule Derivation Database Schema Given Query Transforming the given query by matching rules using CIN or CRE & Estimating costs of the transformations SQO Optimum Query Figure. Semantic Query Optimiser Unmatched CRs & Costs of CRs Rule Derivation Processing by one of the approaches for SQO Chi-square Test Chi-squarevalue >= Decision Value Database Schema Primary Rules Set Secondary Rules Set System Catalog : If the rule contains indexed attributes Figure 2. Automatic Rule Derivation 2. Automatic Rule Derivation As mentioned before, an important aspect of SQO is to derive its own rules automatically when needed. For example, the process of automatic rule derivation by heuristic_based systems [Schkolnick and Tiberio, 985; Siegel et al., 992] is shown with four modules: defining rule characteristics, selecting requested rules, query generation, and rule management. However, the set of these rules contains non-useful rules and may become very large. Having non-useful rules in SQO causes inefficient and slow query processing. Moreover the system can involve a large number of comparisons that make SQO relatively expensive. For these reasons, our suggested approach is to use the chi-square test to calculate a relationship degree for each new rule using a given confidence level and degree of freedom. If the calculated relationship degree of the new rule is greater or equal to the decision value in the list of the chi-square decision values, the new rule is entered to the primary rules set, otherwise the secondary rules set. In contrast to CHAID [Kass, 980], our method of using chi-square test does not split the rules set into two sets against a given database. Our methods use conditions of the CR to construct the chi-square table in order to measure the relationship degrees of each rule, not the relationship degrees of its attributes class. In the case of categorising dependent variables in very large databases, CHAID can be very useful in future work. Our system incorporating the chi-square statistical test is illustrated in Figure 2. 3 Analysing the Rules Set by Chi-square Test As mentioned earlier, if CRs are kept in rules sets according to their reliability in the database, it is possible to limit the number of rules in the sets [Chan and Wong, 99; Imam et al., 993; Piatetsky-Shapiro and Matheus, 993; Siegel et al., 992]. We now explain how we use the chi-square test on the rule derivation process theoretically, step by step, and then we give examples for the usage of the test in SQO in section Chi-squareTest on the Rule Derivation Process Assume that an unmatched CR is A a B b. R is a relation of a given database. A in the relation R is an antecedent attribute of the rule, B in the relation R is the consequent attribute of the rule. is the condition on the

3 antecedent attribute (A is the condition on the consequent attribute (B a and b are constant. It is important to determine whether this rule should be in the primary rules set. The first step of the test is to arrange a chi-square table as follows: i) Rows and columns in the table are represented as below: -The first row characterises occurrences of the condition ( ) on the antecedent attribute (A) in the database, (A a -The second row characterises occurrences of negation of the condition ( ) on the antecedent attribute (A) in the database, (A a -The first column characterises occurrences of the condition ( ) on the consequent attribute (B), (B b -The second column characterises occurrences of negation of the condition ( ) on the consequent attribute (B), (B b ii) Cell values of the table are counted numbers of the occurrences according to the conditions on the attributes. For example, the first cell value, X is a number of occurrences of B b if A a is true; the second cell value, X 2 is a number of occurrences of (B b ) if A a ;...,so on. iii) Calculated values of the last row and the last column, in Table, are totals of each row/column values according to their location in the table. For example, TR is the total of all values in row, ( TR = X + X 2 ), TR2 is the total of all values in row 2, ( TR 2 = X 2 + X 22 ) and so on. T is the total of all values in chi-square table (T= X X 22 Table is the table of the chi-square test showing all represented symbols. Chi-square Table B b (B b ) TR A a X X 2 TR (A a ) X 2 X 22 TR 2 TC TC TC 2 T Table. Table of the chi-square test The second step is to measure the relationship degree of the rule that can be found using the values of Table and the chi-square formulae in statistics. This formula, () is given as: Chi-square Value = n m 2 ( X ) 2 ij = ( Xij Yij ) / Yij = T, ( m i= j= Yij TRi TC j 2, n 2) and where Yij = ( * ), (where T TR i = m j= X ij and TC j = n i= For 2*2 metrics of the chi-square table, another short formula can be used instead of ( This formula is given below: Chi-square Value = 2 =T*( X * X X * X ) /( TC * TC * TR * TR) X ij ) 2 2 After determining the relationship degree, it is compared to the decision value of chi-square test according to a given confidence level and degree of freedom. The degree of freedom is equal to v = (n-)*(m-) = (2-)*(2-) = where n is number of rows in the table (n=2), m is the number of columns in the table (m=2 The final stage is to evaluate the CR : If the relationship degree of the rule is less than the decision value and the CR does not contain any indexed attributes, this CR may be discarded from the primary rules set directly because it has a low relationship degree between the antecedent attribute of the rule and the consequent attribute of the rule. In other words, it is not likely to be relevant to future queries. However it may possible to derive the same weak rule for future queries. Therefore if we can store this tested rule in the secondary rules set as a weak rule we can avoid a subsequent derivation process. Another advantage of having the secondary rules set in SQO is to limit the primary rules set because all weak rules are stored in the secondary rules set. Otherwise this CR is added into the primary rules set since it has a high relationship degree in the database, in other words, it is a promising rule for optimising future queries. Limiting the size of the primary rules set can be a solution to the utility problem in large databases. 4 Applicability of Chi - square Test on Automatic Rule Derivation for Semantic Query Optimisation As mentioned before it is possible to prove applicability of the test using one of the current SQO approaches. Our software is based on a data driven approach [Lowden et al., 995]. Our examples are based on a small database of the DEPARTMENT relation which has 25 instances and 4 different attributes (Dcode, Dname, Project, and Manager), and assumes initially that the system does not

4 have any rules in the primary rules set or in the secondary rules set. Dcode is an index attribute of the relation. This database is given in Table 2. In this section, examples are given to show how the chi-square test can be used to limit the rules set in the rule derivation process for SQO. Dcode Dname Project Manager 'ACCT' 'Accounting' 'ACCT0' 'ACCT' 'Accounting' 'ACCT02' 'ACCT' 'Accounting' 5 'ACCT0' 'MKTG' 'Marketing' 2 'MKTG0' 'MKTG' 'Marketing' 2 'MKTG04' 'MKTG' 'Marketing' 2 'MKTG05' 'MKTG' 'Marketing' 3 'MKTG02' 'MKTG' 'Marketing' 4 'MKTG03' 'MKTG' 'Marketing' 5 'MKTG0' 'MKTG' 'Marketing' 5 'MKTG04' 'MKTG' 'Marketing' 5 'MKTG05' 'MKTG' 'Marketing' 6 'MKTG02' 'MKTG' 'Marketing' 6 'MKTG03' 'PRSN' 'Personnel' 7 'PRSN0' 'PRSN' 'Personnel' 8 'PRSN02' Table 2. The database of the DEPARTMENT relation Assume that we are looking for dname where dcode = MKTG. This query can be represented in SQL as : select dname from DEPARTMENT where dcode = ' MKTG'; Using the data_driven approach, two rules will be derived: ( i) dcode = MKTG' dname = Marketing and (ii) dname = 'Marketing' project >= 2. We take the first rule (i) to show how the system works. Firstly we construct the chi-square table of the rule using Table 2 where A a = (dcode= MKTG ), (A a ) = (dcode= MKTG ), B b = (dname= Marketing ) and (B b ) = (dname = Marketing The table is as follows: Chi-square Table B b (B b ) TR A a (A a ) TC Using the formula () and the table, the relationship degree of the rule can be found: this is equal to 5. This degree is greater than the decision value, according to the 99% confidence level and degree of freedom (v= From the result, this rule should be added into the primary rules set as a strong rule. When the same process is performed on the second rule, (ii) as below, the relationship degree of the rule is found to be equal to 4.6. It is less than the decision value and the CR does not contain the index attribute. Therefore it is added into the secondary rules set. Experiment 3 of the next section shows how likely to derive weak rules and strong rules. 5 Computational Results Our first experiment on the query optimisation process was investigated using only the primary rules set that is given in Table 3. The second experiment was carried out using all rules in Table 4. As a consequence Table 5 shows all time savings between the first and second experiments. Our approach has been tested on a 7000 instances database. The relational schema of the database is given below. STUDENT (name char(30), regno integer, logname char(8), advisor integer, entry integer, year char(2), scheme char(6), uccacode char(6), status char(), examno integer, study char(8), school char(4)) Figure 3 Relational Schema of STUDENT database The database is indexed on the name attribute and the regno attribute. Ten specified queries were as follows: Q) select * from student where status = 'A' and school = 'CSG'; Q2) select * from student where advisor = 479 and uccacode = 'B70002'; Q3) select * from student where entry < 9 and entry > 83 and examno = 0 and school = 'SEG'; Q4) select * from student where advisor = 259 and entry = 90 and status = A ; Q5) select scheme from student where entry = 90 and study = 'PHD CHEM'; Q6) select name, regno, advisor from student where entry = 85 and year = 'G'; Q7) select * from student where uccacode = 'Q008' and entry = 90; Q8) select advisor from student where uccacode = 'B70002'; Q9) select name, regno, advisor from student where entry = 85 and regno <= and regno >= ; Q0) select examno from student where school= 'SEG'; The above queries were chosen for the rule derivation procedure according either to their features or to illustrate

5 the query optimisation process. All queries in the tables are shown with their numbers, e.g. Query 9 is presented as Q9. If a query does not exist in one of tables then the query is not useful. All results are computed from average values found by executing the queries five different times. 5. Experiment Results shown in Table 3 are for the query optimisation procedure using the primary rules set only. The first column of the table shows times (in seconds) for the given queries without reformulation. The second column is with reformulation using the primary rules set. Third column gives numbers of fired rules for the given queries. The last column shows all calculated execution time differences between the first column and the second column. These times in the last column show that reformulation of given queries can be used to reduce the execution times of the original queries. In Table, the greatest savings are from Q5 and Q0 because the conditions of Q5 are refuted by a rule in the primary rules set and the answer of Q0 is found from a rule without executing the given query. Q4, Q6 and Q7 are transformed by rules containing indexed attributes. The execution time of the reformulated query of Q9 is higher than the original query because the original query contains indexed attribute(s) in its where clause. This type of query is already inexpensive to execute without reformulation. Table 3 System operating with the primary rules set from Query to Query2. The system has 33 strong rules in the primary rules set and 2 weak rules in the secondary rules set: Execution Time (secs.) Without ref.with ref. Num. Fired Saved Rules Time Q) Q2) Q3) Q4) Q5) Q6) Q7) Q8) Q9) Q0) Table 4 System operating with all rules (45 rules) Num. Fired Saved Without ref.with ref. Rules Time Q) Q2) Q3) Q4) Q5) Q6) Q7) Q8) Q9) Q0) Experiment 2 This experiment was performed to show the query optimisation process using all rules, e.g. both primary and secondary, in order to compare results of our system. Results are shown in Table 4. Table 5 gives a comparison between Experiments and 2. Differences between the numbers of fired rules, Table 3 and Table 4, show how many weak rule(s) are fired to transform the given queries. Table 5 Efficiency of using only the primary rules set instead of all rules Saved Saved Saved Time Time () Time (2) [() - (2)] Q) Q3) Q4) Q5) Q6) Q7) Q8) Q0) Experiment 3 Our last experiment gave timings for the rule derivation process both with chi-square test and without and are shown in Graph. This graph is based on the averaged times for each rule for a running query. The proportion of weak/strong rules in the rules sets was found to be 29% for weak rules and 7% for strong rules on the STUDENT database. Finally 8% saving was made by splitting up rules into two rules sets according to the chisquare test and currently we are testing the system on a database which has 27,266 instances.

6 Evaluation times Graph : Evaluation times for each rule in the rule derivation process (seconds) 3 6 Conclusion 5 7 with the test 9 Rules without the test This paper has focused on the usage of the chi-square test in automatic rule derivation to limit the set of rules. From the results, the test promises to eliminate weak rules. It may be possible to use other methods in statistics for the limitation of the rule set that can also be used to select only useful rules in the process. Our aims are to extend our current work to other methods in statistics and knowledge discovery systems to find the most appropriate costs function for query processing based on derivability, maintenance and selectivity factors [Graefe and Dewitt, 987; Schkolnick and Tiberio, 985; Ziarko, 99]. Acknowledgement Ayla Sayli is sponsored by the University of Yildiz. References [Chakravarthy et al., 990] U. S. Chakravarthy, J. Grant and J. Minker. Logic-based approach to semantic query optimisation. ACM Transactions on Database Systems, Vol. 5, No. 2, , June 990. [Chan and Wong, 99] K. C. Chan and A. K. C. Wong. A statistical test for extracting classificatory knowledge form databases. Knowledge Discovery in Databases, Ed., The AAAI Press, 07-23, 99. [Graefe and Dewitt, 987] G. Graefe and D. Dewitt. The EXODUS optimiser generator. In Proceedings of the 987 ACM-SIGMOD Conference on Management of Data, San Francisco, 60-7, May 987. [Han et al., 993] J. Han, Y. Cai and N. Cercone. Datadriven discovery of quantitative rules in relational databases. IEEE Transactions on Knowledge and Data Engineering, Vol. 5, no., 29-40, February [Hanson and Widom, 993] E. N. Hanson and J. Widom. An overview of production rules in database systems. The Knowledge Engineering Review, Vol 8:2, 2-43, 993. [Hsu and Knoblock, 994] C. Hsu and C. A. Knoblock. Rule induction for semantic query optimisation. In Proceedings of the Eleventh International Conference on Machine Learning, 994. [Imam et al., 993] I. F. Imam, R. S. Michalski and L. Kerschberg. Discovering attribute dependence in database by integrating symbolic learning and statistical analysis tests, Knowledge Discovery in Databases Workshop, , 993. [Kass, 980] G. V. KASS. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Appl. Statist., 29, No. 2, 9-27, 980. [King, 98] J. J. King. QUIST: A system for semantic query optimisation in relational databases. In Proceeding of the 7 th VLDB Conference, 50-57, Sept. 98. [Lowden et al., 995] B. G. T. Lowden, J. Robinson and K. Y. Lim. A semantic query optimiser using automatic rule derivation. Proc. Fifth Annual Workshop on Information Technologies and Systems, Netherlands, 68-76, December 995. [Piatetsky-Shapipo and Matheus, 993] G. Piatetsky- Shapiro, and C. Matheus. Measuring data dependencies in large databases. Knowledge Discovery in Databases Workshop, 62-73, 993. [Savnik and Flach, 993] I. Savnik and P. A. Flach. Bottom-up induction of functional dependencies from relations. Knowledge Discovery in Databases Workshop, 74-85, 993. [Schkolnick and Tiberio, 985] M. SCHKOLNICK and P. TIBERIO. Estimating the cost of updates in a relational database. ACM Trans. Database Systems, 0, 2, 63-79, June 985. [Shekhar et al., 993] S. Shekhar, B. Hamidzadeh and A. Kohli. Learning transformation rules for semantic query optimisation: a data-driven approach. IEEE, , 993. [Shenoy and Ozsoyoglu, 989] S. T. Shenoy and Z. M. Ozsoyoglu. Design and implementation of semantic query optimiser. IEEE Transactions on Knowledge and Data Engineering, Vol., No. 3, , Sept [Siegel et al., 992] M. D. Siegel, E. Sciore and S. Salveter. A method for automatic rule derivation to support semantic query optimisation. ACM Transactions on Database Systems, Vol. 7, No. 4, , Dec [Yu and Sun, 989] C. Yu and W. Sun. Automatic knowledge acquisition and maintenance for semantic query optimisation. IEEE Trans. Knowl. Data Eng.,, 3, , Sept [Ziarko, 99] W. Ziarko. The discovery, analysis, and representation of data dependencies in databases. In Knowledge Discovery in Databases, The AAAI Press, , 99.

A Statistical Approach to Rule Selection in Semantic Query Optimisation

A Statistical Approach to Rule Selection in Semantic Query Optimisation Barry G. T. Lowden and Jerome Robinson Department of Computer Science, The University of ssex, Wivenhoe Park, Colchester, CO4 3SQ,