A Statistical Approach to Rule Selection in Semantic Query Optimisation

Size: px

Start display at page:

Download "A Statistical Approach to Rule Selection in Semantic Query Optimisation"

Charles George
5 years ago
Views:

1 A Statistical Approach to Rule Selection in Semantic Query Optimisation Barry G. T. Lowden and Jerome Robinson Department of Computer Science, The University of ssex, Wivenhoe Park, Colchester, CO4 3SQ, ssex, United Kingdom Abstract. Semantic Query Optimisation makes use of the semantic knowledge of a database (rules) to perform query transformation. Rules are normally learned from former queries fired by the user. Over time, however, this can result in the rule set becoming very large thereby degrading the efficiency of the system as a whole. Such a problem is known as the utility problem. This paper seeks to provide a solution to the utility problem through the use of statistical techniques in selecting and maintaining an optimal rule set. Statistical methods have, in fact, been used widely in the field of Knowledge Discovery to identify and measure relationships between attributes. Here we extend the approach to Semantic Query Optimisation using the Chi-square statistical method which is integrated into a prototype query optimiser developed by the authors. We also present a new technique for calculating Chi-square, which is faster and more efficient than the traditional method in this situation. 1 Introduction Semantic query optimization is the process of transforming a query into an alternative query that is semantically equivalent but executes in a shorter time. A semantically equivalent query returns the same result set as the original query and is generated from it by the application of rules. Rules can consist of integrity constraints that hold for all states of the database or dynamic rules that hold only for a given state. Generally, rules can be supplied by the database engineer or derived automatically. Automatic rule generation methods include heuristic based systems [13], logic based systems [1], graph based systems [1] and data driven systems [7, 11]. With the development of automatic rule derivation, it becomes increasingly important to filter out ineffective rules which can lead to a deterioration in optimisation performance. The overall size of the rule set (n) is also of crucial importance since some optimisation algorithms run in O(n ) time. The problem of a large rule set degrading the efficiency of the reformulation process is referred to as the utility problem [4, 8, 10, 14]. arlier work by the authors [6] partially addressed the utility problem by time stamping rules when created,

2 modified or used by the reformulation process. When the number of rules exceeds a specified limit, those which are recorded as unused for more than a given period of time or have a low usage count are removed from the rule set. To further explore this problem we have examined a range of statistical tests as applied in knowledge discovery research [5, 8, ]. In this paper we propose the use of Chi-square for the analysis and selection of effective rules for query optimisation. In section we briefly describe a prototype optimiser, developed by the authors, on which we base our experiments. We then discuss a method of calculating Chisquare which is particularly appropriate to the present application and in section 4 we present our computational results. The ARDOR prototype In most rule derivation systems, the characteristics of worthwhile rules are first identified and then a query issued to the database in an attempt to derive such rules. The ARDOR system [7] takes a different approach whereby actual queries to the system are used to learn new rules. For each query (q) an optimum alternative (q ) is constructed, using a simple inductive process. The equality q q is then used to deduce the minimum set of rules needed to perform the transformation algorithmically and these rules added to the rule base. The system is illustrated in Figure 1 and consists of (1) query reformulation, and () the rule derivation and learning modules. The query reformulation process takes the original query submitted by the user and attempts to reformulate it into a more cost effective query by matching the conditions of the query to the rule set using a best first search strategy. For issues relating to bounded optimality of search strategies see [9]. By this means, the original query is transformed into a semantically equivalent query known as a reformulated query using the standard transformation rules of constraint introduction and removal [3]. This query will give the same result as the original query but with normally a shorter execution time. O riginal Q uery Q uery Reformulation Reform ulated Q uery SQ L ngine Results Rules Rule A nalysis Rule D erivation and Learning Fig. 1. Overview of Query Optimiser

3 It is also possible for the reformulation process to return a query result without the need to access the database at all leading to very substantial time savings. Such a situation occurs when the result of the query can be deduced from the rules alone as illustrated in Figure by two simple examples. All rules in the ARDOR system consist of a single antecedent and consequent. Rules sharing both common antecedent and consequent attributes are said to belong to the same rule class. ach time a user query is entered to the system, the learning process will attempt to generate new rules by constructing matching conditions on all attributes other than those present in the original query. Matching conditions are those which evaluate true with respect to the answer set. Original Query(1) : select advisor from student where uccacode B7000 Rule : uccacode B7000 advisor 10 Result : advisor 10 Original Query() : select advisor from student where uccacode B7000 and advisor 300 Rule : uccacode B7000 advisor 10 Result : null result returned Fig.. Results returned without need for database lookup ach new rule generated is added to the current rule set which, over time, may become excessively large. Also many of the rules could eventually become obsolete with respect to the present database state and should be discarded. Having a rule set that contains entries that are not useful to the query optimizer is wasteful in storage space, and more importantly degrades the performance of the transformation algorithm. 3 Statistics and rule maintenance Chi-square is used in knowledge discovery to determine the correlation between the decision attribute and all the other attributes in a dataset. An attribute is considered to be strongly correlated to the decision attribute if the calculated Chi-square value is significantly greater than the tabled value. To illustrate the calculation of Chi-square, we make reference to the example relation shown in Figure 3.

4 dcode dname project manager ACCT Accounting 1 ACCT01 ACCT Accounting 1 ACCT0 ACCT Accounting 5 ACCT01 MKTG Marketing MKTG01 MKTG Marketing MKTG04 MKTG Marketing MKTG05 MKTG Marketing 3 MKTG0 MKTG Marketing 4 MKTG03 MKTG Marketing 5 MKTG01 MKTG Marketing 5 MKTG04 MKTG Marketing 5 MKTG05 MKTG Marketing 6 MKTG0 MKTG Marketing 6 MKTG03 PRSN Personnel 7 PRSN01 PRSN Personnel 8 PRSN0 Fig. 3. An example relation The first step in calculating the Chi-square correlation is to arrange the data in a frequency or contingency table. ach column represents a group and each row represents a category of the sample relation. Figure 4 depicts such a table. Group Category 1 3 Total 1 n 11 n 1 n 13 R 1 n 1 n n 3 R 3 n 31 n 3 n 33 R 3 Total C 1 C C 3 N Fig. 4. A (3x3) frequency or contingency table The Chi-square significance can be calculated by the following formula: or r c ( n ) X i 1 j 1 X n r c i1 j1 N where n observed number of cases of in ith row and jth column number of cases expected in the ith row and jth column.

5 Under the assumption of independence, the expected frequency of the observation in each cell should be proportional to the distribution of row and column totals. This expected frequency is the product of the row and column total divided by the total number of observations. Thus may be calculated from the following formula: The total frequency R i in the ith row is R i R c i j 1 C N Similarly, the total frequency in the jth column is C The statistical degree of freedom (df) for the Chi-square is calculated as: Given, for example, the rule: j r i 1 n n ( 1)( 1) df r c dname Accounting dcode ACCT we may construct the contingency table for the associated rule class, Figure 5. Category Group (dcode) (dname) ACCT MKTG PRSN Total Accounting Marketing Personnel 0 0 Total Fig. 5. Frequency table for dname and dcode Using the earlier expression, the Chi-square value calculated is 9.80 with degree of freedom ( df ) 4. Choosing a confidence interval equal to 97.5%, the tabled Chisquare value is Since the calculated value is greater than the value tabled, the rule is considered to be relevant and retained in the rule set. In certain circumstances it may be appropriate to construct a contingency table limited to a restricted subset of categories where knowledge of the data indicates that this subset shows a markedly different correlation between antecedent and consequent attributes compared with other subsets in the same rule class. For example in the case of shoe size and age, it would be known that the correlation would be much greater for small shoe sizes since, in general, these would relate to children and consequently be associated with a restricted age spectrum. j

6 3.1 Algorithm design and memory limitation The implementation of the Chi-square statistical test normally requires the construction of a contingency table to hold the sampled frequencies of each row and column, the size of the table being dependent on the number of categories. If the size of the table becomes excessive this may well degrade the performance and efficiency of the statistical algorithm. The extreme case is where the table grows to a size beyond the memory limits of the system i.e. page swapping would then occur. These problems would imply that, in some cases, it may not be feasible to employ this technique for rule analysis and an alternative method of calculation is now presented. Consider the contingency table shown in Figure 6: Antecedent Consequent Total value(1) n 11 n 1 n 13 T R1 value() n 1 n n 3 T R value(3) n 31 n 3 n 33 T R3 Total T C1 T C T C3 N Fig. 6. A ( 3 3) contingency table As before the Chi-square is given as follows: X r i 1 c j 1 ( n ) applying this formula to the above table, i.e. R3 and c3, we have: X where ( n ) ( n ) ( n ) ( n ) T ( n ) ( n ) ( n ) ( n ) ( n ) Ri T N 33 Cj If, for some cell n 0, then the contribution to X from that cell is just /. So the contribution from all the cells having zero observed frequency is simply the sum of the corresponding. 3

7 Furthermore, there is no need to calculate their individual values. We know that the sum of the complete set of expected frequencies is N. If we sum the expected frequency,, of all the non-zero cells and subtract this value from N we have the total χ contribution from the cells with zero observed frequencies. Therefore the only computation needed would be for cells with non-zero frequency count. We illustrate the method by means of the following example. Let us assume that n 11, n and n 33 have values 3, 4 and respectively and all other cells contain a zero. The traditional way of calculating Chi-square would be: X ( ) ( ) ( ) ( ) ( ) ( 0 8 ) ( ) ( ) ( ) Our proposed alternative is to compute those cells that have non-zero cells separately from those that have zero frequency counts as follows. Non-zero Chi-square ( ) ( ) ( ) Zero Chi-square 9 ( ) The sum of these two values gives the same result as before without the need to construct a contingency table. This method is not restricted to diagonal tables alone and so may be used to represent rule classes of the form A B OR C. 4 xperimental Results A series of experiments were carried out in order to measure any increase in optimiser performance achieved through the statistical analysis and subsequent elimination of ineffective rules.

8 In each case a specified query collection was executed against the database in order to generate a full associated rule set using the ARDOR rule derivation and learning modules. The rule set was then subjected to the Chi-square analysis, as described in the above section, to filter out those rules considered to be weak thus creating a restricted rule set. The initial query execution was then repeated first using the full rule set and then the restricted set to perform query transformation and optimisation. In one representative experiment the following results were obtained by running a collection of ten queries, shown below, against a 7184 instance database containing real data pertaining to student records at the University of ssex. Query Collection: Q1) select * from student where status 'A' and school 'CSG'; Q) select * from student where advisor 479 and uccacode 'B7000'; Q3) select * from student where entry < 91 and entry > 83 and examno 0 and school 'SG'; Q4) select * from student where advisor 59 and entry 90 and status A ; Q5) select scheme from student where entry 90 and study 'PHD CHM'; Q6) select name, regno, advisor from student where entry 85 and year 'G'; Q7) select * from student where uccacode 'Q10081' and entry 90; Q8) select advisor from student where uccacode 'B7000'; Q9) select name, regno, advisor from student where entry 85 and regno < and regno > ; Q10) select examno from student where school 'SG'; The queries include range, numeric and set membership conditions and make reference to both indexed attributes (name and regno) and also non-indexed attributes. The relational schema is shown in Fig. 7. STUDNT (name char(30), regno integer, logname char(8), advisor integer, entry integer, year char(), scheme char(6), uccacode char(6), status char(1), examno integer, study char(8), school char(4)) Fig. 7. Relational Scheme of STUDNT database Query execution resulted in the generation of 51 rules. The rule derivation module runs as a background task and so does not adversely affect the execution process.

9 Application of the Chi-square test to this rule set identified 16 rules which could be considered as weak, and therefore discarded, leaving 35 strong rules considered to be particularly effective for query optimisation. Finally the query optimiser was run using both the full and restricted rule sets and the results shown in Figure 8. These results were averaged over 50 runs and include query transformation costs. Queries Q, Q8 and Q10 show the greatest overall savings since they can be answered from the rule set alone without the need to access the database itself. Queries Q1, Q3 and Q4 are examples of where the transformation process cannot improve on the original query. In this case query costing routines incorporated within the optimiser would return the original query as optimal for execution. Full rule set 51, restricted set 35. Timings (secs) Unoptimised Full set Restr. set Tot. Savings QURY Q1) Q) Q3) Q4) Q5) Q6) Q7) Q8) Q9) Q10) Fig. 8. xperimental Results Overall an average time saving of 30.5% was achieved across all the queries executed using the full rule set with a further improvement to 38.8% using the restricted rule set. It should be emphasised, however that the performance increase shown is as a result of applying the Chi-square rule filter to a recently derived and therefore generally relevant rule set. ven greater savings, with respect to the use of the full rule set, may be achieved by the Chi-square analysis of rule sets which have become out of date with respect to the current database state. The use of Chi-square, therefore, ensures not only the quality of rules when they are first derived but also subsequently their relevance and effectiveness to the ongoing optimisation process. 5 Conclusions The efficiency of a semantic optimiser is largely dependent on the number and quality of the rules used in the query transformation process. In this paper we have described

10 how the Chi-square statistical test may be used to identify and eliminate rules which are considered to be weak and ineffective. xperimental results show significantly improved performance is achieved using this approach which can be run in conjunction with other rule selection techniques. A revised Chi-square calculation procedure has also been presented which does not require the construction of a contingency table. References 1. U. S. CHAKRAVARTHY, J. GRANT and J. MINKR. Logic-based approach to semantic query optimisation. ACM Transactions on Database Systems, Vol. 15, No., June 1990, K. C. CHAN and A. K. C. WONG. A statistical test for extracting classificatory knowledge form databases. Knowledge Discovery in Databases, d., The AAAI Press, 1991, G. GRAF and D. DWITT. The XODUS optimiser generator. Proceedings of the ACM-SIGMOD Conference on Management of Data, San Francisco, May 1987, J. HAN, Y. CAI and N. CRCON. Data-driven discovery of quantitative rules in relational databases. I Transactions on Knowledge and Data ngineering, Vol. 5, no. 1, February 1993, I. F. IMAM, R. S. MICHALSKI and L. KRSCHBRG. Discovering attribute dependence in database by integrating symbolic learning and statistical analysis tests. Knowledge Discovery in Databases Workshop, 1993, B. G. T. LOWDN and K. Y. LIM. A data driven semantic optimiser, CSM 11, Internal Publication, University of ssex. 7. B. G. T. LOWDN, J. ROBINSON and K. Y. LIM, A semantic optimiser using automatic rule derivation. Proceedings of Workshop on Information Technologies and Systems, 1995, G. PIATTSKY-SHAPIRO, and C. MATHUS. Measuring data dependencies in large databases. Knowledge Discovery in Databases Workshop,1993, S. RUSSLL. Rationality and Intelligence. Artificial Intelligence 94, 1997, I. SAVNIK and P. A.. FLACH. Bottom-up induction of functional dependencies from relations. Knowledge Discovery in Databases Workshop, 1993, S. SHKHAR, B. HAMIDZADH and A. KOHLI. Learning transformation rules for semantic query optimisation: a data-driven approach. I, 1993, S. T. SHNOY and Z. M. OZSOYOGLU. Design and implementation of semantic query optimiser. I Transactions on Knowledge and Data ngineering, Vol. 1, No. 3, Sept. 1989, M. D. SIGL,. SCIOR and S. SALVTR. A method for automatic rule derivation to support semantic query optimisation. ACM Transactions on Database Systems, Vol. 17, No. 4, Dec. 199, W. ZIARKO. The discovery, analysis, and representation of data dependencies in databases. In Knowledge Discovery in Databases, The AAAI Press, 1991,

The Use of Statistics in Semantic Query Optimisation

The Use of Statistics in Semantic Query Optimisation Ayla Sayli ( saylia@essex.ac.uk ) and Barry Lowden ( lowdb@essex.ac.uk ) University of Essex, Dept. of Computer Science Wivenhoe Park, Colchester, CO4