Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper presents a methodology for applying the principles of evolutionary computation to knowledge discovery in databases by evolving SQL queries that describe datasets. In our system, the fittest queries are rewarded by having their attributes being given a higher probability of surviving in subsequent queries. The advantages of using SQL queries include their readability for non-experts and ease of integration with existing databases. The evolutionary algorithm (EA) used in our system is very different from existing EAs, but seems to be effective and efficient according to the experiments to date with three different testing data sets. 1 Introduction Data mining studies the identification and extraction of useful knowledge from large amounts of data [5]. There are a number of different fields of inquiry within data mining, of which classification is particularly popular. Machine learning algorithms that can learn to classify datum correctly can be applied to a wide variety of problem domains, including credit card fraud detection and medical diagnostics [1,2,3]. An important aspect of such algorithms is ensuring that they are easy to comprehend, to facilitate the transfer of machine discovered knowledge to people easily [4]. This paper will present a framework for discovering classification knowledge hidden in a database through evolutionary computation techniques, as applied to SQL queries. The task is related to but different from the conventional classification problem. Instead of trying to learn a classifier for predicting an unseen example, we are most interested in discovering the underlying knowledge and concept that best describes a given set of data from a large database. SQL is a standardised data manipulation language that is widely supported by database vendors. Constructing a data mining framework using SQL is therefore very useful, as it would inherit SQL s portability and readability. Ryu and Eick [7] proposed a genetic programming (GP) based approach to deriving queries from examples. However, there are two major differences between the work presented here and theirs. First, the query languages used are different and, as a result, the chromosome representations are different. Our use of SQL has made the whole system much simpler and more portable. Second, the evolutionary algorithms used are different. While Ryu and Eick [7] used GP, we H. Yin et al. (Eds.): IDEAL 2002, LNCS 2412, pp. 62 67, 2002. c Springer-Verlag Berlin Heidelberg 2002

Evolving SQL Queries for Data Mining 63 have developed a much simpler algorithm which does not use any conventional crossover and mutation operators. Instead, the idea of self-adaptation at the gene level is exploited. Our initial experimental studies have shown that such a simple scheme is very easy to implement, yet very effective and efficient. The rest of this paper is structured as follows. Section 2 describes the architecture of the proposed framework, justifying design decisions made and explaining the benefits and drawbacks that were perceived in the process. Section 3 presents initial results obtained with the framework, and Section 4 concludes the paper with a brief discussion of future work that is planned. 2 Evolving SQL Queries It was necessary to find a way of representing SQL queries genotypically, to allow for the application of evolutionary search operators. Another issue was the design of a fitness function to apply evolutionary pressure to the queries, to guide them towards the correct classification rules. Genotypes were required to encode the list of conditional constraints that specify the criterion by which records should be selected. Each conditional constraint in SQL follows the structure [attribute name] [logical operator] [value]. This sequence was chosen as the basic unit of information, or gene, from which genotypes would be constructed. Genotypic representations varied randomly in length. 2.1 Evolutionary Search The algorithm that was implemented is described in this section. 100 genotypes were constructed by randomly selecting attribute names, logical operators and values. Each attribute in the dataset began with a 0.5 probability of being included in any given genotype. Genotypes were then translated into SQL by initialising a String with the value SELECT * FROM [tablename] WHERE, and then appending each gene in the genotype to the end of the String. For example, a genotype such as this: (LEGS = 4) (PREDATOR = TRUE) (FEATHERS = FALSE) (VENOMOUS = FALSE) would be translated into the following SQL query, through the random addition of AND and OR conditionals: SELECT * FROM Animals WHERE LEGS = 4 AN D PREDATOR = true AND FEATHERS = false OR VENOMOUS = false Such SQL queries, once constructed, were sent to the database, and the results analysed. Each genotype was assigned a fitness value according to the extent to which its results corresponded with a target result set T. The fitness function used was

64 M. Salim and X. Yao fitness = 100 - falsepositives - (2 * falsenegatives), where 100 was an arbitrarily chosen constant. This fitness function was adapted from a paper by Ryu and Eick [7], dealing with deriving queries from object oriented databases. falsepositives is the number of records that were incorrectly identified as belonging to T, and falsenegatives is the number of records that should have been included T, but were not. The fitness function punishes false negatives more than it punishes false positives. If a query returns no false negatives, but several false positives, it can be seen to be correctly identifying the target result set, but generalising too much, whereas a query that returns false negatives is simply incorrect. By punishing false negatives more, it was hoped to apply evolutionary pressure that would favour queries that better classified the training data. After assigning fitness values for the 100 queries, the best and worst three were selected. If a perfect classifier was found (with fitness of 100) the evolution would terminate, otherwise the attributes would have their probabilities re-weighted. Every attribute that appeared in the top three fittest genotypes had its selection probability incremented by 1%. Every attribute in the worst three genotypes had its probability decremented by 1%. The old genotypes were then discarded, and a new set of 100 genotypes were randomly created using the self-adapted probabilities. Over a period of generations, attributes that contributed to higher fitness values came to dominate in the genotype set, whereas attributes that contributed little to a genotype featured less and less. 2.2 Discussions Our algorithm departs from the metaphor commonly used in evolutionary algorithms; however it does offer a mechanism through which the genotypes are iteratively converging on the sector of the search space that offers the greatest classification utility. Although genetic information of parents are not inherited directly by offspring, the genetic information in the whole population is inherited by the next population. Such inheritance is biased toward more useful genetic materials probabilistically. Hence, more useful genetic materials will occur more frequently in a population. It is hoped that classification rules may be discovered as a consequence of this. 3 Experimental Studies Several experiments have been carried out to evaluate the effectiveness and efficiency of the proposed framework. All datasets were downloaded from the UCI Machine Learning Repository 1. Each dataset was tested with 20 independent runs. If after 100 generations a perfect classifier was not found, the best classifier found to date was returned. The results were averaged over the 20 runs, and are presented below. 1 http://www1.ics.uci.edu/ mlearn/mlrepository.html

Evolving SQL Queries for Data Mining 65 3.1 The Zoo Dataset The Zoo dataset contains data items that describe animals. In total 14 attributes are provided, of which 13 are boolean and one has a predefined integer range. The animals are classified into 7 different types. Table 1 describes the results from the Zoo dataset. ANG refers to the average number of generations that it took for our algorithm to find a perfect classifier. Table 1. Results for the Zoo dataset, showing performance of the evolved classifying queries for each animal type. The results were averaged over 20 runs. Type False Positives False Negatives ANG Accuracy 1 0 0 0.8 100.0% 2 0 0 0.7 100.0% 3 1 0 n/a 83.3% 4 0 0 4.7 100.0% 5 0 0 21.0 100.0% 6 0 0 44.5 100.0% 7 2 0 n/a 83.3% It can be seen that our algorithm performed well on most of the classification tasks. The two instances in which it failed to find perfect classifiers are the most difficult tasks within the dataset, as both tasks involve a very small set of animals. In both cases, however, the best queries did not include false negatives. 3.2 Monk s Problems The Monks Problem dataset involves data items with six attributes, all of which are predefined integers between 1 and 4. The first Monk s problem is the identification of data patterns where (B=C) or (E=1). The second problem is the identification of all data patterns that feature exactly two of (B = 1, C = 1, D =1, E = 1, F = 1 or G = 1). The third Monk s problem is the identification of data patterns where (F = 3 and E = 1) or ( F!= 4 and C!= 3), and features 5% noise added to the training set. The results averaged over 20 runs are summarised in Table 2. Our algorithm performed perfectly on the first problem, and very well on the third, but performed poorly on the second problem. Part of the reason lies in SQL s inherent difficulty in expressing the desired conditions. The second Monks Problem requires a solution that compares relative attribute values, whereas SQL is usually used to select records according to a set of disjunctive attribute constraints.

66 M. Salim and X. Yao Table 2. Results for Monks Problem datasets, showing performance of the best queries for each problem. ANG refers to the average number of generations that it took for our algorithm to find a perfect classifier. Type False Positives False Negatives ANG Accuracy Problem 1 0 0 40.6 100.0% Problem 2 85 0 n/a 16.9% Problem 3 5 3 n/a 94.7% 3.3 Credit Card Approval The credit card approval dataset contains anonymised information on credit card application approvals and rejections. The dataset contains a variety of attribute types, with some attributes having predefined values and others having continuous values. The dataset also features 5% noise. Our algorithm succeeded in correctly identifying, on average, 82.9% of the rejections. However, this relative success is countered by the fact that this classifier also included a large number of false positives - 101 on average, accounting for nearly 20% of the dataset size. 3.4 Discussion of the Results The results for the Zoo and Monk s Problem datasets are encouraging. Our algorithm demonstrates the poorest performance on the second Monk s problem, which may be because the problem is not structurally conducive to an SQL based classification rule, although future refinements of our algorithm will hopefully improve upon these results. The results with the credit card approval dataset also show room for improvement. This may be due to its inclusion of continuous variables. Our algorithm performs poorly with continuous valued attributes because, although it can identify attributes that are valuable in making a classification, it cannot make the same distinction for logical operators or values. It is necessary for the algorithm to find the variable values as well as attribute values that are necessary for good classification. It is proposed that logical operators will be given initial selection probabilities as well, which will decrement or increment according to the effect they play upon the fitness value of their genotype. 4 Conclusions By using evolutionary computation techniques to evolve SQL queries it is possible to create a data mining framework that both produces easily readable results, and also can be applied to any SQL compliant database system. The problem considered here is somewhat different from the conventional classification problem. The key question we are addressing here is: Given a subset of data in a

Evolving SQL Queries for Data Mining 67 large database, how can we gain a better understanding of them? Our solution is to evolve human comprehensible SQL queries that describe the data. The algorithm proposed in this paper differs from many traditional evolutionary algorithms, in that it does not use the metaphor of selection, whereby the fittest individuals have their traits inherited by the new generation of individuals, through operations such as crossover or mutation. Rather, it rewards the attributes that make individuals successful, and then iterates the initial step of creation. In other words, rather than survival of the fittest, this work operates upon the principle of survival of the qualities that make the fittest fit. Although many genetic algorithms feature mutation, it is usually scaled down so that it does not destroy any useful structures that evolution may have already constructed. This approach differs in that it divorces the importance of the attribute from the values that the attribute happens to have in a given gene. As such it effects an evolutionary liquidity that in turn results in an appealingly diverse population, more likely to distribute itself over an entire search space than it is to converge on some local optima. Although our preliminary experimental results are promising, they also offer room for improvement. It is hoped that future improvements with regard to dealing with continuous variables will improve performance. References 1. X. Yao and Y. Liu, A new evolutionary system for evolving artificial neural networks, IEEE Transactions on Neural Networks, 8(3):694-713, May 1997. 2. X. Yao and Y. Liu, Making use of population information in evolutionary artificial neural networks, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 28(3):417-425, June 1998. 3. Y. Liu, X. Yao and T. Higuchi, Evolutionary ensembles with negative correlation learning, IEEE Transactions on Evolutionary Computation, 4(4):380-387, November 2000. 4. J. Bobbin and X. Yao, Evolving rules for nonlinear control, In New Frontier in Computational Intelligence and its Applications, M. Mohammadian (ed.), IOS Press, Amsterdam, 2000, pp.197-202. 5. A. A. Freitas, A genetic programming framework for two data mining tasks: classification and knowledge discovery, Genetic Programming 1997: Proc. 2nd Annual Conference, pp 96-101, Stanford University, 1997 6. A. A. Freitas, A survey of evolutionary algorithms for data mining and knowledge discovery, In: A. Ghosh, S. Tsutsui (eds.), Advances in Evolutionary Computation, Springer-Verlag, 2001 7. T. W. Ryu, C. F. Eick, Deriving queries from results using genetic programming, Proc. 2nd International Conference, Knowledge Discovery and Data Mining, pp 303-306, AAAI Press, 1996