Evolving SQL Queries for Data Mining

Similar documents
A Parallel Evolutionary Algorithm for Discovery of Decision Rules

The k-means Algorithm and Genetic Algorithm

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Neural Network Weight Selection Using Genetic Algorithms

Genetic Programming. and its use for learning Concepts in Description Logics

Using a genetic algorithm for editing k-nearest neighbor classifiers

The Genetic Algorithm for finding the maxima of single-variable functions

Classification of Concept-Drifting Data Streams using Optimized Genetic Algorithm

Genetic Algorithms. Kang Zheng Karl Schober

Constructing X-of-N Attributes with a Genetic Algorithm

Approach Using Genetic Algorithm for Intrusion Detection System

CS5401 FS2015 Exam 1 Key

HYBRID GENETIC ALGORITHM WITH GREAT DELUGE TO SOLVE CONSTRAINED OPTIMIZATION PROBLEMS

A Web Page Recommendation system using GA based biclustering of web usage data

Evolutionary Art with Cartesian Genetic Programming

Time Complexity Analysis of the Genetic Algorithm Clustering Method

CHAPTER 2 CONVENTIONAL AND NON-CONVENTIONAL TECHNIQUES TO SOLVE ORPD PROBLEM

Evolutionary form design: the application of genetic algorithmic techniques to computer-aided product design

Hybridization EVOLUTIONARY COMPUTING. Reasons for Hybridization - 1. Naming. Reasons for Hybridization - 3. Reasons for Hybridization - 2

Introduction to Evolutionary Computation

Escaping Local Optima: Genetic Algorithm

A Classifier with the Function-based Decision Tree

Genetic Programming for Data Classification: Partitioning the Search Space

Heuristic Optimisation

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

Solving Sudoku Puzzles with Node Based Coincidence Algorithm

4/22/2014. Genetic Algorithms. Diwakar Yagyasen Department of Computer Science BBDNITM. Introduction

Attribute Selection with a Multiobjective Genetic Algorithm

Offspring Generation Method using Delaunay Triangulation for Real-Coded Genetic Algorithms

Evolutionary Algorithms. CS Evolutionary Algorithms 1

Introduction to Genetic Algorithms

Information Fusion Dr. B. K. Panigrahi

A Genetic Algorithm Approach for Clustering

Using Genetic Algorithms to optimize ACS-TSP

Inducing Parameters of a Decision Tree for Expert System Shell McESE by Genetic Algorithm

Hardware Neuronale Netzwerke - Lernen durch künstliche Evolution (?)

Evolutionary Decision Trees and Software Metrics for Module Defects Identification

A Memetic Genetic Program for Knowledge Discovery

Clustering Analysis of Simple K Means Algorithm for Various Data Sets in Function Optimization Problem (Fop) of Evolutionary Programming

Basic Data Mining Technique

Optimization of Association Rule Mining through Genetic Algorithm

A Memetic Heuristic for the Co-clustering Problem

Suppose you have a problem You don t know how to solve it What can you do? Can you use a computer to somehow find a solution for you?

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

Using Genetic Algorithms in Integer Programming for Decision Support

International Journal of Information Technology and Knowledge Management (ISSN: ) July-December 2012, Volume 5, No. 2, pp.

Applied Cloning Techniques for a Genetic Algorithm Used in Evolvable Hardware Design

Introduction to Data Mining and Data Analytics

A Genetic Algorithm for Graph Matching using Graph Node Characteristics 1 2

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

A Data Mining technique for Data Clustering based on Genetic Algorithm

Genetic Image Network for Image Classification

Combinational Circuit Design Using Genetic Algorithms

Using Genetic Algorithms to Solve the Box Stacking Problem

Deduplication of Hospital Data using Genetic Programming

Unsupervised Feature Selection Using Multi-Objective Genetic Algorithms for Handwritten Word Recognition

Job Shop Scheduling Problem (JSSP) Genetic Algorithms Critical Block and DG distance Neighbourhood Search

Lecture 8: Genetic Algorithms

Keywords: clustering algorithms, unsupervised learning, cluster validity

A Combined Meta-Heuristic with Hyper-Heuristic Approach to Single Machine Production Scheduling Problem

Santa Fe Trail Problem Solution Using Grammatical Evolution

Mutations for Permutations

An Introduction to Evolutionary Algorithms

CONCEPT FORMATION AND DECISION TREE INDUCTION USING THE GENETIC PROGRAMMING PARADIGM

Adaptive Information Filtering: evolutionary computation and n-gram representation 1

Artificial Intelligence Application (Genetic Algorithm)

A Web-Based Evolutionary Algorithm Demonstration using the Traveling Salesman Problem

A Steady-State Genetic Algorithm for Traveling Salesman Problem with Pickup and Delivery

GRANULAR COMPUTING AND EVOLUTIONARY FUZZY MODELLING FOR MECHANICAL PROPERTIES OF ALLOY STEELS. G. Panoutsos and M. Mahfouf

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Metaheuristic Optimization with Evolver, Genocop and OptQuest

Artificial Neural Network based Curve Prediction

JHPCSN: Volume 4, Number 1, 2012, pp. 1-7

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Outline. Motivation. Introduction of GAs. Genetic Algorithm 9/7/2017. Motivation Genetic algorithms An illustrative example Hypothesis space search

Revision of a Floating-Point Genetic Algorithm GENOCOP V for Nonlinear Programming Problems

Application of a Genetic Programming Based Rule Discovery System to Recurring Miscarriage Data

A Genetic k-modes Algorithm for Clustering Categorical Data

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Improving interpretability in approximative fuzzy models via multi-objective evolutionary algorithms.

Multiobjective Optimization Using Adaptive Pareto Archived Evolution Strategy

Enhancing K-means Clustering Algorithm with Improved Initial Center

Monika Maharishi Dayanand University Rohtak

GENETIC ALGORITHM with Hands-On exercise

Discrete Particle Swarm Optimization With Local Search Strategy for Rule Classification

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Finding Effective Software Security Metrics Using A Genetic Algorithm

Multi-relational Decision Tree Induction

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Neural Network Regularization and Ensembling Using Multi-objective Evolutionary Algorithms

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

FEATURE GENERATION USING GENETIC PROGRAMMING BASED ON FISHER CRITERION

Towards Automatic Recognition of Fonts using Genetic Approach

Multiple Classifier Fusion using k-nearest Localized Templates

Multi-Objective Pipe Smoothing Genetic Algorithm For Water Distribution Network Design

ISSN: [Keswani* et al., 7(1): January, 2018] Impact Factor: 4.116

Using CODEQ to Train Feed-forward Neural Networks

IJMIE Volume 2, Issue 9 ISSN:

Meta- Heuristic based Optimization Algorithms: A Comparative Study of Genetic Algorithm and Particle Swarm Optimization

Transcription:

Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper presents a methodology for applying the principles of evolutionary computation to knowledge discovery in databases by evolving SQL queries that describe datasets. In our system, the fittest queries are rewarded by having their attributes being given a higher probability of surviving in subsequent queries. The advantages of using SQL queries include their readability for non-experts and ease of integration with existing databases. The evolutionary algorithm (EA) used in our system is very different from existing EAs, but seems to be effective and efficient according to the experiments to date with three different testing data sets. 1 Introduction Data mining studies the identification and extraction of useful knowledge from large amounts of data [5]. There are a number of different fields of inquiry within data mining, of which classification is particularly popular. Machine learning algorithms that can learn to classify datum correctly can be applied to a wide variety of problem domains, including credit card fraud detection and medical diagnostics [1,2,3]. An important aspect of such algorithms is ensuring that they are easy to comprehend, to facilitate the transfer of machine discovered knowledge to people easily [4]. This paper will present a framework for discovering classification knowledge hidden in a database through evolutionary computation techniques, as applied to SQL queries. The task is related to but different from the conventional classification problem. Instead of trying to learn a classifier for predicting an unseen example, we are most interested in discovering the underlying knowledge and concept that best describes a given set of data from a large database. SQL is a standardised data manipulation language that is widely supported by database vendors. Constructing a data mining framework using SQL is therefore very useful, as it would inherit SQL s portability and readability. Ryu and Eick [7] proposed a genetic programming (GP) based approach to deriving queries from examples. However, there are two major differences between the work presented here and theirs. First, the query languages used are different and, as a result, the chromosome representations are different. Our use of SQL has made the whole system much simpler and more portable. Second, the evolutionary algorithms used are different. While Ryu and Eick [7] used GP, we H. Yin et al. (Eds.): IDEAL 2002, LNCS 2412, pp. 62 67, 2002. c Springer-Verlag Berlin Heidelberg 2002

Evolving SQL Queries for Data Mining 63 have developed a much simpler algorithm which does not use any conventional crossover and mutation operators. Instead, the idea of self-adaptation at the gene level is exploited. Our initial experimental studies have shown that such a simple scheme is very easy to implement, yet very effective and efficient. The rest of this paper is structured as follows. Section 2 describes the architecture of the proposed framework, justifying design decisions made and explaining the benefits and drawbacks that were perceived in the process. Section 3 presents initial results obtained with the framework, and Section 4 concludes the paper with a brief discussion of future work that is planned. 2 Evolving SQL Queries It was necessary to find a way of representing SQL queries genotypically, to allow for the application of evolutionary search operators. Another issue was the design of a fitness function to apply evolutionary pressure to the queries, to guide them towards the correct classification rules. Genotypes were required to encode the list of conditional constraints that specify the criterion by which records should be selected. Each conditional constraint in SQL follows the structure [attribute name] [logical operator] [value]. This sequence was chosen as the basic unit of information, or gene, from which genotypes would be constructed. Genotypic representations varied randomly in length. 2.1 Evolutionary Search The algorithm that was implemented is described in this section. 100 genotypes were constructed by randomly selecting attribute names, logical operators and values. Each attribute in the dataset began with a 0.5 probability of being included in any given genotype. Genotypes were then translated into SQL by initialising a String with the value SELECT * FROM [tablename] WHERE, and then appending each gene in the genotype to the end of the String. For example, a genotype such as this: (LEGS = 4) (PREDATOR = TRUE) (FEATHERS = FALSE) (VENOMOUS = FALSE) would be translated into the following SQL query, through the random addition of AND and OR conditionals: SELECT * FROM Animals WHERE LEGS = 4 AN D PREDATOR = true AND FEATHERS = false OR VENOMOUS = false Such SQL queries, once constructed, were sent to the database, and the results analysed. Each genotype was assigned a fitness value according to the extent to which its results corresponded with a target result set T. The fitness function used was

64 M. Salim and X. Yao fitness = 100 - falsepositives - (2 * falsenegatives), where 100 was an arbitrarily chosen constant. This fitness function was adapted from a paper by Ryu and Eick [7], dealing with deriving queries from object oriented databases. falsepositives is the number of records that were incorrectly identified as belonging to T, and falsenegatives is the number of records that should have been included T, but were not. The fitness function punishes false negatives more than it punishes false positives. If a query returns no false negatives, but several false positives, it can be seen to be correctly identifying the target result set, but generalising too much, whereas a query that returns false negatives is simply incorrect. By punishing false negatives more, it was hoped to apply evolutionary pressure that would favour queries that better classified the training data. After assigning fitness values for the 100 queries, the best and worst three were selected. If a perfect classifier was found (with fitness of 100) the evolution would terminate, otherwise the attributes would have their probabilities re-weighted. Every attribute that appeared in the top three fittest genotypes had its selection probability incremented by 1%. Every attribute in the worst three genotypes had its probability decremented by 1%. The old genotypes were then discarded, and a new set of 100 genotypes were randomly created using the self-adapted probabilities. Over a period of generations, attributes that contributed to higher fitness values came to dominate in the genotype set, whereas attributes that contributed little to a genotype featured less and less. 2.2 Discussions Our algorithm departs from the metaphor commonly used in evolutionary algorithms; however it does offer a mechanism through which the genotypes are iteratively converging on the sector of the search space that offers the greatest classification utility. Although genetic information of parents are not inherited directly by offspring, the genetic information in the whole population is inherited by the next population. Such inheritance is biased toward more useful genetic materials probabilistically. Hence, more useful genetic materials will occur more frequently in a population. It is hoped that classification rules may be discovered as a consequence of this. 3 Experimental Studies Several experiments have been carried out to evaluate the effectiveness and efficiency of the proposed framework. All datasets were downloaded from the UCI Machine Learning Repository 1. Each dataset was tested with 20 independent runs. If after 100 generations a perfect classifier was not found, the best classifier found to date was returned. The results were averaged over the 20 runs, and are presented below. 1 http://www1.ics.uci.edu/ mlearn/mlrepository.html

Evolving SQL Queries for Data Mining 65 3.1 The Zoo Dataset The Zoo dataset contains data items that describe animals. In total 14 attributes are provided, of which 13 are boolean and one has a predefined integer range. The animals are classified into 7 different types. Table 1 describes the results from the Zoo dataset. ANG refers to the average number of generations that it took for our algorithm to find a perfect classifier. Table 1. Results for the Zoo dataset, showing performance of the evolved classifying queries for each animal type. The results were averaged over 20 runs. Type False Positives False Negatives ANG Accuracy 1 0 0 0.8 100.0% 2 0 0 0.7 100.0% 3 1 0 n/a 83.3% 4 0 0 4.7 100.0% 5 0 0 21.0 100.0% 6 0 0 44.5 100.0% 7 2 0 n/a 83.3% It can be seen that our algorithm performed well on most of the classification tasks. The two instances in which it failed to find perfect classifiers are the most difficult tasks within the dataset, as both tasks involve a very small set of animals. In both cases, however, the best queries did not include false negatives. 3.2 Monk s Problems The Monks Problem dataset involves data items with six attributes, all of which are predefined integers between 1 and 4. The first Monk s problem is the identification of data patterns where (B=C) or (E=1). The second problem is the identification of all data patterns that feature exactly two of (B = 1, C = 1, D =1, E = 1, F = 1 or G = 1). The third Monk s problem is the identification of data patterns where (F = 3 and E = 1) or ( F!= 4 and C!= 3), and features 5% noise added to the training set. The results averaged over 20 runs are summarised in Table 2. Our algorithm performed perfectly on the first problem, and very well on the third, but performed poorly on the second problem. Part of the reason lies in SQL s inherent difficulty in expressing the desired conditions. The second Monks Problem requires a solution that compares relative attribute values, whereas SQL is usually used to select records according to a set of disjunctive attribute constraints.

66 M. Salim and X. Yao Table 2. Results for Monks Problem datasets, showing performance of the best queries for each problem. ANG refers to the average number of generations that it took for our algorithm to find a perfect classifier. Type False Positives False Negatives ANG Accuracy Problem 1 0 0 40.6 100.0% Problem 2 85 0 n/a 16.9% Problem 3 5 3 n/a 94.7% 3.3 Credit Card Approval The credit card approval dataset contains anonymised information on credit card application approvals and rejections. The dataset contains a variety of attribute types, with some attributes having predefined values and others having continuous values. The dataset also features 5% noise. Our algorithm succeeded in correctly identifying, on average, 82.9% of the rejections. However, this relative success is countered by the fact that this classifier also included a large number of false positives - 101 on average, accounting for nearly 20% of the dataset size. 3.4 Discussion of the Results The results for the Zoo and Monk s Problem datasets are encouraging. Our algorithm demonstrates the poorest performance on the second Monk s problem, which may be because the problem is not structurally conducive to an SQL based classification rule, although future refinements of our algorithm will hopefully improve upon these results. The results with the credit card approval dataset also show room for improvement. This may be due to its inclusion of continuous variables. Our algorithm performs poorly with continuous valued attributes because, although it can identify attributes that are valuable in making a classification, it cannot make the same distinction for logical operators or values. It is necessary for the algorithm to find the variable values as well as attribute values that are necessary for good classification. It is proposed that logical operators will be given initial selection probabilities as well, which will decrement or increment according to the effect they play upon the fitness value of their genotype. 4 Conclusions By using evolutionary computation techniques to evolve SQL queries it is possible to create a data mining framework that both produces easily readable results, and also can be applied to any SQL compliant database system. The problem considered here is somewhat different from the conventional classification problem. The key question we are addressing here is: Given a subset of data in a

Evolving SQL Queries for Data Mining 67 large database, how can we gain a better understanding of them? Our solution is to evolve human comprehensible SQL queries that describe the data. The algorithm proposed in this paper differs from many traditional evolutionary algorithms, in that it does not use the metaphor of selection, whereby the fittest individuals have their traits inherited by the new generation of individuals, through operations such as crossover or mutation. Rather, it rewards the attributes that make individuals successful, and then iterates the initial step of creation. In other words, rather than survival of the fittest, this work operates upon the principle of survival of the qualities that make the fittest fit. Although many genetic algorithms feature mutation, it is usually scaled down so that it does not destroy any useful structures that evolution may have already constructed. This approach differs in that it divorces the importance of the attribute from the values that the attribute happens to have in a given gene. As such it effects an evolutionary liquidity that in turn results in an appealingly diverse population, more likely to distribute itself over an entire search space than it is to converge on some local optima. Although our preliminary experimental results are promising, they also offer room for improvement. It is hoped that future improvements with regard to dealing with continuous variables will improve performance. References 1. X. Yao and Y. Liu, A new evolutionary system for evolving artificial neural networks, IEEE Transactions on Neural Networks, 8(3):694-713, May 1997. 2. X. Yao and Y. Liu, Making use of population information in evolutionary artificial neural networks, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 28(3):417-425, June 1998. 3. Y. Liu, X. Yao and T. Higuchi, Evolutionary ensembles with negative correlation learning, IEEE Transactions on Evolutionary Computation, 4(4):380-387, November 2000. 4. J. Bobbin and X. Yao, Evolving rules for nonlinear control, In New Frontier in Computational Intelligence and its Applications, M. Mohammadian (ed.), IOS Press, Amsterdam, 2000, pp.197-202. 5. A. A. Freitas, A genetic programming framework for two data mining tasks: classification and knowledge discovery, Genetic Programming 1997: Proc. 2nd Annual Conference, pp 96-101, Stanford University, 1997 6. A. A. Freitas, A survey of evolutionary algorithms for data mining and knowledge discovery, In: A. Ghosh, S. Tsutsui (eds.), Advances in Evolutionary Computation, Springer-Verlag, 2001 7. T. W. Ryu, C. F. Eick, Deriving queries from results using genetic programming, Proc. 2nd International Conference, Knowledge Discovery and Data Mining, pp 303-306, AAAI Press, 1996