MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

MySQL Data Mining: Extending MySQL to support data mining primitives (demo) Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti Dept. of Mathematics and Computer Sciences, University of Catania {ferro,giugno,lpuglisi,apulvirenti}@dmi.unict.it Abstract. The development of predictive applications built on top of knowledge bases is rapidly growing, therefore database systems, especially the commercial ones, are boosting with native data mining analytical tools. In this paper, we present an integration of data mining primitives on top of MySQL 5.1. In particular, we extended MySQL to support frequent itemsets computation and classification based on C4.5 decision trees. These commands are recognized by the parser that was extended to support new SQL statements. Moreover, the implemented algorithms were engineered and integrated in the source code of MySQL in order to allow large-scale applications and a fast response time. Finally, a graphical interface guide the user to explore the new data mining facilities. Key words: Data Mining, MySQL, APRIORI, Decision trees 1 Introduction Commercial database systems such as Oracle and SQL Server are equipped with a wide range of native data mining primitives. They provide predictive analytical tools equipped with graphical user interface allowing to access and explore data to find patterns, relations and hidden knowledge. On the open source databases front, a widely used system, e.g. MySQL, lacks of such data mining primitives. Some basic mining tasks may be performed by facing complex SQL queries, others could be issued through stand alone suites (WEKA [6], RAPIDMINER 1 ). However those approaches do not scale well on the size of the data and result unsuitable for most applications. In this paper, we present MySQL Data Mining 2, a web-based tool that performs an integration of Frequent itemset computation [1] and Classification based on C4.5 [3] algorithm on top of MySQL. These algorithms were implemented in C++ and integrated in the standard distribution of MySQL version 5.1 on Linux OS. In order to execute these commands, the parser of MySQL server was modified and extended to support new 1 http://rapid-i.com/ 2 http://kdd.dmi.unict.it/tedata/

2 Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti SQL statements. Moreover, the implemented algorithms were engineered to allow large-scale applications and a fast response time. Finally, a graphical interface guides the user to explore the new data mining facilities. The demo is organized as follows. Section 2 briefly reviews the data mining algorithms. Section 3 describes the new SQL statements and the main steps of the integration process. Section 4 shows the navigation of the graphical interface. Finally, section 5 reports conclusions and propose future extensions. 2 Data Mining algorithms in MySQL Data Mining This section briefly describes the data mining algorithms integrated in MySQL. The Frequent Itemsets computation algorithm. Mining frequent itemsets can support business decision-making processes such as cross-marketing or analyses on customer buying behavior. APRIORI is an algorithm proposed by [1] for finding all frequent itemsets in a transactional database. It uses an iterative level-wise approach based on candidates generation exploring (k + 1)-itemsets from previously generated k-itemsets. Let L k be a set of frequent k-itemsets and C k a set of candidate k-itemsets. Our implementation consists of two steps: 1. Join Step: find C k by joining L k 1 with itself [1]. 2. Find Step: find L k, i.e. a subset of C k of frequent itemsets. This step is implemented following the strategy presented in MAFIA [2].It uses a vertical bitmap representation of transactions and performs bitwise AND operations to determine the frequency of the itemsets. The algorithm iterates until L k =. Classification based on C4.5 algorithm. Classification allows to extract models describing important data classes that can be used for future predictions. A typical example is a classification model to categorize bank loan applications as either safe or risky. Data classification is a two-step process. In the first step, a classifier is built describing a predetermined set of data classes. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing a training set consisting of a set of tuples and their associated class labels. In the second step, the model is used for classification. First, the predictive accuracy of the classifier is estimated, then if the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. MySQL was extended to support data classification using the implementation of algorithm C4.5 of Quinlan [3]. This algorithm uses decision tree as classifiers, in which internal nodes are tests on an attribute and branches represent the outcomes of the test. Leaf nodes contain a class label. 3 Integrating data mining algorithms into MySQL MySQL architecture consists of five subsystems (the query engine, the storage, buffer, transaction, and recovery managers) that interact with each other in order

Title Suppressed Due to Excessive Length 3 to accomplish the user tasks. In particular, the query engine contains the syntax parser, the query optimizer and the execution component. The syntax parser decomposes the received SQL commands into a form that can be understood by the MySQL engine. The query optimizer prepares the execution plan and passes it to the execution component which interprets and retrieves the records by interacting with the storage manager. The integration of new data mining procedures required the following steps: 1. implementation and optimization of the algorithms described in section 2; 2. definition of new SQL statements for the execution of 1. and extension of Bison grammar file (MySQL syntax parser); 3. integration of 1. in the MySQL server by modifying the query engine. The first step is described in section 2. Next sections describe the other steps. 3.1 Extension of MySQL syntax parser As an example of computation, Figure 2 shows the main phases for the integration of Apriori. In order to define the new command, we modified parser by: extending MySQL s list of symbols with new keywords (e.g. APRIORI): lexical analysis recognizes new symbols after defining them as new MySQL keywords; adding to the parser (i.e. Bison grammar file) new grammar rules for the introduced primitive: the parser matches the grammar rules corresponding to SQL statements and executes the code associated to that rules. The syntax of SQL statement for APRIORI is reported in Figure 1 (a). The ta- (a) (b) Fig. 1. SQL statements syntax. (a) Apriori command. (b) Create model and classify commands. ble name represents the input table for Apriori. Moreover, the user has to specify the minimun support (threshold) and the columns containing transaction ids (col name tid) and item ids (col name item). Other optional parameters can be used to (i) limit the size of the itemsets to compute, (ii) report other information

4 Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti related to items (such as the details of the related transactions) and (iii) specify the type of storage engine (myisam, InnoDB etc). Default storage engine is MyISAM. This command is recognized and executed by MySQL Data Mining. The result is then stored into the database and will be available to the user for further analysis sessions. Figure 1 (b) reports also the syntax of SQL statements for training and classification phases. The integration of these SQL commands followed exactly the same steps of Apriori integration. Here, the generation of the model requires the specification of the training set (training table) and the attribute representing the class (class name). The model implements Information Gain (default one) and the Gain Ratio (GR) as splitting conditions. In this phase, classification rules are stored into the database. Next, the classification of data tuples (new table) is performed by selecting a previously generated model (rules table). 3.2 Extension of MySQL execution component Figure 2 reports the modifications to MySQL Engine. Here, the MySQL Execution Component was modified in the following way: (i) main MySQL procedures (server side) were modified to support the execution of a new command Apriori; (ii) new C++ code (computing frequent itemsets as described in section 2) was added to the standard distribution. Moreover, CREATE and INSERT SQL statements were executed at the low level of the engine in order to store the frequent itemsets in a relational table (that will be available to the user for querying the result). Fig. 2. Framework of MySQL: syntax parser and MySQL engine modifications.

Title Suppressed Due to Excessive Length 5 4 The user interface We equipped MySQL Data Mining with a web interface based on the LAMPP 3 framework. The user starts by logging into the system using his MySQL account. Then, he selects the database and the data mining algorithms to use. The web interface contains also a loader to upload data coming from external sources. The results of each task are stored in the database and will be available to the user in future sessions. This demo is available online 4. 4.1 The Frequent Itemsets computation web interface APRIORI interface includes the following modules: 1. data preparation module: the user can load data from external sources or choose the data from tables stored in the database. The input table for Apriori must have at least two columns, in which the first one contains the transaction ID and the second one contains the item ID (see Fig. 3 (b)); 2. statement preparation module: the user can set the input parameters to generate the frequent itemsets, that is the input table, the fields representing the transactions and the items, the support threshold and an optional limit to the size of the frequent itemsets (see Fig. 3 (a)); 3. data analysis module: it is possible to visualize and query the tables containing the frequent itemsets. 4.2 The Decision tree algorithm The generation of the model is supported by a simple interface which guides user to select the input table and the class from the list of possible attributes. Optionally, the user can set Gain Ratio as splitting condition. Moreover, the user can create an input table by getting the schema and the data from external files (.names,.data). The classification is performed by choosing a previously generated model and a set of tuples to be classified. Such tuples are provided by the user into a table which schema must be consistent with the selected model and must contain a column corresponding to the classifier attribute. 5 Performance Analysis In this section, we report some preliminary experiments concerning the performances on FI computation of our MySQL Data Mining system. Experiments have been performed on a HP Proliant DL380 with 4GB RAM, equipped with Linux Debian Operating System. We used two different benchmark datasets, called mushroom and chess respectively, obtained from the FIMI (Frequent 3 http://www.apachefriends.org/it/xampp-linux.html 4 http://kdd.dmi.unict.it/tedata/ use (guest/guest10) to access.

6 Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti Fig. 3. (a) Statement preparation module. (b) Data preparation module. (c) Data analysis module. Itemset Mining Implementations) repository 5. The mushroom dataset contains 8124 transactions and 23 items per transaction whereas the chess dataset contains 3196 transactions and 37 items per transaction. Fig. 4, reports the running time of our MySQL Apriori (Apriori-DM) without I/O operations and the total execution time needed by the SQL statement (Apriori-SQL). We show also a comparison of our standalone Apriori algorithm (Apriori-extern) with two freely available standalone Apriori implementations of Bodon 6 [4] and Borgelt 7 [5] respectively. Comparisons with Mafia are not reported since the algorithm is optimized for MFI. On the mushroom dataset (Fig. 4 (a)) the Apriori-DM outperforms Bodon [4] because of the use of bitmaps during the verification phase [2]. However, Borgelt implementation yields the best results. On the chess dataset (Fig. 4 (b)), Apriori-DM and Apriori-SQL outperform all the standalone implementations. This is due to the fact that the number of generated FI is very high and the I/O operations in a DBMS are faster than I/O operations on text files. 5 http://fimi.cs.helsinki.fi/ 6 http://www.cs.bme.hu/ bodon/en/apriori/ 7 http://www.borgelt.net/apriori.html

Title Suppressed Due to Excessive Length 7 (a) (b) Fig. 4. Running times varying the threshold. (a) Mushroom dataset. (b) Chess dataset. 6 Conclusions and future work We have presented an integration of data mining algorithms on MySQL. Differently from other database systems, MySQL lacks of such features. Although user may overcome such unavailability, it could result unsuitable for most applications. The main advantages of this approach rely on fact that user can use simple SQL commands to perform complex data mining analysis. Future work includes integration of a wider range of data mining algorithms together with statistical primitives. Acknowledgement We thank all students that have collaborated on the development of the system, in particular Aurelio Giudice, Luciano Gusmano, Antonino Schembri, and Tiziana Zapperi. References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conf., pages 487-499, 1994. 2. M. C. Doug Burdick and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proc. of the 17th International Conference on Data Engineering., pages 77-90, April 2001. 3. J. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996. 4. Ferenc Bodon. Surprising results of trie-based FIM algorithms. 2nd Workshop of Frequent ItemSet Mining Implementations (FIMI 2004, Brighton, UK). 5. Christian Borgelt. Recursion Pruning for the Apriori Algorithm. 2nd Workshop of Frequent ItemSet Mining Implementations (FIMI 2004, Brighton, UK). 6. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Morgan Kaufmann, 2005.