MySQL Data Mining: Extending MySQL to support data mining primitives (demo)

Similar documents
Performance and Scalability: Apriori Implementa6on

An Improved Apriori Algorithm for Association Rules

Applying Packets Meta data for Web Usage Mining

Improved Frequent Pattern Mining Algorithm with Indexing

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

FastLMFI: An Efficient Approach for Local Maximal Patterns Propagation and Maximal Patterns Superset Checking

CS570 Introduction to Data Mining

A recommendation engine by using association rules

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rules Mining using BOINC based Enterprise Desktop Grid

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Association Rule Mining. Introduction 46. Study core 46

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Machine Learning: Symbolische Ansätze

Association Rule Discovery

Gurpreet Kaur 1, Naveen Aggarwal 2 1,2

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

A Novel method for Frequent Pattern Mining

Accelerating frequent itemset mining on graphics processing units

Frequent Itemset Mining on Large-Scale Shared Memory Machines

Data Mining Part 5. Prediction

Results and Discussions on Transaction Splitting Technique for Mining Differential Private Frequent Itemsets

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Mining Frequent Patterns Based on Data Characteristics

Tutorial on Association Rule Mining

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

CHUIs-Concise and Lossless representation of High Utility Itemsets

Graph Propositionalization for Random Forests

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

CARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version)

A Comparative Study of Selected Classification Algorithms of Data Mining

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

An Overview of various methodologies used in Data set Preparation for Data mining Analysis

Parallel FIM Approach on GPU using OpenCL

A Modified Apriori Algorithm for Fast and Accurate Generation of Frequent Item Sets

Discovery of Multi-level Association Rules from Primitive Level Frequent Patterns Tree

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

A mining method for tracking changes in temporal association rules from an encoded database

Temporal Weighted Association Rule Mining for Classification

Mining Distributed Frequent Itemset with Hadoop

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Sequential PAttern Mining using A Bitmap Representation

Product presentations can be more intelligently planned

A New Technique to Optimize User s Browsing Session using Data Mining

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Question Bank. 4) It is the source of information later delivered to data marts.

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Generating Cross level Rules: An automated approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Classification by Association

Delegates must have a working knowledge of MariaDB or MySQL Database Administration.

An Automated Support Threshold Based on Apriori Algorithm for Frequent Itemsets

DATA WAREHOUING UNIT I

Materialized Data Mining Views *

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Optimized Frequent Pattern Mining for Classified Data Sets

Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials *

and maximal itemset mining. We show that our approach with the new set of algorithms is efficient to mine extremely large datasets. The rest of this p

GPU-Accelerated Apriori Algorithm

Performance Analysis of Data Mining Classification Techniques

Associating Terms with Text Categories

On Frequent Itemset Mining With Closure

A Modified Apriori Algorithm

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

Mining High Average-Utility Itemsets

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

A Graph-Based Approach for Mining Closed Large Itemsets

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Clustering and Association using K-Mean over Well-Formed Protected Relational Data

PREDICTION OF POPULAR SMARTPHONE COMPANIES IN THE SOCIETY

Supporting Fuzzy Keyword Search in Databases

On Multiple Query Optimization in Data Mining

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

Data Mining: Mining Association Rules. Definitions. .. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

LIST OF TABLES Parameters used in analyzing FIM-CQTransSWin Characteristics of Mushroom and Retail Datasets 99

Maintenance of the Prelarge Trees for Record Deletion

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

FIMI 03: Workshop on Frequent Itemset Mining Implementations

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Mining N-most Interesting Itemsets. Ada Wai-chee Fu Renfrew Wang-wai Kwong Jian Tang. fadafu,

Comparing Performance of Formal Concept Analysis and Closed Frequent Itemset Mining Algorithms on Real Data

Transcription:

MySQL Data Mining: Extending MySQL to support data mining primitives (demo) Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti Dept. of Mathematics and Computer Sciences, University of Catania {ferro,giugno,lpuglisi,apulvirenti}@dmi.unict.it Abstract. The development of predictive applications built on top of knowledge bases is rapidly growing, therefore database systems, especially the commercial ones, are boosting with native data mining analytical tools. In this paper, we present an integration of data mining primitives on top of MySQL 5.1. In particular, we extended MySQL to support frequent itemsets computation and classification based on C4.5 decision trees. These commands are recognized by the parser that was extended to support new SQL statements. Moreover, the implemented algorithms were engineered and integrated in the source code of MySQL in order to allow large-scale applications and a fast response time. Finally, a graphical interface guide the user to explore the new data mining facilities. Key words: Data Mining, MySQL, APRIORI, Decision trees 1 Introduction Commercial database systems such as Oracle and SQL Server are equipped with a wide range of native data mining primitives. They provide predictive analytical tools equipped with graphical user interface allowing to access and explore data to find patterns, relations and hidden knowledge. On the open source databases front, a widely used system, e.g. MySQL, lacks of such data mining primitives. Some basic mining tasks may be performed by facing complex SQL queries, others could be issued through stand alone suites (WEKA [6], RAPIDMINER 1 ). However those approaches do not scale well on the size of the data and result unsuitable for most applications. In this paper, we present MySQL Data Mining 2, a web-based tool that performs an integration of Frequent itemset computation [1] and Classification based on C4.5 [3] algorithm on top of MySQL. These algorithms were implemented in C++ and integrated in the standard distribution of MySQL version 5.1 on Linux OS. In order to execute these commands, the parser of MySQL server was modified and extended to support new 1 http://rapid-i.com/ 2 http://kdd.dmi.unict.it/tedata/

2 Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti SQL statements. Moreover, the implemented algorithms were engineered to allow large-scale applications and a fast response time. Finally, a graphical interface guides the user to explore the new data mining facilities. The demo is organized as follows. Section 2 briefly reviews the data mining algorithms. Section 3 describes the new SQL statements and the main steps of the integration process. Section 4 shows the navigation of the graphical interface. Finally, section 5 reports conclusions and propose future extensions. 2 Data Mining algorithms in MySQL Data Mining This section briefly describes the data mining algorithms integrated in MySQL. The Frequent Itemsets computation algorithm. Mining frequent itemsets can support business decision-making processes such as cross-marketing or analyses on customer buying behavior. APRIORI is an algorithm proposed by [1] for finding all frequent itemsets in a transactional database. It uses an iterative level-wise approach based on candidates generation exploring (k + 1)-itemsets from previously generated k-itemsets. Let L k be a set of frequent k-itemsets and C k a set of candidate k-itemsets. Our implementation consists of two steps: 1. Join Step: find C k by joining L k 1 with itself [1]. 2. Find Step: find L k, i.e. a subset of C k of frequent itemsets. This step is implemented following the strategy presented in MAFIA [2].It uses a vertical bitmap representation of transactions and performs bitwise AND operations to determine the frequency of the itemsets. The algorithm iterates until L k =. Classification based on C4.5 algorithm. Classification allows to extract models describing important data classes that can be used for future predictions. A typical example is a classification model to categorize bank loan applications as either safe or risky. Data classification is a two-step process. In the first step, a classifier is built describing a predetermined set of data classes. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing a training set consisting of a set of tuples and their associated class labels. In the second step, the model is used for classification. First, the predictive accuracy of the classifier is estimated, then if the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. MySQL was extended to support data classification using the implementation of algorithm C4.5 of Quinlan [3]. This algorithm uses decision tree as classifiers, in which internal nodes are tests on an attribute and branches represent the outcomes of the test. Leaf nodes contain a class label. 3 Integrating data mining algorithms into MySQL MySQL architecture consists of five subsystems (the query engine, the storage, buffer, transaction, and recovery managers) that interact with each other in order

Title Suppressed Due to Excessive Length 3 to accomplish the user tasks. In particular, the query engine contains the syntax parser, the query optimizer and the execution component. The syntax parser decomposes the received SQL commands into a form that can be understood by the MySQL engine. The query optimizer prepares the execution plan and passes it to the execution component which interprets and retrieves the records by interacting with the storage manager. The integration of new data mining procedures required the following steps: 1. implementation and optimization of the algorithms described in section 2; 2. definition of new SQL statements for the execution of 1. and extension of Bison grammar file (MySQL syntax parser); 3. integration of 1. in the MySQL server by modifying the query engine. The first step is described in section 2. Next sections describe the other steps. 3.1 Extension of MySQL syntax parser As an example of computation, Figure 2 shows the main phases for the integration of Apriori. In order to define the new command, we modified parser by: extending MySQL s list of symbols with new keywords (e.g. APRIORI): lexical analysis recognizes new symbols after defining them as new MySQL keywords; adding to the parser (i.e. Bison grammar file) new grammar rules for the introduced primitive: the parser matches the grammar rules corresponding to SQL statements and executes the code associated to that rules. The syntax of SQL statement for APRIORI is reported in Figure 1 (a). The ta- (a) (b) Fig. 1. SQL statements syntax. (a) Apriori command. (b) Create model and classify commands. ble name represents the input table for Apriori. Moreover, the user has to specify the minimun support (threshold) and the columns containing transaction ids (col name tid) and item ids (col name item). Other optional parameters can be used to (i) limit the size of the itemsets to compute, (ii) report other information

4 Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti related to items (such as the details of the related transactions) and (iii) specify the type of storage engine (myisam, InnoDB etc). Default storage engine is MyISAM. This command is recognized and executed by MySQL Data Mining. The result is then stored into the database and will be available to the user for further analysis sessions. Figure 1 (b) reports also the syntax of SQL statements for training and classification phases. The integration of these SQL commands followed exactly the same steps of Apriori integration. Here, the generation of the model requires the specification of the training set (training table) and the attribute representing the class (class name). The model implements Information Gain (default one) and the Gain Ratio (GR) as splitting conditions. In this phase, classification rules are stored into the database. Next, the classification of data tuples (new table) is performed by selecting a previously generated model (rules table). 3.2 Extension of MySQL execution component Figure 2 reports the modifications to MySQL Engine. Here, the MySQL Execution Component was modified in the following way: (i) main MySQL procedures (server side) were modified to support the execution of a new command Apriori; (ii) new C++ code (computing frequent itemsets as described in section 2) was added to the standard distribution. Moreover, CREATE and INSERT SQL statements were executed at the low level of the engine in order to store the frequent itemsets in a relational table (that will be available to the user for querying the result). Fig. 2. Framework of MySQL: syntax parser and MySQL engine modifications.

Title Suppressed Due to Excessive Length 5 4 The user interface We equipped MySQL Data Mining with a web interface based on the LAMPP 3 framework. The user starts by logging into the system using his MySQL account. Then, he selects the database and the data mining algorithms to use. The web interface contains also a loader to upload data coming from external sources. The results of each task are stored in the database and will be available to the user in future sessions. This demo is available online 4. 4.1 The Frequent Itemsets computation web interface APRIORI interface includes the following modules: 1. data preparation module: the user can load data from external sources or choose the data from tables stored in the database. The input table for Apriori must have at least two columns, in which the first one contains the transaction ID and the second one contains the item ID (see Fig. 3 (b)); 2. statement preparation module: the user can set the input parameters to generate the frequent itemsets, that is the input table, the fields representing the transactions and the items, the support threshold and an optional limit to the size of the frequent itemsets (see Fig. 3 (a)); 3. data analysis module: it is possible to visualize and query the tables containing the frequent itemsets. 4.2 The Decision tree algorithm The generation of the model is supported by a simple interface which guides user to select the input table and the class from the list of possible attributes. Optionally, the user can set Gain Ratio as splitting condition. Moreover, the user can create an input table by getting the schema and the data from external files (.names,.data). The classification is performed by choosing a previously generated model and a set of tuples to be classified. Such tuples are provided by the user into a table which schema must be consistent with the selected model and must contain a column corresponding to the classifier attribute. 5 Performance Analysis In this section, we report some preliminary experiments concerning the performances on FI computation of our MySQL Data Mining system. Experiments have been performed on a HP Proliant DL380 with 4GB RAM, equipped with Linux Debian Operating System. We used two different benchmark datasets, called mushroom and chess respectively, obtained from the FIMI (Frequent 3 http://www.apachefriends.org/it/xampp-linux.html 4 http://kdd.dmi.unict.it/tedata/ use (guest/guest10) to access.

6 Alfredo Ferro, Rosalba Giugno, Piera Laura Puglisi, and Alfredo Pulvirenti Fig. 3. (a) Statement preparation module. (b) Data preparation module. (c) Data analysis module. Itemset Mining Implementations) repository 5. The mushroom dataset contains 8124 transactions and 23 items per transaction whereas the chess dataset contains 3196 transactions and 37 items per transaction. Fig. 4, reports the running time of our MySQL Apriori (Apriori-DM) without I/O operations and the total execution time needed by the SQL statement (Apriori-SQL). We show also a comparison of our standalone Apriori algorithm (Apriori-extern) with two freely available standalone Apriori implementations of Bodon 6 [4] and Borgelt 7 [5] respectively. Comparisons with Mafia are not reported since the algorithm is optimized for MFI. On the mushroom dataset (Fig. 4 (a)) the Apriori-DM outperforms Bodon [4] because of the use of bitmaps during the verification phase [2]. However, Borgelt implementation yields the best results. On the chess dataset (Fig. 4 (b)), Apriori-DM and Apriori-SQL outperform all the standalone implementations. This is due to the fact that the number of generated FI is very high and the I/O operations in a DBMS are faster than I/O operations on text files. 5 http://fimi.cs.helsinki.fi/ 6 http://www.cs.bme.hu/ bodon/en/apriori/ 7 http://www.borgelt.net/apriori.html

Title Suppressed Due to Excessive Length 7 (a) (b) Fig. 4. Running times varying the threshold. (a) Mushroom dataset. (b) Chess dataset. 6 Conclusions and future work We have presented an integration of data mining algorithms on MySQL. Differently from other database systems, MySQL lacks of such features. Although user may overcome such unavailability, it could result unsuitable for most applications. The main advantages of this approach rely on fact that user can use simple SQL commands to perform complex data mining analysis. Future work includes integration of a wider range of data mining algorithms together with statistical primitives. Acknowledgement We thank all students that have collaborated on the development of the system, in particular Aurelio Giudice, Luciano Gusmano, Antonino Schembri, and Tiziana Zapperi. References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conf., pages 487-499, 1994. 2. M. C. Doug Burdick and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Proc. of the 17th International Conference on Data Engineering., pages 77-90, April 2001. 3. J. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial Intelligence Research, 4:77-90, 1996. 4. Ferenc Bodon. Surprising results of trie-based FIM algorithms. 2nd Workshop of Frequent ItemSet Mining Implementations (FIMI 2004, Brighton, UK). 5. Christian Borgelt. Recursion Pruning for the Apriori Algorithm. 2nd Workshop of Frequent ItemSet Mining Implementations (FIMI 2004, Brighton, UK). 6. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Morgan Kaufmann, 2005.