On the Use of Data Mining Tools for Data Preparation in Classification Problems

Similar documents
ClowdFlows. Janez Kranjc

JCLEC Meets WEKA! A. Cano, J. M. Luna, J. L. Olmo, and S. Ventura

Integrating an Advanced Classifier in WEKA

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

International Journal of Advanced Research in Computer Science and Software Engineering

Deepak Kumar Pathak M.Tech.(CSE), Invertis University, India. Garima Gupta M.Tech.(CSE), Invertis University, India

Visual Tools to Lecture Data Analytics and Engineering

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

ADDITIONS AND IMPROVEMENTS TO THE ACE 2.0 MUSIC CLASSIFIER

Performance Analysis of Data Mining Classification Techniques

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK

Inductive decision based Real Time Occupancy detector in University Buildings.

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Using Decision Boundary to Analyze Classifiers

A Method Based on RBF-DDA Neural Networks for Improving Novelty Detection in Time Series

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

CloNI: clustering of JN -interval discretization

The Effects of Outliers on Support Vector Machines

A Two Stage Zone Regression Method for Global Characterization of a Project Database

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

MetaData for Database Mining

C-NBC: Neighborhood-Based Clustering with Constraints

Building a Concept Hierarchy from a Distance Matrix

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

DATA MINING AND ITS TECHNIQUE: AN OVERVIEW

IJMIE Volume 2, Issue 9 ISSN:

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Query Disambiguation from Web Search Logs

Temporal Weighted Association Rule Mining for Classification

Visual programming language for modular algorithms

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Individualized Error Estimation for Classification and Regression Models

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

Challenges and Interesting Research Directions in Associative Classification

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

A Framework for Trajectory Data Preprocessing for Data Mining

Domain Independent Prediction with Evolutionary Nearest Neighbors.

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

Sensor Based Time Series Classification of Body Movement

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

A Lazy Approach for Machine Learning Algorithms

Integrating Open Source Tools for Developing Embedded Linux Applications

Advance analytics and Comparison study of Data & Data Mining

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Text Processing Chains: Getting Help from Typed Applicative Systems

Accelerating Immunos 99

Classification using Weka (Brain, Computation, and Neural Learning)

B-kNN to Improve the Efficiency of knn

The Data Mining Application Based on WEKA: Geographical Original of Music

Classification Using Unstructured Rules and Ant Colony Optimization

Using Association Rules for Better Treatment of Missing Values

Increasing efficiency of data mining systems by machine unification and double machine cache

Sentiment Analysis for Customer Review Sites

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Automated Parameter Optimization for Feature Extraction for Condition Monitoring

Discretizing Continuous Attributes Using Information Theory

Speeding Up Logistic Model Tree Induction

An Empirical Study on feature selection for Data Classification

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

An Empirical Study of Lazy Multilabel Classification Algorithms

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Association Rule Mining from XML Data

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

Categorization of Sequential Data using Associative Classifiers

Handling Missing Values via Decomposition of the Conditioned Set

New ensemble methods for evolving data streams

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

3-D MRI Brain Scan Classification Using A Point Series Based Representation

A Cloud Framework for Big Data Analytics Workflows on Azure

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

User Guide Written By Yasser EL-Manzalawy

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

SSV Criterion Based Discretization for Naive Bayes Classifiers

LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES

COMP 465 Special Topics: Data Mining

1 Topic. Image classification using Knime.

Real-Time Data Analysis in ClowdFlows

Multi-relational Decision Tree Induction

Speeding up Logistic Model Tree Induction

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

PARAMETER OPTIMIZATION FOR AUTOMATED SIGNAL ANALYSIS FOR CONDITION MONITORING OF AIRCRAFT SYSTEMS. Mike Gerdes 1, Dieter Scholz 1

Detection and Deletion of Outliers from Large Datasets

WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis

WEKA KnowledgeFlow Tutorial for Version 3-5-6

A Novel Feature Selection Framework for Automatic Web Page Classification

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Transcription:

2012 IEEE/ACIS 11th International Conference on Computer and Information Science On the Use of Data Mining Tools for Data Preparation in Classification Problems Paulo M. Gonçalves Jr., Roberto S. M. Barros and Davi C. L. Vieira Centro de Informática, Universidade Federal de Pernambuco Cidade Universitária, 50.740-560, Recife, Brasil Email: {pmgj,roberto}@cin.ufpe.br Instituto Federal de Pernambuco Cidade Universitária, 50.740-540, Recife, Brasil Abstract The data preparation phase is a critical step in the KDD (Knowledge Discovery in Databases) process. This phase is crucial for a good data mining result because if data is not correctly prepared, all the next phases of the process are compromised. DMPML is a framework that stores preprocessed data for different data mining algorithms in an XML document and retrieves the correct codification by the use of an XSLT document according to the needs of the data mining algorithm. This paper presents a comparison between DMPML and three data mining applications (Weka, RapidMiner, and KNIME) that implement the directed graph approach, concerning the time spent to create and execute the data preparation tasks for two data mining algorithms. The tests were executed using different types of data sets: numerical, categorical, and mixed. We observed that the scheme used by DMPML can simplify the usage of different data mining algorithms and significantly reduce the time spent creating the data preparation tasks. Index Terms Data preparation, DMPML, XML, Tools comparison I. INTRODUCTION The data preparation phase of the KDD process is responsible for data cleaning, integration, selection, and transformation [1], making it suitable for a data mining algorithm. According to Pyle [2], data preparation consumes 60 to 90% of the time needed to mine data and contributes 75 to 90% to the mining project s success. Many tools nowadays can be used to perform this phase. The approach that is most prevalent of describing, not only the data preparation phase, but the whole KDD process, is the use of directed graphs. In this approach, the nodes of the graph represent the tasks to be performed, and the arrows represent how the data flows from one task to another. This approach is being used in tools like Weka [3], RapidMiner [4], and KNIME [5], among others. One example of a directed graph in the RapidMiner tool is presented in Fig. 1. Usually, in the preparation phase it is not previously known which data mining algorithm best fits the data set. So, testing the data set with a broad range of data mining algorithms, and even with different parameters on a single algorithm, is extremely important. For example, in the 2008 KDD Cup, one of the competition s winner tested the data set with AdaBoost, two variations of support vector machine, and a linear classifier trained using the Genetic Algorithm [6], and combined them all in a classifier committee. Considering data streams, even when one data mining algorithm is found to be the best choice for the data, as time goes by and the patterns represented by the data changes (a situation known as concept drift), the best algorithm that fits the data may change as well. Using a scheme that eases the training and testing of diverse data mining algorithms can reduce the time and effort spent during this phase. DMPML (Data Mining Preparation Markup Language) is a framework that stores directives (like outliers, missing, similar values and how to treat them) and the codification of data in an XML document. To obtain the correct codification for a specific algorithm, the types of codification of the variables are passed to an XSLT [7] file that creates a document with the data ready to be processed by that data mining algorithm. Using this approach makes it unnecessary to re-execute all the data preparation tasks every time it is necessary to train or test with a data mining algorithm. The main objective of this paper is to compare two different approaches to perform the data preparation phase of the KDD process: the approach used by DMPML (in the form of the DMPML-TS application, as presented in [8]) and the directed graph approach. Thus, we compared DMPML-TS to three data mining tools that implement the directed graph approach (Weka, RapidMiner, and KNIME). It was observed the time spent creating the directed graphs and the codification file in DMPML for data mining algorithms that deal with numeric and categorical data, as well as the time spent executing the data preparation tasks. The rest of this paper is organized as follows: Section II presents the tools, how they perform the preparation phase, their advantages and disadvantages. Section III presents the data sets used in the tests, their characteristics, how they were obtained, why they were chosen and the preparation tasks applied to them. Section IV presents the results of the tests and considerations about how the tools execute the data preparation. Finally, section V presents our conclusions. II. TOOLS In this section we present the tools used to perform the tests. The three tools that implement the directed graph approach were chosen because they are all recognized by their quality 978-0-7695-4694-0/12 $26.00 2012 IEEE DOI 10.1109/ICIS.2012.79 180 173

Fig. 1. Directed graph in RapidMiner. and widespread usage and they are all open source software, what helps in obtaining and testing the software. The first tool tested was Weka [9]. It has been developed at the University of Waikato, New Zealand, and is composed by a set of data preparation tasks and data mining algorithms used to prepare and analyze data. The version used in the tests was 3.6. Weka offers different graphical user interfaces, based on the task to be performed by the user. The interface that provides the directed graph approach and that was tested in this experiment is the KnowledgeFlow. It offers the same operations available in the Explorer interface in the form of nodes. If there are nodes that create a branch in the graph (with two or more leaving arrows), Weka does not allow the user to choose which of the paths should be performed, executing the whole graph, path by path, sequentially. Up to the moment, Weka only supports (natively) multi-threaded execution of some tasks. For example, it processes individual cross-validation folds in parallel, in separate threads. The second tool used in the tests was KNIME [10]. It is a tool developed at the University of Konstanz, Germany. The version of KNIME used in these tests was 2.3.4. The representation of the graphs use a folder for each operator. Inside each folder, it contains information about the operator in XML files, like its parameters, and the result of the execution of each operator in a binary file. Thus, if some task in the graph contains some kind of error, it is not necessary to re-execute the previous tasks as their results are stored. It is just needed to correct the task or its parameters and resume the execution. KNIME, differently from Weka, allows the user to select which nodes (and consequently, paths) should be executed. Other possibility is to select a task and make it execute all its previous tasks. If the execution of the whole directed graph is necessary, KNIME supports parallel execution of branches on multi-core systems. The third tool used was RapidMiner [4]. It was initially developed at the University of Dortmund, Germany, and is actually maintained by the Rapid-I company. It offers different versions of the software, named community and enterprise. The community version is free and the enterprise, paid. The version used in the tests presented here was the community edition 5.1. It represents the graph using an operator tree, based on XML, similarly to Weka, but the two formats are incompatible. RapidMiner offers extensions which add new features to the tool, like the Weka extension that gives access to about 100 additional modeling schemes, and the Parallel Processing extension which adds new versions of many operators that can execute in parallel on a multi-core machine. The three tools presented above operate in a similar form. All of them support the directed graph approach, as presented in the previous section. The advantage of this approach is that the user can inform the tasks to be performed and their order by just dragging and dropping the tasks and connecting them graphically. The user interaction with this approach is simple and direct. The user can rapidly start using the tool without much training. On the other hand, this approach offers some points for improvement: If it is needed to train and/or test the processed data with different data mining algorithms, or a data mining algorithm with different parameters, the time spent creating the graph increases and the graph increases in complexity as well. If the data set changes slightly, it is necessary to reexecute all the tasks again to obtain the new processed data. One approach to reduce the time spent creating the data preparation phase and simplifying it is DMPML [8]. The DMPML framework describes how to store the codifications of the data for several data mining algorithms so it is not necessary to re-execute all the data preparation tasks over and over. It works in the following manner: based on the raw data (in the XML format), a program asks the user to inform the characteristics of the attributes, like outliers and missing values, how to treat them, what kinds of codification 181 174

should be generated for the variables, among other options. Based on these information, a file is created containing the codifications to be used, possibly for multiple data mining algorithms, named DPDM (Data Processing for Data Mining). To obtain the data prepared for a specific algorithm, an XSLT file is executed, and parses the raw and the codified data to obtain the correct codification. The next section presents the data sets, how they were chosen, and the data preparation tasks applied to them. III. DATA SETS The chosen data sets are presented in Table I, including information like number of instances, number of attributes and type of attributes. They are all part of the UCI Machine Learning Repository [11], a site that presents many data sets used in research. It provides many filters to select the appropriate data set. Concerning the default task, classification was the chosen task because it is the most common task in the repository and all the tools can handle it. The data type chosen was multivariate, as it was intended to test data sets with multiple attributes. We chose data sets with more than 1,000 instances and format type matrix. TABLE I INFORMATION ABOUT THE DATA SETS. Instances Attributes Attribute types Mushroom 8,124 22 Categorical Adult 32,561 14 Categorical, Integer Statlog (Shuttle) 58,000 9 Integer Poker Hand 1,025,010 11 Categorical, Integer At this point, four data sets had been chosen: one with categorical attributes, one with numerical attributes, and two with mixed categorical and numerical attributes, to be possible to test the tools with different tasks and different attribute types. These are the attribute types most common in the repository and used by the data mining algorithms. The tests were performed in a computer running Linux Ubuntu 10.04 64 bits with an Intel Core i3 330M processor and 3.2GB of main memory available to each application. Regarding the tasks performed in the data, our first activity was to convert the data format used in the UCI repository, which is CSV, to the XRFF format. To test the time spent creating and executing the graphs, it was needed to describe a group of tasks to apply to the data sets concerning the attribute types and the data mining algorithms to be used. The tools presented do not offer the same set of tasks, and even when they offer the same task, not necessarily the algorithms and/or parameters used are the same. So, it was necessary to analyze the tasks provided by the tools to define a group of tasks that are defined by them all and set these to the same algorithms and parameters to have a standard base of comparison. The process started by trying to prepare data to an algorithm that deals with numerical data only; for example, an artificial neural network (ANN) [12]. The tasks used in the data sets and to what data sets they were applied are presented at Table II. The numbers at the remove attribute task represent the indexes of the attributes removed. The normalization algorithm chosen was the range [1, p. 71] (also called min-max) normalization and the values are normalized to the [0,1] interval. Concerning the replace missing values task, missing categorical attributes were substituted by the mode value, and numerical attributes by the mean value. These values were chosen because they are the only ones all the tools support. TABLE II TASKS APPLIED TO THE DATA SETS TO AN ANN ALGORITHM. Tasks Adult Mush room Remove attribute 3, 5, 6, 13, 14 Poker Hand Statlog 16 1, 2 1 Normalization X X X Nominal to Binary X X X Replace missing X X Discretization X X X After measuring the time spent creating and executing the data graphs for a numerical only data mining algorithm, it was started the process of modifying the graph so it be able to prepare data for a categorical only algorithm; for example, the ID3 decision tree algorithm [13]. The remove attributes and replace missing values tasks were reused and the discretization task of numerical variables were used to convert numerical values to categorical ones. The discretization algorithm chosen was equal-interval binning [14, p. 297-298], with size 10. An example of the directed graph created to prepare data to the numerical and categorical mining algorithms in RapidMiner can be seen in Fig. 1. Our objective with the proposed tasks was not to verify if they were appropriate to a specific data mining algorithm; for example, if the accuracy of the algorithm increased with the usage of these data preparation tasks. Rather, the objective was to measure how much time the user spent creating the data graph, selecting its parameters, and executing the tasks. The time spent creating the graphs considers that the user knows in advance what tasks to use. So, it is measured only the time the user interfaces with the tools. Computing the time the user takes to identify what tasks to perform is much more complex and was not measured in this experiment. IV. RESULTS Table III presents the time spent, for each tool, to perform the data preparation tasks. Each activity presented here was performed ten times and the average of each value is presented in the table. Column 1 presents the time spent creating the data preparation tasks for a numerical data mining algorithm. In the case of the first three tools, it corresponds to the creation of the directed graph and fine-tuning its parameters, and in the 182 175

case of DMPML-TS, it regards setting its parameters in the user interface. Column 2 presents the time spent executing this graph. Column 3 presents the time spent changing the graph to prepare data to a categorical data mining algorithm. Column 4 presents the time spent executing the new graph. TABLE III AVERAGE TIME SPENT RUNNING THE PREPARATION TASKS (IN SECONDS). Weka Adult 108.69 16.20 34.89 21.00 Mushroom 98.47 6.20 36.71 7.70 Poker Hand - - - - Statlog (Shuttle) 130.45 10.00 49.94 17.90 RapidMiner Adult 104.32 9.10 36.70 14.90 Mushroom 71.10 4.10 19.83 4.00 Poker Hand - - - - Statlog (Shuttle) 60.05 9.00 33.32 23.10 KNIME Adult 115.72 16.02 39.05 18.39 Mushroom 111.52 7.59 25.55 9.03 Poker Hand - - - - Statlog (Shuttle) 84.87 7.78 49.06 11.65 DMPML-TS Adult 41.76 45.48 6.52 67.52 Mushroom 30.52 6.93 6.36 20.88 Poker Hand 64.09 342.40 6.54 426.66 Statlog (Shuttle) 17.19 183.97 6.48 184.05 As it can be observed in column 1, the numbers are fairly higher in tools that use the directed graph approach compared to the DMPML approach. One reason to explain why this happens is because, when a data set is selected, the tools try to load it completely in memory. The larger the data set, the more time is spent waiting for the tool to load it, so that it is possible to continue selecting the other tasks. Trying to load the entire data set in memory has a problem: the tools are limited in the amount of data they can handle. Analyzing Table III, the only tool that could perform the data preparation tasks for the Poker Hand data set (constituted of more than 1 million instances) was DMPML-TS. None of the tools that use the directed graph approach could execute all the data preparation tasks in the entire data set in main memory. DMPML-TS does not suffer from this problem for two different characteristics: First, when the user selects the input file, it is loaded in a different thread. So, while the program loads the file, the user can start to set the data preparation tasks parameters. The second reason is that DMPML-TS parses the XRFF document using SAX (Simple API for XML) to create the DPDM file. The SAX parser walks through the XML input file and raises events when it encounters starting and ending elements, comments, elements content, etc. Thus, by using SAX it is not necessary to load the entire data set in memory, allowing the manipulation of huge data sets and using less main memory in data preparation. Another reason that explains why the tools that implement the directed graph approach spend more time creating the data preparation tasks, when compared to DMPML-TS, is the amount of user interaction needed by them. To perform the data preparation phase, the user needs to: (a) identify where the tasks to be performed are, (b) select them in the user interface, (c) click into the specified area in the application, (d) select their proper settings, (e) connect them using edges (which demands precision because the points of connection can be very small), and finally (f) execute the whole graph. Executing the graphs, on the other hand, an activity where the computer is responsible, was comparatively faster in the directed graph approach. Comparing the second column to the first one in Table III, we can see that the time spent executing the graphs is much smaller than creating them, varying from 5% to 13% from the time spent creating and executing the graphs. Thus, the smaller the data set, the faster it will execute, and therefore, the time spent creating the graph becomes the bottleneck in the process. In DMPML-TS it is exactly the opposite: the time spent to transform the data is much higher compared to the time to prepare data, varying from 18% to 91% of the total time. This was clearly the bottleneck of this approach: the efficiency of the XSL processor to deal with big XML files. After the execution of the data preparation tasks for the numerical algorithm, the preparation for the categorical algorithm was performed using the previous graph and adding the discretization task. In Weka and RapidMiner, there is no support for selecting the tasks that should be executed; so the whole directed graph must be executed, even if only one of the branches of the graph is needed. This solution was used, instead of creating a new graph, because we wanted to reduce the time the user spent manipulating the tool. As can be seen in column 3, DMPML-TS demands virtually no user interaction because the previously generated DPDM file already contains the codifications for the categorical data mining algorithm. So, it is only necessary to inform the new codifications to be used and the path of the destination file to process the data. One characteristic of DMPML-TS that helps reducing the time spent selecting the correct codifications for the variable types is that the values selected by the user are stored in a configuration file, so the user never needs to inform these values again. A. Results analysis Comparing the times spent to prepare data for the numerical algorithm in the DMPML and directed graph approaches, DMPML-TS performed at least 30% better than the other tools. If we also consider the time spent to prepare data for the categorical algorithm, the advantage raises to at least 40%. This emphasizes the fact that the reduction in the time spent 183 176

creating the data preparation tasks is one of the advantages of using DMPML-TS. Fig. 2 presents a t-student performance test with 95% confidence interval based on the time spent manipulating the tools for the numerical and categorical data mining algorithms. It is possible to notice that the time spent by DMPML-TS is statistically lower compared to the other tools, allowing us to state that DMPML-TS has a better performance in tool manipulation compared to the directed graph approach. Fig. 2. Comparative performance on interaction with 95% confidence. On the other hand, up to the moment DMPML-TS requires more time to generate the output file, when dealing with big data sets. The efficiency of the XSL processor directly impacts the DMPML performance. Despite this actual disadvantage, as XSL is an open standard, there are lots of processors available and it is possible to try different processors and test which performs best without needing to recreate the data tasks. For example, the Saxon processor [15] provides a scheme were it pre-compiles the XSL file, generating a file with the XPath commands previously resolved and uses this pre-compiled file to process the XRFF file. However, it was not used in this experiment because it is only available in its paid version. An attempt to minimize the impact of the transformations was made by using parallelism. DMPML-TS performs the transformations in different threads, so, while it is executing the transformation for the numerical algorithm, the user can inform the codifications of the categorical data mining algorithm and execute it. The program, then, executes both transformations simultaneously, reducing the negative impact of the XSL processor. So, we argue that, DMPML-TS requires less user interaction while the directed graph approach is faster executing the graphs. What is more important? We consider reducing the time spent by the human user more important because the time spent by the user is much more expensive than the time spent by the computer to execute the data preparation tasks. As soon as the user finishes creating the graph and the computer starts executing the data preparation tasks, he/she can simply switch to another activity. Another point to be noted is that it is comparatively simpler to use another computer to perform the task, with a faster processor, or with more main memory available, or even to use another computer program that executes the tasks faster. Changing the user is much more difficult. Besides that, with the increasing performance of computers, the impact of the transformations in the DMPML approach tends to be reduced. Another point of comparison is to consider the whole process of preparing data for both numerical and categorical algorithms. This can be seen in Table IV. The times presented for the first three tools were the time needed to create the graph for the numerical and categorical algorithms and the time needed to execute them both. In DMPML-TS, each transformation was executed in a different thread simultaneously. TABLE IV TOTAL TIME NEEDED BY THE TOOLS TO PERFORM THE DATA PREPARATION TASKS (IN SECONDS). Weka Rapid- Miner KNIME DMPML- TS Adult 164.58 155.91 173.16 128.44 Mushroom 142.88 94.92 146.11 66.50 Poker Hand - - - 523.04 Statlog 198.29 116.47 145.58 243.98 It can be seen that, despite the negative impact of the XSL processor efficiency, DMPML-TS had the best performance with the categorical and mixed data sets. This can be explained by two reasons. First, the transformations are being executed in parallel, considerably reducing the time needed to perform both transformations. Considering the Statlog data set, the total time to execute the transformations sequentially was 391.69 seconds. Executing the transformations in parallel demanded 243.98 seconds, a decrease of approximately 37%. The other reason to explain the better efficiency of DMPML is the form it deals with attributes. Attributes constituted of a limited set of values, make their representation in the DPDM file small. So, the XSL processor does not need to parse big chunks of data to encounter its codification nor to verify if it is an outlier or missing value, to identify to what value it should be converted, etc. Thus, DMPML deals nicely with data sets with these characteristics (usually categorical attributes). Therefore, in general, DMPML has a better performance on categorical over numerical data sets. We can confirm this statement analyzing the Statlog data set, which is constituted only of numerical variables. This was the data set where DMPML had its worse overall performance. But, even in this situation, the user needed to spend approximately 66% less time manipulating the tool compared to the best directed graph tool. Fig. 3 presents the confidence interval of the tools considering the whole data preparation process, as presented in Table IV. Based on the figure, we can state, with confidence interval of 95%, that DMPML-TS had a better performance 184 177

in the categorical and mixed data sets. In the numerical data set, RapidMiner had statistically the best performance. Fig. 3. Overall performance test with 95% confidence. V. CONCLUSION Data preparation is one of the most important effort and time consuming phases of the KDD process. In this phase, data is cleaned, integrated, selected, and transformed to be served to a data mining algorithm. As there is no way to know in advance the best data mining algorithm that fits one specific data set, it is important to test the data set with many different algorithms and many different parameters. This paper presented a comparison between three tools that implement the directed graph approach (Weka, KNIME, and RapidMiner), and DMPML-TS. It was possible to observe that the tools that use the directed graph approach make the user spend more time creating the directed graphs than executing them. The smaller the data set, the more time is spent creating the data preparation tasks, comparatively to the time spent executing the tasks. On the other hand, the DMPML approach does not demand a lot of user interaction but spends more time processing the XML documents. But, as it should be more important to reduce the time the user spends manipulating the tools, because it is much easier to upgrade the computer or use another computer with a faster processor and/or more main memory, the DMPML approach is very interesting and promising. DMPML represents a good solution to store the data codification for different data mining algorithms, allowing the user to generate the data for different algorithms in parallel, making him/her spend less time dealing with tools, and simplifying the execution of different algorithms to try to discover the best one for a given data set. With the usage of DMPML it was possible to significantly reduce the time spent creating the directed graphs. The time savings were more than 47% of the time needed to create the data preparation tasks. For the Mushroom data set, the DMPML approach was approximately 30% faster than the best tool that implements the directed graph approach in the time spent preparing and executing the data preparation phase for two different algorithms. It is important to notice that, as the number of tasks increase, more time is spent creating the graph, and more advantages are perceived using DMPML. If it is necessary to test a data set with different data mining algorithms, DMPML also offers a good solution. REFERENCES [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann Publishers, Inc., 2006. [2] D. Pyle, Data collection, preparation, quality, and visualization, in The Handbook of Data Mining, N. Ye, Ed. Lawrence Erlbaum Associates, Inc, Publishers, 2003, pp. 366 391. [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The weka data mining software: an update, SIGKDD Explorations Newsletter, vol. 11, pp. 10 18, November 2009. [Online]. Available: http://dx.doi.org/10.1145/1656274.1656278 [4] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, Yale: rapid prototyping for complex data mining tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD 06. New York, NY, USA: ACM, 2006, pp. 935 940. [Online]. Available: http://dx.doi.org/10.1145/1150402.1150531 [5] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel, Knime: The konstanz information miner, in Data Analysis, Machine Learning and Applications, ser. Studies in Classification, Data Analysis, and Knowledge Organization, C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, Eds. Springer Berlin Heidelberg, 2008, pp. 319 326. [Online]. Available: http://dx.doi.org/10.1007/ 978-3-540-78246-9 38 [6] H.-Y. Lo, C.-M. Chang, T.-H. Chiang, C.-Y. Hsiao, A. Huang, T.-T. Kuo, W.-C. Lai, M.-H. Yang, J.-J. Yeh, C.-C. Yen, and S.-D. Lin, Learning to improve area-under-froc for imbalanced medical data classification using an ensemble method, SIGKDD Explorations Newsletter, vol. 10, pp. 43 46, December 2008. [Online]. Available: http://dx.doi.org/10.1145/1540276.1540290 [7] M. Kay, Xsl transformations (xslt) version 2.0, January 2007. [Online]. Available: http://www.w3.org/tr/xslt20 [8] P. M. Gonçalves, Jr. and R. S. M. Barros, Automating data preprocessing with dmpml and kddml, in International Conference on Computer and Information Science, ser. ICIS 11, S. Xu, W. Du, and R. Lee, Eds. Los Alamitos, CA, USA: IEEE Computer Society, May 2011, pp. 97 103. [Online]. Available: http://dx.doi.org/10.1109/icis.2011.23 [9] E. Frank, M. Hall, G. Holmes, R. Kirkby, B. Pfahringer, I. H. Witten, and L. Trigg, Weka a machine learning workbench for data mining, in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer US, 2010, pp. 1269 1277. [Online]. Available: http://dx.doi.org/10.1007/978-0-387-09823-4 66 [10] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, K. Thiel, and B. Wiswedel, Knime - the konstanz information miner: version 2.0 and beyond, SIGKDD Explorations Newsletter, vol. 11, pp. 26 31, November 2009. [Online]. Available: http://dx.doi.org/10.1145/1656274.1656280 [11] A. Frank and A. Asuncion, UCI machine learning repository, 2011. [Online]. Available: http://archive.ics.uci.edu/ml [12] S. Haykin, Neural Networks and Learning Machine, 3rd ed. New York, NY, USA: Prentice Hall, 2008. [13] J. R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, pp. 81 106, 1986. [Online]. Available: http://dx.doi.org/10.1007/ BF00116251 [14] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA, USA: Morgan Kaufmann, 2005. [15] M. Kay, The saxon xslt and xquery processor. 2010. [Online]. Available: http://saxon.sourceforge.net/ 185 178