On the Use of Data Mining Tools for Data Preparation in Classification Problems

Size: px
Start display at page:

Download "On the Use of Data Mining Tools for Data Preparation in Classification Problems"

Transcription

1 2012 IEEE/ACIS 11th International Conference on Computer and Information Science On the Use of Data Mining Tools for Data Preparation in Classification Problems Paulo M. Gonçalves Jr., Roberto S. M. Barros and Davi C. L. Vieira Centro de Informática, Universidade Federal de Pernambuco Cidade Universitária, , Recife, Brasil Instituto Federal de Pernambuco Cidade Universitária, , Recife, Brasil Abstract The data preparation phase is a critical step in the KDD (Knowledge Discovery in Databases) process. This phase is crucial for a good data mining result because if data is not correctly prepared, all the next phases of the process are compromised. DMPML is a framework that stores preprocessed data for different data mining algorithms in an XML document and retrieves the correct codification by the use of an XSLT document according to the needs of the data mining algorithm. This paper presents a comparison between DMPML and three data mining applications (Weka, RapidMiner, and KNIME) that implement the directed graph approach, concerning the time spent to create and execute the data preparation tasks for two data mining algorithms. The tests were executed using different types of data sets: numerical, categorical, and mixed. We observed that the scheme used by DMPML can simplify the usage of different data mining algorithms and significantly reduce the time spent creating the data preparation tasks. Index Terms Data preparation, DMPML, XML, Tools comparison I. INTRODUCTION The data preparation phase of the KDD process is responsible for data cleaning, integration, selection, and transformation [1], making it suitable for a data mining algorithm. According to Pyle [2], data preparation consumes 60 to 90% of the time needed to mine data and contributes 75 to 90% to the mining project s success. Many tools nowadays can be used to perform this phase. The approach that is most prevalent of describing, not only the data preparation phase, but the whole KDD process, is the use of directed graphs. In this approach, the nodes of the graph represent the tasks to be performed, and the arrows represent how the data flows from one task to another. This approach is being used in tools like Weka [3], RapidMiner [4], and KNIME [5], among others. One example of a directed graph in the RapidMiner tool is presented in Fig. 1. Usually, in the preparation phase it is not previously known which data mining algorithm best fits the data set. So, testing the data set with a broad range of data mining algorithms, and even with different parameters on a single algorithm, is extremely important. For example, in the 2008 KDD Cup, one of the competition s winner tested the data set with AdaBoost, two variations of support vector machine, and a linear classifier trained using the Genetic Algorithm [6], and combined them all in a classifier committee. Considering data streams, even when one data mining algorithm is found to be the best choice for the data, as time goes by and the patterns represented by the data changes (a situation known as concept drift), the best algorithm that fits the data may change as well. Using a scheme that eases the training and testing of diverse data mining algorithms can reduce the time and effort spent during this phase. DMPML (Data Mining Preparation Markup Language) is a framework that stores directives (like outliers, missing, similar values and how to treat them) and the codification of data in an XML document. To obtain the correct codification for a specific algorithm, the types of codification of the variables are passed to an XSLT [7] file that creates a document with the data ready to be processed by that data mining algorithm. Using this approach makes it unnecessary to re-execute all the data preparation tasks every time it is necessary to train or test with a data mining algorithm. The main objective of this paper is to compare two different approaches to perform the data preparation phase of the KDD process: the approach used by DMPML (in the form of the DMPML-TS application, as presented in [8]) and the directed graph approach. Thus, we compared DMPML-TS to three data mining tools that implement the directed graph approach (Weka, RapidMiner, and KNIME). It was observed the time spent creating the directed graphs and the codification file in DMPML for data mining algorithms that deal with numeric and categorical data, as well as the time spent executing the data preparation tasks. The rest of this paper is organized as follows: Section II presents the tools, how they perform the preparation phase, their advantages and disadvantages. Section III presents the data sets used in the tests, their characteristics, how they were obtained, why they were chosen and the preparation tasks applied to them. Section IV presents the results of the tests and considerations about how the tools execute the data preparation. Finally, section V presents our conclusions. II. TOOLS In this section we present the tools used to perform the tests. The three tools that implement the directed graph approach were chosen because they are all recognized by their quality /12 $ IEEE DOI /ICIS

2 Fig. 1. Directed graph in RapidMiner. and widespread usage and they are all open source software, what helps in obtaining and testing the software. The first tool tested was Weka [9]. It has been developed at the University of Waikato, New Zealand, and is composed by a set of data preparation tasks and data mining algorithms used to prepare and analyze data. The version used in the tests was 3.6. Weka offers different graphical user interfaces, based on the task to be performed by the user. The interface that provides the directed graph approach and that was tested in this experiment is the KnowledgeFlow. It offers the same operations available in the Explorer interface in the form of nodes. If there are nodes that create a branch in the graph (with two or more leaving arrows), Weka does not allow the user to choose which of the paths should be performed, executing the whole graph, path by path, sequentially. Up to the moment, Weka only supports (natively) multi-threaded execution of some tasks. For example, it processes individual cross-validation folds in parallel, in separate threads. The second tool used in the tests was KNIME [10]. It is a tool developed at the University of Konstanz, Germany. The version of KNIME used in these tests was The representation of the graphs use a folder for each operator. Inside each folder, it contains information about the operator in XML files, like its parameters, and the result of the execution of each operator in a binary file. Thus, if some task in the graph contains some kind of error, it is not necessary to re-execute the previous tasks as their results are stored. It is just needed to correct the task or its parameters and resume the execution. KNIME, differently from Weka, allows the user to select which nodes (and consequently, paths) should be executed. Other possibility is to select a task and make it execute all its previous tasks. If the execution of the whole directed graph is necessary, KNIME supports parallel execution of branches on multi-core systems. The third tool used was RapidMiner [4]. It was initially developed at the University of Dortmund, Germany, and is actually maintained by the Rapid-I company. It offers different versions of the software, named community and enterprise. The community version is free and the enterprise, paid. The version used in the tests presented here was the community edition 5.1. It represents the graph using an operator tree, based on XML, similarly to Weka, but the two formats are incompatible. RapidMiner offers extensions which add new features to the tool, like the Weka extension that gives access to about 100 additional modeling schemes, and the Parallel Processing extension which adds new versions of many operators that can execute in parallel on a multi-core machine. The three tools presented above operate in a similar form. All of them support the directed graph approach, as presented in the previous section. The advantage of this approach is that the user can inform the tasks to be performed and their order by just dragging and dropping the tasks and connecting them graphically. The user interaction with this approach is simple and direct. The user can rapidly start using the tool without much training. On the other hand, this approach offers some points for improvement: If it is needed to train and/or test the processed data with different data mining algorithms, or a data mining algorithm with different parameters, the time spent creating the graph increases and the graph increases in complexity as well. If the data set changes slightly, it is necessary to reexecute all the tasks again to obtain the new processed data. One approach to reduce the time spent creating the data preparation phase and simplifying it is DMPML [8]. The DMPML framework describes how to store the codifications of the data for several data mining algorithms so it is not necessary to re-execute all the data preparation tasks over and over. It works in the following manner: based on the raw data (in the XML format), a program asks the user to inform the characteristics of the attributes, like outliers and missing values, how to treat them, what kinds of codification

3 should be generated for the variables, among other options. Based on these information, a file is created containing the codifications to be used, possibly for multiple data mining algorithms, named DPDM (Data Processing for Data Mining). To obtain the data prepared for a specific algorithm, an XSLT file is executed, and parses the raw and the codified data to obtain the correct codification. The next section presents the data sets, how they were chosen, and the data preparation tasks applied to them. III. DATA SETS The chosen data sets are presented in Table I, including information like number of instances, number of attributes and type of attributes. They are all part of the UCI Machine Learning Repository [11], a site that presents many data sets used in research. It provides many filters to select the appropriate data set. Concerning the default task, classification was the chosen task because it is the most common task in the repository and all the tools can handle it. The data type chosen was multivariate, as it was intended to test data sets with multiple attributes. We chose data sets with more than 1,000 instances and format type matrix. TABLE I INFORMATION ABOUT THE DATA SETS. Instances Attributes Attribute types Mushroom 8, Categorical Adult 32, Categorical, Integer Statlog (Shuttle) 58,000 9 Integer Poker Hand 1,025, Categorical, Integer At this point, four data sets had been chosen: one with categorical attributes, one with numerical attributes, and two with mixed categorical and numerical attributes, to be possible to test the tools with different tasks and different attribute types. These are the attribute types most common in the repository and used by the data mining algorithms. The tests were performed in a computer running Linux Ubuntu bits with an Intel Core i3 330M processor and 3.2GB of main memory available to each application. Regarding the tasks performed in the data, our first activity was to convert the data format used in the UCI repository, which is CSV, to the XRFF format. To test the time spent creating and executing the graphs, it was needed to describe a group of tasks to apply to the data sets concerning the attribute types and the data mining algorithms to be used. The tools presented do not offer the same set of tasks, and even when they offer the same task, not necessarily the algorithms and/or parameters used are the same. So, it was necessary to analyze the tasks provided by the tools to define a group of tasks that are defined by them all and set these to the same algorithms and parameters to have a standard base of comparison. The process started by trying to prepare data to an algorithm that deals with numerical data only; for example, an artificial neural network (ANN) [12]. The tasks used in the data sets and to what data sets they were applied are presented at Table II. The numbers at the remove attribute task represent the indexes of the attributes removed. The normalization algorithm chosen was the range [1, p. 71] (also called min-max) normalization and the values are normalized to the [0,1] interval. Concerning the replace missing values task, missing categorical attributes were substituted by the mode value, and numerical attributes by the mean value. These values were chosen because they are the only ones all the tools support. TABLE II TASKS APPLIED TO THE DATA SETS TO AN ANN ALGORITHM. Tasks Adult Mush room Remove attribute 3, 5, 6, 13, 14 Poker Hand Statlog 16 1, 2 1 Normalization X X X Nominal to Binary X X X Replace missing X X Discretization X X X After measuring the time spent creating and executing the data graphs for a numerical only data mining algorithm, it was started the process of modifying the graph so it be able to prepare data for a categorical only algorithm; for example, the ID3 decision tree algorithm [13]. The remove attributes and replace missing values tasks were reused and the discretization task of numerical variables were used to convert numerical values to categorical ones. The discretization algorithm chosen was equal-interval binning [14, p ], with size 10. An example of the directed graph created to prepare data to the numerical and categorical mining algorithms in RapidMiner can be seen in Fig. 1. Our objective with the proposed tasks was not to verify if they were appropriate to a specific data mining algorithm; for example, if the accuracy of the algorithm increased with the usage of these data preparation tasks. Rather, the objective was to measure how much time the user spent creating the data graph, selecting its parameters, and executing the tasks. The time spent creating the graphs considers that the user knows in advance what tasks to use. So, it is measured only the time the user interfaces with the tools. Computing the time the user takes to identify what tasks to perform is much more complex and was not measured in this experiment. IV. RESULTS Table III presents the time spent, for each tool, to perform the data preparation tasks. Each activity presented here was performed ten times and the average of each value is presented in the table. Column 1 presents the time spent creating the data preparation tasks for a numerical data mining algorithm. In the case of the first three tools, it corresponds to the creation of the directed graph and fine-tuning its parameters, and in the

4 case of DMPML-TS, it regards setting its parameters in the user interface. Column 2 presents the time spent executing this graph. Column 3 presents the time spent changing the graph to prepare data to a categorical data mining algorithm. Column 4 presents the time spent executing the new graph. TABLE III AVERAGE TIME SPENT RUNNING THE PREPARATION TASKS (IN SECONDS). Weka Adult Mushroom Poker Hand Statlog (Shuttle) RapidMiner Adult Mushroom Poker Hand Statlog (Shuttle) KNIME Adult Mushroom Poker Hand Statlog (Shuttle) DMPML-TS Adult Mushroom Poker Hand Statlog (Shuttle) As it can be observed in column 1, the numbers are fairly higher in tools that use the directed graph approach compared to the DMPML approach. One reason to explain why this happens is because, when a data set is selected, the tools try to load it completely in memory. The larger the data set, the more time is spent waiting for the tool to load it, so that it is possible to continue selecting the other tasks. Trying to load the entire data set in memory has a problem: the tools are limited in the amount of data they can handle. Analyzing Table III, the only tool that could perform the data preparation tasks for the Poker Hand data set (constituted of more than 1 million instances) was DMPML-TS. None of the tools that use the directed graph approach could execute all the data preparation tasks in the entire data set in main memory. DMPML-TS does not suffer from this problem for two different characteristics: First, when the user selects the input file, it is loaded in a different thread. So, while the program loads the file, the user can start to set the data preparation tasks parameters. The second reason is that DMPML-TS parses the XRFF document using SAX (Simple API for XML) to create the DPDM file. The SAX parser walks through the XML input file and raises events when it encounters starting and ending elements, comments, elements content, etc. Thus, by using SAX it is not necessary to load the entire data set in memory, allowing the manipulation of huge data sets and using less main memory in data preparation. Another reason that explains why the tools that implement the directed graph approach spend more time creating the data preparation tasks, when compared to DMPML-TS, is the amount of user interaction needed by them. To perform the data preparation phase, the user needs to: (a) identify where the tasks to be performed are, (b) select them in the user interface, (c) click into the specified area in the application, (d) select their proper settings, (e) connect them using edges (which demands precision because the points of connection can be very small), and finally (f) execute the whole graph. Executing the graphs, on the other hand, an activity where the computer is responsible, was comparatively faster in the directed graph approach. Comparing the second column to the first one in Table III, we can see that the time spent executing the graphs is much smaller than creating them, varying from 5% to 13% from the time spent creating and executing the graphs. Thus, the smaller the data set, the faster it will execute, and therefore, the time spent creating the graph becomes the bottleneck in the process. In DMPML-TS it is exactly the opposite: the time spent to transform the data is much higher compared to the time to prepare data, varying from 18% to 91% of the total time. This was clearly the bottleneck of this approach: the efficiency of the XSL processor to deal with big XML files. After the execution of the data preparation tasks for the numerical algorithm, the preparation for the categorical algorithm was performed using the previous graph and adding the discretization task. In Weka and RapidMiner, there is no support for selecting the tasks that should be executed; so the whole directed graph must be executed, even if only one of the branches of the graph is needed. This solution was used, instead of creating a new graph, because we wanted to reduce the time the user spent manipulating the tool. As can be seen in column 3, DMPML-TS demands virtually no user interaction because the previously generated DPDM file already contains the codifications for the categorical data mining algorithm. So, it is only necessary to inform the new codifications to be used and the path of the destination file to process the data. One characteristic of DMPML-TS that helps reducing the time spent selecting the correct codifications for the variable types is that the values selected by the user are stored in a configuration file, so the user never needs to inform these values again. A. Results analysis Comparing the times spent to prepare data for the numerical algorithm in the DMPML and directed graph approaches, DMPML-TS performed at least 30% better than the other tools. If we also consider the time spent to prepare data for the categorical algorithm, the advantage raises to at least 40%. This emphasizes the fact that the reduction in the time spent

5 creating the data preparation tasks is one of the advantages of using DMPML-TS. Fig. 2 presents a t-student performance test with 95% confidence interval based on the time spent manipulating the tools for the numerical and categorical data mining algorithms. It is possible to notice that the time spent by DMPML-TS is statistically lower compared to the other tools, allowing us to state that DMPML-TS has a better performance in tool manipulation compared to the directed graph approach. Fig. 2. Comparative performance on interaction with 95% confidence. On the other hand, up to the moment DMPML-TS requires more time to generate the output file, when dealing with big data sets. The efficiency of the XSL processor directly impacts the DMPML performance. Despite this actual disadvantage, as XSL is an open standard, there are lots of processors available and it is possible to try different processors and test which performs best without needing to recreate the data tasks. For example, the Saxon processor [15] provides a scheme were it pre-compiles the XSL file, generating a file with the XPath commands previously resolved and uses this pre-compiled file to process the XRFF file. However, it was not used in this experiment because it is only available in its paid version. An attempt to minimize the impact of the transformations was made by using parallelism. DMPML-TS performs the transformations in different threads, so, while it is executing the transformation for the numerical algorithm, the user can inform the codifications of the categorical data mining algorithm and execute it. The program, then, executes both transformations simultaneously, reducing the negative impact of the XSL processor. So, we argue that, DMPML-TS requires less user interaction while the directed graph approach is faster executing the graphs. What is more important? We consider reducing the time spent by the human user more important because the time spent by the user is much more expensive than the time spent by the computer to execute the data preparation tasks. As soon as the user finishes creating the graph and the computer starts executing the data preparation tasks, he/she can simply switch to another activity. Another point to be noted is that it is comparatively simpler to use another computer to perform the task, with a faster processor, or with more main memory available, or even to use another computer program that executes the tasks faster. Changing the user is much more difficult. Besides that, with the increasing performance of computers, the impact of the transformations in the DMPML approach tends to be reduced. Another point of comparison is to consider the whole process of preparing data for both numerical and categorical algorithms. This can be seen in Table IV. The times presented for the first three tools were the time needed to create the graph for the numerical and categorical algorithms and the time needed to execute them both. In DMPML-TS, each transformation was executed in a different thread simultaneously. TABLE IV TOTAL TIME NEEDED BY THE TOOLS TO PERFORM THE DATA PREPARATION TASKS (IN SECONDS). Weka Rapid- Miner KNIME DMPML- TS Adult Mushroom Poker Hand Statlog It can be seen that, despite the negative impact of the XSL processor efficiency, DMPML-TS had the best performance with the categorical and mixed data sets. This can be explained by two reasons. First, the transformations are being executed in parallel, considerably reducing the time needed to perform both transformations. Considering the Statlog data set, the total time to execute the transformations sequentially was seconds. Executing the transformations in parallel demanded seconds, a decrease of approximately 37%. The other reason to explain the better efficiency of DMPML is the form it deals with attributes. Attributes constituted of a limited set of values, make their representation in the DPDM file small. So, the XSL processor does not need to parse big chunks of data to encounter its codification nor to verify if it is an outlier or missing value, to identify to what value it should be converted, etc. Thus, DMPML deals nicely with data sets with these characteristics (usually categorical attributes). Therefore, in general, DMPML has a better performance on categorical over numerical data sets. We can confirm this statement analyzing the Statlog data set, which is constituted only of numerical variables. This was the data set where DMPML had its worse overall performance. But, even in this situation, the user needed to spend approximately 66% less time manipulating the tool compared to the best directed graph tool. Fig. 3 presents the confidence interval of the tools considering the whole data preparation process, as presented in Table IV. Based on the figure, we can state, with confidence interval of 95%, that DMPML-TS had a better performance

6 in the categorical and mixed data sets. In the numerical data set, RapidMiner had statistically the best performance. Fig. 3. Overall performance test with 95% confidence. V. CONCLUSION Data preparation is one of the most important effort and time consuming phases of the KDD process. In this phase, data is cleaned, integrated, selected, and transformed to be served to a data mining algorithm. As there is no way to know in advance the best data mining algorithm that fits one specific data set, it is important to test the data set with many different algorithms and many different parameters. This paper presented a comparison between three tools that implement the directed graph approach (Weka, KNIME, and RapidMiner), and DMPML-TS. It was possible to observe that the tools that use the directed graph approach make the user spend more time creating the directed graphs than executing them. The smaller the data set, the more time is spent creating the data preparation tasks, comparatively to the time spent executing the tasks. On the other hand, the DMPML approach does not demand a lot of user interaction but spends more time processing the XML documents. But, as it should be more important to reduce the time the user spends manipulating the tools, because it is much easier to upgrade the computer or use another computer with a faster processor and/or more main memory, the DMPML approach is very interesting and promising. DMPML represents a good solution to store the data codification for different data mining algorithms, allowing the user to generate the data for different algorithms in parallel, making him/her spend less time dealing with tools, and simplifying the execution of different algorithms to try to discover the best one for a given data set. With the usage of DMPML it was possible to significantly reduce the time spent creating the directed graphs. The time savings were more than 47% of the time needed to create the data preparation tasks. For the Mushroom data set, the DMPML approach was approximately 30% faster than the best tool that implements the directed graph approach in the time spent preparing and executing the data preparation phase for two different algorithms. It is important to notice that, as the number of tasks increase, more time is spent creating the graph, and more advantages are perceived using DMPML. If it is necessary to test a data set with different data mining algorithms, DMPML also offers a good solution. REFERENCES [1] J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd ed. Morgan Kaufmann Publishers, Inc., [2] D. Pyle, Data collection, preparation, quality, and visualization, in The Handbook of Data Mining, N. Ye, Ed. Lawrence Erlbaum Associates, Inc, Publishers, 2003, pp [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The weka data mining software: an update, SIGKDD Explorations Newsletter, vol. 11, pp , November [Online]. Available: [4] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, Yale: rapid prototyping for complex data mining tasks, in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD 06. New York, NY, USA: ACM, 2006, pp [Online]. Available: [5] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel, Knime: The konstanz information miner, in Data Analysis, Machine Learning and Applications, ser. Studies in Classification, Data Analysis, and Knowledge Organization, C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, Eds. Springer Berlin Heidelberg, 2008, pp [Online]. Available: [6] H.-Y. Lo, C.-M. Chang, T.-H. Chiang, C.-Y. Hsiao, A. Huang, T.-T. Kuo, W.-C. Lai, M.-H. Yang, J.-J. Yeh, C.-C. Yen, and S.-D. Lin, Learning to improve area-under-froc for imbalanced medical data classification using an ensemble method, SIGKDD Explorations Newsletter, vol. 10, pp , December [Online]. Available: [7] M. Kay, Xsl transformations (xslt) version 2.0, January [Online]. Available: [8] P. M. Gonçalves, Jr. and R. S. M. Barros, Automating data preprocessing with dmpml and kddml, in International Conference on Computer and Information Science, ser. ICIS 11, S. Xu, W. Du, and R. Lee, Eds. Los Alamitos, CA, USA: IEEE Computer Society, May 2011, pp [Online]. Available: [9] E. Frank, M. Hall, G. Holmes, R. Kirkby, B. Pfahringer, I. H. Witten, and L. Trigg, Weka a machine learning workbench for data mining, in Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach, Eds. Springer US, 2010, pp [Online]. Available: 66 [10] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, K. Thiel, and B. Wiswedel, Knime - the konstanz information miner: version 2.0 and beyond, SIGKDD Explorations Newsletter, vol. 11, pp , November [Online]. Available: [11] A. Frank and A. Asuncion, UCI machine learning repository, [Online]. Available: [12] S. Haykin, Neural Networks and Learning Machine, 3rd ed. New York, NY, USA: Prentice Hall, [13] J. R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, pp , [Online]. Available: BF [14] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA, USA: Morgan Kaufmann, [15] M. Kay, The saxon xslt and xquery processor [Online]. Available:

ClowdFlows. Janez Kranjc

ClowdFlows. Janez Kranjc ClowdFlows Janez Kranjc What is ClowdFlows CONSTRUCT a workflow in the browser EXECUTE on the cloud SHARE your experiments and results { What is ClowdFlows What is ClowdFlows A platform for: composition,

More information

JCLEC Meets WEKA! A. Cano, J. M. Luna, J. L. Olmo, and S. Ventura

JCLEC Meets WEKA! A. Cano, J. M. Luna, J. L. Olmo, and S. Ventura JCLEC Meets WEKA! A. Cano, J. M. Luna, J. L. Olmo, and S. Ventura Dept. of Computer Science and Numerical Analysis, University of Cordoba, Rabanales Campus, Albert Einstein building, 14071 Cordoba, Spain.

More information

Integrating an Advanced Classifier in WEKA

Integrating an Advanced Classifier in WEKA Integrating an Advanced Classifier in WEKA Paul Ştefan Popescu sppopescu@gmail.com Mihai Mocanu mocanu@software.ucv.ro Marian Cristian Mihăescu mihaescu@software.ucv.ro ABSTRACT In these days WEKA has

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 4, April 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Discovering Knowledge

More information

Deepak Kumar Pathak M.Tech.(CSE), Invertis University, India. Garima Gupta M.Tech.(CSE), Invertis University, India

Deepak Kumar Pathak M.Tech.(CSE), Invertis University, India. Garima Gupta M.Tech.(CSE), Invertis University, India Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Comparative Study

More information

Visual Tools to Lecture Data Analytics and Engineering

Visual Tools to Lecture Data Analytics and Engineering Visual Tools to Lecture Data Analytics and Engineering Sung-Bae Cho 1 and Antonio J. Tallón-Ballesteros 2(B) 1 Department of Computer Science, Yonsei University, Seoul, Korea sbcho@yonsei.ac.kr 2 Department

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo (luigi.grimaudo@polito.it) DataBase And Data Mining Research Group (DBDMG) Summary RapidMiner project Strengths

More information

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo (luigi.grimaudo@polito.it) DataBase And Data Mining Research Group (DBDMG) Summary RapidMiner project Strengths

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data DATA ANALYSIS I Types of Attributes Sparse, Incomplete, Inaccurate Data Sources Bramer, M. (2013). Principles of data mining. Springer. [12-21] Witten, I. H., Frank, E. (2011). Data Mining: Practical machine

More information

ADDITIONS AND IMPROVEMENTS TO THE ACE 2.0 MUSIC CLASSIFIER

ADDITIONS AND IMPROVEMENTS TO THE ACE 2.0 MUSIC CLASSIFIER ADDITIONS AND IMPROVEMENTS TO THE ACE 2.0 MUSIC CLASSIFIER Jessica Thompson Cory McKay John Ashley Burgoyne Ichiro Fujinaga Music Technology jessica.thompson@ mail.mcgill.ca CIRMMT cory.mckay@ mail.mcgill.ca

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK

DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK DATA TYPE MANAGEMENT IN A DATA MINING APPLICATION FRAMEWORK Lauri Tuovinen, Perttu Laurinen and Juha Röning Department of Electrical and Information Engineering, P.O. Box 4500, FIN-90014 University of

More information

Inductive decision based Real Time Occupancy detector in University Buildings.

Inductive decision based Real Time Occupancy detector in University Buildings. Inductive decision based Real Time Occupancy detector in University Buildings. Nikita Jain Research Scholar, USICT, GGSIPU Department of Computer Science Bhagwan Parshuram Institute of Technology Abstract

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 21 Table of contents 1 Introduction 2 Data mining

More information

Using Decision Boundary to Analyze Classifiers

Using Decision Boundary to Analyze Classifiers Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision

More information

A Method Based on RBF-DDA Neural Networks for Improving Novelty Detection in Time Series

A Method Based on RBF-DDA Neural Networks for Improving Novelty Detection in Time Series Method Based on RBF-DD Neural Networks for Improving Detection in Time Series. L. I. Oliveira and F. B. L. Neto S. R. L. Meira Polytechnic School, Pernambuco University Center of Informatics, Federal University

More information

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Introduction Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 20 Table of contents 1 Introduction 2 Data mining

More information

CloNI: clustering of JN -interval discretization

CloNI: clustering of JN -interval discretization CloNI: clustering of JN -interval discretization C. Ratanamahatana Department of Computer Science, University of California, Riverside, USA Abstract It is known that the naive Bayesian classifier typically

More information

The Effects of Outliers on Support Vector Machines

The Effects of Outliers on Support Vector Machines The Effects of Outliers on Support Vector Machines Josh Hoak jrhoak@gmail.com Portland State University Abstract. Many techniques have been developed for mitigating the effects of outliers on the results

More information

A Two Stage Zone Regression Method for Global Characterization of a Project Database

A Two Stage Zone Regression Method for Global Characterization of a Project Database A Two Stage Zone Regression Method for Global Characterization 1 Chapter I A Two Stage Zone Regression Method for Global Characterization of a Project Database J. J. Dolado, University of the Basque Country,

More information

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface A Comparative Study of Classification Methods in Data Mining using RapidMiner Studio Vishnu Kumar Goyal Dept. of Computer Engineering Govt. R.C. Khaitan Polytechnic College, Jaipur, India vishnugoyal_jaipur@yahoo.co.in

More information

MetaData for Database Mining

MetaData for Database Mining MetaData for Database Mining John Cleary, Geoffrey Holmes, Sally Jo Cunningham, and Ian H. Witten Department of Computer Science University of Waikato Hamilton, New Zealand. Abstract: At present, a machine

More information

C-NBC: Neighborhood-Based Clustering with Constraints

C-NBC: Neighborhood-Based Clustering with Constraints C-NBC: Neighborhood-Based Clustering with Constraints Piotr Lasek Chair of Computer Science, University of Rzeszów ul. Prof. St. Pigonia 1, 35-310 Rzeszów, Poland lasek@ur.edu.pl Abstract. Clustering is

More information

Building a Concept Hierarchy from a Distance Matrix

Building a Concept Hierarchy from a Distance Matrix Building a Concept Hierarchy from a Distance Matrix Huang-Cheng Kuo 1 and Jen-Peng Huang 2 1 Department of Computer Science and Information Engineering National Chiayi University, Taiwan 600 hckuo@mail.ncyu.edu.tw

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher ISSN: 2394 3122 (Online) Volume 2, Issue 1, January 2015 Research Article / Survey Paper / Case Study Published By: SK Publisher P. Elamathi 1 M.Phil. Full Time Research Scholar Vivekanandha College of

More information

DATA MINING AND ITS TECHNIQUE: AN OVERVIEW

DATA MINING AND ITS TECHNIQUE: AN OVERVIEW INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 DATA MINING AND ITS TECHNIQUE: AN OVERVIEW 1 Bilawal Singh, 2 Sarvpreet Singh Abstract: Data mining has assumed a

More information

IJMIE Volume 2, Issue 9 ISSN:

IJMIE Volume 2, Issue 9 ISSN: WEB USAGE MINING: LEARNER CENTRIC APPROACH FOR E-BUSINESS APPLICATIONS B. NAVEENA DEVI* Abstract Emerging of web has put forward a great deal of challenges to web researchers for web based information

More information

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets Md Nasim Adnan and Md Zahidul Islam Centre for Research in Complex Systems (CRiCS)

More information

Query Disambiguation from Web Search Logs

Query Disambiguation from Web Search Logs Vol.133 (Information Technology and Computer Science 2016), pp.90-94 http://dx.doi.org/10.14257/astl.2016. Query Disambiguation from Web Search Logs Christian Højgaard 1, Joachim Sejr 2, and Yun-Gyung

More information

Temporal Weighted Association Rule Mining for Classification

Temporal Weighted Association Rule Mining for Classification Temporal Weighted Association Rule Mining for Classification Purushottam Sharma and Kanak Saxena Abstract There are so many important techniques towards finding the association rules. But, when we consider

More information

Visual programming language for modular algorithms

Visual programming language for modular algorithms Visual programming language for modular algorithms Rudolfs Opmanis, Rihards Opmanis Institute of Mathematics and Computer Science University of Latvia, Raina bulvaris 29, Riga, LV-1459, Latvia rudolfs.opmanis@gmail.com,

More information

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya Volume 5, Issue 1, January 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Performance

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES USING DIFFERENT DATASETS V. Vaithiyanathan 1, K. Rajeswari 2, Kapil Tajane 3, Rahul Pitale 3 1 Associate Dean Research, CTS Chair Professor, SASTRA University,

More information

Challenges and Interesting Research Directions in Associative Classification

Challenges and Interesting Research Directions in Associative Classification Challenges and Interesting Research Directions in Associative Classification Fadi Thabtah Department of Management Information Systems Philadelphia University Amman, Jordan Email: FFayez@philadelphia.edu.jo

More information

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm Ann. Data. Sci. (2015) 2(3):293 300 DOI 10.1007/s40745-015-0060-x Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm Li-min Du 1,2 Yang Xu 1 Hua Zhu 1 Received: 30 November

More information

A Framework for Trajectory Data Preprocessing for Data Mining

A Framework for Trajectory Data Preprocessing for Data Mining A Framework for Trajectory Data Preprocessing for Data Mining Luis Otavio Alvares, Gabriel Oliveira, Vania Bogorny Instituto de Informatica Universidade Federal do Rio Grande do Sul Porto Alegre Brazil

More information

Domain Independent Prediction with Evolutionary Nearest Neighbors.

Domain Independent Prediction with Evolutionary Nearest Neighbors. Research Summary Domain Independent Prediction with Evolutionary Nearest Neighbors. Introduction In January of 1848, on the American River at Coloma near Sacramento a few tiny gold nuggets were discovered.

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

Sensor Based Time Series Classification of Body Movement

Sensor Based Time Series Classification of Body Movement Sensor Based Time Series Classification of Body Movement Swapna Philip, Yu Cao*, and Ming Li Department of Computer Science California State University, Fresno Fresno, CA, U.S.A swapna.philip@gmail.com,

More information

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1, Albert Bifet 2, Bernhard Pfahringer 2, Geoff Holmes 2 1 Department of Signal Theory and Communications Universidad

More information

A Lazy Approach for Machine Learning Algorithms

A Lazy Approach for Machine Learning Algorithms A Lazy Approach for Machine Learning Algorithms Inés M. Galván, José M. Valls, Nicolas Lecomte and Pedro Isasi Abstract Most machine learning algorithms are eager methods in the sense that a model is generated

More information

Integrating Open Source Tools for Developing Embedded Linux Applications

Integrating Open Source Tools for Developing Embedded Linux Applications Integrating Open Source Tools for Developing Embedded Linux Applications Raul Fernandes Herbster 1, Hyggo Almeida 1, Angelo Perkusich 1, Dalton Guerrero 1 1 Embedded Systems and Pervasive Computing Laboratory

More information

Advance analytics and Comparison study of Data & Data Mining

Advance analytics and Comparison study of Data & Data Mining Advance analytics and Comparison study of Data & Data Mining Review paper on Concepts and Practice with RapidMiner Vandana Kaushik 1, Dr. Vikas Siwach 2 M.Tech (Software Engineering) Kaushikvandana22@gmail.com

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Text Processing Chains: Getting Help from Typed Applicative Systems

Text Processing Chains: Getting Help from Typed Applicative Systems Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference Text Processing Chains: Getting Help from Typed Applicative Systems Marie Anastacio, Ismaïl Biskri

More information

Accelerating Immunos 99

Accelerating Immunos 99 Accelerating Immunos 99 Paul Taylor 1,2 and Fiona A. C. Polack 2 and Jon Timmis 2,3 1 BT Research & Innovation, Adastral Park, Martlesham Heath, UK, IP5 3RE 2 Department of Computer Science, University

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

B-kNN to Improve the Efficiency of knn

B-kNN to Improve the Efficiency of knn Dhrgam AL Kafaf, Dae-Kyoo Kim and Lunjin Lu Dept. of Computer Science & Engineering, Oakland University, Rochester, MI 809, U.S.A. Keywords: Abstract: Efficiency, knn, k Nearest Neighbor. The knn algorithm

More information

The Data Mining Application Based on WEKA: Geographical Original of Music

The Data Mining Application Based on WEKA: Geographical Original of Music Management Science and Engineering Vol. 10, No. 4, 2016, pp. 36-46 DOI:10.3968/8997 ISSN 1913-0341 [Print] ISSN 1913-035X [Online] www.cscanada.net www.cscanada.org The Data Mining Application Based on

More information

Classification Using Unstructured Rules and Ant Colony Optimization

Classification Using Unstructured Rules and Ant Colony Optimization Classification Using Unstructured Rules and Ant Colony Optimization Negar Zakeri Nejad, Amir H. Bakhtiary, and Morteza Analoui Abstract In this paper a new method based on the algorithm is proposed to

More information

Using Association Rules for Better Treatment of Missing Values

Using Association Rules for Better Treatment of Missing Values Using Association Rules for Better Treatment of Missing Values SHARIQ BASHIR, SAAD RAZZAQ, UMER MAQBOOL, SONYA TAHIR, A. RAUF BAIG Department of Computer Science (Machine Intelligence Group) National University

More information

Increasing efficiency of data mining systems by machine unification and double machine cache

Increasing efficiency of data mining systems by machine unification and double machine cache Increasing efficiency of data mining systems by machine unification and double machine cache Norbert Jankowski and Krzysztof Grąbczewski Department of Informatics Nicolaus Copernicus University Toruń,

More information

Sentiment Analysis for Customer Review Sites

Sentiment Analysis for Customer Review Sites Sentiment Analysis for Customer Review Sites Chi-Hwan Choi 1, Jeong-Eun Lee 2, Gyeong-Su Park 2, Jonghwa Na 3, Wan-Sup Cho 4 1 Dept. of Bio-Information Technology 2 Dept. of Business Data Convergence 3

More information

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy Lutfi Fanani 1 and Nurizal Dwi Priandani 2 1 Department of Computer Science, Brawijaya University, Malang, Indonesia. 2 Department

More information

Automated Parameter Optimization for Feature Extraction for Condition Monitoring

Automated Parameter Optimization for Feature Extraction for Condition Monitoring Automated Parameter Optimization for Feature Extraction for Condition Monitoring Mike Gerdes 1, Diego Galar 2, Dieter Scholz 1 1 Hamburg University of Applied Sciences, AERO Aircraft Design and Systems

More information

Discretizing Continuous Attributes Using Information Theory

Discretizing Continuous Attributes Using Information Theory Discretizing Continuous Attributes Using Information Theory Chang-Hwan Lee Department of Information and Communications, DongGuk University, Seoul, Korea 100-715 chlee@dgu.ac.kr Abstract. Many classification

More information

Speeding Up Logistic Model Tree Induction

Speeding Up Logistic Model Tree Induction Speeding Up Logistic Model Tree Induction Marc Sumner 1,2,EibeFrank 2,andMarkHall 2 1 Institute for Computer Science, University of Freiburg, Freiburg, Germany sumner@informatik.uni-freiburg.de 2 Department

More information

An Empirical Study on feature selection for Data Classification

An Empirical Study on feature selection for Data Classification An Empirical Study on feature selection for Data Classification S.Rajarajeswari 1, K.Somasundaram 2 Department of Computer Science, M.S.Ramaiah Institute of Technology, Bangalore, India 1 Department of

More information

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos

KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW. Ana Azevedo and M.F. Santos KDD, SEMMA AND CRISP-DM: A PARALLEL OVERVIEW Ana Azevedo and M.F. Santos ABSTRACT In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION

PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION PRIVACY-PRESERVING MULTI-PARTY DECISION TREE INDUCTION Justin Z. Zhan, LiWu Chang, Stan Matwin Abstract We propose a new scheme for multiple parties to conduct data mining computations without disclosing

More information

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA) International Journal of Innovation and Scientific Research ISSN 2351-8014 Vol. 12 No. 1 Nov. 2014, pp. 217-222 2014 Innovative Space of Scientific Research Journals http://www.ijisr.issr-journals.org/

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining

An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining An Efficient Algorithm for Finding the Support Count of Frequent 1-Itemsets in Frequent Pattern Mining P.Subhashini 1, Dr.G.Gunasekaran 2 Research Scholar, Dept. of Information Technology, St.Peter s University,

More information

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012 Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data Fall 2012 Data Warehousing and OLAP Introduction Decision Support Technology On Line Analytical Processing Star Schema

More information

Association Rule Mining from XML Data

Association Rule Mining from XML Data 144 Conference on Data Mining DMIN'06 Association Rule Mining from XML Data Qin Ding and Gnanasekaran Sundarraj Computer Science Program The Pennsylvania State University at Harrisburg Middletown, PA 17057,

More information

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning Hai Zhao and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd.,

More information

Categorization of Sequential Data using Associative Classifiers

Categorization of Sequential Data using Associative Classifiers Categorization of Sequential Data using Associative Classifiers Mrs. R. Meenakshi, MCA., MPhil., Research Scholar, Mrs. J.S. Subhashini, MCA., M.Phil., Assistant Professor, Department of Computer Science,

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

New ensemble methods for evolving data streams

New ensemble methods for evolving data streams New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

3-D MRI Brain Scan Classification Using A Point Series Based Representation

3-D MRI Brain Scan Classification Using A Point Series Based Representation 3-D MRI Brain Scan Classification Using A Point Series Based Representation Akadej Udomchaiporn 1, Frans Coenen 1, Marta García-Fiñana 2, and Vanessa Sluming 3 1 Department of Computer Science, University

More information

A Cloud Framework for Big Data Analytics Workflows on Azure

A Cloud Framework for Big Data Analytics Workflows on Azure A Cloud Framework for Big Data Analytics Workflows on Azure Fabrizio MAROZZO a, Domenico TALIA a,b and Paolo TRUNFIO a a DIMES, University of Calabria, Rende (CS), Italy b ICAR-CNR, Rende (CS), Italy Abstract.

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

User Guide Written By Yasser EL-Manzalawy

User Guide Written By Yasser EL-Manzalawy User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient

More information

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering World Journal of Computer Application and Technology 5(2): 24-29, 2017 DOI: 10.13189/wjcat.2017.050202 http://www.hrpub.org Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a International Conference on Education Technology, Management and Humanities Science (ETMHS 2015) Research on Applications of Data Mining in Electronic Commerce Xiuping YANG 1, a 1 Computer Science Department,

More information

SSV Criterion Based Discretization for Naive Bayes Classifiers

SSV Criterion Based Discretization for Naive Bayes Classifiers SSV Criterion Based Discretization for Naive Bayes Classifiers Krzysztof Grąbczewski kgrabcze@phys.uni.torun.pl Department of Informatics, Nicolaus Copernicus University, ul. Grudziądzka 5, 87-100 Toruń,

More information

LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES

LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION TECHNIQUES 8 th International Conference on DEVELOPMENT AND APPLICATION SYSTEMS S u c e a v a, R o m a n i a, M a y 25 27, 2 0 0 6 LOAD BALANCING IN MOBILE INTELLIGENT AGENTS FRAMEWORK USING DATA MINING CLASSIFICATION

More information

COMP 465 Special Topics: Data Mining

COMP 465 Special Topics: Data Mining COMP 465 Special Topics: Data Mining Introduction & Course Overview 1 Course Page & Class Schedule http://cs.rhodes.edu/welshc/comp465_s15/ What s there? Course info Course schedule Lecture media (slides,

More information

1 Topic. Image classification using Knime.

1 Topic. Image classification using Knime. 1 Topic Image classification using Knime. The aim of image mining is to extract valuable knowledge from image data. In the context of supervised image classification, we want to assign automatically a

More information

Real-Time Data Analysis in ClowdFlows

Real-Time Data Analysis in ClowdFlows 2013 IEEE International Conference on Big Data Real-Time Data Analysis in ClowdFlows Janez Kranjc 1,2 Vid Podpečan 1 1 Jožef Stefan Institute 2 International Postgraduate School Jožef Stefan Jamova 39,

More information

Multi-relational Decision Tree Induction

Multi-relational Decision Tree Induction Multi-relational Decision Tree Induction Arno J. Knobbe 1,2, Arno Siebes 2, Daniël van der Wallen 1 1 Syllogic B.V., Hoefseweg 1, 3821 AE, Amersfoort, The Netherlands, {a.knobbe, d.van.der.wallen}@syllogic.com

More information

Speeding up Logistic Model Tree Induction

Speeding up Logistic Model Tree Induction Speeding up Logistic Model Tree Induction Marc Sumner 1,2,EibeFrank 2,andMarkHall 2 Institute for Computer Science University of Freiburg Freiburg, Germany sumner@informatik.uni-freiburg.de Department

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

PARAMETER OPTIMIZATION FOR AUTOMATED SIGNAL ANALYSIS FOR CONDITION MONITORING OF AIRCRAFT SYSTEMS. Mike Gerdes 1, Dieter Scholz 1

PARAMETER OPTIMIZATION FOR AUTOMATED SIGNAL ANALYSIS FOR CONDITION MONITORING OF AIRCRAFT SYSTEMS. Mike Gerdes 1, Dieter Scholz 1 AST 2011 Workshop on Aviation System Technology PARAMETER OPTIMIZATION FOR AUTOMATED SIGNAL ANALYSIS FOR CONDITION MONITORING OF AIRCRAFT SYSTEMS Mike Gerdes 1, Dieter Scholz 1 1 Aero - Aircraft Design

More information

Detection and Deletion of Outliers from Large Datasets

Detection and Deletion of Outliers from Large Datasets Detection and Deletion of Outliers from Large Datasets Nithya.Jayaprakash 1, Ms. Caroline Mary 2 M. tech Student, Dept of Computer Science, Mohandas College of Engineering and Technology, India 1 Assistant

More information

WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis

WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis , pp-01-05 WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis P.B.Khanale 1, Vaibhav M. Pathak 2 1 Department of Computer Science,Dnyanopasak College,Parbhani 431 401 e-mail

More information

WEKA KnowledgeFlow Tutorial for Version 3-5-6

WEKA KnowledgeFlow Tutorial for Version 3-5-6 WEKA KnowledgeFlow Tutorial for Version 3-5-6 Mark Hall Peter Reutemann June 1, 2007 c 2007 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................

More information

A Novel Feature Selection Framework for Automatic Web Page Classification

A Novel Feature Selection Framework for Automatic Web Page Classification International Journal of Automation and Computing 9(4), August 2012, 442-448 DOI: 10.1007/s11633-012-0665-x A Novel Feature Selection Framework for Automatic Web Page Classification J. Alamelu Mangai 1

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information