Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

Similar documents
CHAPTER 6 EXPERIMENTS

CHAPTER 4 METHODOLOGY AND TOOLS

Performance Analysis of Data Mining Classification Techniques

A study of classification algorithms using Rapidminer

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advanced Research in Computer Science and Software Engineering

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

A Performance Assessment on Various Data mining Tool Using Support Vector Machine

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Comparative Study of Clustering Algorithms using R

Iteration Reduction K Means Clustering Algorithm

I. INTRODUCTION II. RELATED WORK.

An Empirical Study on feature selection for Data Classification

A Study on Data mining Classification Algorithms in Heart Disease Prediction

Data Mining: An experimental approach with WEKA on UCI Dataset

Enhanced Bug Detection by Data Mining Techniques

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Heart Disease Detection using EKSTRAP Clustering with Statistical and Distance based Classifiers

Dynamic Clustering of Data with Modified K-Means Algorithm

Missing Value Imputation in Multi Attribute Data Set

Parametric Comparisons of Classification Techniques in Data Mining Applications

A Comparative Study of Selected Classification Algorithms of Data Mining

A Cloud Based Intrusion Detection System Using BPN Classifier

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

SVM Classification in Multiclass Letter Recognition System

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Prediction of Crop Yield using Machine Learning

K-Means Clustering With Initial Centroids Based On Difference Operator

ISSN: [Sagunthaladevi* et al., 6(2): February, 2017] Impact Factor: 4.116

The Role of Biomedical Dataset in Classification

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

A Review on Enhancing Web Navigation Usability by Analyzing and Comparing Actual and Anticipated Usage

Iris recognition using SVM and BP algorithms

Count based K-Means Clustering Algorithm

Role of Fuzzy Set in Students Performance Prediction

Enhancing K-means Clustering Algorithm with Improved Initial Center

Normalization based K means Clustering Algorithm

An Enhanced K-Medoid Clustering Algorithm

Uncertain Data Classification Using Decision Tree Classification Tool With Probability Density Function Modeling Technique

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 4, Jul Aug 2017

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Generating Optimized Decision Tree Based on Discrete Wavelet Transform Kiran Kumar Reddi* 1 Ali Mirza Mahmood 2 K.

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

EVALUATING THE EFFICIENCY OF RULE TECHNIQUES FOR FILE CLASSIFICATION

Data Preprocessing Method of Web Usage Mining for Data Cleaning and Identifying User navigational Pattern

Building Data Mining Application for Customer Relationship Management

Domain Independent Prediction with Evolutionary Nearest Neighbors.

Fall Principles of Knowledge Discovery in Databases. University of Alberta

A neural-networks associative classification method for association rule mining

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Data Mining Download or Read Online ebook data mining in PDF Format From The Best User Guide Database

Data Mining and Soft Computing

Categorization of Sequential Data using Associative Classifiers

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Analyzing Outlier Detection Techniques with Hybrid Method

An Enhanced Approach for Secure Pattern. Classification in Adversarial Environment

A Heart Disease Risk Prediction System Based On Novel Technique Stratified Sampling

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Journal of Theoretical and Applied Information Technology. KNNBA: K-NEAREST-NEIGHBOR-BASED-ASSOCIATION ALGORITHM

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Disease Prediction in Data Mining

CS570: Introduction to Data Mining

IMPROVING APRIORI ALGORITHM USING PAFI AND TDFI

Chapter 8 The C 4.5*stat algorithm

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

A Novel Feature Selection Framework for Automatic Web Page Classification

Dr. Prof. El-Bahlul Emhemed Fgee Supervisor, Computer Department, Libyan Academy, Libya

An Ensemble Approach to Enhance Performance of Webpage Classification

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

PREDICTION OF POPULAR SMARTPHONE COMPANIES IN THE SOCIETY

A Rough Set Approach for Generation and Validation of Rules for Missing Attribute Values of a Data Set

An Efficient Clustering for Crime Analysis

A Comparative Study of Classification Techniques for Fire Data Set

CLASSIFICATION OF C4.5 AND CART ALGORITHMS USING DECISION TREE METHOD

Global Journal of Engineering Science and Research Management

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

IJMIE Volume 2, Issue 9 ISSN:

An Efficient Approach towards K-Means Clustering Algorithm

Advance analytics and Comparison study of Data & Data Mining

Silvia Rostianingsih, Gregorius Satia Budhi and Leonita Kumalasari Theresia Petra Christian University,

Detection and Deletion of Outliers from Large Datasets

Improving Classifier Performance by Imputing Missing Values using Discretization Method

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

Feature Selection Algorithm with Discretization and PSO Search Methods for Continuous Attributes

ANALYSIS COMPUTER SCIENCE Discovery Science, Volume 9, Number 20, April 3, Comparative Study of Classification Algorithms Using Data Mining

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 05, 2016 ISSN (online):

Network Intrusion Detection Using Fast k-nearest Neighbor Classifier

APRIORI ALGORITHM FOR MINING FREQUENT ITEMSETS A REVIEW

Improved Frequent Pattern Mining Algorithm with Indexing

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

Transcription:

A Comparative Study of Classification Methods in Data Mining using RapidMiner Studio Vishnu Kumar Goyal Dept. of Computer Engineering Govt. R.C. Khaitan Polytechnic College, Jaipur, India vishnugoyal_jaipur@yahoo.co.in Abstract Data mining is the knowledge discovery process which analyses the large volumes of data from various aspects and summarizing it into useful information; data mining has become an essential and important component in various fields of daily life. It is used to identify hidden patterns in a large data set. Classification is an important data mining technique with broad applications to classify the various kinds of data used in nearly every field of human life. In this paper we have worked with different data mining applications and various classification algorithms, these algorithms have been applied on different dataset to find out the efficiency of the algorithm This paper analyze the five major classification algorithms: k-nearest neighborhood (KNN), Naive Bayes (NB), Decision Tree (DT), Decision Stump ( DS) and Rule Induction (RI) and compare the performance of these major algorithms. The results are tested on five datasets namely Weighting, Golf, Iris, Deals and Labor using Rapid Miner Studio. Index Terms Data Mining, Classification, Rapid Miner. I. INTRODUCTION Classification is a classic machine learning data mining technique. Basically classification is used to classify each item in a set of data into one of predefined set of classes or groups [2]. The Classification methods use mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we develop the software that can learn how to classify the data items into groups [3]. For example, we can apply classification in application that given all records of students who left the college; predict who will probably leave the college in a future period. In this case, we divide the records of students into two groups that named leave and stay. And then we can ask our data mining software to classify the students into separate groups. II. METHODOLOGY In this paper the RapidMiner Studio 6[8] was used to perform experiments by taking the past project data from the repositories. Five well known and important classification algorithms k-nearest neighborhood (KNN), Naive Bayes (NB), Decision Tree(DT), Decision Stump(DS) and Rule Induction(RI) were applied on the Weighting, Golf, Iris, Deals and Labor datasets and the outputs were tabulated and plotted in a 2 dimensional graph. Then one by one these datasets are evaluated and their accuracy was evaluated. Amount of correctly classified instances and incorrectly classified instances have been recorded. Each algorithm is run over five predefined datasets and their performance in terms of accuracy was evaluated. III. THE RAPIDMINER TOOL For a successful classification implementation, RapidMiner Studio 6 was used to perform experiments. RapidMiner is one of the world s most widespread and most used open source data mining solutions [8]. The project was born at the University of Dortmund in 2001 and has been developed further by Rapid-I GmbH since 2007. With this academic background, RapidMiner continues to not only address business clients, but also universities and researchers from the most diverse disciplines. Fig.1. RapidMiner User Interface

RapidMiner has a comfortable user interface (Fig.1), where analyses are configured in a process view. RapidMiner uses a modular concept for this, where each step of an analysis (e.g. a pre-processing step or a learning procedure) is illustrated by an operator in the analysis process. These operators have input and output ports via which they can communicate with the other operators in order to receive input data or pass the hanged data and generated models over to the operators that follow. Fig.2. A RapidMiner process with of operators for model production Thus a data flow is created through the entire analysis process, as shown in fig. 2. The most complex analysis situations and needs can be handled by so-called super-operators, which in turn can contain a complete sub process. A well-known example is the cross-validation, which contains two sub processes. A sub process is responsible for producing a model from the respective training data while the second sub process is given this model and any other generated results in order to apply these to the test data and measure the quality of the model in each case. A typical application is shown in fig. 3. Fig. 3. The internal sub processes of a cross-validation IV. DATASET For performing the comparison analysis we need the past project datasets. A number of data sets were selected for running the test. For bias issues, some data sets have been downloaded from the UCI repository [6] and some were taken from RapidMiner Studio. Table I shows the selected and downloaded data sets for testing purposes. As shown in the table, each dataset is described by the data type being used, the number of instances stored within the data set, the number of attributes that describe each dataset. These data sets were chosen because they have different characteristics and have addressed different areas. These datasets have been taken from RapidMiner Studio and UCI machine learning repository system. TABLE I. DATASETS DESCRIPTION Dataset Data Type Attributes Weighting Multivariate 7 500 Golf Multivariate 5 14 Iris Multivariate 6 150 Deals Multivariate 4 1000 Labor Multivariate 17 40 V. EXPERIMENTAL STUDY AND RESULTS The above discussed five algorithms have their implemented source code in the RapidMiner Studio 6 version upon which experiments have carried out in order to measure the performance parameters of the algorithms over the datasets. The results are summarized in the following tables and graphs.

TABLE II. PERFORMANCE OF KNN ALGORITHM Datasets Weighting 444 56 88.80 Golf 6 8 42.86 Iris 144 6 96.00 Deals 973 27 97.30 Labor 34 6 85.00 Fig. 4. KNN Algorithm: Percentage The KNN algorithms performed well for Iris and Deals dataset. It is also performed well for weighting and Labor datasets, but for the Golf dataset, the accuracy is low. TABLE III. PERFORMANCE OF NAIVE BAYES ALGORITHM Weighting 451 49 90.20 Golf 8 6 57.14 Iris 143 7 95.33 Deals 926 74 92.60 Labor 35 5 87.50 Fig. 5. Naive Bayes Algorithm: Percentage

As shown in the Fig. 5 the Naive Bayes algorithm perform well for Weighting, Iris, Deals, and Labor dataset. The accuracy is slightly low Golf dataset. TABLE IV. PERFORMANCE OF DECISION TREE ALGORITHM Weighting 442 58 88.40 Golf 7 7 50.00 Iris 140 10 93.33 Deals 996 4 99.60 Labor 22 18 55.00 As shown in the Fig. 6 the Decision Tree algorithm performed well for weighting, Iris and Deals dataset, but for the Golf and Labor datasets, the accuracy is low. As shown in the Fig. 7 the Decision Stump algorithm the accuracy is good for weighting and Deals, average for Golf, Iris and Labor datasets. Fig. 6. Decision Tree Algorithm: Percentage TABLE V. PERFORMANCE OF DECISION STUMP ALGORITHM Weighting 407 93 81.40 Golf 9 5 64.29 Iris 100 50 66.67 Deals 732 268 73.20 Labor 24 16 60.00

Fig. 7. Decision Stump Algorithm: Percentage TABLE VI. PERFORMANCE OF RULE INDUCTION ALGORITHM Weighting 432 68 86.40 Golf 9 5 64.29 Iris 142 8 94.67 Deals 960 40 96.00 Labor 24 16 60.00 As shown in the Fig. 8 the Rule Induction algorithm performed well for Iris and deals dataset, average for Golf, and Labor datasets. Fig. 8. Rule Induction Algorithm: Percentage VI. COMPARISON The k-nearest neighborhood (KNN), Naive Bayes (NB), Decision Tree (DT), Decision Stump (DS) and Rule Induction (RI) classification techniques were used on the Weighting, Golf, Iris, Deals and Labor datasets using Rapid Miner Studio and the Consolidated outputs are tabulated (Table VII) and plotted in a 2 dimensional graph as shown in fig. 9. TABLE VII. PERFORMANCE COMPARISON Dataset KNN NB DT DS RI Weighting 88.80 90.20 88.40 81.40 86.40 Golf 42.86 57.14 50.00 64.29 64.29 Iris 96.00 95.33 93.33 66.67 94.67 Deals 97.30 92.60 99.60 73.20 96.00 Labor 85.00 87.50 55.00 60.00 60.00

VII. CONCLUSION For Weighting dataset, the all the algorithms performed well. For Golf dataset, Naive Bayes, Decision Stump and Rule Induction performed average, but for the KNN and Decision Tree the accuracy is low. For Iris KNN, Naive Bayes, Decision Tree and Rule Induction performed well, and Decision Stump performed average. For Deals dataset KNN, Naive Bayes, Decision Tree and Rule Induction performed well, and Decision Stump performed average. For Labor dataset KNN and Naive Bayes performed good and Decision Tree, Decision Stump and Rule Induction performed average. For the given datasets, Decision Stump algorithm is performing worst and Naive Bayes is performing average. The Decision Tree shows some improvement over these two. Fig. 9. Classification Algorithms: Percentage The KNN and Rule Induction are performing well among the all algorithms, but the KNN can be considered as the best among these algorithms for these datasets. REFERENCES [1] MacQueen J. B., "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability.University of California Press. 1967, pp. 281 297. [2] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,second Edition, (2006). [3] Margaret H. Danham,S. Sridhar, Data mining, Introductory and Advanced Topics, Person education, 1st ed., 2006. [4] Anshul Goyal, Rajni Mehta, Performance Comparison of Naive Bayes and J48 Classification Algorithms, IJAER, Vol. 7, No. 11, 2012. [5] Milan Kumari, Sunila Godara, Comparative Study of Data Mining Classification Methods in cardiovascular Disease Prediction, IJCST, Vol. 2, Issue 2, 2011, pp. 304-308. [6] UCI Machine Learning Repository Irvine, CA: University of California, School of Information and Computer Science. http://www.ics.uci.edu/~mlearn/mlrepository.html. [7] Surjeet Kumar Yadav and Saurabh Pal: Data Mining: A Prediction for Performance Improvement of Engineering Students using Classification World of Computer Science and Information Technology Journal (WCSIT) Vol. 2, No. 2, 2012. [8] RapidMiner is an open source-learning environment for data mining and machine learning. https://rapidminer.com [9] Sanjay D. Sawaitul, Prof. K. P. Wagh, Dr. P. N. Chatur, Classification and Prediction of Future Weather by using Back Propagation Algorithm-An Approach,International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com ( ISSN 2250-2459, Volume 2, Issue 1, 2012) [10] Qasem A. Al-Radaideh & Eman Al Nagi, Using Data Mining Techniques to Build a Classification Model for Predicting Employees Performance,(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 3, No. 2, 2012.