CHAPTER 6 EXPERIMENTS

Similar documents
Index Terms Data Mining, Classification, Rapid Miner. Fig.1. RapidMiner User Interface

Chapter 8 The C 4.5*stat algorithm

Performance Analysis of Data Mining Classification Techniques

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Improving Imputation Accuracy in Ordinal Data Using Classification

A study of classification algorithms using Rapidminer

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Parametric Comparisons of Classification Techniques in Data Mining Applications

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

International Journal of Advanced Research in Computer Science and Software Engineering

An Ensemble Data Mining Approach for Intrusion Detection in a Computer Network

CS6220: DATA MINING TECHNIQUES

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Contents. Preface to the Second Edition

Perceptron-Based Oblique Tree (P-BOT)

Naïve Bayes for text classification

Locally Weighted Naive Bayes

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Code No: R Set No. 1

Chuck Cartledge, PhD. 23 September 2017

A Performance Assessment on Various Data mining Tool Using Support Vector Machine

Research Article International Journals of Advanced Research in Computer Science and Software Engineering ISSN: X (Volume-7, Issue-6)

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

AMOL MUKUND LONDHE, DR.CHELPA LINGAM

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

k-nearest Neighbors + Model Selection

COMPARISON OF DIFFERENT CLASSIFICATION TECHNIQUES

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

CHAPTER 4 METHODOLOGY AND TOOLS

Predict the box office of US movies

Weighting and selection of features.

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

PROJECT 1 DATA ANALYSIS (KR-VS-KP)

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

An Empirical Study on feature selection for Data Classification

COMPARISON OF SUBSAMPLING TECHNIQUES FOR RANDOM SUBSPACE ENSEMBLES

Data Mining: An experimental approach with WEKA on UCI Dataset

Machine Learning in Action

Combined Weak Classifiers

New ensemble methods for evolving data streams

A Selective Sampling Approach to Active Feature Selection

A neural-networks associative classification method for association rule mining

Automatic Classification of Audio Data

ADVANCED CLASSIFICATION TECHNIQUES

Tutorial Case studies

A Comparison of Decision Tree Algorithms For UCI Repository Classification

Data Mining and Knowledge Discovery Practice notes Numeric prediction and descriptive DM

6.034 Quiz 2, Spring 2005

The Role of Biomedical Dataset in Classification

A REVIEW ON VARIOUS APPROACHES OF CLUSTERING IN DATA MINING

VECTOR SPACE CLASSIFICATION

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

Data Mining With Weka A Short Tutorial

Performance Measures

Data Preprocessing. Supervised Learning

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data

Inducer: a Rule Induction Workbench for Data Mining

Available online at ScienceDirect. Procedia Computer Science 35 (2014 )

9. Conclusions. 9.1 Definition KDD

Collective Intelligence in Action

CISC 4631 Data Mining

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Version Space Support Vector Machines: An Extended Paper

Classification using Weka (Brain, Computation, and Neural Learning)

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Classification of Different Wheat Varieties by Using Data Mining Algorithms

Analysis of a Population of Diabetic Patients Databases in Weka Tool P.Yasodha, M. Kannan

Performance analysis of a MLP weight initialization algorithm

A Lazy Approach for Machine Learning Algorithms

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

CS229 Final Project: Predicting Expected Response Times

Differentiation of Malignant and Benign Breast Lesions Using Machine Learning Algorithms

Rita McCue University of California, Santa Cruz 12/7/09

A Survey of Distance Metrics for Nominal Attributes

Multi-label classification using rule-based classifier systems

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Table Of Contents: xix Foreword to Second Edition

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Forward Feature Selection Using Residual Mutual Information

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Prototyping DM Techniques with WEKA and YALE Open-Source Software

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Author Prediction for Turkish Texts

Data Mining An Overview ITEV, F /18

CloNI: clustering of JN -interval discretization

Partition Based Perturbation for Privacy Preserving Distributed Data Mining

Lab Exercise Three Classification with WEKA Explorer

Sample 1. Dataset Distribution F Sample 2. Real world Distribution F. Sample k

Why MultiLayer Perceptron/Neural Network? Objective: Attributes:

Transcription:

CHAPTER 6 EXPERIMENTS 6.1 HYPOTHESIS On the basis of the trend as depicted by the data Mining Technique, it is possible to draw conclusions about the Business organization and commercial Software industry. Regarding business institutions, there are areas where the efficiency can be improved by effective data mining technique, data processing service and quality. There have been certain controllable factors effecting efficiency of the industry. Its hypothesis is controlling the controllable factors so that the Industrial, organization, Business organization, Education organization, Engineering Institutions, Software Industries Website designers can improve its data processing operational performance. 6.2 CONTRIBUTION OF THE KNOWLEDGE The detailed study of the Knowledge Discovery data processing using Data mining quality make a significant contribution to knowledge. Further, it provides suggestions of practical significance to the data processing using data mining technique used for high quality data processing. Data mining technique can serve services and quality at all levels by bringing out danger spots and suggesting possible Complexity of knowledge discovery data processing. The data mining technique helps Industrial, organization, Business organization, Education organization, Engineering Institutions, Web designers and software industry, by finding out weather the policies and procedures complied, by studying new data mining technique ideas and directions of further development and by suggesting equipments to be used or weather the same can be effectively employed in knowledge discovery data processing for effective business. 68

6.3 EXPERIMENTS PERFORMED Experiments were performed to evaluate the performance and compare different data mining techniques. For each data mining techniques different algorithms were selected. In particular the empirical evaluations of following algorithm were performed. 1. Classification algorithms c. K-nearest neighborhood d. Naive Bayes e. Decision Tree f. Decision Stump g. Rule Induction 5. Decision tree algorithms a. BF Tree b. FT Tree c. J48 Tree d. LAD Tree 6. Neural network algorithm a. Multilayer Perceptron b. Radial Basis Function 7. Association Rule Mining Algorithms a. Apriori b. FP-Growth 6.4 EVALUATION OF CLASSIFICATION ALGORITHMS In this work the RapidMiner Studio 6[16] was used to perform experiments by taking the past project data from the repositories[15]. Five well known and important classification algorithms k-nearest neighborhood (KNN), Naive Bayes (NB), Decision Tree(DT), Decision Stump(DS) and Rule Induction(RI) were applied on the Weighting, Golf, Iris, Deals and Labor datasets and the outputs were tabulated and plotted in a 2 dimensional graph. Then one by one 69

these datasets are evaluated and their accuracy was evaluated. Amount of correctly classified instances and incorrectly classified instances have been recorded. Each algorithm is run over five predefined datasets and their performance in terms of accuracy was evaluated. 6.4.1 Dataset used For performing the comparison analysis we need the past project datasets. A number of data sets were selected for running the test. For bias issues, some data sets have been downloaded from the UCI repository [15] and some were taken from RapidMiner Studio. Table 6.1 shows the selected and downloaded data sets for testing purposes. As shown in the table, each dataset is described by the data type being used, the number of instances stored within the data set, the number of attributes that describe each dataset. These data sets were chosen because they have different characteristics and have addressed different areas (Table 6.1). These datasets have been taken from RapidMiner Studio and UCI machine learning repository system. It is assumed that for dataset having high number of instance the performance is high. This is because dataset having high number of instance provides enough instances for training. To verify above assumption the size of datasets for general classification is taken small (from 14 to 1000). For decision tree and neural network the size of dataset is taken large (upto 8000). Having this varying dataset size for different algorithm will provide justification to the belief. It also verifies that whether the assumptions made is true or not. Table 6.1: Dataset for classification algorithms Dataset Data Type Attributes Instances Weighting Multivariate 6 500 Golf Multivariate 5 14 Iris Multivariate 6 150 Deals Multivariate 4 1000 Labor Multivariate 16 40 70

6.5 EVALUATION OF DECISION TREE ALGORITHMS For a successful decision tree implementation, Weka 3.6.8 [17] was used to aid the investigation. The BF Tree, FT Tree, J48 Tree, and LAD Tree algorithms were applied on the five datasets and the outputs were tabulated and plotted in a 2 dimensional graph. Then one by one datasets are evaluated and their accuracy was evaluated. Amount of correctly classified instances and incorrectly classified instances have been recorded. Each algorithm is run over five predefined datasets and their performance in terms of accuracy was evaluated. 6.5.1 Dataset used We have taken five datasets containing nominal attributes type that is all these datasets contains the continuous attributes (Table 6.2). These datasets have been taken from UCI machine learning repository system [15]. Table 6.2: Dataset for decision tree algorithms Dataset Attributes Instances diabetes 9 668 hypothyroid 30 3662 mushroom 23 8124 optdigits 65 5620 segment 20 2310 6.6 EVALUATION OF NEURAL NETWORK ALGORITHMS In this experiment the performance of neural network algorithms viz Multilayer Perceptron and Radial Basis Function was evaluated and compared using IBM SPSS Statistics software [18]. The purpose of the experiments was twofold. The first aspect was to verify that RBF networks did in fact provide consistently better results than an MLP network. The second purpose was to investigate the effect of dataset variation on the performance of the two networks. 71

6.6.1 Dataset used Four dataset having large number of instance had been chosen to evaluated and compare the Multilayer Perceptron and Radial Basis Function. Theses dataset (Table 6.3) have been taken from IBM SPSS statistics repository system [18]. Table 6.3: Dataset for neural network algorithms Dataset Attributes Instances worldsales 3 1000 tv-survey 6 906 debate 4 1296 cable-survey 10 6000 6.7 EVALUATION OF ASSOCIATION RULE MINING ALGORITHMS In this experiment the performance of Association Rule Mining Algorithms viz Apriori and FP-Growth was evaluated and compared using Weka 3.6.8 software [17]. Again the purpose of the experiments was twofold. The first aspect was to compare the performance in term of execution time, to find which algorithm is better then other. The second purpose was to investigate the effect of number of instance variation on the performance of the two algorithms. 6.7.1 Dataset used The Supermarket dataset is used for the experimentation. This dataset contains 4627 instances and 217 attributes. The performance of Apriori and FP-Growth algorithms was evaluated based upon execution time for different number of instances. This dataset have been taken from UCI repository system [15]. 72