CHAPTER 6 EXPERIMENTS 6.1 HYPOTHESIS On the basis of the trend as depicted by the data Mining Technique, it is possible to draw conclusions about the Business organization and commercial Software industry. Regarding business institutions, there are areas where the efficiency can be improved by effective data mining technique, data processing service and quality. There have been certain controllable factors effecting efficiency of the industry. Its hypothesis is controlling the controllable factors so that the Industrial, organization, Business organization, Education organization, Engineering Institutions, Software Industries Website designers can improve its data processing operational performance. 6.2 CONTRIBUTION OF THE KNOWLEDGE The detailed study of the Knowledge Discovery data processing using Data mining quality make a significant contribution to knowledge. Further, it provides suggestions of practical significance to the data processing using data mining technique used for high quality data processing. Data mining technique can serve services and quality at all levels by bringing out danger spots and suggesting possible Complexity of knowledge discovery data processing. The data mining technique helps Industrial, organization, Business organization, Education organization, Engineering Institutions, Web designers and software industry, by finding out weather the policies and procedures complied, by studying new data mining technique ideas and directions of further development and by suggesting equipments to be used or weather the same can be effectively employed in knowledge discovery data processing for effective business. 68
6.3 EXPERIMENTS PERFORMED Experiments were performed to evaluate the performance and compare different data mining techniques. For each data mining techniques different algorithms were selected. In particular the empirical evaluations of following algorithm were performed. 1. Classification algorithms c. K-nearest neighborhood d. Naive Bayes e. Decision Tree f. Decision Stump g. Rule Induction 5. Decision tree algorithms a. BF Tree b. FT Tree c. J48 Tree d. LAD Tree 6. Neural network algorithm a. Multilayer Perceptron b. Radial Basis Function 7. Association Rule Mining Algorithms a. Apriori b. FP-Growth 6.4 EVALUATION OF CLASSIFICATION ALGORITHMS In this work the RapidMiner Studio 6[16] was used to perform experiments by taking the past project data from the repositories[15]. Five well known and important classification algorithms k-nearest neighborhood (KNN), Naive Bayes (NB), Decision Tree(DT), Decision Stump(DS) and Rule Induction(RI) were applied on the Weighting, Golf, Iris, Deals and Labor datasets and the outputs were tabulated and plotted in a 2 dimensional graph. Then one by one 69
these datasets are evaluated and their accuracy was evaluated. Amount of correctly classified instances and incorrectly classified instances have been recorded. Each algorithm is run over five predefined datasets and their performance in terms of accuracy was evaluated. 6.4.1 Dataset used For performing the comparison analysis we need the past project datasets. A number of data sets were selected for running the test. For bias issues, some data sets have been downloaded from the UCI repository [15] and some were taken from RapidMiner Studio. Table 6.1 shows the selected and downloaded data sets for testing purposes. As shown in the table, each dataset is described by the data type being used, the number of instances stored within the data set, the number of attributes that describe each dataset. These data sets were chosen because they have different characteristics and have addressed different areas (Table 6.1). These datasets have been taken from RapidMiner Studio and UCI machine learning repository system. It is assumed that for dataset having high number of instance the performance is high. This is because dataset having high number of instance provides enough instances for training. To verify above assumption the size of datasets for general classification is taken small (from 14 to 1000). For decision tree and neural network the size of dataset is taken large (upto 8000). Having this varying dataset size for different algorithm will provide justification to the belief. It also verifies that whether the assumptions made is true or not. Table 6.1: Dataset for classification algorithms Dataset Data Type Attributes Instances Weighting Multivariate 6 500 Golf Multivariate 5 14 Iris Multivariate 6 150 Deals Multivariate 4 1000 Labor Multivariate 16 40 70
6.5 EVALUATION OF DECISION TREE ALGORITHMS For a successful decision tree implementation, Weka 3.6.8 [17] was used to aid the investigation. The BF Tree, FT Tree, J48 Tree, and LAD Tree algorithms were applied on the five datasets and the outputs were tabulated and plotted in a 2 dimensional graph. Then one by one datasets are evaluated and their accuracy was evaluated. Amount of correctly classified instances and incorrectly classified instances have been recorded. Each algorithm is run over five predefined datasets and their performance in terms of accuracy was evaluated. 6.5.1 Dataset used We have taken five datasets containing nominal attributes type that is all these datasets contains the continuous attributes (Table 6.2). These datasets have been taken from UCI machine learning repository system [15]. Table 6.2: Dataset for decision tree algorithms Dataset Attributes Instances diabetes 9 668 hypothyroid 30 3662 mushroom 23 8124 optdigits 65 5620 segment 20 2310 6.6 EVALUATION OF NEURAL NETWORK ALGORITHMS In this experiment the performance of neural network algorithms viz Multilayer Perceptron and Radial Basis Function was evaluated and compared using IBM SPSS Statistics software [18]. The purpose of the experiments was twofold. The first aspect was to verify that RBF networks did in fact provide consistently better results than an MLP network. The second purpose was to investigate the effect of dataset variation on the performance of the two networks. 71
6.6.1 Dataset used Four dataset having large number of instance had been chosen to evaluated and compare the Multilayer Perceptron and Radial Basis Function. Theses dataset (Table 6.3) have been taken from IBM SPSS statistics repository system [18]. Table 6.3: Dataset for neural network algorithms Dataset Attributes Instances worldsales 3 1000 tv-survey 6 906 debate 4 1296 cable-survey 10 6000 6.7 EVALUATION OF ASSOCIATION RULE MINING ALGORITHMS In this experiment the performance of Association Rule Mining Algorithms viz Apriori and FP-Growth was evaluated and compared using Weka 3.6.8 software [17]. Again the purpose of the experiments was twofold. The first aspect was to compare the performance in term of execution time, to find which algorithm is better then other. The second purpose was to investigate the effect of number of instance variation on the performance of the two algorithms. 6.7.1 Dataset used The Supermarket dataset is used for the experimentation. This dataset contains 4627 instances and 217 attributes. The performance of Apriori and FP-Growth algorithms was evaluated based upon execution time for different number of instances. This dataset have been taken from UCI repository system [15]. 72