IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Size: px

Start display at page:

Download "IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde"

Wendy McCormick
5 years ago
Views:

1 IEE 520 Data Mining Project Report Shilpa Madhavan Shinde

2 Contents I. Dataset Description... 3 II. Data Classification... 3 III. Class Imbalance... 5 IV. Classification after Sampling... 5 V. Final Model... 7 Arizona State University Shilpa Madhavan Shinde 2

3 I. Dataset Description Attributes 109 Class Variable y (2 classes) o R (1089 rows) o G (1933 rows) The dataset is skewed in terms of the class variables. II. Data Classification Different classifiers were tried on the dataset. 10-fold cross validation was used. All results are based on the 10-fold cross validation. Naïve Bayes, K-Nearest Neighbor and Random Forest have the largest difference between the error rates for the two classes. SVM, Multilayer Perceptron, Rule Based & Decision Tree classifier result in approximately the same overall and individual error rates. Using boosting or bagging doesn t help reduce the error rates for any of these classifiers. Rule based works better for the class imbalance problem, hence the results from the boosting and bagging for rule based is shown in Arizona State University Shilpa Madhavan Shinde 3

4 Table 1. Arizona State University Shilpa Madhavan Shinde 4

5 Table 1: Cross Validation Errors Classifier Naïve Bayes KNN SVM Linear Kernel, c=2, exponent=1 Multilayer Perceptron Rule Based decision tree RF trees, 10 features ADABoost Rule based Bagging Rule based cross validation - 10 fold Class Total Error (rows) Out of Bag Error Error per class Confusion Matrix a b <-- classified as R (1089) 35.63% a = R G (1933) 14.99% 3.36% b = G R (1089) 57.48% a = R G (1933) 26.27% 8.69% b = G R (1089) 17.17% a = R G (1933) 12.48% 9.83% b = G R (1089) 17.45% a = R G (1933) 13.43% 11.17% b = G R (1089) 18.92% a = R G (1933) 13.27% 10.09% b = G R (1089) 19.56% a = R G (1933) 13.40% 9.93% b = G R (1089) 39.67% a = R G (1933) 15.07% 15.06% 1.19% b = G R (1089) 19.38% a = R G (1933) 12.61% 8.79% b = G R (1089) 21.67% a = R G (1933) 12.44% 17.11% 7.24% b = G Figure 1: Cross Validation Error Rate 70.00% 60.00% 8.69% Error Rates 50.00% 40.00% 3.36% 1.19% 30.00% 20.00% 10.00% 14.99% 26.27% 10.09% 9.93% 9.83% 11.17% 13.43% 13.27% 13.40% 12.48% 7.24% 8.79% 15.07% 12.61% 12.44% 0.00% 35.63% 57.48% 17.17% 17.45% 18.92% 19.56% 39.67% 19.38% 21.67% Naïve Bayes KNN SVM Linear Multilayer Kernel, c=2, Perceptron exponent=1 Rule Based decision tree RF trees, 10 features ADABoost Rule based Bagging Rule based R-error G Error Total Error Arizona State University Shilpa Madhavan Shinde 5

6 III. Class Imbalance The error rate for the R-class does not improve with any enhancements to the original classifiers. Sampling is used to handle the class imbalance problem. The hybrid approach using a combination of undersampling the majority class and oversampling the rare class is used. Using the filter option in weka, hybrid approach for sampling is implemented The rarer class is upsampled to 1471 rows of data (originally 1089 rows). The majority class is downsampled to 1551 rows of data (originally 1933 rows). The results from the analysis done using this approach are given in the section below. IV. Classification after Sampling The classifiers used with the original dataset were used to classify the dataset after sampling. The results obtained from the classifiers are listed in Arizona State University Shilpa Madhavan Shinde 6

7 Table 2 & Figure 2. Arizona State University Shilpa Madhavan Shinde 7

8 Table 2: Error Rates after sampling cross validation - 10 fold, hybrid sampling Confusion Matrix Classifier Naïve bayes SVM Linear Kernel, c=2, exponent=1 KNN Rule Based Trees ADA Boost trees ADA Boost Rule Based RF 500 trees 10 features Total Error Out of Bag Class Error Error a b <-- classified as R (1471) 12.85% a = R G (1551) 12.67% 14.89% b = G R (1471) 6.46% a = R G (1551) 7.64% 8.77% b = G R (1471) 4.49% a = R G (1551) 5.56% 6.58% b = G R (1471) 2.58% a = R G (1551) 6.98% 7.03% b = G R (1471) 2.58% a = R G (1551) 2.78% 2.97% b = G R (1471) 2.72% a = R G (1551) 2.42% 2.42% 2.13% b = G R (1471) 2.11% a = R G (1551) 2.28% 2.45% b = G R (1471) 2.58% a = R G (1551) 2.05% 2.08% 1.61% b = G Figure 2: Error Rates after Sampling 30.00% 25.00% Error Rates with Sampling 20.00% 14.89% 15.00% 10.00% 5.00% 0.00% 12.85% Naïve bayes 12.67% 8.77% 7.64% 6.46% SVM Linear Kernel, c=2, exponent=1 6.98% 6.58% 5.56% 7.03% 4.49% 2.97% 2.78% 2.13% 2.42% 2.45% 2.28% 1.61% 2.05% 2.58% 2.58% 2.72% 2.11% 2.58% KNN Rule Based Trees ADA Boost trees ADA Boost Rule Based RF 500 trees 10 features R-error G-Error Total Error Arizona State University Shilpa Madhavan Shinde 8

9 The results indicate that sampling from the dataset has resulted in better results for the rarer class for all of the classifiers. Random Forest using 500 trees and 10 features gives the best results among all the options looked at. The error rate for the overall model is 2.05%, the Out Of Bag error is 2.08%. The error rate for the majority class is 1.61% and for the rarer class is 2.58%. V. Final Model The final model chosen is a random forest of 500 trees each constructed while considering 10 random features. The results from the final model chosen are listed below. === Run information === Scheme: weka.classifiers.trees.randomforest -I 500 -K 10 -S 1 Relation: Train Data for Project 2009-weka.filters.unsupervised.attribute.Remove-R1- weka.filters.supervised.instance.resample-b1.0-s1-z weka.filters.supervised.instance.resample-b1.0-s1-z100.0-weka.filters.allfilterweka.filters.allfilter-weka.filters.supervised.instance.resample-b0.0-s1-z100.0 Instances: 3022 Attributes: 109 [list of attributes omitted] Test mode: 10-fold cross-validation === Classifier model (full training set) === Random forest of 500 trees, each constructed while considering 10 random features. Out of bag error: Time taken to build model: seconds === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error 16.47% Arizona State University Shilpa Madhavan Shinde 9

10 Root relative squared error 31.84% Total Number of Instances 3022 === Detailed Accuracy By Class === F- ROC TP Rate FP Rate Precision Recall Measure Area Class R G Weighted Avg === Confusion Matrix === a b Error Rate <-- classified as % a = R % b = G Arizona State University Shilpa Madhavan Shinde 10

S2 Text. Instructions to replicate classification results.

S2 Text. Instructions to replicate classification results. Machine Learning (ML) Models were implemented using WEKA software Version 3.8. The software can be free downloaded at this link: http://www.cs.waikato.ac.nz/ml/weka/downloading.html.