Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Size: px

Start display at page:

Download "Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis"

Constance Pope
5 years ago
Views:

1 Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

2 CHAPTER 3 BEST FIRST AND GREEDY SEARCH BASED CFS AND NAÏVE BAYES ALGORITHMS FOR HEPATITIS DIAGNOSIS 3.1 Introduction Feature Selection [37, 38, 49, 88] is a technique which is used to reduce the dimensionality of data or eliminate the irrelevant features and improve the predictive accuracy. The feature selection begins with an empty set of features and generates all possible single feature expansions and the subset with the maximum accuracy is chosen and expanded in the same way by adding single features. The search continues and if the accuracy s subset expansion is maximized, then the search goes to the next best unexpanded subset. Then, the subset with the maximum accuracy will be selected as the reduced feature set [7, 33, 59]. The objective of this study is to predict the life expectancy for patients with hepatitis based on hepatitis data and improve the classification accuracy. This work is used Naive Bayes algorithm to get the accuracy of the classification and prediction. In order to increase its accuracy Correlation Based Feature Selection (CFS), best first algorithm and greedy approach of feature selection are being used. This is to make sure that the noisy or irrelevant features are removed. Then compare the accuracy of prediction by using Naive bayes and other classification algorithms like J48, Multi layer Perceptron (MLP), Sequential Minimal Optimization (SMO) and Radial Basis Function (RBF). This chapter is organized as follow: The section 3.2 deals with the concept of CFS and best first search and greedy search algorithm. The section 3.3 deals with the proposed methodology and the section 3.4 illustrates the Experimental results. 65

3 3.2 Correlation Based Feature Selection Filter Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction. For that purpose an attribute evaluator and a search method are needed. The evaluator determines what method is used to assign a worth to each subset of attributes. The search method determines what style of search is performed. In this work CFS is used as an evaluator and BFS and GS as searching methods Feature Evaluation CFS [41, 57, 61,100] is a heuristics approach for evaluating the worth or merit of a subset of features. It is one of the techniques to rank the relevance of features by measuring correlation between features and classes and between features and other features. Given number of features k and classes c, CFS defined relevance of features subset by using Pearson s correlation equation M kr / k + ( k 1) r S = r cf (3.1) cf ff Where M s is relevance of feature subset, r cf is the average linear correlation coefficient between these features and classes and r ff is the average linear correlation coefficient among different features. Normally, CFS adds (forward selection) or deletes (backward selection) one feature at a time, however, in this work used best first search (BFS) and greedy hill climbing search algorithms for the best results [24, 25]. 66

4 3.2.2 Searching the Feature Subset Feature selection can be viewed as a search problem, with each state in the search space specifying a subset of the possible features. Search strategies can be influenced by search directions. In this section Greedy search and best first search are elaborated Greedy Hill Climbing Search (GS) Searching the space of feature subsets within reasonable time constraints is necessary if a feature selection algorithm is to operate on data with a large number of features. One simple search strategy, called greedy hill climbing, considers local changes to the current feature subset. Frequently, local change is the addition or deletion of a single feature from the subset. When the algorithm considers only additions to the feature subset it is known as forward selection and the deletion is known as backward elimination method [24, 25, 15]. An alternative approach, called stepwise bi-directional search, uses both addition and deletion. It encompasses each of these variations, the search method may consider all possible local changes to the current subset and then choose the best, or the first change that improves the merit of the current feature subset. In both cases, once a change is identified then it is never reconsidered Best First Search (BFS) The Best first search is an important AI search strategy that allows backtracking along the search path [24, 25]. Like greedy hill climbing, best first search moves through the search space by making local changes to the current feature subset. However, unlike hill climbing method, suppose path being explored begins to look less promising, the best 67

5 first search method can back-track to a more promising previous subset and continue the search from there. A best first search will explore the entire search space for specified time, so it is common to use a stopping criterion. Normally this involves limiting the number of fully expanded subset and that results in no improvement Correlation Measures Correlation measures are important to get the merit of the feature subset. To estimate the merit of a feature subset, it is necessary to compute the correlation (dependence) among attributes by applying equation (3.1). Research on decision tree induction has provided a number of methods for estimating the quality of an attribute- that is, how predictive one attribute is of another. For discrete class problems, CFS first discretizes numeric features using the technique of Fayyad and Irani and to estimate the degree of association among discrete features then uses a modified information gain measure (symmetrical uncertainty). If X and Y are discrete random variables, Equations (3.2) and (3.3) give the entropy of Y before and after observing X. H H ( Y ) p( y p( y) = )log 2 (3.2) y Y ( Y X ) p( x) p( y x)log 2 p( y x) x X = (3.3) y Y The amount by which the entropy of Y decreases and reflects the additional information about Y provided by X is called the information gain. Information gain is given by Gain = H(Y) - H(Y X) (3. 4) 68

6 = H(X) H(X Y ) = H(Y) + H(X) - H(X,Y ) Information gain is a symmetrical measure, (i.e.), the amount of information gained about Y after observing X is equal to the amount of information gained about X after observing Y. Unfortunately, information gain is biased in favour of features with more values, that is, attributes with greater numbers of values will appear to gain more information than those with fewer values even if they are actually no more informative. Furthermore, the correlations in Equation (3.1) should be normalized to ensure that they are comparable and have the same effect. Symmetrical uncertainty compensates for information gain's bias toward attributes with more values and normalizes its value to the range [0, 1]: Symmetrical Uncertainty = 2.0 x gain H ( Y ) + H ( X ) (3.5) To handle unknown (missing) data values in an attribute, CFS distributes their counts across the represented values in proportion to their relative frequencies. 3.3 Proposed Work The proposed approach is incorporated in two categories. Firstly, all the numbers of features of the hepatitis disease dataset were reduced to ten from nineteen by CFS Evaluator based on best first and the subset is classified by using Naive Bayes classification Algorithm. The model is evaluated with the performance measures like accuracy, sensitivity, specificity and precision. Secondly, all the numbers of features of the hepatitis disease dataset were reduced to ten from nineteen by CFS Evaluator based on Greedy search and the subset is classified by using Naive Bayes classification Algorithm. The model is evaluated with the 69

7 above said performance measures. The Architectural diagram of proposed methodology is shown in Fig Hepatitis Dataset Dimensionality Reduction CFS and Best First Search CFS and Greedy Search Reduced Subset1 Reduced Subset2 Classification by Naive Bayes Performance Evaluation Fig. 3.1 Proposed Methodology of BFSCFS-NB and GSCFS-NB BFSCFS-NB Algorithm A copy of the training data is first discretized and then passed to CFS. CFS calculates feature-class, feature-feature correlations using symmetric Uncertainty measures and then searches the feature subset space best first search. The subset with the highest merit found during the search is used to reduce the dimensionality of both the original training data and the testing data. The reduced datasets may then be passed to a machine learning scheme for training and testing. Training and testing data are reduced to contain only the features selected by CFS. The dimensionally reduced data can then be passed to a Naïve Bayes algorithm for classification and prediction. 70

8 The working principle of BFSCFS-NB is shown in Fig.3.2. CFS Training data Data pre-processing Calculate Feature correlation Discretization Feature-class BFS search f1 f2 f3 f4 Class Feature Set Merits f1 Featurefeature f2 Feature Evaluation f3 f4 Dimensionality Reduction Naïve Bayes Algorithm Testing data Final Evaluation Fig. 3.2 Proposed Model for CFS and Naïve Bayes Algorithm 71

9 The following shows the proposed BFSCFS-NB algorithm: Step 1: To start with OPEN list containing the start state, the CLOSED list empty and BEST start state. Step 2: Let assign s = arg max e(x) (get the state from OPEN with the highest evaluation). Step 3: Eliminate s from OPEN and add to CLOSED. Step 4: If e(s) e(best), then BEST s. Step 5: For every child t of s that is not in the OPEN or CLOSED list, evaluate and add to OPEN. Step 6: If BEST changed in the last set of expansions, go to 2 Step 7: Return BEST. Step 8: Obtain the new data set. Step 9: Construct both training and test data discrete. Step 10: Estimate the prior probabilities P(C j ), j=1,... k from the training data, where k is the number of classes. Step 11: Estimate the conditional probabilities P(A i = a l C j ), i= 1,...,D, j=1,...,k, l= 1,...,d from the training data, where D is the number of features, d is the number of discretization level. Step 12: Estimate the posterior probabilities P(C j A) for each test example x represented by a feature vector A. Step 13: Assign x to the class C * such that C * = arg max j=1,2 P(C j A). 72

10 The first half of the algorithm from step one to eight is used to select the subset using Best First Search and then the second half of the algorithm from nine to thirteen are for classification using Naïve Bayes GSCFS-NB Algorithm The working principle of GSCFS-NB is shown in Fig.3.3. Training data Data pre-processing Calculate Feature correlation CFS Discretization Feature-class f1 f2 f3 f4 Greedy search Class Feature Set Merits f1 Featurefeature f2 Feature Evaluation f3 f4 Dimensionality Reduction Naïve Bayes Algorithm Testing data Final Evaluation Fig. 3.3 Proposed Model for CFS and Naïve Bayes Algorithm 73

11 A copy of the training data is first discretized and then passed to CFS. CFS calculates feature-class, feature-feature correlations using symmetric uncertainty measures and then searches the feature subset space using Greedy search. The subset with the highest merit found during the search is used to reduce the dimensionality of both the original training data and the testing data. The reduced datasets may then be passed to a machine learning scheme for training and testing. Training and testing data are reduced to contain only the features selected by CFS. The dimensionally reduced data can then be passed to a Naïve Bayes classification for prediction. The proposed GSCFS-NB algorithm is given below. Step 1: Let s start state. Step 2: Enlarge s by making each possible local change. Step 3: Evaluate each child t of s. Step 4: Let s child t with the highest evaluation e (t). Step 5: If e(s ) e(s) then s s, go to 2. Step 6: Return s. Step 7: Obtain the new data set. Step 8: Construct both training and test data discrete. Step 9: Estimate the prior probabilities P(C j ), j=1,... k from the training data, where k is the number of classes. Step 10: Estimate the conditional probabilities P(A i = a l C j ), i= 1,...,D, j=1,...,k, l= 1,...,d from the training data, where D is the number of features, d is the number of discretization level. 74

12 Step 11: Estimate the posterior probabilities P(C j A) for each test example x represented by a feature vector A. Step 12: Assign x to the class C * such that C * = arg max j=1,2 P(C j A). The first half of the algorithm from step one to seven is used to select the subset using Greedy Search and then the second half of the algorithm from eight to twelve is for classification using Naïve Bayes. 3.4 Experimental Results and Discussion The hepatitis data set were applied to evaluate the proposed method. The whole dataset is divided for training the models and test them by the ratio of 66%:34% respectively. The training set is used to estimate each model parameters, while the test set is used to independently assess the individual models. The experiments are implemented in WEKA data mining workbench using Intel(R) core(tm)-i3-2328m CPU with 2.20 GHz. In this work CFS was used as a feature selection method. In CFS, the best first and Greedy search was used as searching strategy. After Feature selection, the numbers of attributes selected by greedy search based CFS were age, sex, malaise, spiders, ascites, varices, bilirubin, albumin, protime, histology respectively. The attributes selected through Best First Search based CFS were age, sex, malaise, spiders, ascites, varices, bilirubin, albumin, protime, histology. After identifying the subset, naïve bayes algorithm was used for classification purpose. Next stage was compared with the existing classification algorithms like sequential minimal optimization, J48, multilayer Perceptron and Radial Basis Function. 75

13 3.4.1 Performance Evaluation The experimental results illustrate the various measures that are used to evaluate the model for classification and prediction. In this work the accuracy, sensitivity, specificity, precision and kappa statistics are elaborated. Accuracy The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The associated class label of each test tuple is compared with the learned classifier s class prediction for that tuple. The ten fold cross validation method is used for estimating classifier accuracy. The Equation (3.6) is to measure the accuracy. Accuracy = (TP + TN)/(TP + TN + FP + FN) (3.6) Sensitivity Sensitivity is also referred to as the true positive rate, that is, the proportion of positive tuples that are correctly identified. The Equation (3.7) is used to measure the sensitivity. Sensitivit y = TP/(TP + FN) (3.7) Specificity The specificity is the true negative rate. That is, the proportion of negative tuples that are correctly identified. The Equation (3.8) is used to measure the specificity. Specificit y = TN/(TN + FP) (3.8) Precision A false positive (FP) occurs when the outcome is incorrectly predicted as yes (or positive) when it is actually no (negative). A false negative (FN) occurs when the 76

14 outcome is incorrectly predicted as no when it is actually yes. The Equation (3.9) is used to measure the Precision. Precision = TP/(TP + FP) (3.9) The accuracy, Time, Precision, Sensitivity and Specificity for Naive Bayes, J48, Multilayer Perceptron, SMO and RBF [91, 92] are shown in Table 3.1. Classification Algorithms Multi Layer Perceptron Table 3.1 Performance Measures: Before Feature Selection Accuracy Time Precision Sensitivity Specificity 80.0% % 80.0% 85.17% RBF 83.80% % 83.8% 87.86% SMO 83.16% % 83.2% 86.95% J % % 83.9% 87.93% Naive Bayes 84.51% % 84.5% 88.61% The accuracy, Time, Precision, Sensitivity and Specificity for CFS-MLP, CFS-RBF, CFS-SMO, CFS-J48 and CFS-NB based are listed out in Table 3.2. Table 3.2 Performance Measures: After Feature Selection based on BFSCFS Classification Algorithms CFS-MLP CFS-RBF CFS-SMO CFS-J48 CFS-NB Accuracy Time Precision Sensitivity Specificity 84.51% % 84.5% 91.05% 86.4% % 86.5% 91.49% 83.22% % 83.2% 90.30% 81.29% % 81.3% 89.68% 88.53% % 88.7% 92.86% 77

15 Based on classification accuracy, sensitivity and specificity the models were evaluated. CFS with Best first search and naïve bayes algorithm were applied in this proposed method. Using this model, a prediction accuracy of 88.53% is achieved. The accuracy, Time, Precision, Sensitivity and Specificity for CFS-MLP, CFS-RBF, CFS-SMO, CFS-J48 and CFS-NB are mentioned in Table 3.3. Table 3.3 Performance Measures: After Feature Selection based on GSCFS Classification Algorithms CFS- MLP CFS-RBF CFS-SMO CFS-J48 CFS-NB Accuracy Time Precision Sensitivity Specificity 84.51% % 84.5% 89.90% 86.45% % 86.5% 89.15% 83.22% % 83.2% 88.30% 81.29% % 81.3% 89.38% 87.74% % 87.7% 91.86% Based on classification accuracy, sensitivity and specificity the models were evaluated. This work has applied CFS with Greedy search and naïve bayes algorithms were proposed. Using this model a prediction accuracy of 87.74% is achieved K-Fold Cross Validation In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or folds, D 1, D 2 D k, each of approximately equal size and the training 78

16 and testing are performed k times. In iteration i, partition D i is kept as the test set and the remaining partitions are collectively used to train the model. That is, in the first iteration, subsets D 2,.. D k collectively serve as the training set in order to obtain a first model, which is tested on D 1. The second iteration is trained on subsets D 1, D 2,.. D k and tested on D 2 and so on and so forth. Here, each sample is used the same number of times for training and once for testing. The accuracy estimate for classification is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data. The error estimate for prediction can be computed as the total loss from the k iterations, divided by the total number of initial tuples. Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples. That is, only one sample is left out at a time for the test set. In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data. In this work 10-fold cross-validation is used for estimating accuracy Kappa Statistics The kappa parameter measures pairwise agreement between two different observers, corrected for an expected chance agreement [40]. For instance, if the value is one, it means that there is a complete agreement between the classifier and real world value. Kappa value is calculated using the following equation: K = [ P( A) P( E) /[1 P( E)] (3.10) P ( A) = ( TP + TN ) / N (3.11) 2 P ( E) [( TP + FN ) * ( TP + FP) * ( TN + FN )] N = (3.12) 79

17 Where N is the total number of instances used. P (A) is the percentage of agreement between the classifier and underlying truth calculated by Equation (3.11). P (E) is the chance of agreement calculated by Equation (3. 12). In this study the kappa value is for CFS-NB based on best first search and greedy search which is calculated by Equation (3.10). The mean absolute error is a quantity used to measure predictions of the eventual outcomes and the mean absolute error is given by, Mean absolute Error = d yi yi ' = 1 d (3.13) i The mean absolute error is an average of the absolute errors y i - y i, Where y i = prediction and y i = true value. The Root mean squared error is the square root of the mean of the squares of the values. It squares the errors before they are averaged and Route Means Square Error gives a relatively high weight to large errors. The Route Means Square Error E i of an individual program i is evaluated by the equation: 2 1 n P(ij) Tj E i = (3.14) n = 1 T j j Where P(ij) = the value predicted by the individual program. i = fitness case and T j =the target value for fitness case j. The mean absolute error and root mean square error of the classification error measures are shown in Tables 3.4 and

18 Table 3.4 Classifier Statistical results based on BFSCFS Classifier Mean Absolute Error Root Mean Square Error Kappa Statistics CFS-MLP CFS-RBF CFS-SMO CFS-J CFS-NB Table 3.5 Classifier Statistical Results based on GSCFS Classifier Mean Absolute Error Root Mean Square Error Kappa Statistics CFS-MLP CFS-RBF CFS-SMO CFS-J CFS-NB Confusion Matrix The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples of different classes. A confusion matrix for two classes is shown in Table 3.6. Table 3.6 Different outcomes of two Class Prediction Predicted Class Yes No Actual Class Yes No True Positive False Positive False Negative True Negative 81

19 Given m classes, a confusion matrix is a table of at least size m by m. An entry, CM i,j in the first m rows and m columns indicates the number of tuples of class i that were labelled by the classifier as class j. To have good accuracy for a classifier preferably most of the tuples would be represented along the diagonal of the confusion matrix from entry CM 1.1 to entry CM m, m with the rest of the entries being close to zero. The table may have additional rows or columns to provide totals or recognition rates per class. Given two classes, it could be considered positive tuples and negative tuples. True positives refer to the positive tuples that were correctly labeled by the classifier, while true negatives are the negative tuples that were correctly labeled by the classifier. The False positives are the negative tuples that were incorrectly labeled. Similarly, false negatives are the positive tuples that were incorrectly labeled. These terms are useful when analyzing a classifier s ability. A confusion matrix is calculated for Naive Bayes, CFS-NB classifiers and CFS-NB classifier to interpret the results. The confusion matrix is shown in Tables 3.7 and 3.8. Table 3.7 Confusion Matrix: Before Feature Selection based on NB a b Classified as a = DIE b = LIVE Table 3.8 Confusion Matrix: After Feature Selection based on BFSCFS-NB a b Classified as 23 9 a = DIE b = LIVE 82

20 Table 3.9 Confusion Matrix: After Feature Selection based on GSCFS-NB a b Classified as a = DIE b = LIVE Graph Results This section illustrates the graph results of the accuracy, sensitivity, specificity and precision and time over Naive bayes, J48, Multilayer Perceptron, Sequential Minimal Optimization and Radial Basis Function, Naive bayes algorithm, J48, Multilayer Perceptron, Sequential Minimal Optimization and Radial Basis Function Fig.3.4 shows performance analysis related to accuracy of various classification algorithms for CFS. Accucacy in % Performance Analysis Related to Accuracy CFS-MLP CFS-RBF CFS-SMO Algorithms CFS-J48 CFS-NB Accuracy Fig. 3.4 Performance related to Accuracy for CFS-NB 83

21 Fig. 3.5 shows performance analysis related to time over various classification algorithms for CFS. Performance Analysis Related to Time Time(ms) Time CFS-MLP CFS-RBF CFS-SMO CFS-J48 CFS-NB Algorithms Fig. 3.5 Performance related to Time for CFS-NB Fig.3.6 shows performance analysis related to Sensitivity of various classification algorithms for CFS. Sensitivity in % Performance Analysis Related to Sensitivity CFS-MLP CFS-RBF CFS-SMO Algorithms CFS-J48 GS Based CFS-NB Sensitivity Fig. 3.6 Performance related to Sensitivity for CFS-NB 84

22 Fig. 3.7 shows performance analysis related to Specificity of various classification algorithms for CFS. Performance Analysis Related to Specificity Specificity in % CFS-MLP CFS-RBF CFS-SMO BFS Based CFS-J48 CFS-NB Specificity Algorithms Fig. 3.7 Performance related to Specificity for CFS-NB Fig. 3.8 shows performance analysis related to Precision of various classification algorithms for CFS. Performance Analysis Related to Precision Precision in % Precision CFS-MLP CFS-RBF CFS-SMO CFS-J48 BFS Based CFS-NB Algorithms Fig. 3.8 Performance related to Precision for CFS-NB 85

23 Fig. 3.9 shows performance analysis related to accuracy of various classification algorithms for CFS. 90 Performance Analysis Related to Accuracy Accucacy in % Accuracy 78 CFS-MLP CFS-RBF CFS-SMO CFS-J48 CFS-NB Algorithms Fig. 3.9 Performance related to Accuracy for CFS-NB Fig shows performance analysis related to time of various classification algorithms for CFS. Performance Analysis Related to Time Time(ms) Time CFS-MLP CFS-RBF CFS-SMO CFS-J48 CFS-NB Algorithms Fig Performance related to Time for CFS-NB 86

24 Fig shows performance analysis related to Sensitivity of various classification algorithms for CFS. 90 Performance Analysis Related to Sensitivity Sensitivity in % CFS-MLP CFS-RBF CFS-SMO Algorithms CFS-J48 GS Based CFS-NB Sensitivity Fig Performance related to Sensitivity for CFS-NB Fig shows performance analysis related to Specificity of various classification algorithms for CFS. Performance Analysis Related to Specificity Specificity in % GS Based CFS-MLP CFS-RBF CFS-SMO CFS-J48 CFS-NB Algorithms Specificity Fig Performance related to Specificity for CFS-NB 87

25 Fig shows performance analysis related to Precision of various classification algorithms for CFS. Performance Analysis Related to Precision Precision in % Precision CFS-MLP CFS-RBF CFS-SMO CFS-J48 GS Based CFS-NB Algorithms Fig Performance related to Precision for CFS-NB 3.5 Summary In this proposed work an enhanced medical diagnostic method for addressing hepatitis diagnosis problem is developed. Experiment results on various portions of the hepatitis dataset proved that the new approach performs better in distinguishing the live from the dead one. It is observed that CFS-NB achieved the best classification accuracies for a reduced feature subset that contained ten features. Meanwhile, comparative study is conducted on the methods such as CFS-MLP, CFS-SMO, CFS-J48, CFS-RBF, CFS-MLP, CFS-SMO, CFS-J48 and CFS-RBF. The experimental result shows that the CFS-NB performed advantageously over the other methods in terms 88

26 of the classification accuracy, Sensitivity, Specificity and time. Other measures like kappa statistics and classification error measures are also elaborated. The BFS search based CFS-NB s performance is better with other algorithms. In this work best first search and greedy search techniques were used. Further work incorporate Particle swarm Optimization and Genetic search based algorithms GNSCFS-NB and PSOCFS-NB were developed to reduce the dimensionality of data and computational cost. These algorithms are described in the subsequent chapter. 89

PCA-NB Algorithm to Enhance the Predictive Accuracy

PCA-NB Algorithm to Enhance the Predictive Accuracy T.Karthikeyan 1, P.Thangaraju 2 1 Associate Professor, Dept. of Computer Science, P.S.G Arts and Science College, Coimbatore, India 2 Research Scholar,